kb-cleaner
Clean and normalize messy knowledge base documents into polished, RAG-friendly markdown — fixes headings, lists, tables, and conversion artifacts, with a conservative typo pass.
Overview
kb-cleaner takes raw or messy knowledge-base documents and produces a polished, cleaned markdown file following a consistent rule set. The output is designed to be both human-readable and friendly to retrieval/RAG pipelines: predictable structure, aggressive cleanup on formatting, conservative edits on meaning.
The skill supports Bahasa Indonesia, Bahasa Melayu, and English content, and is careful about language-specific typo rules — it confirms the source language with you before running the typo pass when the document mixes languages or contains religious, legal, medical, or scientific terms-of-art.
Inputs Handled
| Input | How it’s processed |
|---|---|
.md | Read directly |
.pdf | Text extraction via the pdf skill, including OCR of embedded images |
.pptx | Slide text, speaker notes, and image OCR in slide order via the pptx skill |
.docx | Converted to markdown via the docx skill |
Cleaning Passes
Applied in order:
- Case normalization — ALL-UPPERCASE words and headings to proper case, with language-appropriate title-case rules
- Heading hierarchy — enforce H1 title → H2 topics → H3 subtopics, no skipped levels, no invented topics
- Paragraph reflow — remove line breaks that split sentences; strip docx conversion artifacts
- Markdown artifact cleanup — fix
1\.escapes, backtick-wrapped apostrophes, double spaces, missing spaces after list dots - List normalization — only
-and1.lists allowed; converts non-standard enumerations (i),a),(1), unicode bullets) and auto-fixes bullets that are clearly sequences (Pertama, Kemudian, Lalu / First, Then, Finally) - Table normalization — converts HTML/ASCII tables to padded, left-aligned pipe tables; 2-column key-value tables become definition lists for better RAG chunking
- Conservative typo pass — fixes only clear typos; never silently changes proper nouns, terms-of-art, product names/prices, or transliteration variants
Every run ends with a verification report: what was restructured, which typos were corrected, and anything flagged but deliberately left unchanged — so you can sanity-check the edits before the file enters a retrieval pipeline.
Output
- Filename:
<original name> Cleaned.md, saved next to the input file - A non-optional verification report summarizing all changes
How to Access
- Download the
kb-cleaner.skillpackage from the shared Drive link - Import it into Claude Skills (or extract it and follow
SKILL.md) - Run it on any KB file that needs polishing
Example Prompts
Clean this KB: [attach messy .docx or .pdf]Extract and clean this PDF into a KB — the content is in Bahasa Indonesia.Fix the list and table formatting in this markdown file,but don't touch the Arabic transliterations.