kb-cleaner

Clean and normalize messy knowledge base documents into polished, RAG-friendly markdown — fixes headings, lists, tables, and conversion artifacts, with a conservative typo pass.

Overview

kb-cleaner takes raw or messy knowledge-base documents and produces a polished, cleaned markdown file following a consistent rule set. The output is designed to be both human-readable and friendly to retrieval/RAG pipelines: predictable structure, aggressive cleanup on formatting, conservative edits on meaning.

The skill supports Bahasa Indonesia, Bahasa Melayu, and English content, and is careful about language-specific typo rules — it confirms the source language with you before running the typo pass when the document mixes languages or contains religious, legal, medical, or scientific terms-of-art.

Inputs Handled

InputHow it’s processed
.mdRead directly
.pdfText extraction via the pdf skill, including OCR of embedded images
.pptxSlide text, speaker notes, and image OCR in slide order via the pptx skill
.docxConverted to markdown via the docx skill

Cleaning Passes

Applied in order:

  1. Case normalization — ALL-UPPERCASE words and headings to proper case, with language-appropriate title-case rules
  2. Heading hierarchy — enforce H1 title → H2 topics → H3 subtopics, no skipped levels, no invented topics
  3. Paragraph reflow — remove line breaks that split sentences; strip docx conversion artifacts
  4. Markdown artifact cleanup — fix 1\. escapes, backtick-wrapped apostrophes, double spaces, missing spaces after list dots
  5. List normalization — only - and 1. lists allowed; converts non-standard enumerations (i), a), (1), unicode bullets) and auto-fixes bullets that are clearly sequences (Pertama, Kemudian, Lalu / First, Then, Finally)
  6. Table normalization — converts HTML/ASCII tables to padded, left-aligned pipe tables; 2-column key-value tables become definition lists for better RAG chunking
  7. Conservative typo pass — fixes only clear typos; never silently changes proper nouns, terms-of-art, product names/prices, or transliteration variants

Every run ends with a verification report: what was restructured, which typos were corrected, and anything flagged but deliberately left unchanged — so you can sanity-check the edits before the file enters a retrieval pipeline.

Output

  • Filename: <original name> Cleaned.md, saved next to the input file
  • A non-optional verification report summarizing all changes

How to Access

  1. Download the kb-cleaner.skill package from the shared Drive link
  2. Import it into Claude Skills (or extract it and follow SKILL.md)
  3. Run it on any KB file that needs polishing

Example Prompts

Clean this KB: [attach messy .docx or .pdf]
Extract and clean this PDF into a KB — the content is in Bahasa Indonesia.
Fix the list and table formatting in this markdown file,
but don't touch the Arabic transliterations.