---
name: "kb-cleaner"
description: "Clean and normalize messy knowledge base documents into polished, RAG-friendly markdown — fixes headings, lists, tables, and conversion artifacts, with a conservative typo pass."
type: skill
roles: ["Engineering", "Customer Success"]
activities: ["Knowledge Management", "Data Quality"]
tags: ["Knowledge Base", "Cleanup", "Data Quality", "RAG", "Markdown"]
owner: "Yayan Adipraja"
sourceUrl: "https://drive.google.com/file/d/1qe4VotwK5GCCbUkiAbo3q1gELebEPY9b/view"
---

# kb-cleaner

> Clean and normalize messy knowledge base documents into polished, RAG-friendly markdown — fixes headings, lists, tables, and conversion artifacts, with a conservative typo pass.

## Overview

kb-cleaner takes raw or messy knowledge-base documents and produces a polished, cleaned markdown file following a consistent rule set. The output is designed to be both human-readable and friendly to retrieval/RAG pipelines: predictable structure, aggressive cleanup on formatting, conservative edits on meaning.

The skill supports Bahasa Indonesia, Bahasa Melayu, and English content, and is careful about language-specific typo rules — it confirms the source language with you before running the typo pass when the document mixes languages or contains religious, legal, medical, or scientific terms-of-art.

## Inputs Handled

| Input | How it's processed |
| --- | --- |
| `.md` | Read directly |
| `.pdf` | Text extraction via the `pdf` skill, including OCR of embedded images |
| `.pptx` | Slide text, speaker notes, and image OCR in slide order via the `pptx` skill |
| `.docx` | Converted to markdown via the `docx` skill |

## Cleaning Passes

Applied in order:

1. **Case normalization** — ALL-UPPERCASE words and headings to proper case, with language-appropriate title-case rules
2. **Heading hierarchy** — enforce H1 title → H2 topics → H3 subtopics, no skipped levels, no invented topics
3. **Paragraph reflow** — remove line breaks that split sentences; strip docx conversion artifacts
4. **Markdown artifact cleanup** — fix `1\.` escapes, backtick-wrapped apostrophes, double spaces, missing spaces after list dots
5. **List normalization** — only `-` and `1.` lists allowed; converts non-standard enumerations (`i)`, `a)`, `(1)`, unicode bullets) and auto-fixes bullets that are clearly sequences (*Pertama, Kemudian, Lalu* / *First, Then, Finally*)
6. **Table normalization** — converts HTML/ASCII tables to padded, left-aligned pipe tables; 2-column key-value tables become definition lists for better RAG chunking
7. **Conservative typo pass** — fixes only clear typos; never silently changes proper nouns, terms-of-art, product names/prices, or transliteration variants

Every run ends with a **verification report**: what was restructured, which typos were corrected, and anything flagged but deliberately left unchanged — so you can sanity-check the edits before the file enters a retrieval pipeline.

## Output

- Filename: `<original name> Cleaned.md`, saved next to the input file
- A non-optional verification report summarizing all changes

## How to Access

1. Download the `kb-cleaner.skill` package from the [shared Drive link](https://drive.google.com/file/d/1qe4VotwK5GCCbUkiAbo3q1gELebEPY9b/view)
2. Import it into Claude Skills (or extract it and follow `SKILL.md`)
3. Run it on any KB file that needs polishing

## Example Prompts

```
Clean this KB: [attach messy .docx or .pdf]
```

```
Extract and clean this PDF into a KB — the content is in Bahasa Indonesia.
```

```
Fix the list and table formatting in this markdown file,
but don't touch the Arabic transliterations.
```
