supermarkdown

High-performance HTML to Markdown converter with full GitHub Flavored Markdown support. Written in Rust, available for Node.js and as a native Rust crate.

Features

Fast - Written in Rust with O(n) algorithms, significantly faster than JavaScript alternatives
Full GFM Support - Tables with alignment, strikethrough, autolinks, fenced code blocks
Accurate - Handles malformed HTML gracefully via html5ever
Configurable - Multiple heading styles, link styles, custom selectors
Zero Dependencies - Single native binary, no JavaScript runtime overhead
Cross-Platform - Pre-built binaries for Windows, macOS, and Linux (x64 & ARM64)
TypeScript Ready - Full type definitions included
Async Support - Non-blocking conversion for large documents

Installation

Node.js

npm install @vakra-dev/supermarkdown

Rust

cargo add supermarkdown

CLI

Install the CLI binary via cargo:

cargo install supermarkdown-cli

Command Line Usage

The CLI allows you to convert HTML files from the command line or via stdin:

# Convert a file
supermarkdown page.html > page.md

# Pipe HTML from curl
curl -s https://example.com | supermarkdown

# Exclude navigation and ads
supermarkdown --exclude "nav,.ad,#sidebar" page.html

# Use setext-style headings and referenced links
supermarkdown --heading-style setext --link-style referenced page.html

CLI Options

Option	Description
`-h, --help`	Print help message
`-v, --version`	Print version
`--heading-style <STYLE>`	`atx` (default) or `setext`
`--link-style <STYLE>`	`inline` (default) or `referenced`
`--code-fence <CHAR>`	` (default) or `~`
`--bullet <CHAR>`	`-` (default), `*`, or `+`
`--exclude <SELECTORS>`	CSS selectors to exclude (comma-separated)

Quick Start

import { convert } from "@vakra-dev/supermarkdown";

const html = `
  <h1>Hello World</h1>
  <p>This is a <strong>test</strong> with a <a href="https://example.com">link</a>.</p>
`;

const markdown = convert(html);
console.log(markdown);
// # Hello World
//
// This is a **test** with a [link](https://example.com).

Common Use Cases

Cleaning Web Scrapes

When scraping websites, HTML often contains navigation, ads, and other non-content elements. Use selectors to extract only what you need:

import { convert } from "@vakra-dev/supermarkdown";

// Raw HTML from a web scrape
const scrapedHtml = await fetchPage("https://example.com/article");

// Clean conversion - remove nav, ads, sidebars
const markdown = convert(scrapedHtml, {
  excludeSelectors: [
    "nav",
    "header",
    "footer",
    ".sidebar",
    ".advertisement",
    ".cookie-banner",
    ".social-share",
    ".comments",
    "script",
    "style",
  ],
});

Preparing Content for LLMs

When feeding web content to LLMs, you want clean, focused text without HTML artifacts:

import { convert } from "@vakra-dev/supermarkdown";

// Extract just the article content for RAG pipelines
const markdown = convert(html, {
  excludeSelectors: [
    "nav",
    "header",
    "footer",
    "aside",
    ".related-posts",
    ".author-bio",
  ],
  includeSelectors: ["article", ".post-content", "main"],
});

// Now feed to your LLM
const response = await llm.chat({
  messages: [
    {
      role: "user",
      content: `Summarize this article:\n\n${markdown}`,
    },
  ],
});

Processing Blog Posts

Convert blog HTML while preserving code blocks and formatting:

import { convert } from "@vakra-dev/supermarkdown";

const blogHtml = `
<article>
  <h1>Getting Started with Rust</h1>
  <p>Rust is a systems programming language focused on safety.</p>
  <pre><code class="language-rust">fn main() {
    println!("Hello, world!");
}</code></pre>
  <p>The <code>println!</code> macro prints to stdout.</p>
</article>
`;

const markdown = convert(blogHtml);
// Output:
// # Getting Started with Rust
//
// Rust is a systems programming language focused on safety.
//
// ```rust
// fn main() {
//     println!("Hello, world!");
// }
// ```
//
// The `println!` macro prints to stdout.

Converting Documentation Pages

Handle tables, definition lists, and nested structures common in docs:

import { convert } from "@vakra-dev/supermarkdown";

const docsHtml = `
<h2>API Reference</h2>
<table>
  <tr><th>Method</th><th>Description</th></tr>
  <tr><td><code>convert()</code></td><td>Sync conversion</td></tr>
  <tr><td><code>convertAsync()</code></td><td>Async conversion</td></tr>
</table>
<dl>
  <dt>headingStyle</dt>
  <dd>ATX (#) or Setext (underlines)</dd>
</dl>
`;

const markdown = convert(docsHtml);
// Output:
// ## API Reference
//
// | Method | Description |
// | --- | --- |
// | `convert()` | Sync conversion |
// | `convertAsync()` | Async conversion |
//
// headingStyle
// :   ATX (#) or Setext (underlines)

Batch Processing

Process multiple documents efficiently with async conversion:

import { convertAsync } from "@vakra-dev/supermarkdown";

const urls = [
  "https://example.com/page1",
  "https://example.com/page2",
  "https://example.com/page3",
];

// Fetch and convert in parallel
const markdownDocs = await Promise.all(
  urls.map(async (url) => {
    const html = await fetch(url).then((r) => r.text());
    return convertAsync(html, {
      excludeSelectors: ["nav", "footer"],
    });
  })
);

Usage

Basic Conversion

import { convert } from "@vakra-dev/supermarkdown";

const markdown = convert("<h1>Title</h1><p>Paragraph</p>");

With Options

import { convert } from "@vakra-dev/supermarkdown";

const markdown = convert(html, {
  headingStyle: "setext", // 'atx' (default) or 'setext'
  linkStyle: "referenced", // 'inline' (default) or 'referenced'
  excludeSelectors: ["nav", ".sidebar", "#ads"],
  includeSelectors: [".important"], // Override excludes for specific elements
});

Async Conversion

For large documents, use convertAsync to avoid blocking the main thread:

import { convertAsync } from "@vakra-dev/supermarkdown";

const markdown = await convertAsync(largeHtml);

// Process multiple documents in parallel
const results = await Promise.all([
  convertAsync(html1),
  convertAsync(html2),
  convertAsync(html3),
]);

API Reference

`convert(html, options?)`

Converts HTML to Markdown synchronously.

Parameters:

html (string) - The HTML string to convert
options (object, optional) - Conversion options

Returns: string - The converted Markdown

`convertAsync(html, options?)`

Converts HTML to Markdown asynchronously.

Parameters:

html (string) - The HTML string to convert
options (object, optional) - Conversion options

Returns: Promise - The converted Markdown

Options

Option	Type	Default	Description
`headingStyle`	`'atx'` \| `'setext'`	`'atx'`	ATX uses `#` prefix, Setext uses underlines
`linkStyle`	`'inline'` \| `'referenced'`	`'inline'`	Inline: `[text](url)`, Referenced: `[text][1]`
`codeFence`	'`' \| `'~'`	'`'	Character for fenced code blocks
`bulletMarker`	`'-'` \| `'*'` \| `'+'`	`'-'`	Character for unordered list items
`baseUrl`	`string`	`undefined`	Base URL for resolving relative links
`excludeSelectors`	`string[]`	`[]`	CSS selectors for elements to exclude
`includeSelectors`	`string[]`	`[]`	CSS selectors to force keep (overrides excludes)

Supported Elements

Block Elements

HTML	Markdown
`<h1>` - `<h6>`	`#` headings or setext underlines
`<p>`	Paragraphs with blank lines
`<blockquote>`	`>` quoted blocks (supports nesting)
`<ul>`, `<ol>`	`-` or `1.` lists (supports `start` attribute, task lists)
`<pre><code>`	Fenced code blocks with language detection
`<table>`	GFM tables with alignment and captions
`<hr>`	`---` horizontal rules
`<dl>`, `<dt>`, `<dd>`	Definition lists
`<details>`, `<summary>`	Collapsible sections
`<figure>`, `<figcaption>`	Images with captions

Inline Elements

HTML	Markdown
`<a>`	`[text](url)`, `[text][ref]`, or `<url>` (autolink). Falls back to `title`/`aria-label` for empty link text.
`<img>`	`![alt](src)`. Base64 `data:` URIs are filtered out.
`<strong>`, `<b>`	`bold`
`<em>`, `<i>`	`italic`
`<code>`	`code` (handles nested backticks)
`<del>`, `<s>`, `<strike>`	`~~strikethrough~~`
`<sub>`	`<sub>subscript</sub>`
`<sup>`	`<sup>superscript</sup>`
`<br>`	Line breaks

HTML Passthrough

Elements without Markdown equivalents are preserved as HTML:

<kbd> - Keyboard input
<mark> - Highlighted text
<abbr> - Abbreviations (preserves title attribute)
<samp> - Sample output
<var> - Variables

Advanced Features

Table Alignment

Extracts alignment from align attribute or text-align style:

<table>
  <tr>
    <th align="left">Left</th>
    <th align="center">Center</th>
    <th align="right">Right</th>
  </tr>
</table>

Output:

| Left | Center | Right |
| :--- | :----: | ----: |

Ordered List Start

Respects the start attribute on ordered lists:

<ol start="5">
  <li>Fifth item</li>
  <li>Sixth item</li>
</ol>

Output:

5. Fifth item
6. Sixth item

Autolinks

When a link's text matches its URL or email, autolink syntax is used:

<a href="https://example.com">https://example.com</a>
<a href="mailto:test@example.com">test@example.com</a>

Output:

<https://example.com>
<test@example.com>

Code Block Language Detection

Automatically detects language from class names:

language-* (e.g., language-rust)
lang-* (e.g., lang-python)
highlight-* (e.g., highlight-go)
hljs-* (highlight.js classes, excluding token classes like hljs-keyword)
Bare language names (e.g., javascript, python) as fallback

<pre><code class="language-rust">fn main() {}</code></pre>

Output:

```rust
fn main() {}
```

Code blocks containing backticks automatically use more backticks as delimiters.

Line Number Handling

Line number gutters are automatically stripped from code blocks. Elements with these class patterns are skipped:

gutter
line-number
line-numbers
lineno
linenumber

URL Encoding

Spaces and parentheses in URLs are automatically percent-encoded:

// <a href="https://example.com/path (1)">link</a>
// → [link](https://example.com/path%20%281%29)

Selector-Based Filtering

Remove unwanted elements like navigation, ads, or sidebars:

const markdown = convert(html, {
  excludeSelectors: [
    "nav",
    "header",
    "footer",
    ".sidebar",
    ".advertisement",
    "#cookie-banner",
  ],
  includeSelectors: [".main-content"],
});

Limitations

Some HTML features cannot be fully represented in Markdown:

Feature	Behavior
Table colspan/rowspan	Content placed in first cell
Nested tables	Inner tables converted inline
Form elements	Skipped
iframe/video/audio	Skipped (no standard Markdown equivalent)
CSS styling	Ignored (except `text-align` for tables)
Empty elements	Removed from output

Edge Cases

supermarkdown handles many edge cases gracefully:

Malformed HTML

Invalid or malformed HTML is parsed via html5ever, which applies browser-like error recovery:

// Missing closing tags, nested issues - all handled
const html = "<p>Unclosed paragraph<div>Mixed<p>nesting</div>";
const markdown = convert(html); // Produces sensible output

Deeply Nested Lists

Nested lists maintain proper indentation:

const html = `
<ul>
  <li>Level 1
    <ul>
      <li>Level 2
        <ul>
          <li>Level 3</li>
        </ul>
      </li>
    </ul>
  </li>
</ul>`;
// Output:
// - Level 1
//   - Level 2
//     - Level 3

Code Blocks with Backticks

When code contains backticks, the fence automatically uses more backticks:

const html = "<pre><code>Use `backticks` for code</code></pre>";
// Output uses 4 backticks as fence:
// ````
// Use `backticks` for code
// ````

Empty Elements

Empty paragraphs, divs, and spans are stripped to avoid blank lines:

const html = "<p></p><p>Real content</p><p>   </p>";
const markdown = convert(html);
// Output: "Real content" (empty paragraphs removed)

Special Characters in URLs

Spaces, parentheses, and other special characters in URLs are percent-encoded:

const html = '<a href="https://example.com/file (1).pdf">Download</a>';
// Output: [Download](https://example.com/file%20%281%29.pdf)

Tables Without Headers

Tables missing <thead> use the first row as header:

const html = `
<table>
  <tr><td>A</td><td>B</td></tr>
  <tr><td>1</td><td>2</td></tr>
</table>`;
// Output:
// | A | B |
// | --- | --- |
// | 1 | 2 |

Mixed Content in Lists

List items with mixed block/inline content are handled:

const html = `
<ul>
  <li>Simple item</li>
  <li>
    <p>Paragraph in list</p>
    <pre><code>code block</code></pre>
  </li>
</ul>`;
// Outputs proper markdown with preserved formatting

Troubleshooting

Empty or Minimal Output

Problem: convert() returns empty string or very little content.

Causes & Solutions:

Content is in excluded elements - Check if your content is inside nav, header, etc. that might match default patterns
```
// Try without selectors first
const markdown = convert(html);
```
JavaScript-rendered content - supermarkdown converts static HTML only. If the page uses client-side rendering, you need to render it first (e.g., with Puppeteer or Playwright)
Content in iframes - iframe content is not extracted. Fetch iframe src separately if needed

Missing Code Block Language

Problem: Code blocks don't have language annotation.

Solution: supermarkdown looks for language-*, lang-*, or highlight-* class patterns. Ensure your HTML uses standard class naming:

<!-- Detected -->
<pre><code class="language-python">...</code></pre>
<pre><code class="lang-js">...</code></pre>

<!-- Not detected -->
<pre><code class="python-code">...</code></pre>

Tables Not Rendering Correctly

Problem: Tables appear as plain text or are malformed.

Causes & Solutions:

Missing table structure - Ensure proper <table>, <tr>, <td> structure
Nested tables - GFM doesn't support nested tables; inner tables are flattened
colspan/rowspan - These are not supported in GFM; content goes in first cell

Links Missing or Broken

Problem: Links don't appear or have wrong URLs.

Solutions:

Relative URLs - Use baseUrl option to resolve relative links:
```
convert(html, { baseUrl: "https://example.com" });
```
Links in excluded elements - Navigation links are often in <nav> which may be excluded

Performance Issues with Large Documents

Problem: Conversion is slow for very large HTML files.

Solutions:

Use async - convertAsync() won't block the event loop
Pre-filter HTML - Remove obvious non-content before conversion
Stream processing - For very large docs, consider splitting into sections

Special Characters Appearing Wrong

Problem: Characters like <, >, & appear as entities.

Solution: This is usually correct behavior - these characters need escaping in markdown. If you're seeing & where you expect &, the source HTML may have double-encoded entities.

Rust Usage

Add to your Cargo.toml:

[dependencies]
supermarkdown = "0.0.6"

use supermarkdown::{convert, convert_with_options, Options, HeadingStyle};

// Basic conversion
let markdown = convert("<h1>Hello</h1>");

// With options
let options = Options::new()
    .heading_style(HeadingStyle::Setext)
    .exclude_selectors(vec!["nav".to_string()]);

let markdown = convert_with_options("<h1>Hello</h1>", &options);

Performance

supermarkdown is designed for high performance:

Single-pass parsing - O(n) HTML traversal
Pre-computed metadata - List indices and CSS selectors computed in one pass
Zero-copy where possible - Minimal string allocations
Native code - No JavaScript runtime overhead

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

# Clone the repository
git clone https://github.com/vakra-dev/supermarkdown.git
cd supermarkdown

# Run tests
cargo test

# Build Node.js bindings
cd crates/supermarkdown-napi
npm install
npm run build

License

MIT License - see LICENSE for details.

Name	Name	Last commit message	Last commit date
Latest commit History 44 Commits 44 Commits
.github/workflows	.github/workflows
crates	crates
.gitignore	.gitignore
Cargo.toml	Cargo.toml
LICENSE	LICENSE
README.md	README.md

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

supermarkdown

Features

Installation

Node.js

Rust

CLI

Command Line Usage

CLI Options

Quick Start

Common Use Cases

Cleaning Web Scrapes

Preparing Content for LLMs

Processing Blog Posts

Converting Documentation Pages

Batch Processing

Usage

Basic Conversion

With Options

Async Conversion

API Reference

convert(html, options?)

convertAsync(html, options?)

Options

Supported Elements

Block Elements

Inline Elements

HTML Passthrough

Advanced Features

Table Alignment

Ordered List Start

Autolinks

Code Block Language Detection

Line Number Handling

URL Encoding

Selector-Based Filtering

Limitations

Edge Cases

Malformed HTML

Deeply Nested Lists

Code Blocks with Backticks

Empty Elements

Special Characters in URLs

Tables Without Headers

Mixed Content in Lists

Troubleshooting

Empty or Minimal Output

Missing Code Block Language

Tables Not Rendering Correctly

Links Missing or Broken

Performance Issues with Large Documents

Special Characters Appearing Wrong

Rust Usage

Performance

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

`convert(html, options?)`

`convertAsync(html, options?)`