Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

vakra-dev/supermarkdown

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
44 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

supermarkdown

npm version crates.io License: MIT

High-performance HTML to Markdown converter with full GitHub Flavored Markdown support. Written in Rust, available for Node.js and as a native Rust crate.

Features

  • Fast - Written in Rust with O(n) algorithms, significantly faster than JavaScript alternatives
  • Full GFM Support - Tables with alignment, strikethrough, autolinks, fenced code blocks
  • Accurate - Handles malformed HTML gracefully via html5ever
  • Configurable - Multiple heading styles, link styles, custom selectors
  • Zero Dependencies - Single native binary, no JavaScript runtime overhead
  • Cross-Platform - Pre-built binaries for Windows, macOS, and Linux (x64 & ARM64)
  • TypeScript Ready - Full type definitions included
  • Async Support - Non-blocking conversion for large documents

Installation

Node.js

npm install @vakra-dev/supermarkdown

Rust

cargo add supermarkdown

CLI

Install the CLI binary via cargo:

cargo install supermarkdown-cli

Command Line Usage

The CLI allows you to convert HTML files from the command line or via stdin:

# Convert a file
supermarkdown page.html > page.md

# Pipe HTML from curl
curl -s https://example.com | supermarkdown

# Exclude navigation and ads
supermarkdown --exclude "nav,.ad,#sidebar" page.html

# Use setext-style headings and referenced links
supermarkdown --heading-style setext --link-style referenced page.html

CLI Options

Option Description
-h, --help Print help message
-v, --version Print version
--heading-style <STYLE> atx (default) or setext
--link-style <STYLE> inline (default) or referenced
--code-fence <CHAR> ` (default) or ~
--bullet <CHAR> - (default), *, or +
--exclude <SELECTORS> CSS selectors to exclude (comma-separated)

Quick Start

import { convert } from "@vakra-dev/supermarkdown";

const html = `
  <h1>Hello World</h1>
  <p>This is a <strong>test</strong> with a <a href="https://example.com">link</a>.</p>
`;

const markdown = convert(html);
console.log(markdown);
// # Hello World
//
// This is a **test** with a [link](https://example.com).

Common Use Cases

Cleaning Web Scrapes

When scraping websites, HTML often contains navigation, ads, and other non-content elements. Use selectors to extract only what you need:

import { convert } from "@vakra-dev/supermarkdown";

// Raw HTML from a web scrape
const scrapedHtml = await fetchPage("https://example.com/article");

// Clean conversion - remove nav, ads, sidebars
const markdown = convert(scrapedHtml, {
  excludeSelectors: [
    "nav",
    "header",
    "footer",
    ".sidebar",
    ".advertisement",
    ".cookie-banner",
    ".social-share",
    ".comments",
    "script",
    "style",
  ],
});

Preparing Content for LLMs

When feeding web content to LLMs, you want clean, focused text without HTML artifacts:

import { convert } from "@vakra-dev/supermarkdown";

// Extract just the article content for RAG pipelines
const markdown = convert(html, {
  excludeSelectors: [
    "nav",
    "header",
    "footer",
    "aside",
    ".related-posts",
    ".author-bio",
  ],
  includeSelectors: ["article", ".post-content", "main"],
});

// Now feed to your LLM
const response = await llm.chat({
  messages: [
    {
      role: "user",
      content: `Summarize this article:\n\n${markdown}`,
    },
  ],
});

Processing Blog Posts

Convert blog HTML while preserving code blocks and formatting:

import { convert } from "@vakra-dev/supermarkdown";

const blogHtml = `
<article>
  <h1>Getting Started with Rust</h1>
  <p>Rust is a systems programming language focused on safety.</p>
  <pre><code class="language-rust">fn main() {
    println!("Hello, world!");
}</code></pre>
  <p>The <code>println!</code> macro prints to stdout.</p>
</article>
`;

const markdown = convert(blogHtml);
// Output:
// # Getting Started with Rust
//
// Rust is a systems programming language focused on safety.
//
// ```rust
// fn main() {
//     println!("Hello, world!");
// }
// ```
//
// The `println!` macro prints to stdout.

Converting Documentation Pages

Handle tables, definition lists, and nested structures common in docs:

import { convert } from "@vakra-dev/supermarkdown";

const docsHtml = `
<h2>API Reference</h2>
<table>
  <tr><th>Method</th><th>Description</th></tr>
  <tr><td><code>convert()</code></td><td>Sync conversion</td></tr>
  <tr><td><code>convertAsync()</code></td><td>Async conversion</td></tr>
</table>
<dl>
  <dt>headingStyle</dt>
  <dd>ATX (#) or Setext (underlines)</dd>
</dl>
`;

const markdown = convert(docsHtml);
// Output:
// ## API Reference
//
// | Method | Description |
// | --- | --- |
// | `convert()` | Sync conversion |
// | `convertAsync()` | Async conversion |
//
// headingStyle
// :   ATX (#) or Setext (underlines)

Batch Processing

Process multiple documents efficiently with async conversion:

import { convertAsync } from "@vakra-dev/supermarkdown";

const urls = [
  "https://example.com/page1",
  "https://example.com/page2",
  "https://example.com/page3",
];

// Fetch and convert in parallel
const markdownDocs = await Promise.all(
  urls.map(async (url) => {
    const html = await fetch(url).then((r) => r.text());
    return convertAsync(html, {
      excludeSelectors: ["nav", "footer"],
    });
  })
);

Usage

Basic Conversion

import { convert } from "@vakra-dev/supermarkdown";

const markdown = convert("<h1>Title</h1><p>Paragraph</p>");

With Options

import { convert } from "@vakra-dev/supermarkdown";

const markdown = convert(html, {
  headingStyle: "setext", // 'atx' (default) or 'setext'
  linkStyle: "referenced", // 'inline' (default) or 'referenced'
  excludeSelectors: ["nav", ".sidebar", "#ads"],
  includeSelectors: [".important"], // Override excludes for specific elements
});

Async Conversion

For large documents, use convertAsync to avoid blocking the main thread:

import { convertAsync } from "@vakra-dev/supermarkdown";

const markdown = await convertAsync(largeHtml);

// Process multiple documents in parallel
const results = await Promise.all([
  convertAsync(html1),
  convertAsync(html2),
  convertAsync(html3),
]);

API Reference

convert(html, options?)

Converts HTML to Markdown synchronously.

Parameters:

  • html (string) - The HTML string to convert
  • options (object, optional) - Conversion options

Returns: string - The converted Markdown

convertAsync(html, options?)

Converts HTML to Markdown asynchronously.

Parameters:

  • html (string) - The HTML string to convert
  • options (object, optional) - Conversion options

Returns: Promise - The converted Markdown

Options

Option Type Default Description
headingStyle 'atx' | 'setext' 'atx' ATX uses # prefix, Setext uses underlines
linkStyle 'inline' | 'referenced' 'inline' Inline: [text](url), Referenced: [text][1]
codeFence '`' | '~' '`' Character for fenced code blocks
bulletMarker '-' | '*' | '+' '-' Character for unordered list items
baseUrl string undefined Base URL for resolving relative links
excludeSelectors string[] [] CSS selectors for elements to exclude
includeSelectors string[] [] CSS selectors to force keep (overrides excludes)

Supported Elements

Block Elements

HTML Markdown
<h1> - <h6> # headings or setext underlines
<p> Paragraphs with blank lines
<blockquote> > quoted blocks (supports nesting)
<ul>, <ol> - or 1. lists (supports start attribute, task lists)
<pre><code> Fenced code blocks with language detection
<table> GFM tables with alignment and captions
<hr> --- horizontal rules
<dl>, <dt>, <dd> Definition lists
<details>, <summary> Collapsible sections
<figure>, <figcaption> Images with captions

Inline Elements

HTML Markdown
<a> [text](url), [text][ref], or <url> (autolink). Falls back to title/aria-label for empty link text.
<img> ![alt](src). Base64 data: URIs are filtered out.
<strong>, <b> **bold**
<em>, <i> *italic*
<code> `code` (handles nested backticks)
<del>, <s>, <strike> ~~strikethrough~~
<sub> <sub>subscript</sub>
<sup> <sup>superscript</sup>
<br> Line breaks

HTML Passthrough

Elements without Markdown equivalents are preserved as HTML:

  • <kbd> - Keyboard input
  • <mark> - Highlighted text
  • <abbr> - Abbreviations (preserves title attribute)
  • <samp> - Sample output
  • <var> - Variables

Advanced Features

Table Alignment

Extracts alignment from align attribute or text-align style:

<table>
  <tr>
    <th align="left">Left</th>
    <th align="center">Center</th>
    <th align="right">Right</th>
  </tr>
</table>

Output:

| Left | Center | Right |
| :--- | :----: | ----: |

Ordered List Start

Respects the start attribute on ordered lists:

<ol start="5">
  <li>Fifth item</li>
  <li>Sixth item</li>
</ol>

Output:

5. Fifth item
6. Sixth item

Autolinks

When a link's text matches its URL or email, autolink syntax is used:

<a href="https://example.com">https://example.com</a>
<a href="mailto:test@example.com">test@example.com</a>

Output:

<https://example.com>
<test@example.com>

Code Block Language Detection

Automatically detects language from class names:

  • language-* (e.g., language-rust)
  • lang-* (e.g., lang-python)
  • highlight-* (e.g., highlight-go)
  • hljs-* (highlight.js classes, excluding token classes like hljs-keyword)
  • Bare language names (e.g., javascript, python) as fallback
<pre><code class="language-rust">fn main() {}</code></pre>

Output:

```rust
fn main() {}
```

Code blocks containing backticks automatically use more backticks as delimiters.

Line Number Handling

Line number gutters are automatically stripped from code blocks. Elements with these class patterns are skipped:

  • gutter
  • line-number
  • line-numbers
  • lineno
  • linenumber

URL Encoding

Spaces and parentheses in URLs are automatically percent-encoded:

// <a href="https://example.com/path (1)">link</a>
// → [link](https://example.com/path%20%281%29)

Selector-Based Filtering

Remove unwanted elements like navigation, ads, or sidebars:

const markdown = convert(html, {
  excludeSelectors: [
    "nav",
    "header",
    "footer",
    ".sidebar",
    ".advertisement",
    "#cookie-banner",
  ],
  includeSelectors: [".main-content"],
});

Limitations

Some HTML features cannot be fully represented in Markdown:

Feature Behavior
Table colspan/rowspan Content placed in first cell
Nested tables Inner tables converted inline
Form elements Skipped
iframe/video/audio Skipped (no standard Markdown equivalent)
CSS styling Ignored (except text-align for tables)
Empty elements Removed from output

Edge Cases

supermarkdown handles many edge cases gracefully:

Malformed HTML

Invalid or malformed HTML is parsed via html5ever, which applies browser-like error recovery:

// Missing closing tags, nested issues - all handled
const html = "<p>Unclosed paragraph<div>Mixed<p>nesting</div>";
const markdown = convert(html); // Produces sensible output

Deeply Nested Lists

Nested lists maintain proper indentation:

const html = `
<ul>
  <li>Level 1
    <ul>
      <li>Level 2
        <ul>
          <li>Level 3</li>
        </ul>
      </li>
    </ul>
  </li>
</ul>`;
// Output:
// - Level 1
//   - Level 2
//     - Level 3

Code Blocks with Backticks

When code contains backticks, the fence automatically uses more backticks:

const html = "<pre><code>Use `backticks` for code</code></pre>";
// Output uses 4 backticks as fence:
// ````
// Use `backticks` for code
// ````

Empty Elements

Empty paragraphs, divs, and spans are stripped to avoid blank lines:

const html = "<p></p><p>Real content</p><p>   </p>";
const markdown = convert(html);
// Output: "Real content" (empty paragraphs removed)

Special Characters in URLs

Spaces, parentheses, and other special characters in URLs are percent-encoded:

const html = '<a href="https://example.com/file (1).pdf">Download</a>';
// Output: [Download](https://example.com/file%20%281%29.pdf)

Tables Without Headers

Tables missing <thead> use the first row as header:

const html = `
<table>
  <tr><td>A</td><td>B</td></tr>
  <tr><td>1</td><td>2</td></tr>
</table>`;
// Output:
// | A | B |
// | --- | --- |
// | 1 | 2 |

Mixed Content in Lists

List items with mixed block/inline content are handled:

const html = `
<ul>
  <li>Simple item</li>
  <li>
    <p>Paragraph in list</p>
    <pre><code>code block</code></pre>
  </li>
</ul>`;
// Outputs proper markdown with preserved formatting

Troubleshooting

Empty or Minimal Output

Problem: convert() returns empty string or very little content.

Causes & Solutions:

  1. Content is in excluded elements - Check if your content is inside nav, header, etc. that might match default patterns

    // Try without selectors first
    const markdown = convert(html);
  2. JavaScript-rendered content - supermarkdown converts static HTML only. If the page uses client-side rendering, you need to render it first (e.g., with Puppeteer or Playwright)

  3. Content in iframes - iframe content is not extracted. Fetch iframe src separately if needed

Missing Code Block Language

Problem: Code blocks don't have language annotation.

Solution: supermarkdown looks for language-*, lang-*, or highlight-* class patterns. Ensure your HTML uses standard class naming:

<!-- Detected -->
<pre><code class="language-python">...</code></pre>
<pre><code class="lang-js">...</code></pre>

<!-- Not detected -->
<pre><code class="python-code">...</code></pre>

Tables Not Rendering Correctly

Problem: Tables appear as plain text or are malformed.

Causes & Solutions:

  1. Missing table structure - Ensure proper <table>, <tr>, <td> structure
  2. Nested tables - GFM doesn't support nested tables; inner tables are flattened
  3. colspan/rowspan - These are not supported in GFM; content goes in first cell

Links Missing or Broken

Problem: Links don't appear or have wrong URLs.

Solutions:

  1. Relative URLs - Use baseUrl option to resolve relative links:

    convert(html, { baseUrl: "https://example.com" });
  2. Links in excluded elements - Navigation links are often in <nav> which may be excluded

Performance Issues with Large Documents

Problem: Conversion is slow for very large HTML files.

Solutions:

  1. Use async - convertAsync() won't block the event loop
  2. Pre-filter HTML - Remove obvious non-content before conversion
  3. Stream processing - For very large docs, consider splitting into sections

Special Characters Appearing Wrong

Problem: Characters like <, >, & appear as entities.

Solution: This is usually correct behavior - these characters need escaping in markdown. If you're seeing &amp; where you expect &, the source HTML may have double-encoded entities.

Rust Usage

Add to your Cargo.toml:

[dependencies]
supermarkdown = "0.0.6"
use supermarkdown::{convert, convert_with_options, Options, HeadingStyle};

// Basic conversion
let markdown = convert("<h1>Hello</h1>");

// With options
let options = Options::new()
    .heading_style(HeadingStyle::Setext)
    .exclude_selectors(vec!["nav".to_string()]);

let markdown = convert_with_options("<h1>Hello</h1>", &options);

Performance

supermarkdown is designed for high performance:

  • Single-pass parsing - O(n) HTML traversal
  • Pre-computed metadata - List indices and CSS selectors computed in one pass
  • Zero-copy where possible - Minimal string allocations
  • Native code - No JavaScript runtime overhead

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

# Clone the repository
git clone https://github.com/vakra-dev/supermarkdown.git
cd supermarkdown

# Run tests
cargo test

# Build Node.js bindings
cd crates/supermarkdown-napi
npm install
npm run build

License

MIT License - see LICENSE for details.

Morty Proxy This is a proxified and sanitized view of the page, visit original site.