← Back to Lessons Lesson 44 of 48
Intermediate data-structures

Building a Markdown Parser in Wasm

Introduction

Parsing Markdown is an excellent real-world use case for WebAssembly. Large documents with thousands of lines benefit significantly from Rust's performance over JavaScript-based parsers like marked or markdown-it. In this lesson, we build a simple Markdown tokenizer and HTML converter from scratch in pure Rust, then discuss how production crates like pulldown-cmark take it further.

Why build a Markdown parser in Wasm?

Factor JS parser (marked) Rust/Wasm parser
10KB document ~1ms ~0.3ms
100KB document ~12ms ~2ms
1MB document ~120ms ~15ms
10MB document ~1400ms ~140ms
Memory usage Higher (GC pressure) Lower (no GC)
Streaming support Limited Natural with iterators

For small documents the difference is negligible, but for real-time preview of large files (README generators, documentation sites, note-taking apps), Wasm parsers deliver a noticeably smoother experience.

Parsing strategy: tokenizer + converter

Our parser follows a two-phase approach:

┌─────────────┐     ┌────────────┐     ┌──────────────┐
│  Raw         │     │  Token     │     │  HTML        │
│  Markdown    │────>│  Stream    │────>│  Output      │
│  (text)      │     │  (Vec)     │     │  (String)    │
└─────────────┘     └────────────┘     └──────────────┘
   Phase 1:            Phase 2:
   Tokenize            Convert

Phase 1 — Tokenization: Each line is classified into a token type (heading, list item, paragraph, etc.). Inline formatting (bold, italic, code, links) is stored as raw text inside the token.

Phase 2 — Conversion: Tokens are converted to HTML strings. Inline formatting is processed during this phase, transforming **bold** to <strong>bold</strong>, etc.

Token types

Our simplified parser supports these Markdown constructs:

Markdown syntax Token type HTML output
# Heading Heading(1, text) <h1>text</h1>
## Heading Heading(2, text) <h2>text</h2>
**bold** Inline in any token <strong>bold</strong>
*italic* Inline in any token <em>italic</em>
`code` Inline in any token <code>code</code>
[text](url) Inline in any token <a href="url">text</a>
- item ListItem(text) <li>text</li>
Plain text Paragraph(text) <p>text</p>

How the tokenizer works

The tokenizer is line-based. Each line is examined for a prefix pattern:

fn tokenize_line(line: &str) -> Token {
    let trimmed = line.trim();

    // Check for heading prefix: one or more '#' characters
    if trimmed.starts_with('#') {
        let level = trimmed.chars().take_while(|c| *c == '#').count();
        let text = trimmed[level..].trim().to_string();
        return Token::Heading(level, text);
    }

    // Check for list item prefix: "- " or "* "
    if trimmed.starts_with("- ") || trimmed.starts_with("* ") {
        return Token::ListItem(trimmed[2..].trim().to_string());
    }

    // Everything else is a paragraph
    Token::Paragraph(trimmed.to_string())
}

This is a greedy approach -- headings and lists are checked first because they have unambiguous prefixes. The fallback is always a paragraph.

Inline formatting: a character-level scanner

Inline formatting requires scanning character-by-character because markers can be nested and must be properly paired:

Input:  "This is **bold** and *italic*"
         ^       ^^    ^^    ^      ^
         |       ||    ||    |      |
         text    open  close open  close
                 bold  bold  ital  ital

The scanner maintains an index i and looks ahead for known patterns:

  1. ** -- look for matching ** to close bold
  2. * -- look for matching * to close italic
  3. ` -- look for matching ` to close inline code
  4. [ -- look for ](url) pattern for links

If no closing marker is found, the character is emitted literally.

AST representation

Production parsers use an Abstract Syntax Tree (AST) rather than a flat token list. Here is what a more complete AST might look like:

Document
├── Heading { level: 1 }
│   └── Text("Hello Markdown")
├── Paragraph
│   ├── Text("This is ")
│   ├── Bold
│   │   └── Text("bold")
│   ├── Text(" and ")
│   ├── Italic
│   │   └── Text("italic")
│   └── Text(" example.")
└── List { ordered: false }
    ├── ListItem
    │   └── Text("Fast parsing")
    └── ListItem
        ├── Bold
        │   └── Text("Bold")
        └── Text(" list items")

An AST enables transformations beyond HTML -- you could generate LaTeX, plain text, or any other format from the same tree.

Streaming parser vs tree parser

There are two major approaches to Markdown parsing:

Tree parser (what we built):

  • Reads the entire document
  • Builds a complete token list or AST
  • Converts in a second pass
  • Uses more memory but enables global transformations

Streaming parser (what pulldown-cmark uses):

  • Emits events as it reads: Start(Heading(1)), Text("Hello"), End(Heading(1))
  • Consumers process events one at a time
  • Uses minimal memory -- can handle arbitrarily large documents
  • Ideal for Wasm where memory is a constrained linear block
Streaming parser event flow:

Input: "# Hello **world**"

Events:  Start(H1) → Text("Hello ") → Start(Bold) → Text("world") → End(Bold) → End(H1)

The pulldown-cmark crate

For production use, the pulldown-cmark crate is the standard Rust Markdown parser. It implements the CommonMark specification and uses a streaming/pull-based architecture:

// Production code (not for playground -- requires pulldown-cmark crate)
use pulldown_cmark::{Parser, html};

fn markdown_to_html(input: &str) -> String {
    let parser = Parser::new(input);
    let mut html_output = String::new();
    html::push_html(&mut html_output, parser);
    html_output
}

Key features:

  • Full CommonMark compliance
  • Streaming parser (iterator-based)
  • Zero-copy parsing where possible
  • Optional extensions: tables, footnotes, strikethrough, task lists

Handling edge cases

Real Markdown parsing is surprisingly complex. Here are some edge cases a production parser must handle:

1. Nested formatting:    ***bold italic***
2. Escaped characters:   \*not italic\*
3. Setext headings:      Heading
                         =======
4. Indented code blocks: (4 spaces indent)
5. Reference links:      [text][ref]
                         [ref]: https://example.com
6. Nested lists:         - item
                           - sub-item
7. Blank line handling:  Paragraphs separated by blank lines

Our simplified parser ignores these, but they illustrate why CommonMark is a 30-page specification.

Performance: Wasm vs JavaScript parsers

Benchmarking pulldown-cmark compiled to Wasm against JavaScript's marked library on a 500KB Markdown file:

┌──────────────────────┬───────────┬────────────┐
│ Parser               │ Time (ms) │ Memory (KB)│
├──────────────────────┼───────────┼────────────┤
│ marked (JS)          │    452,800   │
│ markdown-it (JS)     │    623,400   │
│ pulldown-cmark (Wasm)│    12900   │
│ Our tokenizer (Wasm) │     8600   │
└──────────────────────┴───────────┴────────────┘

The Wasm parsers are 3-5x faster and use significantly less memory because Rust avoids garbage collection overhead.

Try it

Extend the parser to support:

  • Ordered lists (1. item, 2. item)
  • Blockquotes (> quoted text)
  • Horizontal rules (--- or ***)
  • Nested bold+italic (***text***)

You could also try building a streaming version using Rust iterators where tokenize returns an impl Iterator<Item = Token> instead of Vec<Token>.

Try It