Intermediate data-structures

Regex & Text Processing in Wasm

Why Rust Regex in Wasm?

JavaScript has built-in regex support, so why compile Rust's regex crate to Wasm? Performance, correctness, and features:

  Benchmark: Match 10,000 email addresses

  JS RegExp          ████████████████████████  24ms
  Rust regex (Wasm)  ████████  8ms
  Rust regex (native)████  4ms

  ← faster                         slower →

Feature	JavaScript RegExp	Rust `regex` crate
Engine type	Backtracking (NFA/DFA)	Guaranteed DFA/NFA hybrid
Catastrophic backtracking	Possible (ReDoS)	Impossible (linear time)
Unicode support	Partial (ES2018+)	Full (Unicode categories)
Named captures	Yes (ES2018)	Yes
Lookahead/lookbehind	Yes	No (guarantees O(n))
Compilation speed	Fast	Slower (but cacheable)
Match speed	Good	Excellent (2-5x faster)

ReDoS Protection

The Rust regex crate guarantees O(n) matching time regardless of the pattern. This makes it safe against Regular Expression Denial of Service (ReDoS):

  Pattern: (a+)+$     Input: "aaaaaaaaaaaaaaaaX"

  JavaScript RegExp:
    Tries 2^n combinations → hangs for seconds
    This is a ReDoS vulnerability!

  Rust regex crate:
    Linear scan → returns "no match" instantly
    Safe by construction (no backtracking)

Using the regex Crate in Wasm

use regex::Regex;
use wasm_bindgen::prelude::*;

#[wasm_bindgen]
pub struct TextProcessor {
    email_re: Regex,
    url_re: Regex,
    phone_re: Regex,
}

#[wasm_bindgen]
impl TextProcessor {
    #[wasm_bindgen(constructor)]
    pub fn new() -> Self {
        TextProcessor {
            email_re: Regex::new(r"[\w.+-]+@[\w-]+\.[\w.]+").unwrap(),
            url_re: Regex::new(r"https?://[\w\-._~:/?#\[\]@!$&'()*+,;=%]+").unwrap(),
            phone_re: Regex::new(r"\+?1?[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}").unwrap(),
        }
    }

    pub fn find_emails(&self, text: &str) -> Vec<String> {
        self.email_re.find_iter(text)
            .map(|m| m.as_str().to_string())
            .collect()
    }

    pub fn validate_email(&self, email: &str) -> bool {
        self.email_re.is_match(email)
    }

    pub fn redact_phones(&self, text: &str) -> String {
        self.phone_re.replace_all(text, "[REDACTED]").to_string()
    }
}

Compiling regex to Wasm — size considerations

The regex crate adds approximately 200-400 KB to your Wasm binary (depending on Unicode table inclusion). To reduce size:

# Cargo.toml
[dependencies]
regex = { version = "1", default-features = false, features = ["std"] }
# Drops Unicode tables: saves ~100 KB but loses \p{...} patterns

Common Regex Patterns for Wasm Apps

  Pattern                     Matches                   Use Case
  ─────────────────────────── ───────────────────────── ──────────────────
  [\w.+-]+@[\w-]+\.[\w.]+     user@example.com          Email validation
  https?://[^\s]+             http://example.com/path   URL extraction
  \d{4}-\d{2}-\d{2}          2024-01-15                ISO date
  \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}  192.168.1.1     IP address
  #[0-9a-fA-F]{6}            #ff00aa                   Hex color
  \b[A-Z]{2,}\b              NASA, FBI                 Acronyms
  ^\s*$                       (empty/whitespace)        Blank lines
  <[^>]+>                     <div class="x">           HTML tags

Text Search and Replace

Efficient search with compiled regex

use regex::Regex;

// Compile once, use many times
lazy_static! {
    static ref LOG_PATTERN: Regex = Regex::new(
        r"\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] (ERROR|WARN|INFO|DEBUG): (.+)"
    ).unwrap();
}

fn parse_log(line: &str) -> Option<(String, String, String)> {
    LOG_PATTERN.captures(line).map(|caps| (
        caps[1].to_string(),  // timestamp
        caps[2].to_string(),  // level
        caps[3].to_string(),  // message
    ))
}

Search and replace with captures

use regex::Regex;

fn anonymize_data(text: &str) -> String {
    let email_re = Regex::new(r"([\w.+-]+)@([\w-]+\.[\w.]+)").unwrap();
    let phone_re = Regex::new(r"\d{3}[-.]?\d{3}[-.]?\d{4}").unwrap();

    let result = email_re.replace_all(text, "***@$2");  // Keep domain
    let result = phone_re.replace_all(&result, "***-***-****");
    result.to_string()
}

CSV Processing in Wasm

CSV parsing is a common use case where Wasm outperforms JavaScript significantly:

  Processing 100,000 CSV rows

  Papa Parse (JS)     ████████████████████████████████  320ms
  Rust csv crate      ████████████  120ms
  (Wasm)

  ← faster                                    slower →

use csv::ReaderBuilder;
use wasm_bindgen::prelude::*;

#[wasm_bindgen]
pub fn parse_csv(data: &str) -> JsValue {
    let mut reader = ReaderBuilder::new()
        .has_headers(true)
        .from_reader(data.as_bytes());

    let mut rows: Vec<Vec<String>> = Vec::new();
    for result in reader.records() {
        if let Ok(record) = result {
            rows.push(record.iter().map(|s| s.to_string()).collect());
        }
    }

    serde_wasm_bindgen::to_value(&rows).unwrap()
}

Unicode Handling

Rust's regex crate has excellent Unicode support, which matters for international text processing:

  Character          UTF-8 Bytes    Rust char    JS char
  ────────────────── ────────────── ──────────── ─────────
  'A'                1 byte         1 char       1 code unit
  'e' (with accent)  2 bytes        1 char       1 code unit
  Chinese character  3 bytes        1 char       1 code unit
  Emoji              4 bytes        1 char       2 code units!

use regex::Regex;

// Unicode-aware word boundary
let re = Regex::new(r"\b\w+\b").unwrap();  // Works with Unicode!

// Unicode category matching
let re = Regex::new(r"\p{Han}+").unwrap();       // Chinese characters
let re = Regex::new(r"\p{Cyrillic}+").unwrap();  // Russian/Cyrillic
let re = Regex::new(r"\p{Emoji}+").unwrap();     // Emoji

// Case-insensitive with Unicode
let re = Regex::new(r"(?i)strasse|straße").unwrap();  // Matches German

Performance Optimization Tips

  Optimization                         Impact
  ──────────────────────────────────── ──────────────────────────
  Compile regex once, reuse            10-100x faster
  Use is_match() instead of find()     Faster (no capture alloc)
  Anchor patterns (^...$)              Avoids scanning entire text
  Use RegexSet for multiple patterns   1 scan instead of N scans
  Disable Unicode if ASCII-only        Smaller binary, faster
  Use bytes::Regex for raw bytes       Avoids UTF-8 validation

RegexSet: Match multiple patterns in one pass

use regex::RegexSet;

let set = RegexSet::new(&[
    r"[\w.+-]+@[\w-]+\.[\w.]+",    // email
    r"https?://[^\s]+",             // URL
    r"\d{3}[-.]?\d{3}[-.]?\d{4}",  // phone
    r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}",  // IP
]).unwrap();

let text = "Contact john@example.com or visit https://example.com";
let matches: Vec<usize> = set.matches(text).into_iter().collect();
// matches = [0, 1]  → found email and URL patterns

Summary

Rust's regex crate compiled to Wasm gives you 2-5x faster text processing than JavaScript with guaranteed linear-time matching (no ReDoS). Use it for email validation, log parsing, CSV processing, search-and-replace, and any text-heavy workload. The crate adds ~200-400 KB to your Wasm binary but pays for itself in performance and safety. Combine it with full Unicode support for international text processing that just works.

Try It

main.rs

Report an issue

Wasm + Machine Learning

Wasm + SVG Manipulation