← Back to Lessons Lesson 42 of 48
Intermediate data-structures

Regex & Text Processing in Wasm

Why Rust Regex in Wasm?

JavaScript has built-in regex support, so why compile Rust's regex crate to Wasm? Performance, correctness, and features:

  Benchmark: Match 10,000 email addresses

  JS RegExp          ████████████████████████  24ms
  Rust regex (Wasm)  ████████  8ms
  Rust regex (native)████  4ms

  ← faster                         slower →
Feature JavaScript RegExp Rust regex crate
Engine type Backtracking (NFA/DFA) Guaranteed DFA/NFA hybrid
Catastrophic backtracking Possible (ReDoS) Impossible (linear time)
Unicode support Partial (ES2018+) Full (Unicode categories)
Named captures Yes (ES2018) Yes
Lookahead/lookbehind Yes No (guarantees O(n))
Compilation speed Fast Slower (but cacheable)
Match speed Good Excellent (2-5x faster)

ReDoS Protection

The Rust regex crate guarantees O(n) matching time regardless of the pattern. This makes it safe against Regular Expression Denial of Service (ReDoS):

  Pattern: (a+)+$     Input: "aaaaaaaaaaaaaaaaX"

  JavaScript RegExp:
    Tries 2^n combinations → hangs for seconds
    This is a ReDoS vulnerability!

  Rust regex crate:
    Linear scan → returns "no match" instantly
    Safe by construction (no backtracking)

Using the regex Crate in Wasm

use regex::Regex;
use wasm_bindgen::prelude::*;

#[wasm_bindgen]
pub struct TextProcessor {
    email_re: Regex,
    url_re: Regex,
    phone_re: Regex,
}

#[wasm_bindgen]
impl TextProcessor {
    #[wasm_bindgen(constructor)]
    pub fn new() -> Self {
        TextProcessor {
            email_re: Regex::new(r"[\w.+-]+@[\w-]+\.[\w.]+").unwrap(),
            url_re: Regex::new(r"https?://[\w\-._~:/?#\[\]@!$&'()*+,;=%]+").unwrap(),
            phone_re: Regex::new(r"\+?1?[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}").unwrap(),
        }
    }

    pub fn find_emails(&self, text: &str) -> Vec<String> {
        self.email_re.find_iter(text)
            .map(|m| m.as_str().to_string())
            .collect()
    }

    pub fn validate_email(&self, email: &str) -> bool {
        self.email_re.is_match(email)
    }

    pub fn redact_phones(&self, text: &str) -> String {
        self.phone_re.replace_all(text, "[REDACTED]").to_string()
    }
}

Compiling regex to Wasm — size considerations

The regex crate adds approximately 200-400 KB to your Wasm binary (depending on Unicode table inclusion). To reduce size:

# Cargo.toml
[dependencies]
regex = { version = "1", default-features = false, features = ["std"] }
# Drops Unicode tables: saves ~100 KB but loses \p{...} patterns

Common Regex Patterns for Wasm Apps

  Pattern                     Matches                   Use Case
  ─────────────────────────── ───────────────────────── ──────────────────
  [\w.+-]+@[\w-]+\.[\w.]+     user@example.com          Email validation
  https?://[^\s]+             http://example.com/path   URL extraction
  \d{4}-\d{2}-\d{2}          2024-01-15                ISO date
  \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}  192.168.1.1     IP address
  #[0-9a-fA-F]{6}            #ff00aa                   Hex color
  \b[A-Z]{2,}\b              NASA, FBI                 Acronyms
  ^\s*$                       (empty/whitespace)        Blank lines
  <[^>]+>                     <div class="x">           HTML tags

Text Search and Replace

Efficient search with compiled regex

use regex::Regex;

// Compile once, use many times
lazy_static! {
    static ref LOG_PATTERN: Regex = Regex::new(
        r"\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] (ERROR|WARN|INFO|DEBUG): (.+)"
    ).unwrap();
}

fn parse_log(line: &str) -> Option<(String, String, String)> {
    LOG_PATTERN.captures(line).map(|caps| (
        caps[1].to_string(),  // timestamp
        caps[2].to_string(),  // level
        caps[3].to_string(),  // message
    ))
}

Search and replace with captures

use regex::Regex;

fn anonymize_data(text: &str) -> String {
    let email_re = Regex::new(r"([\w.+-]+)@([\w-]+\.[\w.]+)").unwrap();
    let phone_re = Regex::new(r"\d{3}[-.]?\d{3}[-.]?\d{4}").unwrap();

    let result = email_re.replace_all(text, "***@$2");  // Keep domain
    let result = phone_re.replace_all(&result, "***-***-****");
    result.to_string()
}

CSV Processing in Wasm

CSV parsing is a common use case where Wasm outperforms JavaScript significantly:

  Processing 100,000 CSV rows

  Papa Parse (JS)     ████████████████████████████████  320ms
  Rust csv crate      ████████████  120ms
  (Wasm)

  ← faster                                    slower →
use csv::ReaderBuilder;
use wasm_bindgen::prelude::*;

#[wasm_bindgen]
pub fn parse_csv(data: &str) -> JsValue {
    let mut reader = ReaderBuilder::new()
        .has_headers(true)
        .from_reader(data.as_bytes());

    let mut rows: Vec<Vec<String>> = Vec::new();
    for result in reader.records() {
        if let Ok(record) = result {
            rows.push(record.iter().map(|s| s.to_string()).collect());
        }
    }

    serde_wasm_bindgen::to_value(&rows).unwrap()
}

Unicode Handling

Rust's regex crate has excellent Unicode support, which matters for international text processing:

  Character          UTF-8 Bytes    Rust char    JS char
  ────────────────── ────────────── ──────────── ─────────
  'A'                1 byte         1 char       1 code unit
  'e' (with accent)  2 bytes        1 char       1 code unit
  Chinese character  3 bytes        1 char       1 code unit
  Emoji              4 bytes        1 char       2 code units!
use regex::Regex;

// Unicode-aware word boundary
let re = Regex::new(r"\b\w+\b").unwrap();  // Works with Unicode!

// Unicode category matching
let re = Regex::new(r"\p{Han}+").unwrap();       // Chinese characters
let re = Regex::new(r"\p{Cyrillic}+").unwrap();  // Russian/Cyrillic
let re = Regex::new(r"\p{Emoji}+").unwrap();     // Emoji

// Case-insensitive with Unicode
let re = Regex::new(r"(?i)strasse|straße").unwrap();  // Matches German

Performance Optimization Tips

  Optimization                         Impact
  ──────────────────────────────────── ──────────────────────────
  Compile regex once, reuse            10-100x faster
  Use is_match() instead of find()     Faster (no capture alloc)
  Anchor patterns (^...$)              Avoids scanning entire text
  Use RegexSet for multiple patterns   1 scan instead of N scans
  Disable Unicode if ASCII-only        Smaller binary, faster
  Use bytes::Regex for raw bytes       Avoids UTF-8 validation

RegexSet: Match multiple patterns in one pass

use regex::RegexSet;

let set = RegexSet::new(&[
    r"[\w.+-]+@[\w-]+\.[\w.]+",    // email
    r"https?://[^\s]+",             // URL
    r"\d{3}[-.]?\d{3}[-.]?\d{4}",  // phone
    r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}",  // IP
]).unwrap();

let text = "Contact john@example.com or visit https://example.com";
let matches: Vec<usize> = set.matches(text).into_iter().collect();
// matches = [0, 1]  → found email and URL patterns

Summary

Rust's regex crate compiled to Wasm gives you 2-5x faster text processing than JavaScript with guaranteed linear-time matching (no ReDoS). Use it for email validation, log parsing, CSV processing, search-and-replace, and any text-heavy workload. The crate adds ~200-400 KB to your Wasm binary but pays for itself in performance and safety. Combine it with full Unicode support for international text processing that just works.

Try It