Regex & Text Processing in Wasm
Why Rust Regex in Wasm?
JavaScript has built-in regex support, so why compile Rust's regex crate to Wasm? Performance, correctness, and features:
Benchmark: Match 10,000 email addresses
JS RegExp ████████████████████████ 24ms
Rust regex (Wasm) ████████ 8ms
Rust regex (native)████ 4ms
← faster slower →| Feature | JavaScript RegExp | Rust regex crate |
|---|---|---|
| Engine type | Backtracking (NFA/DFA) | Guaranteed DFA/NFA hybrid |
| Catastrophic backtracking | Possible (ReDoS) | Impossible (linear time) |
| Unicode support | Partial (ES2018+) | Full (Unicode categories) |
| Named captures | Yes (ES2018) | Yes |
| Lookahead/lookbehind | Yes | No (guarantees O(n)) |
| Compilation speed | Fast | Slower (but cacheable) |
| Match speed | Good | Excellent (2-5x faster) |
ReDoS Protection
The Rust regex crate guarantees O(n) matching time regardless of the pattern. This makes it safe against Regular Expression Denial of Service (ReDoS):
Pattern: (a+)+$ Input: "aaaaaaaaaaaaaaaaX"
JavaScript RegExp:
Tries 2^n combinations → hangs for seconds
This is a ReDoS vulnerability!
Rust regex crate:
Linear scan → returns "no match" instantly
Safe by construction (no backtracking)Using the regex Crate in Wasm
use regex::Regex;
use wasm_bindgen::prelude::*;
#[wasm_bindgen]
pub struct TextProcessor {
email_re: Regex,
url_re: Regex,
phone_re: Regex,
}
#[wasm_bindgen]
impl TextProcessor {
#[wasm_bindgen(constructor)]
pub fn new() -> Self {
TextProcessor {
email_re: Regex::new(r"[\w.+-]+@[\w-]+\.[\w.]+").unwrap(),
url_re: Regex::new(r"https?://[\w\-._~:/?#\[\]@!$&'()*+,;=%]+").unwrap(),
phone_re: Regex::new(r"\+?1?[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}").unwrap(),
}
}
pub fn find_emails(&self, text: &str) -> Vec<String> {
self.email_re.find_iter(text)
.map(|m| m.as_str().to_string())
.collect()
}
pub fn validate_email(&self, email: &str) -> bool {
self.email_re.is_match(email)
}
pub fn redact_phones(&self, text: &str) -> String {
self.phone_re.replace_all(text, "[REDACTED]").to_string()
}
}Compiling regex to Wasm — size considerations
The regex crate adds approximately 200-400 KB to your Wasm binary (depending on Unicode table inclusion). To reduce size:
# Cargo.toml
[dependencies]
regex = { version = "1", default-features = false, features = ["std"] }
# Drops Unicode tables: saves ~100 KB but loses \p{...} patternsCommon Regex Patterns for Wasm Apps
Pattern Matches Use Case
─────────────────────────── ───────────────────────── ──────────────────
[\w.+-]+@[\w-]+\.[\w.]+ user@example.com Email validation
https?://[^\s]+ http://example.com/path URL extraction
\d{4}-\d{2}-\d{2} 2024-01-15 ISO date
\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} 192.168.1.1 IP address
#[0-9a-fA-F]{6} #ff00aa Hex color
\b[A-Z]{2,}\b NASA, FBI Acronyms
^\s*$ (empty/whitespace) Blank lines
<[^>]+> <div class="x"> HTML tagsText Search and Replace
Efficient search with compiled regex
use regex::Regex;
// Compile once, use many times
lazy_static! {
static ref LOG_PATTERN: Regex = Regex::new(
r"\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] (ERROR|WARN|INFO|DEBUG): (.+)"
).unwrap();
}
fn parse_log(line: &str) -> Option<(String, String, String)> {
LOG_PATTERN.captures(line).map(|caps| (
caps[1].to_string(), // timestamp
caps[2].to_string(), // level
caps[3].to_string(), // message
))
}Search and replace with captures
use regex::Regex;
fn anonymize_data(text: &str) -> String {
let email_re = Regex::new(r"([\w.+-]+)@([\w-]+\.[\w.]+)").unwrap();
let phone_re = Regex::new(r"\d{3}[-.]?\d{3}[-.]?\d{4}").unwrap();
let result = email_re.replace_all(text, "***@$2"); // Keep domain
let result = phone_re.replace_all(&result, "***-***-****");
result.to_string()
}CSV Processing in Wasm
CSV parsing is a common use case where Wasm outperforms JavaScript significantly:
Processing 100,000 CSV rows
Papa Parse (JS) ████████████████████████████████ 320ms
Rust csv crate ████████████ 120ms
(Wasm)
← faster slower →use csv::ReaderBuilder;
use wasm_bindgen::prelude::*;
#[wasm_bindgen]
pub fn parse_csv(data: &str) -> JsValue {
let mut reader = ReaderBuilder::new()
.has_headers(true)
.from_reader(data.as_bytes());
let mut rows: Vec<Vec<String>> = Vec::new();
for result in reader.records() {
if let Ok(record) = result {
rows.push(record.iter().map(|s| s.to_string()).collect());
}
}
serde_wasm_bindgen::to_value(&rows).unwrap()
}Unicode Handling
Rust's regex crate has excellent Unicode support, which matters for international text processing:
Character UTF-8 Bytes Rust char JS char
────────────────── ────────────── ──────────── ─────────
'A' 1 byte 1 char 1 code unit
'e' (with accent) 2 bytes 1 char 1 code unit
Chinese character 3 bytes 1 char 1 code unit
Emoji 4 bytes 1 char 2 code units!use regex::Regex;
// Unicode-aware word boundary
let re = Regex::new(r"\b\w+\b").unwrap(); // Works with Unicode!
// Unicode category matching
let re = Regex::new(r"\p{Han}+").unwrap(); // Chinese characters
let re = Regex::new(r"\p{Cyrillic}+").unwrap(); // Russian/Cyrillic
let re = Regex::new(r"\p{Emoji}+").unwrap(); // Emoji
// Case-insensitive with Unicode
let re = Regex::new(r"(?i)strasse|straße").unwrap(); // Matches GermanPerformance Optimization Tips
Optimization Impact
──────────────────────────────────── ──────────────────────────
Compile regex once, reuse 10-100x faster
Use is_match() instead of find() Faster (no capture alloc)
Anchor patterns (^...$) Avoids scanning entire text
Use RegexSet for multiple patterns 1 scan instead of N scans
Disable Unicode if ASCII-only Smaller binary, faster
Use bytes::Regex for raw bytes Avoids UTF-8 validationRegexSet: Match multiple patterns in one pass
use regex::RegexSet;
let set = RegexSet::new(&[
r"[\w.+-]+@[\w-]+\.[\w.]+", // email
r"https?://[^\s]+", // URL
r"\d{3}[-.]?\d{3}[-.]?\d{4}", // phone
r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", // IP
]).unwrap();
let text = "Contact john@example.com or visit https://example.com";
let matches: Vec<usize> = set.matches(text).into_iter().collect();
// matches = [0, 1] → found email and URL patternsSummary
Rust's regex crate compiled to Wasm gives you 2-5x faster text processing than JavaScript with guaranteed linear-time matching (no ReDoS). Use it for email validation, log parsing, CSV processing, search-and-replace, and any text-heavy workload. The crate adds ~200-400 KB to your Wasm binary but pays for itself in performance and safety. Combine it with full Unicode support for international text processing that just works.