A Developer's Complete Guide to Regular Expressions
What regular expressions are and why they matter
A regular expression is a declarative mini-language for describing text patterns. Instead of writing imperative code to parse a string character by character, you declare what you're looking for and the regex engine finds it. This makes regex extraordinarily concise for tasks like extracting emails from a log file, validating form input, splitting on arbitrary delimiters, or replacing text patterns across an entire codebase. Every major programming language — JavaScript, Python, Java, Go, Ruby, Rust, PHP — includes a built-in regex engine, making regex patterns immediately transferable across languages with only minor syntax differences.
The building blocks: literals, metacharacters, and character classes
Most characters in a regex match themselves literally. 'hello' matches the string 'hello'. Metacharacters are characters with special meaning: `.` matches any single character except newline, `^` anchors to the start of the string, `$` anchors to the end. Character classes let you match any character from a set: `[aeiou]` matches any vowel, `[a-z]` matches any lowercase letter, `[^0-9]` matches any non-digit. Shorthand classes are even more convenient: `\d` is any digit (equivalent to `[0-9]`), `\w` is any word character `[a-zA-Z0-9_]`, `\s` is any whitespace. Uppercase versions are negated: `\D` is any non-digit, `\W` is any non-word character, `\S` is any non-whitespace.
Quantifiers: controlling how many times to match
Quantifiers specify how many times the preceding element must appear. `*` means zero or more times. `+` means one or more times. `?` means zero or one time (makes an element optional). `{n}` means exactly n times. `{n,m}` means between n and m times. `{n,}` means n or more times. By default all quantifiers are greedy — they match as many characters as possible. Add a `?` after any quantifier to make it lazy: `*?`, `+?`, `{n,m}?`. Lazy quantifiers match as few characters as possible and are essential when parsing delimited content like HTML tags or quoted strings.
Anchors and word boundaries
Anchors match a position in the string rather than a character. `^` matches the start of the string (or start of each line with the `m` flag). `$` matches the end of the string. `\b` matches a word boundary — the transition between a word character and a non-word character. Word boundaries are essential for matching whole words: `/\bcat\b/` matches 'cat' in 'the cat sat' but does NOT match 'cat' in 'concatenate' or 'catalog'. `\B` is the inverse: it matches any position that is NOT a word boundary. These zero-width assertions are powerful for precision matching without consuming extra characters.
Groups, alternation, and backreferences
Parentheses group elements and capture the matched text: `(\d{4})-(\d{2})` captures a year and month separately. The `|` operator means alternation — 'or': `cat|dog` matches either 'cat' or 'dog'. Alternation applies to everything on each side, so use groups to scope it: `gr(a|e)y` matches 'gray' or 'grey'. A backreference `\1` refers to what capture group 1 actually matched — useful for finding repeated words: `(\w+) \1` matches 'the the' or 'is is'. Named groups `(?<year>\d{4})` make patterns self-documenting and allow reference by name in replacements.
Lookaheads, lookbehinds, and zero-width assertions
Lookaheads and lookbehinds let you match based on what surrounds a position without including the surrounding text in the match. A positive lookahead `(?=...)` asserts that a pattern follows. A negative lookahead `(?!...)` asserts that a pattern does NOT follow. Lookbehinds `(?<=...)` and `(?<!...)` do the same for what precedes the match position. Example: `\d+(?= dollars)` matches a number only when followed by ' dollars', but the result contains only the number, not ' dollars'. This is powerful for extracting values from structured text without capturing delimiters.
Regex across programming languages
The core regex syntax is similar across JavaScript, Python, Java, Go, and Ruby, but flags and method names differ. Python uses the `re` module: `re.findall()`, `re.sub()`, `re.match()`. JavaScript uses string methods: `.match()`, `.replace()`, `.matchAll()`. Go's regexp package uses RE2 syntax which does not support lookaheads or backreferences for performance reasons. Ruby has first-class regex literals like JavaScript. Java requires double-escaping backslashes in string literals: `"\\d"` to match a digit. PCRE (used by PHP, Perl, many tools) is the most feature-complete flavor and supports atomic groups and possessive quantifiers not found in JavaScript.