Regex (Regular Expression)

Introduction

Regex, short for Regular Expression, is a powerful syntax for searching, matching, and manipulating strings based on specific patterns. It is widely used in:

Data validation (e.g., emails, phone numbers)
Search-and-replace operations
Log file parsing
Input sanitization
Web scraping
Syntax highlighting
Compilers and lexers

Regular expressions are supported in almost every modern programming language (Python, JavaScript, Java, Perl, C#, etc.) and command-line tools (grep, sed, awk).

Core Concept

A regular expression is a sequence of characters that defines a search pattern. This pattern can match:

Literal characters (e.g., hello)
Character classes (e.g., [A-Z])
Quantifiers (e.g., *, +, {n,m})
Anchors (e.g., ^, $)
Groups and alternation (e.g., (abc|def))

Basic Syntax

Symbol	Meaning	Example
`.`	Any character except newline	`a.b` → `acb`
`^`	Start of string	`^abc`
`$`	End of string	`xyz$`
`*`	0 or more repetitions	`a*` matches `""`, `a`, `aaa`
`+`	1 or more repetitions	`a+` matches `a`, `aa`
`?`	0 or 1 occurrence	`a?` matches `""`, `a`
`{n}`	Exactly n repetitions	`a{3}` matches `aaa`
`{n,}`	At least n repetitions	`a{2,}` matches `aa`, `aaa`
`{n,m}`	Between n and m repetitions	`a{2,4}` matches `aa`, `aaa`, `aaaa`
`[]`	Character class	`[aeiou]`
`	`	Alternation (OR)
`()`	Grouping	`(ab)+`
`\`	Escape special characters	`\.` matches a literal `.`

Character Classes

Pattern	Matches
`[abc]`	a, b, or c
`[^abc]`	Any character except a, b, or c
`[a-z]`	Any lowercase letter
`[A-Z]`	Any uppercase letter
`[0-9]`	Any digit
`\d`	Digit (same as `[0-9]`)
`\D`	Non-digit
`\w`	Word character (`[a-zA-Z0-9_]`)
`\W`	Non-word character
`\s`	Whitespace
`\S`	Non-whitespace

Anchors and Boundaries

Anchor	Matches at…
`^`	Start of string
`$`	End of string
`\b`	Word boundary
`\B`	Non-word boundary

Grouping and Capturing

import re

match = re.match(r"My name is (\w+)", "My name is Alice")
print(match.group(1))  # Alice

Parentheses () capture the matched value.
Use group(1), group(2), etc. to retrieve them.

Non-Capturing Groups

(?:abc|def)

?: disables capturing, useful when you don’t need to reference the match.

Lookaheads and Lookbehinds

Pattern	Description
`(?=...)`	Positive lookahead
`(?!...)`	Negative lookahead
`(?<=...)`	Positive lookbehind
`(?<!...)`	Negative lookbehind

Example

re.findall(r"\d+(?= dollars)", "Pay 100 dollars or 50 dollars")
# ['100', '50']

Matches digits only when followed by “dollars”.

Common Regex Patterns

Use Case	Pattern
Email	`\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-z]{2,}\b`
URL	`https?://[^\s]+`
Phone (US)	`$?\d{3}$?[-.\s]?\d{3}[-.\s]?\d{4}`
Date (YYYY-MM-DD)	`\d{4}-\d{2}-\d{2}`
Hex color	`#?([a-fA-F0-9]{6}

Regex in Python

Match from start of string

re.match(r"\d+", "123abc")  # matches
re.match(r"\d+", "abc123")  # None

Search anywhere in the string

re.search(r"\d+", "abc123")  # matches 123

Find all matches

re.findall(r"\d+", "a1b22c333")  # ['1', '22', '333']

Replace (Substitute)

re.sub(r"\d+", "#", "a1b22c333")  # "a#b#c#"

Regex in JavaScript

let str = "Email: [email protected]";
let pattern = /\b\w+@\w+\.\w+\b/;
let match = str.match(pattern);
console.log(match[0]);  // "[email protected]"

Flags: /pattern/gi
- g: global
- i: case-insensitive
- m: multiline

Regex Flags (Python)

Flag	Description
`re.I`	Case-insensitive
`re.M`	Multi-line mode
`re.S`	Dot matches newline
`re.X`	Verbose mode (allows comments/spacing)

Tools for Testing Regex

Tool	Website
regex101	https://regex101.com
regexr	https://regexr.com
Pythex	https://pythex.org
Debuggex	https://www.debuggex.com

These tools offer live previews, explanations, and syntax highlighting.

Performance Considerations

Backtracking: Greedy patterns (.*) may cause performance issues.
Use lazy quantifiers (*?, +?) to reduce excessive matching.
Anchoring your patterns (^, $) helps limit scope.
For very large text, consider regex libraries that support non-backtracking engines.

Common Pitfalls

Pitfall	Explanation
Forgetting to escape `.`	It matches any character unless escaped as `\.`
Overuse of `.*`	Greedy matching can consume too much
Not using `^` and `$`	Partial matches can return unexpected results
Misusing character ranges	`[a-zA-Z]` is valid, but `[A-z]` includes `[\]^_`
Nested groups confusion	Use named groups or re-structure

Best Practices

Test your patterns using regex tools
Use named groups for readability: (?P<name>\w+)
Avoid unnecessary capturing groups—use (?:...) when you don’t need them
Use raw strings in Python (r"pattern") to avoid double escaping
Comment your complex expressions (use re.X)

Conclusion

Regex is a highly expressive tool for string processing and validation. Mastering it allows developers to perform complex text operations with minimal code. However, because of its compact syntax and potential pitfalls, it’s important to: