OR HOW I LEARNED TO STOP WORRYING AND LOVE THE REGEX
We've all used regular expressions one way or another, maybe when defining a Django url, when grepping for files or processes, when defining rules in server configurations replacing this with that and so on.
(Throughout this text, I have used this notation to denote regular expressions or other code-related stuff.)
But what are regular expressions?
A regexp? is nothing to worry about.
It is a string composed of literal characters and metacharacters.
foo is a regex and so is [A-Z]+\d+ or /^1?$|^(11+?)\1+$/.
It's a sequence of characters that describes a pattern used to find text in a larger chunk of text, validate that a string complies to some conditions, do replace operations and split.
How it's made
It's all made possible by a regex engine. Such an engine can be found almost everywhere, languages use it either implicitly (such as Perl, javascript), or explicitly (like Python, Java, .NET), using libraries or modules. A great deal of popular tools also include a regex engine, tools like IDEs, grep, SQL conditions, vim? and many others.
Everything has regex.
I bet Mercedes has regex too.
Having said this, a regex is a string that describes a pattern that a regex engine uses in order to find matches in a body of text for the purposes listed above.
Regex engine
Everything started during the 50s, when a mathematician named Klenee conceived a regular language based on a mathematical property called regularity. Regular expression engines that conform to this regularity are DFA (Deterministic Finite Automatons). The name of the father of regular expressions is found in the * metacharacter a.k.a. Klenee star.
All languages or tools that use regex have their own regex engine, which can be of two types:
DFA: It's the oldest and fastest, based on Finite State Automaton, but not suited for advanced stuff. DFA matching is faster and more consistent and implemented in tools like GNU egrep and variations and awk.
NFA or Traditional NFA (Nondeterministic Finite Automaton): In time, people saw potential for creating more powerful patterns, so they wrote their own implementation of NFA engines.
NFA is based on backtracking. Usually, the pattern is compiled into byte-code, similar to machine instructions. The engine executes the code, going from one instruction to another. When one instruction fails, it backtracks, trying another method to match. You know, backtracking.
It's found in implementations of Python, Perl, Tcl, ed, sed, most versions of grep and some versions of egrep and awk.
The two types of engines are also called 'text-directed' and 'regex-directed'.
As an example to illustrate the difference between the two, take the pattern one(self)?(selfsufficient)? against 'oneselfsufficient'.
The NFA engine first matches 'one' and then 'self', leaving the optional 'selfsufficient' out, since it does not match 'sufficient'.
The DFA on the other hand, finds the longest match, 'oneselfsufficient', since it is aware of all possible matches.
A very nice explanation of the algorithms underneath the engines and the conversion from NDA to DFA can be found here, and a place to visualise the NFA and DFA constructs of a regular expressions here.
How to use them?
Regular expressions are like lego bricks.
The basic building blocks are simple, but they can be combined, with the proper know-how in really fancy structures.
The building blocks are:
Line anchors and position matchers - peculiar because they match positions rather than text ^ (string start), $ (string end) ^$ would match an empty line, with nothing on it, not even spaces. \b (word boundary, matches empty string at the beginning of the word). When searching for the word 'cat', by using \bcat\b words like 'application' or 'catholic' would not be matched. \A start of string and \Z end of string
Character classes [ ] Metacharacters change their meaning when used inside character classes [abc] would match 'a', 'b' or 'c' It can also be used with ranges, like [a-z], [0-9], [abcm-z]. The dash is a metacharacter only when used inside a character class, and not if it is the first character in the class. [^abc] negated character class. Match any character that is not 'a', 'b' or 'c'. e.g. [^L] the Christmas regex, get it? (no L) e.g. [0123456789abcdefABCDEF] or [0-9a-fA-F] for searching hex numbers.
Quantifiers They refer to the immediately preceding item and reflect the quantity of matches. + one or more times, but at least once * any number or occurrences, including none or all ? optional item, match one time or zero times. Like *, ? it is always successful. Nice, no? e.g. colou?r to match either 'color' or 'colour'. {m,n} interval quantifier, match at least m times and at most n times. Can also be used as {m,} or {m}.
Alternation | Combines more expressions into one, and tries to match any of the alternatives. e.g. gr(e|a)y to match 'gray' or 'grey'. If used without parentheses, it would match 'gre' or 'ay'.
Any character . . matches any character, usually except newlines (read below more on matching newlines). e.g Trying to match a date like '12-03-1999', but not knowing exactly if the separator is '.', '-' or '/'. For this, writing 12.03.1999 would match the dot, dash or slash, but would also match something like '12X03 1999'. Depending on the data that is being searched, this may or may not be safe enough. 12[-.\\]03[-.\\]1999 would be more accurate, and since this expression uses character classes, the dot here is an actual dot, not a metacharacter.
Text capturing () unnamed capture e.g. ([0-9]+) on '1900 Autumn' will match 1900 (?P<name>) named capture e.g. By using the expression (?P<year>([0-9]+)), one can then get the match value from match.group('year') (?:..) non capturing parentheses e.g. (?:[0-9]+) on '1900 Autumn' will not return anything, being used to increase readability or to group patterns that should not be matched.
Escaping character When wanting to match characters that are also metacharacters, they have to be escaped to have their literal meaning like so: '\.', '\*', '\?' etc. e.g. For matching a string like '1+1=2' the regular expression should be 1\+1=2, otherwise, the + has special meaning.
Character and class shorthands \d matches any digit (equivalent with [0-9]) \D matches any non digit (equivalent with [^\d]) \s matches whitespace characters (equivalent with [\t\r\n\f]) \S matches any non-whitespace characters \w matches any alphanumeric character (equivalent with [a-zA-Z0-9_]) \W matches any non-alphanumeric character (equivalent with [^\w])
Modifiers (Search options) IGNORECASE causes literal characters to match both the upper and the lower case character. MULTILINE used to alter the way newlines are interpreted by the engine, as such the ^ and $ apply to the beginning and end of each line, not only to the entire string. DOTALL allows the . metacharacter to match newlines as well. UNICODE and LOCALE make \w, \W, \b, \B, \s and \S depending on the Unicode character properties database, respectively, to the current locale. Under Python 2, objects use the ASCII character set and the regex processing assumes the pattern and the text are both ASCII. So, without setting the UNICODE flag, the string 'Français' will not be matched by the expression \w+, since 'ç' is not part of the ASCII character set. This is not necessary in Python 3, where unicode is used by default. VERBOSE with this flag enabled, the whitespaces within the pattern are ignored, allowing a more clear expression, split over multiple lines and containing comments.
Regexes start parsing from left to right (not sure about RTL languages), like such:
Using ^.*([0-9][0-9]) on 'CA 9765, USA' would first capture everything because of the .*. Moving on, there is [0-9], so a need for a digit. Since all successful states are saved, it backtracks and gets an 'A', and the match fails; does it again, it gets an 'S', keeps searching until it reaches '5' and the first [0-9] matches.
It repeats the process for the second [0-9].
A regex is naturally greedy. Greedy, but docile.
A greedy quantifier (as all quantifiers are by default) tells the engine to match as many instances as possible. To change this behavior, so to match as few as possible, the expression should be made lazy. So, while the expression A* on string 'AAA' catches everything, A*? catches only the first A.
When to use regex?
By using regular expressions, one is able to match varying sets of characters, and this is the first thing that is not possible by using other available string methods.
Use regular expressions to simplify conditions, to reduce complexity of conditions, to capture groups of text, to have easy to maintain code and to save time.
When not to use regex?
It is not at all a good idea to parse HTML with regular expressions or xml or any other structure for which parsers already exist
When the expression is not readable
When you don't know what you're doing
For ReDoS (regular expression denial of service)
When you have an abusive nature
Python's re module
re is the default module to support regular expressions in Python. Some say it's not so good but it provides most of the functionality one could ever need, so no need to worry.
The module provides several functions, constants and exceptions.
compile(pattern, flags=0) - it compiles a regular expression pattern into a regular expression object, which can be used for matching using the match and search methods. It is useful when the object is going to be reused throughout the program, otherwise, it can be skipped.
match(pattern, string, flags=0) - if zero or more characters from the beginning of the string match, it returns a Match object, otherwise, it returns None. The Match object contains details about the part of the string matched by the regular expression. The flags can be combined if they are "or"ed together.
search(pattern, string, flags=0) - similar to match(), but it scans all the string, not only its beginning.
findall(pattern, string, flags=0) - returns a list of all non-overlapping matches, not just one as with search(), in the order they were found. This method creates the entire list before it can be returned as a result.
finditer(pattern, string, flags=0) - returns an iterator that produces Match instances, not strings as with findall().
string = "AB CJ HD MS"
iterator = re.finditer("\w{2}", string)for match in iterator:
s = match.start()
e = match.end()print 'Found "%s" at %d:%d' % (string[s:e], s, e)#Found "AB" at 0:2#Found "CJ" at 3:5#Found "HD" at 6:8#Found "MS" at 9:11
re.sub(pattern, string, count=0, flags=0) - find all substrings where the pattern matches, and replace them with the new string.
string = 'January & February & March'
string = re.sub('&', '&', string)print string # January & February & March
string2 = re.sub('&', '&', 'January & February & March', 1)print string2 # January & February & March
# Trying to do the same thing in Perl would translate to:
my $string = 'January & February & March';( my $newstring = $string ) =~ s/&/&/g;
// And the same in javascriptvar string = 'January & February & March'
string = string.replace(/&/g, "&") // in those last examples, g stands for global replace
A rather useless but fun example is checking whether a number is prime.
The argument is flatten into unary form, such that number 5 becomes 11111, after which it goups the flattened form into multiple equal groups. If it successfully groups them, the number is not prime.
def is_prime(n):return re.match(r'^1?$|^(11+?)\1+$', "1" * n) == None;
Performance
It is important when writing a pattern to be as specific as possible, in order to bypass possible performance issues.
Most performance problems that appear are due to the regex-directed nature of the NFA engine, based on backtracking, and it is desireable to construct the regex as to avoid backtracking when possible.
When examining efficiency, we can think of the number of individual tests that the engine does in the course of a match. For example, when matching marty against 'smarty', there are 6 tests in the initial attempt to match m against 's', then a against 's', and so on, and zero backtracks. Anyway, all tools have varying optimizations and it depends on the one you use.
Regular expressions are a very useful, cross-language, cross-platform, processing tool that one can use in the command line, in the editor, in code, or to paint city walls.
For further reading, I suggest trying out Jeffrey E.F. Friedl's 'Mastering Regular Expressions', and this place or this one are nice for trying out patterns.
You can find other interesting articles on similar topics here.
Enjoy!
Comments