8 FEB, 2023

Regex, huh?

Regular expressions, usually shorted to regex or re, are a fantastic way to search for patterns of text amongst other text. There are a multitude of ways to configure the search pattern, and several neat little tricks that make them significantly more useful than just another way of doing 'Find & Replace.'

I will be using Python for my examples

The basics

A search pattern is applied to some other text. The pattern consists of some arrangement of text characters and symbols, which do not always literally mean themselves. The other text is usually somewhat unknown, which some patterns within it that are known.

An example search pattern could be:

r"Hello, world!"

This will exactly match the literal text Hello, world!. We can do so much better than that though! What if we need to match some variable bit of text? Say, a year: four numbers in a row, between zero and nine to be inclusive and 'future-proof.'

r"[0-9]{4}"

r" " Encloses a regex string in Python
[ ] Encloses a list of characters, where only one is matched
0-9 The text characters between 0 and 9
{ } Encloses the number of times to match the preceding thing
4 Match four times

That seems pretty straightforward, and it only gets cooler from here! The one challenge with regular expressions is that whitespace (spaces, new lines) are treated as themselves, which would change the search pattern. Hence, search patterns are usually dense lines of characters and symbols, which is why it's super important to break them apart yourself when you're trying to make / understand them. That being said, let's look at a more complex example:

r"^\$title = \"(.*)\";

r" " Encloses a regex string in Python
^ Matches the start of the text to be searched*
\$ Matches the literal character $
title =  Matches the literal text title = 
\" Matches the literal character "
( ) Encloses a subgroup of things to match
. Matches one of any character
* Matches the preceding thing any number of times
; Mathces the literal character ;

* Or the start of a new line within the text, depending on the options used to match

You may note a few backslash \ symbols in this one - they are used to match the following character literally, instead of using its special meaning (as by default)

This example will search through and match something like $title = "Regex, huh?";. I also used the parenthesis to denote a subgroup - this can be extracted separately!

> re.search(r"^\$title = \"(.*)\";", content).group(0)
'$title = "Regex, huh?";'

> re.search(r"^\$title = \"(.*)\";", content).group(1)
'Regex, huh?'
	

What next?

The best part of regular expressions is that all of this functionality can be infinitely configured by combining each technique together. You can even exclude certain things from being matched within your search pattern, like fetching all the pages of a website except for the index. Anyway, this is the part where you go and learn more for yourself

If you're curious about that whole exclusion thing, here's a sample:

r"^(?!index)(.*)\.php"

r" " Encloses a regex string in Python
^ Matches the start of the text to be searched
( ) Encloses a subgroup of things to match
? The 'lookahead' marker*
! Inverts or negates the preceding symbol^
index The literal characters index
\.php The literal characters .php

* Needs a determining symbol to follow it

^ Combined with ? this becomes a 'negative lookahead' i.e. "don't match this even though it otherwise would"

Tools to help

The best resource for figuring things out is the documentation for your package or programming language. The next best resource is regex101.com. It gives a validator and checker for all common regex specifications as used by popular languages, as well as a neat breakdown with detailed descriptions of what each character in the search patter means.