Regular Expressions Tutorial with Useful Expressions

Ever needed to clean up some HTML content, Grab the text between a tag, or find all the email addresses in a document? Well Regular Expressions are here to the rescue.

You are watching Web Snacks and I’m John Harbison

Regular expressions or sometimes called regex is a sequence of characters used to pattern match. It is commonly used is many programming languages like Perl, PHP, javascript, Ruby and many others – but today we are using them in a text editor – I will be using Sublime Text for my examples

You can find a full list of editors that use Regex on Wikipedia
https://en.wikipedia.org/wiki/List_of_regular_expression_software

Why would you use Regular Expressions?
Simply because it saves time
Sure you could go through a document and click and copy and edit, but it is so much simpler to just have the computer work for you. Instead of manually copying email addresses in the document you could grab all of them at once. Instead of formatting the document to insert line breaks you can grab all of them programatically. Regex lets the computer work for you.

To really understand regex you need to understand the syntax – Today though I want to give you some simple pattern matches that are extremely useful and we are going to use Sublime Text to see them in real time.

For an example I’ve got a simple HTML document that has all of the lines joined. You might see some nasty html like this if you open someone else’s code or a site has been minified to eliminate the white space. You may need to edit it so being able to make it more human readable can be a time saver.

So I’ve opened the document in Sublime Text. I hit Command F on Mac or Cntrl F on Windows to open the Find Dialogue. On the left of the Find bar there are some buttons that I need to select. The first is the first button, the .* button that means we are searching with Regular expressions. The next two buttons I need to select are the “Wrap” button which allows me to search the entire document at once instead of just searching below or above my cursor. The last button I need to turn on is the highlight matches button which is directly to the left of the search box.

Now I’m ready to work

the first thing I want to do is split my paragraphs of content. In the search field I will type

which will select all of the closing paragraph tags.
As I type in

</p>

You’ll see that they all become highlighted. I’ll then Hit the “Find all” button which will select all instances of the closing paragraph tag. Hit the forward arrow on the keyboard which will move the cursor to the end of the select tags. Hit “enter” twice to insert line breaks in between the paragraphs. This makes the wall of text much more readable.

We are going to continue making the markup readable by putting all of the list items on separate lines. Command F to pull up the find dialogue. Type in the ending list item tag

</li>

Click Find all then hit the forward arrow then enter. Just like we did with the paragraphs, now all of our list items will be easier to read. We can finish up by moving the opening unordered list tags to their own lines as well. Also in Sublime Text you can select multiple lines and hit the tab key to indent lines.

For these two examples all we did was a straight tag match. Regex will pattern match anything you type. There are some scenarios where you might have to add slashes to get them to match, but for the most part if you type it, it will match. For this reason you may have actually used a regular expression matching system in the past and didn’t even know it.

But lets do some other things that might be a little more complicated.

What if we wanted to make a collection of links. So for this I’m going to select all of the anchor tags, complete with their linked content and put them all in a list at the bottom of the page.

Lets start by adding a header. I’ll add an

<h1>

Tag and name it All Links. Now lets select all of our links. Command F to bring up Find. Now all anchor tags start with angle bracket and a . So we will start

<a 

Now we want everything until the closing

</a>

… So if I started writing my expression I know I would put a

<a and </a> 

– So how do we get all the stuff in the middle? We are going to use a wild card character which is a period. If you are watching the screen if you just typed

<a

. You would see that all of the beginning of the links are selected and when we put the period it grabbed the space after the a The period wild card only does one character at a time. The next thing we have to do is to tell it we want all wild characters after the A. For that we will add a plus sign which is called a greedy quantifier. So know our regex looks like this

<a.+ 

If you are typing along with me you will see that all of the text is now selected. So now we need to make it stop its selection at the closing of the anchor tag. So you should now have

<a.+</a> 

and it will select all of your links. I will say that this isn’t a fool proof regex pattern. If your code had a closing anchor tag that is just chilling out in the content with no start then your pattern would match all the content regardless of the first closing anchor tag. To fix this pattern you need to add a Question mark. This is called the lazy quantifier. By adding the Question mark before your final closing anchor tag the expression now says – find everything until you get to the first closing anchor tag – your script is lazy and doesn’t want to select anymore. So Your final expression looks like this

<a.+?</a>

. Hit the find all and then Command C to copy them – put your curson at the bottom under the header and paste.

Using the same technique could also grab all of the text in between the paragraph tags.

<p.+?</p>

Would select all of the paragraphs.

To grab all of the list items would be

<li.+?</li>

As you noticed in the paragraph and list item example it even grabbed the tags. If you didn’t want to grab the tags you would need to make an expression using the “Look ahead” and “Look Behind” special groups. When you group an expression you put parentheses around the group. The first special group we will look at is the “Look ahead” group. It basicall means – “hey match this pattern” and then give me everything in front of it. The positive look ahead looks like this

 (?<=PATTERN)

so for a paragraph tag you’d have

(?<=<p>)

. So if you type that into Sublime Text you get … Nothing! That is because this group just matches the case and we are using this to NOT select the tags, we just want to find them. So we need to add some greed quantifiers to our expression.

(?<=<p>).+ 

AHA! Now we have Sublime Text grabbing everything after our opening paragraph Tags. To have it stop matching before the closing tag requires us to use the “Look Behind” special Group which looks like

(?=PATTERN)

So our look behind group will look like

(?=</p>)

If you are typing this into Sublime Text you’ll see the matching stop at the closing tag. You’ve just grabbed everything between the tags.

So far all of the techniques have been extremely useful and you could use them for most cases. The last expression I will show you is fairly complex but you may need to use something like it since it could be a fairly common scenario. What if you needed to grab all of the email addresses in a document. You can use a regular expression to grab them all. So drum rolllllllll Here is the expression to grab all of the emails we are going to use.

[a-zA-Z0-9_\-\.]+@[a-zA-Z0-9_\-\.]+\.[a-zA-Z]+

OMG right! This is a crazy expression but I’ll talk you through it. When use use square brackets in regex you are declaring a character set – this could be straight set of characters that you might want to grab like [123] would grab all of the instances of 1, 2, or 3. If you add the hyphen the set becomes a range so [0-9] will match any numbers 0 to 9 which is all numbers. The same goes for letters [a-z] would give you all lowercase letters. So looking at the first part of our expression

[a-zA-Z0-9]

will match all lower and uppercase letters and all numbers 0-9. So even though it is a bit cryptic it makes sense. You might be wondering why we wouldnt’ just use the wild card period – well that is because we want to validate the email addresses somewhat. If you look at the expression further you’ll see the underscore then backslash hyphen and backslash period. That is because we also need to add those characters, the underscore, hyphen and period to our matching because email addresses can have those characters as well. The hypen and period have backslashes in front of them so that those special characters are only treated as characters and not some kind of special regex character like a wild card character or a range. Going back to the expession, you’ll see the plus symbol which is the greedy quanitifier meaning it will look for those characters over and over until it gets to the at @ symbol. So this is how we are going to match an email address. All emails have letters numbers and a few symbols, then the @ symbol, then some more letters, numbers and a few symbols then a period then it finishes with some letters until it hits a space. That is what makes up most email addresses. Our final regex expresssion for an email is actually 3 groups with special characters in between. A special group a greedy quantifier, an @ symbol another group with a greedy quantifier a backslashed period – which is treated as a period then finally another group followed by a greedy quantifier. This will select most email addresses

So will it grab all email addresses? No – because there are all types of crazy email addresses out there. If you search for an email validating regex expression it is huge and complicated – but this expression will work for most cases.

We can use our expression to grab all of the emails in our html file and paste them at the bottom. Command F paste or type in the expression and then Find All. Command C to copy them. Place your cursor and Paste. Viola. All the email addresses

Regular Expressions are awesome and powerful – you’ve seen what they can do in a simple text editor – now imagine what you can do if you start programming with them. They are magic.