How Regular Expressions Can Help You Manipulate Your Data

As a former retail marketing analyst, I used Excel extensively, and I grew to love and respect it even more than before. It’s a powerful tool that’s taken for granted, especially by those who need it. But, and I knew this as I was whipping up Vlookup/Match combinations for my reports, I was barely scratching the surface of what this kind of application was capable of. When I started getting deeper into digital marketing on my own, I realized just how much more I needed to know and understand about spreadsheets.

When it comes to getting the most from a spreadsheet, being agile with regular expressions, RegEx, as it relates to data analysis, is a powerful skill to have. You may already be familiar with regular expressions, but you may not realize how they can work with Google Sheets or even Google Data Studio. If you’re using Excel, it’s important to realize that there is no built-in RegEx functionality, unless you integrate VBA. VBA is a complicated language and using it to integrate regex into your Excel sheets gets messy.

This is one of the many reasons why working with Google is a far easier option. Here’s what this post is going to cover:

  1. Google Sheets Regular Expression Syntax Notations
  2. Essential RegEx Functions

For this post, here’s a sample of the PPC marketing data I’m using from my FauxStore that’s sorted in descending order by Cost:

Screenshot of the data used for Regular Expressions calculations in Google Sheets

Google Sheets Regular Expression Syntax Notations

Dot Asterisk (.*)

Dot (.) matches ANY character, even blanks, while asterisk (*) matches 0 or more times. When combined, dot asterisk (.*), you are able to get ANY combination of characters. Example, .*widget matches any number of characters before the term widget, such as black widget , white widget, etc…

Using .*et, this shows the resulting campaign names that fit the pattern. It applies whether the the .* is before or after the pattern

Results of using dot asterisk regular expression notation

Question Mark (?)

The question mark matches a character 0 or more times. For example, widget? matches any text with widget, widgets, widgets, widgeted…you get the point. In this example, I used rand? as the expression, and it picked up the words containing those characters in that order with different endings!

What happens when we use the question mark in a regular expression

Dot Plus (.+)

Plus (+) matches 1 or more times, and when combined with the dot, it matches 1 or more characters after the word. For example, widget.+ will match widgets but not widget. In this example, of rand.+, it only picks up those with an additional ending:

Example of using dot plus notation

Pipe (|)

Logical OR is designated by the pipe character | and can be used to compare letters or words. For example: widget|sprocket means give me any terms with either widget or sprocket. Here, I wanted to see Competitor or Remarketing campaigns only, so I used the pipe notation. Bear in mind, you can also use it for single letters, which is handy if you know you’re dealing with some names with upper and lower cases of the same word in the sheet.

Using pipe regular expression notation - TeXXic by Transitionyte.com

 

String: Caret(^) & Dollar Sign ($)

^ (which is right above the 6 on  most keyboards) means the first character in a string, while $ means matching the last number in a string. For example, ^W, means choose all strings that begin with the letter ‘W’, while t$ means choose strings that end with the letter ‘t’.

Here I wanted to check on campaigns that begin with A (capital, because if it had been lower case, results would have come up as N/A) and I wanted those that ended with r:

Using caret and dollar sign regular expressions

 

These are just some of the expressions that most use, and that you’ll use when trying to manipulate data in a certain way.

 

Mastering Regular Expressions by Jeffrey E. F. Friedl

If you want to really get a handle of regular expressions

There are quite a few more regular expressions, that you can use, nest, and play around with. You can combine them with spreadsheet functions, which is even more advanced. But for now, these basics provide a great place to start.

Essential RegEx Functions

When you need to clean up categorical data, regular expression functions such as REGEXMATCH, REGEXREPLACE and REGEXEXTRACT are the quickest ways to accomplish this, especially if you have hundreds of lines of data. Imagine if you had to go through and clean all of that line by line, you would go crazy. Regular expressions help you to accomplish those changes in broad strokes accurately.

REGEXMATCH

REGEXMATCH asks the if data with a certain configuration is in the spreadsheet, and it returns a boolean TRUE or FALSE. By itself, that’s somewhat unhelpful, and that’s why REGEXMATCH is used with the Filter () function. Filter() returns a special version of the original range based on criteria specified by REGEXMATCH. This RegEx has a minimum of 2 arguments:

  • Range – the table you want to search through, e.g. A2:F35. Notice, I’m not doing the first row. In some examples, they do include the first row, but the sorting that will take place, doesn’t reprint the header, and renders the same result whether you’re starting with A1 or A2
  • Condition1 – This is the first condition for the sort, such as A1:A35 > 50. You can add as many conditions as you want after this first one. In this case, the first condition will contain the REGEXMATCH function

Here’s an example of what resulted when I filtered my FauxStore costs that were lower than $25 using the following formula: Filter(A1:E29,D1:D29<25)

Results from my quick use of the FILTER function in my FauxStore Sheet

Notice that the headers didn’t get added with this, and that’s fine, because if you’re doing this function, you would have the headers copied already, and the last thing you want to do is to duplicate them while doing this sorting.

As you can see, that one formula gave me this table of costs that were lower than $25.

That second part after the comma is where you’d add the REGEXMATCH function. Say, you wanted to get all the competitor campaign, but I notice that the campaign name isn’t consistent. I have Competitors and Competitor, which indicates this isn’t a clean data table. This is where the  ?  would come in. My RegEx would be Competitors?, which would take care of singular or plural.

The function would look like this: Filter(A1:E29,REGEXMATCH(B1:B29,H5))

The data range is the same. The REGEXMATCH is looking for the expression in the campaign column, and it’s using the H5 as the pattern to search for.

Now, H5 is the separate cell with the regular expression. This is one of the best practices I recommend, because it’s easier to deal with a regular expression in a separate cell that you can reference. If you need to change it, you just need to change the cell and not hunt for it in the function. The results are:

Results of Filter function with REGEXMATCH on my FauxStore

 

REGEXREPLACE

As I mentioned, my FauxStore data is a bit messy. There are differing campaign names that can make analysis a challenge. For example, you can see that I have the words Brand in upper and lower case and I want to make that a more uniform campaign name. This is where Google Sheets’ REGEXREPLACE helps in a more sophisticated way than Excel.

Firstly, REGEXREPLACE has 3 arguments.

  • Argument 1 is the text you want to test,
  • Argument 2 is the regular expression you want to test argument 1 with, and
  • Argument 3 is what you want to replace the original text with.

What I did, again, was put my regular expression syntax in a separate cell, in this case A1. I

Setting up Argument 2 of REGEXREPLACE - TeXXic by Transitionyte.com

start with ‘ . ‘ because it considers even the space before the characters we’re looking for. Then I grouped the OR statement in square brackets to tell it that I’m looking for instances where the word is either brand or Brand. Finally, I make sure that argument 3 has the replacement string encased in quotes with the appropriate spacing so my terms aren’t bunched together. So, my REGEXREPLACE syntax

Syntax of RegExReplace formula - TeXXic by Transitionyte.com

looks like this: REGEXREPLACE(B3,$A$1, Brand”), where I’m referencing the text in B3, testing it against the absolute reference of what I have in A1, and replacing it with that new string.

Here’s a snip of the result for the 5th column, New Campaign Name:

Result of RegexExtract on Data

 

REGEXEXTRACT

Now, I’ve decided to do away with the ‘Brand’ part of the campaign name altogether and just stick to the regular names. With REGEXEXTRACT, you have the part you want to match and the part you want to extract, which is encased in parentheses.

REGEXTRACT only takes 2 arguments, the text to test and your precise regular expression syntax.

regexextract regular expression syntax to use

Initial Regexextract formula to change my campaign name

Here is the regular expression, I’ll be using and the REGEXEXTRACT:

 

Looks simple enough, right? So, here’s the result from that:

REGEXEXTRACT formula error result - #N/A

And there’s a problem.

The trick with REGEXEXTRACT is that the pattern has to exist or it will throw up an #N/A error. While REGEXTRACT(B3,$A$1) is technically fine, you need some logical refinement. To fix this, I added an IF/ISNA logical combination:

regexextract error checking formula to prevent outcome errors

=if(ISNA(REGEXEXTRACT(B3,$A$1)),B3,REGEXEXTRACT(B3,$A$1)) says that if you test this REGEXEXTRACT formula and it throws up an #N/A error, because that pattern is not in the cell, then just return the original value to the new cell. If you don’t get an error when you test the REGEX, then go ahead and make the extraction.

REGEXEXTRACT with logical error checking results

That worked!

There are plenty of ways to work regular expressions whenever you’re dealing with marketing analytics. It just takes practice.