De-mystifying Regular Expressions

Discuss, and evaluate the latest news and innovations in Website Analytics

Moderator: Moderators

De-mystifying Regular Expressions

Postby oneelephantpickle » Wed Aug 22, 2007 8:45 am

The secret of neatly segemented reports in your GA is to filter out those visits that are unnecessary. :!: And creating filters in GA requires a fair knowledge of Regular expressions( RegEx) also known as POSIX. :roll: I am making an attempt to show them here for the benefit of us all in thier bare minimum-it promises to be tantalising :twisted:

Google analytics says this about Backslash
\ Escape any of the above
What they mean is, you can use a backslash to turn any special character into a not-so-special character. Google makes this hard by using the word “escape,” :roll: when they merely mean, use a backslash to take the magic out of a special character and make it an everyday character.
Although the backslash can be used with any special character, I see it used most often with a dot. This is because a dot is both a special character and one that is used with the Internet all the time (Example: http://www.myspace.com — we see it there twice.) On the Internet (and so, with Google Analytics) we almost always are using dots as regular dots and so need a backslash to keep it as a mere dot. Here’s an example: mysite\.com and here’s another one (this time, an IP address): 64\.68\.82\.164

Google Analytics says this about dots:
. “Matches any one character”. What they mean is, (Match any one character that comes from where? I asked myself…)
This is exactly what they mean, but it is so out of context. They mean that you can create a RegEx like this
.ate
And it will match hate, fate, sate, or any four character expression. For that matter, it will match 8ate. It won’t match just ate, because it wants one character to substitute for the dot.


^ — “Match to the beginning of the field”
This is exactly what Google Analytics’ Help says about the carat anchor:
Anchor carats are useful in other places besides just urls. Let’s say you want to create a filter for the entire range of IP addresses in your company. However, your IP addresses all start with a two digit number, like 64.xx.xx.xxx, so you wouldn’t want to filter out something that looked like this: 164.xx.xx.xx. To solve that problem, you can use a carat: ^64 etc

$: “match to the end of the field”

What they really mean is, don’t match if the string from my website has any characters beyond where I have placed the dollar sign in my Regular Expression. The dollar sign signals all the characters that I want to match to.
So let’s say that you have some pages that end in htm and others in html. You want to write a Google Analytics Step 1 (part of a goal) for your email sign-up form, but you only want the new .htm version. Your RegEx might look like this:

/email-form\.htm$

The dollar sign tells the Google Analytics, if the page on your site has anything after the final “m” in “htm,” it doesn’t count as a match to this expression.
You might have an IP addresss that needs to be screened out : 12\.34\.56\.78, which matches 12.34.56.78, but you want to be sure that it doesn’t match 12.34.56.789 — so you set up your expression to be 12\.34\.56\.78$ . And if you want to be sure that it doesn’t match 512.34.56.78 as well, you should use the beginning anchor ^12\.34\.56\.78$

?—“Match zero or one of the previous expression”
This time, Google does a pretty good job of meaning what they say
When they say, “The previous expression,” it means the character that comes right before the question mark. Let me explain it little more.
Let’s say that you have a website and you only want to look at the referrers that have the word “labor” in their title. But some of those referrers come from non-US countries where they spell it “labour.” You could create a filter like this: labou?r
That way, it will match “labour” (which does have a “u,” which is the previous expression) and labor (which has zero of the previous expression, i.e. no “u” is included.)

()-parenthesis
This was a hard one to screw up, although they have done a good job of screwing up other easy Regular Expressions. :? Let me explain it with the help of mathematical equation. :idea:

6*(2+3) :idea:

is equivalent to 6*2 plus 6*3, parenthesis in Regular Expressions make sure that the stuff outside of the parenthesis get applied to the stuff inside of the parenthesis equally.

For example — Just remember that the pipe symbol | stands for OR — we can have a regular expression like this:
grand(mother|father)

That will match either grandmother or grandfather.


[] “Match one item in this list” (That’s how GA defines it)
This is exactly what they mean, it just sounds hard because they don’t tell you how to create the list and how to define an item. Simple explanation: When you use square brackets, each character within the bracket is an item. Look at this sample list with five items in it, each of which happens to be a vowel: [aeiou]. The hard part is undertanding that you don’t need anything to separate the characters, and that each item in the list is only one character.
Here’s how someone might use square brackets with Google Analytics. Let’s say you were selling items with part numbers formatted like this: PART1, PART2, etc. You want to know how often someone lands on your site by typing the actual part number into a search engine, but you only care about PART3, PART5 and PART7. So, you could enter PART[357] into the fiter box on the top of your Overall Keyword


- Dashes- “Create a range in a list”That is how Google Analytics defines dashes.
That means, instead of creating a list like this [abcdefghijkl], you can create it like this: [a-l], and it means the same thing — only one letter out of the list gets matched. You can also combine the range method and the brute force, type-them-all-in method and create a list like this: [a-lqtz], which matches any one letter between a and l, or q, or t, or z.

+ “Match one or more of the previous items”
The simplest meaning of “Previous Items” is “previous character.” So, I could look for my name in my Google Analytics search terms by typing this into the “quick filter” box: Nit+in. That will return Nitin or Nitttin , and for that matter,nitttttttin .
Alternatively, you can build a list of Previous Items by using square brackets. Like this: [abc]+ This will return a, ab, cab, c, b, bbbb and the like. This seems a little strange, but in fact, if you read the interpretation (”match one or more of the previous items”), you’ll notice there isn’t anything about the previous items being in a specific order.

* “Match zero or more of the previous items”
When it comes to stars (or call them asterisks if you like), thats what Google Analytic says

Perfectly reasonable, if you know how to create a list of previous items.
If the only special character you are using is the star *, then the previous item is defined as the previous character. For example, let’s say that my company has five digit part numbers, and I want to know how many people are searching for part number 34. The problems I have are all those leading zeros - technically, the part number is PN00034. So I could use the little Google Analytics filter box in my search report with a RegEx like this: PN0*34. That will bring me back all the searches for PN034 and PN0034 and PN00034 and PN00000034 and for that matter, PN34, since using the star means that the previous item doesn’t need to be in the search — zero or more of the previous items, it says.
oneelephantpickle
 

Postby DJ » Mon Aug 27, 2007 5:23 pm

This is fantastic information!

Why dont you post a blog about this one? This is stuff few people know but obviously that you do!
The Deej
|
DJ
Site Admin
 
Posts: 1022
Joined: Thu May 04, 2006 4:47 pm


Return to Web Analytics

Who is online

Users browsing this forum: No registered users and 6 guests