Regular Expression Overview

Introduction

Posix regular expressions are used to match or capture portions of a field using wildcards and metacharacters. They are often used for text manipulation tasks. Most of the filters included in Urchin use these expressions to match the data and perform an action when a match is achieved. For instance, an exclude filter is designed to exclude the hit if the regular expression in the filter matches the data contained in the field specified by the filter.

Regular expressions are text strings that contain characters, numbers, and wildcards. A list of common wildcards is contained in the table below. Note that these wildcard characters can be used literally by escaping them with a backslash '\'.

WildcardMeaning
.match any single character
*match zero or more of the previous item
+match one or more of the previous item
?match zero or one of the previous item
()remember contents of parenthesis as item
[]match one item in this list
-create a range in a list
|or
^match to the beginning of the field
$match to the end of the field
\escape any of the above

Tips for Regular Expressions

  1. Make the regular expression as simple as possible. Complex expressions take longer to process or match than simple expressions.
  2. Avoid the use of .* if possible since this expression matches everything and may slow down processing the expression. For instance, if you need to match index.html, use index\.html, not .*index\.html.*
  3. Try to group patterns together when possible. For instance, if you wish to match a file suffix or .gif, .jpg, and .png, use "\.(gif|jpg|png)" not "\.gif|\.jpg|\.png".
  4. Be sure to escape the regular expression wildcards or metacharacters if you wish to match those literal characters.
  5. Use anchors whenever possible. The anchor characters are ^ and $, which match either the beginning or end of an expression. Using these when possible will speed up processing. For instance, to match foo directory in /foo/bar, use ^/foo/ instead of /foo/. Using the ^ will force the expression to match at the beginning and will improve processing speed.