Constructing a filter in Urchin requires careful attention to detail. This is important for both performance reasons and to ensure that the filter behaves in a predictable manner. Urchin uses POSIX regular expression for specifying filter patterns, which are quite robust but can also be rather complex. Specifying a regular expression filter pattern incorrectly may not only generate the wrong results, but may significantly increase Urchin log processing times; perhaps as much as an order of magnitude.
Tips for Filtering
The general rule of thumb with filters is "the simpler, the better". Each filter you add increases processing time. Complex filtering patterns also increase processing time.
- Use as simple a pattern as possible that matches only what you are trying to filter. For instance, if you want to filter out "googlebot" in the Browsers, just put "googlebot" as the filter pattern, not ".*googlebot.*".
- Whenever possible, use a single filter with multiple patterns instead of multiple separate filters. As an example, if you want to filter out PDF files, Flash files, and MPEGs from your Top Pages reports, use a filter pattern of "\.pdf|\.swf|\.mpeg" instead of creating a separate filter for each.
- Remember to escape regular expression metacharacters (see below) by preceding them with a backslash if you are using those specific characters are part in your pattern. For example, the dot "." character is a regular expression metacharacter. Therefore, if you want to filter out an IP address you need to specify the filter pattern as "127\.0\.0\.1" instead of "127.0.0.1".
Before you begin, you may want to have a copy of your web log file accessible so you can determine exactly how to construct the filter. Why is this important? During report processing, when Urchin invokes the filter you've added, it will attempt to make an exact match between your filter pattern and content found in the log file. So, knowing exactly how the content is displayed in the log file is essential when building the filter.
If you do not have direct access to the log file, you may be able to gather similar information by reviewing content in the report without any filter added. For example, the Pages and Referrals reports often contain enough detailed information to build filters from. However, some information is still not available such as query strings resulting from dynamically generated web pages. That information is available only from the log file.
NOTE: Once the filter is applied, it will only affect the reports processed "after" it is added. Any existing data will not be affected by the filter. Applying a filter to existing reports requires clearing that data and reprocessing the log files.
There are 3 primary components to an Urchin filter.
- Filter Type: This determines whether Urchin is to include or exclude specific data from the report, and in some cases (Dynamic URL) how the information will be displayed in the report. If you select "Include Pattern" Urchin will ONLY include log file entries that contain the pattern or patterns you tell it to include in the report.
- Filter Field: Determines which field in the log file Urchin should apply the filter to. If you are unsure which field to select, review section below titled "Filter Field Definitions by Log Format Type."
- Filter Pattern: Contains the string of characters Urchin will try and match from the log file during processing. Filter patterns applied to the URI-Stem(W3C) or Request(NCSA) field should be written exactly as they appear in the log file. Or, you may truncate the pattern in some cases. All other patterns are subject to POSIX regular expression rules. That means that certain characters will be interpreted differently than they appear. For example a dot "." means "match any single character." A full list of these expressions can be found at the bottom of this article.
Filter Field Definitions by Log Format Type:
NCSA Extended Combined Format (Apache)
|Urchin Report View||Urchin Filter Field|
Windows W3C Format
|Urchin Report View||Urchin Filter Field|
Filtering OUT personal IP address from Windows W3C log files
Filter Type: Exclude Pattern Filter Field: IP Address Filter Pattern: 63\.212\.171\.5Filtering OUT robots from NCSA or W3C log files
Filter Type: Exclude Pattern Filter Field: User-Agent Filter Pattern: bot|Bot|BOT|Robot|robot
Filtering IN content from support directory from NCSA logs
Filter Type: Include Pattern Filter Field: Request (NCSA) Filter Pattern: /support/
POSIX Regular Expression list:
|.||Match any single character|
|*||Match zero or more of the previous item|
|+||Match one of more of the previous item|
|?||Match zero or one of the previous item|
|( )||Remember contents of parenthesis as item|
|[ ]||Match one item in this list|
|-||Create a range in a list|
|^||Match to the beginning of the line|
|$||Match to the end of the line|
|\||Escapes any of the above (when preceding wild card)|