Visitor Identification Methods

Overview

Urchin has five different methods for identifying visitors and sessions, depending on available information. Of these, the patent-pending Urchin Traffic Monitor (UTM) is a highly accurate system that was specifically designed to identify unique visitors, sessions, exact paths, and return frequency behavior. There are a number of visitor loyalty and client reports that are only available when using the UTM System. The UTM System is easy to install and is highly recommended for all businesses.

In addition to the UTM System, Urchin can use IP addresses, User-Agents, Usernames, and Session-IDs to identify sessions. The following table compares the abilities of each of the five identification techniques:

AbilityIP OnlyIP+User-AgentUsername Session ID UTM 
Identifies non-proxied sessions XXX XX
Identifies some proxied sessions XX XX
Uniquely identifies each session XX
Defeats session IP proxying XX
Defeats most provider caching XX
Defeats browser caching X
Uniquely identifies visitors X X
Captures exact path sequence X
Captures visitor loyalty metrics X
Captures browser capabilities X

Data Model

The underlying model within Urchin for handling unique visitors is based on a hierarchical notion of a unique set of visitors interacting with the website through one or more sessions. Each session can contain one or more hits and pageviews. Pageviews are kept in order so that a path through the website for each session is understood. As shown in the diagram, the Visitor represents an individual's interaction with the website over time. Each unique visitor will have one or more sessions, and within each session is zero or more pageviews that comprise the path the visitor took for that session.

Proxying and Caching

In attempting to identify and track unique visitors and sessions, we are basically going against the nature of the web, which is anonymous interaction. Particularly troublesome to tracking visitors are the increasingly common proxying and caching techniques used by service providers and the browsers, themselves. Proxying hides the actual IP address of the visitor and can use one IP address to represent more than one web user. A user's IP address can change between sessions and in some cases multiple IP addresses will be used to represent a cluster of users. Thus, it is possible that one visitor will have different IP addresses for each hit and/or different IP addresses between sessions.

Caching of pages can occur at several locations. Large providers look to decrease the load on their network by caching or remembering commonly viewed pages and images. For example, if thousands of users from a particular provider are viewing the CNN website, the provider may benefit from caching the static pages and images of the website and delivering those pieces to the users from within the provider's network. This has the effect of pages being delivered without the knowledge of the actual website.

Browser caching adds to the question. Most browsers are configured to only check content once per session. If a visitor lands on the home page of a particular website, clicks to a subpage, and then uses the back-button to go back to the home page, the second request of the home page is most likely never sent to the website server, but pulled from the browser's memory. An analysis of paths may result in an incomplete path missing the cached pages.

In the above diagram, the actual path taken through the website by the client is shown at the top, while the apparent path from the server's point of view is shown at the bottom. In this case, before proceeding to Page-3 the user goes back to the Page-1. The server never sees this request and from its point of view it appears the user went directly from Page-2 to Page-3. There may not even be a link from Page-2 to Page-3.

Visitor Identification Methods

As mentioned previously, Urchin has five different methods for identifying visitors, sessions and paths. The more sophisticated methods which can address the above issues may require special configuration of your website. The following descriptions describe the workings of each method in more detail.

1. IP-Only: The IP-Only method is provided for backward compatibility with Urchin 3, and for basic IT reporting where uniquely identifying sessions is not needed. This method uses only the IP Address to identify visitor sessions. Thirty minutes of inactivity will constitute a new session. The only data requirements for using this method is a timestamp and IP Address of the visitor.

2. IP-Agent: The default method, which requires no additional configuration, uses the IP address and user-agent (browser) information to determine unique sessions. A configurable thirty-minute timeout is used to identify the beginning of a new session for a visitor. While this method is still susceptible to proxying and caching, the addition of the user-agent information can help detect multiple users from one IP address. In addition, this method includes a special AOL filter, which attempts to reduce the impact of their round-robin proxying techniques. This method does not require any additional configuration.

3. Usernames: This method is provided for secure sites that require logins such as Intranets and Extranets. Websites that are only partially protected should not use this method. The Username identification is taken directly from the username field in the log file. This information is generally logged if the website is configured to require authentication. This method uses a thirty-minute period of inactivity to separate sessions from the same username.

4. Session ID: The fourth visitor identification method available in Urchin is the Session ID method, which can use pre- existing unique session identifiers to uniquely identify each session. Many content delivery applications and web servers will provide session ids to manage user interaction with the webserver. These session ids are typically located in the URI query or stored in a Cookie. As long as this information is available in the log data, Urchin can be configured to take advantage of these identifiers. Using session ids provides a much more accurate measurement of unique sessions, but still does not identify returning unique visitors. This method is also susceptible to some forms of caching including the above example.

In many cases, the ability to use session ids may already be available, and thus, the time required to configure this feature may be short. For dynamically generated sites, taking advantage of this feature should be straightforward. The result is more accurate visitor session and path analysis.

5. Urchin Traffic Monitor (UTM): The last method for visitor identification available in Urchin is the Urchin Tracking Module. This system was specifically designed to negate the effects of caching and proxying and allow the server to see every unique click from every visitor without significantly increasing the load on the server. The UTM system tracks return visitor behavior, loyalty and frequency of use. The client-side data collection also provides information on browser capabilities.

The UTM is installed by including a small amount of JavaScript code in each of your webpages. This can be done manually or automatically via server side includes and other template systems. Complete details on installing UTM are covered in the articles later in this section.

Once installed, the Urchin Traffic Monitor is triggered each time someone views a page from the website. The UTM Sensor uniquely identifies each visitor and sends one extra hit for each pageview. This additional hit is very lightweight and most systems will not see any additional load. The Urchin engine identifies these extra hits in the normal log file and uses this additional data to create an exact picture of every step taken by the users. This method also identifies visitors and sessions uniquely so that return visitation behavior can be properly analyzed. While this method takes a little extra time to configure, it highly recommended for comprehensive detailed analytics.