Visitor Identification Methods

Overview

Urchin 4 has four different methods for identifying visitor traffic: IP-Only, IP-Agent, Session ID, and the patent-pending Urchin Tracking Module (UTM). The UTM System was specifically designed to identify unique visitors, sessions, exact paths, and return frequency behavior. For businesses looking to get a clear picture of their visitor traffic and behavior, the UTM System is an extremely valuable technology that combines the best of client and server side information.

Visitor Identification MethodAccuracyCostValue
1. IP-Only DAC
2. IP-Agent CAB
3. Session ID BBB
4. Urchin Traffic Monitor ABA

As shown in the above figure, the Urchin Traffic Monitor results in the highest accuracy and value with a reasonable cost of ownership. In contrast, the first method, which is based on the IP Address of the user, is the easiest to setup, but has some serious accuracy shortcomings. IP-Agent uses both the IP Address of the user and the user-agent (browser) information, which is slightly better. The third method can be low cost if your site already creates session ids and logs these into the server logs. This method is a significant improvement over the first two, but still has some shortcomings in dealing with return visitation and caching. Only the UTM System provides a comprehensive solution for maximum information.

Data Model

The underlying model within Urchin for handling unique visitors is based on a hierarchical notion of a unique set of visitors interacting with the website through one or more sessions. Each session can contain one or more hits and pageviews. Pageviews are kept in order so that a path through the website for each session is understood. It may not be possible to uniquely identify all visitors and sessions. For this reason, an identification type is associated with each so that analysis can look at the percentage of visitors that are exactly identified and the percentage of visitors that may not be unique. This will become clearer as we look at the different identification methods.

The above diagram illustrates the data model for each unique visitor. A type is associated with each unique visitor and each session. The unique visitor represents an individuals interaction with the website over time. Each unique visitor will have one or more sessions, and within each session is zero or more pageviews that comprise the path the visitor took for that session.

Proxying and Caching

In attempting to identify and track unique visitors and sessions, we are basically going against the nature of the web, which is anonymous interaction. Particularly troublesome to tracking visitors are the increasingly common proxying and caching techniques used by service providers and the browsers, themselves. Proxying hides the actual IP address of the visitor and can use one IP address to represent more than one web user. A users IP address can change between sessions and in some cases multiple IP addresses will be used to represent a cluster of users. Thus, it is possible that one visitor will have different IP addresses for each hit and/or different IP addresses between sessions.

Caching of pages can occur at several locations. Large providers look to decrease the load on their network by caching or remembering commonly viewed pages and images. For example, if thousands of users from a particular provider are viewing the CNN website, the provider may benefit from caching the static pages and images of the website and delivering those pieces to the users from within the providers network. This has the effect of pages being delivered without the knowledge of the actual website.

Browser caching adds to the question. Most browsers are configured to only check content once per session. If a visitor lands on the home page of a particular website, clicks to a subpage, and then uses the back-button to go back to the home page, the second request of the home page is most likely never sent to the website server, but pulled from the browsers memory. An analysis of paths may result in an incomplete path missing the cached pages.

In the above diagram, the actual path taken through the website by the client is shown at the top, while the apparent path from the servers point of view is shown at the bottom. In this case, before proceeding to Page-3 the user goes back to the Page-1. The server never sees this request and from its point of view it appears the user went directly from Page-2 to Page-3. There may not even be a link from Page-2 to Page-3.

Unique Visitor Identification Methods

As mentioned previously, Urchin 4 has four different methods for identifying visitors, sessions and paths. The more complicated methods which can address the above issues may require special configuration of your website. The following descriptions will help you assess which method is right for you.

1. IP-Only: The IP-Only method is provided for backward compatibility with Urchin 3. This method uses only the IP Address to identify visitors. This method is susceptible to all forms of caching and proxying, and should only be used to provide a base-line comparison between Urchin 4 and previous versions.

2. IP-Agent: The default method, which requires no additional configuration, uses the IP address and user-agent (browser) information to determine unique visitors. A configurable 30-minute timeout is used to identify the beginning of a new session for a visitor. While this method is still susceptible to proxying and caching, the addition of the user-agent information can help detect multiple users from one IP address. In addition, this method includes a special AOL filter, which attempts to reduce the impact of their round-robin proxying techniques. This method does not require any additional configuration and therefore is the easiest to use. This method is good for getting a general idea of traffic and user behavior, but is not reliable for exact measurements.

3. Session ID: The third visitor identification method available in Urchin is the Session ID method, which can use pre- existing unique session identifiers to uniquely identify each session. Many content delivery applications and web servers will provide session ids to manage user interaction with the webserver. These session ids are typically located in the URI query or stored in a Cookie. As long as this information is available in the log data, Urchin can be configured to take advantage of these identifiers. Using session ids provides a much more accurate measurement of unique sessions, but still does not identify returning unique visitors. This method is also susceptible to some forms of caching including the above example.

In many cases, the ability to use session ids may already be available, and thus, the time required to configure this feature may be short. For dynamically generated sites, taking advantage of this feature should be straightforward. The result is more accurate visitor session and path analysis.

4. Urchin Traffic Monitor (UTM): The last method for visitor identification available in Urchin is the patent-pending Urchin Tracking Module. This system was specifically designed to negate the effects of caching and proxying and allow the server to see every unique click from every visitor without significantly increasing the load on the server. In addition, the UTM system tracks visitors between sessions so that an analysis of unique visitor behavior and repeat visit patterns enables you to understand the frequency and nature of your online visitors.

Once installed, the Urchin Traffic Monitor is triggered each time someone views a page from the website. The UTM Sensor uniquely identifies each visitor and sends one extra hit for each pageview. This additional hit is very lightweight and most systems will not see any additional load. The Urchin engine identifies these extra hits in the normal log file and uses this additional data to create an exact picture of every step taken by the users. This method also identifies visitors and sessions uniquely so that return visitation behavior can be properly analyzed. While this method takes a little extra time to configure, it highly recommended for comprehensive detailed analytics.

FeatureIP-MethodsSession IDUTM
Identifies non-proxied visitors XXX
Uniquely identifies each session XX
Defeats session IP proxying XX
Defeats most provider caching XX
Defeats browser caching X
Uniquely identifies visitors X
Handles changing visitor IPs X
Captures exact path sequence X
Captures return visitor behavior X