Reducing Disk Storage for Urchin Profile Monthly Databases

Reducing Disk Storage for Urchin Profile Monthly Databases

Overview

Urchin reporting data is stored in independent monthly databases for each Profile configured within Urchin. These databases typically reside in the data/reports directory of the Urchin distribution. By default, Urchin will keep an unlimited number of these monthly Profile databases. For most small and medium sized sites, the storage requirements are modest. Because Urchin reporting does not require access to the raw webserver logs once they've been processed, there is no need to keep the webserver logs. The processed Urchin monthly databases will be approximately 5-10% of the size of the raw webserver logs that were processed to populate the Urchin databases, and in most cases this will represent a very minimal amount of disk space even if all Urchin databases are kept indefinitely.

For large sites, however, which produce hundreds or thousands of megabytes worth of webserver logs per day, or hosting providers who have a very large number of Profiles configured, it may be desirable to reduce Urchin's ongoing data storage requirement. This can be accomplished in one of the following ways:

  1. Set the profile to automatically delete the raw tracking data after processing the logs
  2. Set the profile to archive historic data
  3. Limit the number of months of historical reporting data that are retained

Instructions for each of these methods is provided at the end of this article.

Technical Overview of Urchin Database Storage

For each Urchin profile, Urchin maintains a set of database files stored monthly in directories named YYYYMM. Each of these directories contains ~50 files that provide data for the reporting engine. The directory and database files are named after the month for which they store data. The complete list of databases is:

YYYYMM-uhed --> header for the database

YYYYMM-usti --> string index

YYYYMM-ustd --> string data

YYYYMM-udai --> aggregate tables index

YYYYMM-udXX --> aggregate data tables (XX is replaced with the table number from the datamap)

YYYYMM-uvii --> visitor index

YYYYMM-uvid --> visitor data

YYYYMM-used --> session data

YYYYMM-upad --> path data

YYYYMM-utrd --> transaction data (Ecommerce)

YYYYMM-uitd --> item data (Ecommerce)

YYYYMM-ulti --> log tracking index

YYYYMM-ultd --> log tracking data

YYYYMM-utod --> totals data

YYYYMM-uhid --> histogram data

YYYYMM-umad --> visitor matrix data

Each set of databases is complete for the month of data that it contains. Since there is no interdependency between the monthly database sets, archiving and pruning operations can be performed independently on each database set without affecting any other month.

Under normal operation, the entire set of monthly database file is retained for each month. However, four of these database files are used only by the Urchin log processing engine. These database files are:

YYYYMM-usti

YYYYMM-udai

YYYYMM-ulti

YYYYMM-ultd

The following database files are used by the Urchin log processing engine and for cross segmentation and visitor drilldown in the reporting. Removing the contents will only affect those reporting features.

YYYYMM-uvii

YYYYMM-uvid

YYYYMM-used

YYYYMM-upad

YYYYMM-utrd

YYYYMM-uitd

These databases contain information about visitors, sessions, paths, transactions and products. These files can account for a substantial percentage of the total storage space required for the month, on the order of 10-50%. Thus there can be a significant disk space advantage by setting the Keep Raw Tracking Data option to off in the Storage/DB screen of the Profile configuration.

It is recommended that only extremely high traffic sites for which keeping the raw tracking data represents a disk or CPU resource consumption issue disable the keeping of raw tracking data.

Other potential disk space savings can be obtained by compressing historic Urchin monthly databases into ZIP archives. The resulting archives are typically only 20-30% the size of the uncompressed database set. While the Urchin reporting engine cannot read the ZIP archives directly, it has the ability to extract the databases it needs from the ZIP archives on the fly. This is completely transparent to a person viewing Urchin reports, other than a slight delay while the databases are being unpacked. The reporting engine does not remove the databases it has unpacked; this allows quicker access to data while the person is viewing the Urchin reports. However, the original ZIP archive is left in place, so a periodic cleanup operation can simply remove the unpacked databases to regain the disk space once again.

The last avenue for reducing Urchin storage requirements is to establish a policy for the duration of historical reporting that Urchin is to provide. For instance, in environments where Urchin is provided as a reporting service with a hosting package, it is very common to provide Urchin historical for the period of one year. Due to the monthly organization of Urchin databases, it is very easy for automatic scripting mechanisms to automatically remove old monthly databases that have aged past a certain threshold. When a historical reporting length policy is implemented, Urchin's data storage requirement will typically stabilize or only increase slightly once the historical retention limit has been reached.

Methods for Reducing Data Storage - How To

Method 1: Delete the Raw Tracking Data after Log Processing

You can configure the profile to delete raw visitor and session information after processing. For large sites, this improves performance and reduces the amount of data stored. Note: Sessions that overlap days appear as two sessions (one for each day) instead of one session, when this configuration is selected. The difference in results will be negligible for most sites.

To configure the profile to delete raw visitor and session information after processing:

  1. In the Admin interface, click Configuration, then Urchin Profiles-->Profiles.

  2. Edit the desired profile.
  3. In the Storage/DB tab, turn the Keep Raw Tracking Data field "off".
  4. Click Update.

Method 2: Auto-Archive Historic Data

You can configure the profile to compress historic monthly data into an archive. The reports can view the archived data, but no additional hits may be processed for the archived months.

To configure the profile to archive historic data,

  1. In the Admin interface, click Configuration, then Urchin Profiles-->Profiles.

  2. Edit the desired profile.
  3. In the Storage/DB tab, turn the Archive DB field "on".
  4. Specify a number of months for the Archive DB After field.
  5. Click Update.

Method 3: Limit Retention of Databases for Historical Reporting

For each Urchin Profile, simply remove any databases in the data/reports/profile-name directory that begin with a YYYYMM prefix that have aged past the threshold needed for historical reporting. For example, if you wish to retain a one-year reporting history and the current month is February 2004, you would remove any databases named 200301-*data.un* to delete the reporting data from January 2003 for that Urchin profile. This would be repeated for all databases older than January 2003.