Reducing Disk Storage for Urchin Profile Monthly Databases

Overview

Urchin reporting data is stored in independent monthly databases for each Profile configured within Urchin. These databases typically reside in the data/reports directory of the Urchin distribution. By default, Urchin will keep an unlimited number of these monthly Profile databases. For most small and medium sized sites, the storage requirements are modest. Because Urchin reporting does not require access to the raw webserver logs once they've been processed, there is no need to keep the webserver logs. The processed Urchin monthly databases will be approximately 5-10% of the size of the raw webserver logs that were processed to populate the Urchin databases, and in most cases this will represent a very minimal amount of disk space even if all Urchin databases are kept indefinitely.

For large sites, however, which produce hundreds or thousands of megabytes worth of webserver logs per day, or hosting providers who have a very large number of Profiles configured, it may be desirable to reduce Urchin's ongoing data storage requirement. This can be accomplished in one of the following ways:

  1. Set the profile to automatically delete the raw tracking data after processing the logs
  2. Set the profile to archive historic data
  3. Limit the number of months of historical reporting data that are retained
Instructions for each of these methods is provided at the end of this article.

Technical Overview of Urchin Database Storage

For each Urchin profile, Urchin maintains a set of nine monthly databases that provide data for the reporting engine. The databases are named after the month for which they store data. The complete list of databases is:

    YYYYMM-hdata.und --> hash table data            
    YYYYMM-hdata.uni --> hash table index                 
    YYYYMM-hdata.uns --> hash table string data  
    YYYYMM-ldata.und --> log tracking data  
    YYYYMM-ldata.uni --> log tracking indexes  
    YYYYMM-pdata.und --> path data  
    YYYYMM-sdata.und --> session data  
    YYYYMM-tdata.und --> totals data  
    YYYYMM-udata.unf --> header for the database
    YYYYMM-vdata.und --> visitor data  
    YYYYMM-vdata.uni --> visitor index
    
Each set of databases is complete for the month of data that it contains. Since there is no interdependency between the monthly database sets, archiving and pruning operations can be performed independently on each database set without affecting any other month.

Under normal operation, the entire set of nine monthly database file is retained for each month. However, four of these database files are used only by the Urchin log processing engine. These database files are:

    YYYYMM-pdata.und
    YYYYMM-sdata.und
    YYYYMM-vdata.und
    YYYYMM-vdata.uni
    
These databases contain information about paths, sessions and visitors and can account for a substantial percentage of the total storage space required for the month, on the order of 10-50%. Thus there can be a significant disk space advantage by setting the Keep Raw Tracking Data option to off in the Storage/DB screen of the Profile configuration.

Important Note: If you plan to upgrade to a future major release of Urchin, this raw tracking data will be used for linking records together. Absence of this data will affect certain new visitor-centric drill down reports that are planned for Urchin. Therefore, it is recommended that only extremely high traffic sites for which keeping the raw tracking data represents a disk or CPU resource consumption issue disable the keeping of raw tracking data.

Other potential disk space savings can be obtained by compressing historic Urchin monthly databases into ZIP archives. The resulting archives are typically only 20-30% the size of the uncompressed database set. While the Urchin reporting engine cannot read the ZIP archives directly, it has the ability to extract the databases it needs from the ZIP archives on the fly. This is completely transparent to a person viewing Urchin reports, other than a slight delay while the databases are being unpacked. The reporting engine does not remove the databases it has unpacked; this allows quicker access to data while the person is viewing the Urchin reports. However, the original ZIP archive is left in place, so a periodic cleanup operation can simply remove the unpacked databases to regain the disk space once again.

The last avenue for reducing Urchin storage requirements is to establish a policy for the duration of historical reporting that Urchin is to provide. For instance, in environments where Urchin is provided as a reporting service with a hosting package, it is very common to provide Urchin historical for the period of one year. Due to the monthly organization of Urchin databases, it is very easy for automatic scripting mechanisms to automatically remove old monthly databases that have aged past a certain threshold. When a historical reporting length policy is implemented, Urchin's data storage requirement will typically stabilize or only increase slightly once the historical retention limit has been reached.

Methods for Reducing Data Storage - How To

Method 1: Delete the Raw Tracking Data after Log Processing

You can configure the profile to delete raw visitor and session information after processing. For large sites, this improves performance and reduces the amount of data stored. Note: Sessions that overlap days appear as two sessions (one for each day) instead of one session, when this configuration is selected. The difference in results will be negligible for most sites.

To configure the profile to delete raw visitor and session information after processing:

  1. In the Admin interface, click Configuration, then Urchin Profiles-->Profiles.
  2. Edit the desired profile.
  3. In the Storage/DB tab, turn the Keep Raw Tracking Data field "off".
  4. Click Update.

Method 2: Auto-Archive Historic Data

You can configure the profile to compress historic monthly data into an archive. The reports can view the archived data, but no additional hits may be processed for the archived months.

To configure the profile to archive historic data,

  1. In the Admin interface, click Configuration, then Urchin Profiles-->Profiles.
  2. Edit the desired profile.
  3. In the Storage/DB tab, turn the Archive DB field "on".
  4. Specify a number of months for the Archive DB After field.
  5. Click Update.

Method 3: Limit Retention of Databases for Historical Reporting

For each Urchin Profile, simply remove any databases in the data/reports/profile-name directory that begin with a YYYYMM prefix that have aged past the threshold needed for historical reporting. For example, if you wish to retain a one-year reporting history and the current month is February 2004, you would remove any databases named 200301-*data.un* to delete the reporting data from January 2003 for that Urchin profile. This would be repeated for all databases older than January 2003.

For an example of a ready-to-run Perl script that will automatically prune the Urchin databases after a certain period of time, please see the PruneUrchinData script at http://www.urchin.com/support/scripts/purge_udata.pl