Urchin reporting data is stored in independent monthly databases for each Profile configured within Urchin. These databases typically reside in the data/reports directory of the Urchin distribution. By default, Urchin will keep an unlimited number of these monthly Profile databases. For most small and medium sized sites, the storage requirements are fairly modest. Because Urchin reporting does not require access to the raw webserver logs once they've been processed, there is no need to keep the webserver logs. The "crunched" Urchin monthly databases will be approximately 5-10% of the size of the raw webserver logs that were processed to populate the Urchin databases, and in most cases this will represent a very minimal amount of disk space even if all Urchin databases are kept indefinitely.
However, for large sites which produce hundreds or thousands of megabytes worth of webserver logs per day, or hosting providers who have a very large number of Profiles configured, it may be desirable to reduce Urchin's ongoing data storage requirement. This can be done through a number of different methods:
- Make the Urchin databases read-only after a set period of time by removing the database components that are only used by the log processing engine
- Compress monthly databases in a ZIP archive after a set number of months
- Limit the number of months of historical reporting data that are retained
Technical Overview of Urchin Database Storage
For each Urchin profile, Urchin maintains a set of nine monthly databases that provide data for the reporting engine. The databases are named after the month for which they store data. The complete list of databases is:
YYYYMM-hdata.und YYYYMM-hdata.uni YYYYMM-hdata.uns YYYYMM-pdata.und YYYYMM-sdata.und YYYYMM-tdata.und YYYYMM-udata.unf YYYYMM-vdata.und YYYYMM-vdata.uniEach set of databases is complete for the month of data that it contains. Since there is no interdependency between the monthly database sets, archiving and pruning operations can be performed independently on each database set without affecting any other month.
Under normal operation, the entire set of nine monthly database file is retained for each month. However, four of these database files are used only by the Urchin log processing engine. These database files are:
YYYYMM-pdata.und YYYYMM-sdata.und YYYYMM-vdata.und YYYYMM-vdata.uniSince they are only accessed during log processing operations, it is generally safe to remove them once all the log processing for that month is complete. These databases contain information about paths, sessions and visitors and can account for a substantial percentage of the total storage space required for the month, on the order of 10-50%. Thus there can be a significant disk space advantage to removing these databases once updates for the month are complete. Important Note: future major releases of Urchin are likely to use the data from these databases for reporting purposes. Therefore, it is recommended that these databases be retained if you wish to have complete historical reporting for Urchin after a major version upgrade.
Other potential disk space savings can be obtained by compressing the Urchin monthly databases into ZIP archives. The resulting archives are typically only 20-30% the size of the uncompressed database set. While the Urchin reporting engine cannot read the ZIP archives directly, it has the ability to extract the databases it needs from the ZIP archives on the fly. This is completely transparent to a person viewing Urchin reports, other than a slight delay while the databases are being unpacked. The reporting engine does not remove the databases it has unpacked; this allows quicker access to data while the person is viewing the Urchin reports. However, the original ZIP archive is left in place, so a periodic cleanup operation can simply remove the unpacked databases to regain the disk space once again.
The last avenue for reducing Urchin storage requirements is to establish a policy for the duration of historical reporting that Urchin is to provide. For instance, in environments where Urchin is provided as a reporting service with a hosting package, it is very common to provide Urchin historical for the period of one year. Due to the monthly organization of Urchin databases, it is very easy for automatic scripting mechanisms to automatically remove old monthly databases that have aged past a certain threshold. When a historical reporting length policy is implemented, Urchin's data storage requirement will typically stabilize or only increase slightly once the historical retention limit has been reached.
Methods for Reducing Data Storage - Technical Details
Method 1: Remove Update-Only Urchin Databases
Warning! This technique should only be used for months that will not require further updates, and where complete historical reporting is not needed after a major Urchin release (see note above). Once these databases have been removed, the Urchin log processing engine will fail if it attempts to process additional data for that month. This will require that reprocessing all the raw webserver logs from that month if Urchin data requires updating. In general, it is unwise to remove these databases unless they are a minimum of two months old.
For each Urchin Profile, remove the following databases in the data/reports/profile-name directory:
YYYYMM-pdata.und YYYYMM-sdata.und YYYYMM-vdata.und YYYYMM-vdata.uniwhere YYYYMM refers to the 4-digit year and 2-digit month for the databases you wish to remove. It is recommended that this operation be done in a regularly scheduled script that runs at least once a month.
Method 2: Compress Urchin monthly databases in a ZIP archive
For each Urchin Profile, create a ZIP archive of all the databases in the data/reports/profile-name directory that begin with a YYYYMM prefix. To ensure consistency, this operation should be done using the zip utility supplied in the Urchin distribution in the util directory. The ZIP archive must be named YYYYMM-archive.zip in order for the Urchin reporting engine access data within it. Important Note: If you need to process data for a month that has already been archived, you should unpack the ZIP archive for that month, remove it, and then process the corresponding logs. This will ensure that automated cleanup scripts do not replace your recently updated databases with older databases from the ZIP archive.
The following is an example of invoking zip to create an archive for the databases for April 2003:
zip -q -m 200204-archive.zip 200204-*data.un*This will create the ZIP archive and automatically delete the database files that were added to the archive. Again, it is recommended that this archiving operation be done at least once a month on regularly scheduled basis.
Method 3: Limit Retention of Databases for Historical Reporting
For each Urchin Profile, simply remove any databases in the data/reports/profile-name directory that begin with a YYYYMM prefix that have aged past the threshold needed for historical reporting. For example, if you wish to retain a one-year reporting history and the current month is April 2003, you would remove any databases named 200203-*data.un* to delete the reporting data from March 2002 for that Urchin profile. This would be repeated for all databases older than March 2002.
- While the methods of reducing Urchin disk storage outlined in this document can be done manually, it is recommended that the archiving/cleanup tasks be run from a script called by an automated scheduler such as cron on UNIX-type operating systems or the Windows Task Scheduler
- At a minimum, these archiving/cleanup operations should be done once per month. However, since the Urchin reporting engine may unpack archived data on an as-needed basis, a daily schedule is recommended to keep disk space usage in check.
- Be aware that the transition from one month to the next will necessarily require that archiving be done for all Urchin profiles. Since the process of creating ZIP archives is demanding on both CPU and I/O bandwidth resources, it may be wise to run the cleanup script at low priority and possibly design it to stagger the archiving for the Profiles across several days.