How sampling works in Google Analytics

Background

Sampling in Google Analytics (GA) or in any analytics software refers to the practice of selecting a subset of data from your traffic and reporting on the trends available in that sample set. Sampling is widely used in statistical analysis because analyzing a subset of data gives similar results to analyzing all of the data. In addition, sampling speeds up processing for reports when the volume of data is so large as to slow down report queries.

Session sampling

How standard reports work

Each property within Google Analytics stores a copy of all the unfiltered data associated with the unique property number. Each reporting view associated with a property creates a set of unsampled, pre-aggregated data tables, which are processed on a daily basis. The suite of standard Google Analytics reports relies on these pre-aggregated tables to deliver unsampled reports in a timely manner.

Apart from the standard reports, users may issue ad-hoc queries to Google Analytics. Common queries include applying Segments to standard reports, applying a secondary dimension, or running a custom report. When the front-end issues a query, GA inspects the set of pre-aggregated tables to determine whether the query can be wholly satisfied by existing aggregates. If not, GA goes back to the raw session data to process and compute aggregate data on-the-fly. If the resulting report is sampled, you will always see a yellow box at the top of the report which says, This report is based on N sessions.

How ad-hoc reports work

As discussed, in cases where the report query cannot be satisfied by existing aggregates (i.e., pre-aggregated tables), GA goes back to the raw session data to compute the requested information. In order to reduce latency, GA may sample session data for such queries. Specifically, GA inspects the number of sessions for the specified date range at the property level. If the number of sessions to the property in the given date range exceeds 250K sessions1, GA will employ a sampling algorithm which uses a sample set of 250K, proportional to the distribution of sessions by day for the selected date range. Thus, the session sampling rate varies for every query depending on the number of sessions included in the selected date range for the given property. Note that the sample size can configured to be anywhere from 1K to 500K; the default size is 250K.

Implications for filtered views and Segments

It is important to note that session sampling occurs at the property level, not the view level. For ad-hoc queries, the sample set of 250K sessions2 is determined at the property level, and then the view-level filters are applied. As such, views that are filtered may have fewer sessions included in the sampled calculation. Similarly, Segments are applied after the 250K sessions are sampled, so fewer sessions may be included in the calculation.

In general, session sampling is a highly effective means of reducing query latency while maintaining a high level of accuracy. In particular, GA's approach to sampling works very well for fast, top N queries and other queries that have a relatively broad, uniform distribution across sessions. Session sampling can be less accurate for 'needle in a haystack' problems, such as single keyword analysis and longtail analysis, or cases with narrow dimension filtering, such as heavily filtered views or conversion analysis where conversions constitute a small fraction of sessions. For those types of analysis, please refer to the section on accessing unsampled reports with GA Premium accounts.

Dimension value aggregates

How standard reports work

As discussed, the pre-aggregated tables per view are processed on a daily basis. These tables report data on all sessions, though there is a limit to the number of rows/distinct values in the pre-aggregated tables3. GA aggregates data when there are more than 75,000 rows of data in a single table in a single day. In other words, when there are more than 75,000 values for a given table, GA will take the top N4 values and creates an aggregate entry for the remaining values labeled (other).

Implications for multi-day requests

It is important to note that the top N entries are determined on a per day basis. For example, if you select any single day in the Pages report, you will see at most 75,000 rows; all other pages are aggregated into the (other) category. Therefore, a page that is grouped in the (other) category one day, may not necessarily be grouped in the (other) category another day. So when running a report for a multi-day date range, you may run into inconsistencies as some pages (or other dimension value) in the longtail may be included in the (other) bucket or its own row across days.

Additionally, for any date range, Google Analytics returns a maximum of 1M rows for the report. Rows in excess of 1M are rolled up into an (other) entry.

Because dimension values (e.g., unique URLs and campaign keywords) often repeat across given days, this threshold typically only affects sites or apps with a lot of unique content and/or keywords.

Learn more about how data is aggregated under (other).

How ad-hoc reports work

In cases where the user query cannot be satisfied by existing aggregates (i.e., pre-aggregated tables), GA goes back to the raw session data to compute the requested information. In that situation, GA will return a maximum of 1M unique dimension values included in the sample set for the query.

Other reports

Sampling and multi-channel funnel reports

The multi-channel funnel reports are based on 1M conversions. If the number of conversions exceeds 1M for the given date range, then GA samples up to 1M conversions at the view level. Note that sampling occurs at the view, not property, level for MCF reports.

Additionally, the maximum number of unique conversion paths is 200K per day. All other conversion paths are aggregated under (other).

Sampling and flow visualization reports

Flow visualization reports (including Visitors Flow and Goal Flow) are generated off a subset of 100K sessions for a given date range. Similar to standard report session sampling, the 100K sessions are sampled at the property level. Therefore, applying view filters or Segments can further reduce the sample set size.

For this reason, the flow visualization reports, including entrance, exit, and conversion rates may differ from the results in the standard content and conversion reports, which are based on a different sample set.

Data collection sampling

If your website or app has many millions of pageviews per month, you might consider configuring your tracking code to sample your data. For information on how to do this, follow the instructions in the Developer Guide for your specific environment:

By sampling the hits for your site or app, you will get reliable report results while staying within the hit limits for your account. The limit on the number of hits for a standard GA account is 10M hits/month. For Premium accounts, the hit limit is 1B+ hits/month. When data collection sampling is implemented, hits are discarded on the client-side and are never collected or processed by Google Analytics. Therefore, discarded hits cannot be recovered through Premium unsampled reports. Also, unlike session sampling, Google Analytics does not extrapolate the report results based on the data collection sample rate. That said, an added benefit of data collection sampling will be that report response times may be faster with less data in the account.

Data collection sampling occurs consistently across users. Therefore, once a user has been selected for data collection, all sessions (including future sessions) for the user will send data to GA. For mobile applications, this means that downloads of the application that have been selected for data collection will send all data to GA, while other instances of the application will not send any hits.

Note that even if the data for your site is not sampled when it is collected, certain types of reports will encounter other types of sampling, including session sampling and dimension value aggregation, based on the nature of the query. See How ad-hoc reports work for session sampling.

1 See adjusting the sample size. Sample size is adjustable from 1K to 500K sessions.

2 See adjusting the sample size.

3 Tables may correspond to a single report or multiple reports. Tables may contain a single dimension (e.g., Keyword) or multiple dimensions (e.g., Ad Group and Campaign). At the most granular level, reports will contain at most 75K rows of data. Higher levels in the table hierarchy, such as ad group, may contain less than 75K rows.

4 As determined by the relevant metric for the report/table (e.g. # sessions, # events, # page views, # transactions).