How sampling works

Sampling in Google Analytics is the practice of selecting a subset of data from your traffic and reporting on the trends available in that sample set. Sampling is widely used in statistical analysis because analyzing a subset of data gives similar results to analyzing all of the data. In addition, sampling speeds up processing for reports when the volume of data is so large as to slow down report queries.

In this article:

Session sampling

How standard reports work

Each property within Analytics stores a copy of all the unfiltered data associated with the unique property number. Each reporting view associated with a property creates a set of unsampled, pre-aggregated data tables, which are processed on a daily basis. These pre-aggregated tables are used to quickly display unsampled reports.

Apart from the standard reports, users may issue ad-hoc queries to Analytics. Common queries include applying segments to standard reports, applying a secondary dimension, or running a custom report. When the front-end issues a query, Analytics inspects the set of pre-aggregated tables to determine whether the query can be wholly satisfied by existing aggregates. If not, Analytics goes back to the raw session data to process and compute aggregate data on-the-fly. If the resulting report is sampled, you will see a message at the top of the report, below and to the right of the report title, which says, This report is based on N sessions.

How ad-hoc reports work

If Analytics needs to compute aggregate data on-the-fly to satisfy the report query, it may sample the raw session data in order to reduce latency. Specifically, Analytics inspects the number of sessions for the specified date range at the property level. If the number of sessions in the property over the given date range exceeds 500k sessions (25M for Premium)1, Analytics will employ a sampling algorithm which uses a sample set proportional to the distribution of sessions by day for the selected date range. Thus, the session sampling rate varies for every query depending on the number of sessions included in the selected date range for the given property.

Implications for filtered views and segments

Session sampling occurs at the property level, not the view level. For ad-hoc queries, the sample set is determined at the property level, and then the view-level filters are applied. As such, views that are filtered may have fewer sessions included in the sampled calculation. Similarly, Segments are applied after sampling, so fewer sessions may be included in the calculation.

Google Analytics Premium: For Google Analytics Premium, sampling occurs at the view level. As a result, view filters do not impact the sampling size. However, Segments are applied after sampling, so fewer sessions may be included in the calculation.

Generally, session sampling reduces query latency while maintaining a high level of accuracy. Analytics sampling works well for fast, top N queries and other queries that have a relatively broad, uniform distribution across sessions. Session sampling can be less accurate for 'needle in a haystack' problems, such as single keyword analysis and long tail analysis. It is also less accurate in situations that involve narrow dimension filtering, such as heavily filtered views or conversion analysis where conversions constitute a small fraction of sessions. For those types of analysis, refer to Unsampled reports for one-time report needs and Custom Tables for ongoing unsampled data needs on a specific data set, both available to Google Analytics Premium accounts.

Dimension value aggregates

How standard reports work

The pre-aggregated tables per view are processed on a daily basis. These tables report data on all sessions, though there is a limit to the number of rows/distinct values in the pre-aggregated tables2. Analytics aggregates data when there are more than 50k rows of data (75k for Premium) in a single table for a single day. In other words, when there are more than 50k values (75k for Premium) for a given table, Analytics takes the top N3 values and creates an aggregate entry for the remaining values labeled (other).

Implications for multi-day requests

The top N entries are determined on a per day basis. For example, if you select any single day in the Pages report, you will see at most 50k rows (75k for Premium); all other pages are aggregated into the (other) category. Therefore, a page that is grouped in the (other) category one day, may not necessarily be grouped in the (other) category another day. So, when running a report for a multi-day date range, there may be inconsistencies, because some pages (or other dimension values) in the long tail may be included in the (other) bucket or its own row across days.

Because dimension values (e.g., unique URLs and campaign keywords) often repeat across given days, this threshold typically only affects sites and apps that have many distinct pages/screens and/or keywords.

Learn more about how data is aggregated under (other).

How ad-hoc reports work

In cases where the user query cannot be satisfied by existing aggregates (i.e., pre-aggregated tables), Analytics goes back to the raw session data to compute the requested information. In this case, Analytics returns a maximum of 1M unique dimension values included in the sample set for the query.

Other reports

Sampling and Multi-Channel Funnel reports

The Multi-Channel Funnel reports are based on 1M conversions. If the number of conversions exceeds 1M for the active date range, Analytics samples up to 1M conversions at the view level. Note that sampling occurs at the view level, not the property level, for Multi-Channel Funnel reports.

The maximum number of unique conversion paths is 200K per day. All other conversion paths are aggregated under (other).

Sampling and flow visualization reports

Flow visualization reports (including Visitors Flow and Goal Flow) are generated from a subset of 100K sessions for the active date range. Similar to standard report session sampling, the 100K sessions are sampled at the property level. Therefore, applying view filters or Segments can further reduce the sample set size.

For this reason, the flow visualization reports, including entrance, exit, and conversion rates may differ from the results in the standard content and conversion reports, which are based on a different sample set.

Data collection sampling

If your site or app has many millions of pageviews per month, you might consider configuring your tracking code to sample your data. For information on how to do this, follow the instructions in the Developer Guide for your specific environment:

By sampling the hits for your site or app, you will get reliable report results while staying within the hit limits for your account. The limit on the number of hits for a standard Analytics account is 10M hits/month. For Premium accounts, the hit limit is 1B+ hits/month. When data collection sampling is implemented, hits are discarded on the client-side and are never collected or processed by Analytics. Therefore, discarded hits cannot be recovered through Premium unsampled reports. Also, unlike session sampling, Analytics does not extrapolate the report results based on the data collection sample rate. That said, an added benefit of data collection sampling will be that report response times may be faster with less data in the account.

Data collection sampling occurs consistently across users. Therefore, once a user has been selected for data collection, all sessions (including future sessions) for the user will send data to GA. For mobile applications, this means that downloads of the application that have been selected for data collection will send all data to GA, while other instances of the application will not send any hits.

Note that even if the data for your site is not sampled when it is collected, certain types of reports will encounter other types of sampling, including session sampling and dimension value aggregation, based on the nature of the query. See How ad-hoc reports work for session sampling.

1 See adjusting the sample size.

2 Tables may correspond to a single report or multiple reports. Tables may contain a single dimension (e.g., Keyword) or multiple dimensions (e.g., Ad Group and Campaign). At the most granular level, reports will contain at most 50k rows of data (75k for Premium).

3 As determined by the relevant metric for the report/table (e.g. # sessions, # events, # page views, # transactions).

Was this article helpful?