Search Protocol Reference

Appendices

Back to top

Appendix A: Estimated vs. Actual Number of Results


The Google Search Appliance does not guarantee the ability to return a particular number of results for any given search query. The total count of results is an estimate of the actual number of results for the search request. This section covers issues relating to this topic.

In search appliance software version 6.2 and later, the estimated number of results is different depending on whether filtering is enabled.

  • When filtering is not enabled, you see the estimated total number of results.
  • When filtering is enabled, for all but the last page of results you see the estimated total number of results. If you have requested the last page of results, then you see the total number of filtered results, which is likely to be much smaller than the estimated total number of results.

You can use the rc search parameter to request an accurate result count for up to 1M documents, but it might introduce high latency.

Counting Results in Secure Search

The total count of search results is not provided when a secure search is performed, regardless of which type of output format, XML or HTML, is used. A secure search request includes the parameters access=a or access=s.

How the Google Search Appliance Determines the Number of Results to Return

When search results are returned, the number of results is determined by one of the following conditions:

  • If the Google Search Appliance has results to satisfy the search request, then the requested number of results are returned.
  • If the Google Search Appliance has fewer results than the number requested in the search request, the last page of results is returned. The last page is determined by dividing the total number of results into pages based on the number of results requested.
  • If no results are found, then an empty result set is returned.

To determine if a results page is the last page of available results, check for any of the following conditions:

  • The first result number returned does not match the first result number requested.
  • The number of results returned is less than the number of results requested.
  • The results returned do not contain a link to the next result set.

Navigation

When the total number of results returned is an estimate, the navigation structure for search results is based on this estimate. Google recommends two approaches for generating a navigation scheme for your search results:

  1. Only provide the search user with the ability to navigate to the previous results page and the next results page. The output format can be configured to provide links to the previous and next result set when appropriate.
  2. Provide the search user with the ability to jump to any search page within the estimated number of results. If the user requests a results page beyond which results are actually available, the last results page is returned. The navigation structure is updated when the last page is displayed. This is the behavior you see in the default output of the Google Search Appliance.

Automatic Filtering

When the automatic filtering feature is active, the number of results returned is significantly reduced. Automatic filtering reduces undesirable results such as duplicate entries. You can disable this feature using the instructions in Automatic Filtering.

Filtered search results are identified in the returned results. For example, the <FI/> XML tag is present in XML search results where automatic document filtering occurs.

Google recommends that the search results page displays a message on the last page similar to the following, when automatic filtering occurs:

In order to show you the most relevant results, we have omitted some entries very similar to the search results already displayed. If you like, you can repeat the search with the omitted results included.

This is the behavior you see in the default output format of the Google Search Appliance.

The underlined text in the message should be a hypertext link to submit the same search again with the parameter filter=0. Google finds that this method of informing users about automatic document filtering is effective. This method is used on the Google Internet search site.

If you are using OneBox modules to provide additional query results to your users, note that the results served through a OneBox module are reported separately. The number of OneBox results are not added to the number of standard results.

Back to top

Appendix B: URL Encoding


Some characters are not safe to use in a URL without first being encoded. Because a Google Search Appliance request is made by using an HTTP URL, the search request must follow URL conventions, including character encoding, where necessary.

The HTTP URL syntax specifies that only alphanumeric characters, the special characters $-_.+!*’(), and the reserved characters ;/?:@=& can be used as values within an HTTP URL request. Since reserved characters are used by the search engine to decode the URL, and some special characters are used to request search features, all non-alphanumeric characters used as a value to an input parameter must be URL-encoded.

To URL-encode a string, replace each non-alphanumeric character with its hexadecimal ASCII value, in the format of a percent sign (%) character followed by two hexadecimal digits. Such an ASCII value may be referred to as an escape code. Spaces can be replaced by the plus sign (+) character for query parameters except when requesting search results by meta name or values.

If you are using the search box on the search appliance, you single-encode the special characters $-.+!*’(). Underscores (_) do not need to be URL-encoded in the search box.

If you are using special characters in a search query, you double-encode the special characters $-.+!*’().

Underscores (_) do not need to be URL-encoded in the search box or in a search query.

Some input parameters require that the values passed to Google search are double-URL-encoded. This requirement means that you must apply the URL encoding to the string twice in succession to generate the final value. See the input parameter descriptions (Search Parameters) for more information.

Special characters in a query are the ones described as query term separators (see Special Characters: Query Term Separators) and meta tags names and values. Special characters within the document content do not get indexed so they are not searchable. For example, an indexed document containing a paragraph ending with “the *end” is not searchable using query “%2Aend” in the GSA search box. Only ‘end’ is indexed.

For more information about URL encoding, see W3C (http://www.w3.org/TR/html401/interact/forms.html#form-content-type) and IETF (http://www.ietf.org/rfc/rfc1738.txt) web sites.

Examples

Original String

URL-Encoded String

chicken -teriyaki

chicken+%2Dteriyaki

admission form site:www.stanford.edu

admission+form+site%3Awww.stanford.edu

 

Original String

Doubly URL-Encoded String

William Shakespeare

William%2BShakespeare

admission form site:www.stanford.edu

admission%2Bform%2Bsite%253Awww.stanford.edu

Back to top

Appendix C: Date Formatting


The search appliance recognizes dates in most reasonable formats. However, dates that only mention the year (YY or YYYY), such as 2008, are not used. For dates in the format month year, the date is assumed to be the first of the month. The search appliance currently recognizes most Latin1 month names, but not Chinese, Japanese, or Korean month names.

Format

Description

Example

YYYY

All digits in a year

2008

YY

Last two digits of a year

08

YR

All four digits or only the last two digits of the year

YY, YYYY

M

Month represented by one or two digits

9 or 09

D

Day of the month represented by one or two digits

7 or 07

MM

Month represented by two digits

04

DD

Day of the month represented by two digits

07

WK

Day of the week

Monday or Mon

MON

Month

March or Mar

O

The relationship of local time to Universal Time (UT).

O is used in a standard date format that follows ISO/IEC 8824.

O is denoted by a plus sign (+), a minus sign (-), or the letter Z. A minus sign indicates that the local time is ahead of UT; a plus sign, behind UT; and the letter Z, equal to UT.

Pacific Standard Time would be a minus sign because it is ahead of UT.

Acceptable Date Formats

The following table lists date formats that you can use with the Google Search Appliance.

Format

Separator

Example

YYYY-M-D

Hyphen

2008-2-27

YYYY-D-M

Hyphen

2008-27-2

YYYY.M.D

Period

2008.2.27

YYYY.D.M

Period

2008.27.2

YYYY/M/D

Slash

2008/2/27

YYYY/D/M

Slash

2008/27/2

D-M-YYYY

Hyphen

20-2-2008

M-D-YYYY

Hyphen

2-23-2008

D.M.YYYY

Period

20.2.2008

M.D.YYYY

Period

2.23.2008

D/M/YYYY

Slash

20/2/2008

M/D/YYYY

Slash

2/23/2008

YY-MM-DD

Hyphen

09-04-27

DD-MM-YY

Hyphen

27-04-09

MM-DD-YY

Hyphen

04-27-09

YY.MM.DD

Period

09.04.27

DD.MM.YY

Period

27.04.09

MM.DD.YY

Period

04.27.09

YY/MM/DD

Slash

09/04/27

DD/MM/YY

Slash

27/04/09

MM/DD/YY

Slash

04/27/09

WK, D MON, YR

Comma

Tue, 3 March, 2009

WK, MON D, YR

Comma

Tue, March 3, 2009

D MON, YR

Space and comma

2 Jan, 09

MON YYYY

Space

March 2009

MON D, YR

Space and comma

Mar 03, 09

MON YY

Space

Mar 09

YYYYMMDDHHmm

(none)

200903211642 (see Note 1 below)

YYYYMMDDHH

(none)

2009082116

YYYYMMDD

(none)

20090323

YYYYMM

(none)

200903

YYYY

(none)

2009

DDMMYYYY

(none)

23032009

MMDDYYYY

(none)

03232009

YYMMDD

(none)

090225

DDMMYY

(none)

150209

MMDDYY

(none)

021509

YYYY

(none)

2009

Date Formatting Notes

  1. The YYYYMMDDHH and YYYYMMDDHHmm patterns for specifying dates are supported, however, the search appliance has no notion of sorting search results based on the difference of time in document dates. For example, if a document has a meta tag with a value of 200910212150 and a second document with a value of 200910210900 then the search appliance discards both dates and sets document dates to their modification time (because the YYYYMMDDHHmm format does not get parsed).
  2. Use meta tags with dates in the ISO-8601 format (YYYY-MM-DD) to avoid the confusion caused by multiple dates and multiple formats in the title or text of the documents.
  3. The date of each file is returned in the date field of the results. This cannot be turned off, but you can choose not to display it on the front end to your users. To learn more about sorting by date, see Sorting.
  4. If no date is found for a file, it is indexed without date data. Results that do not contain date data are displayed at the end of the results with dates, sorted by relevance.
  5. If you have documents that contain exceptions to the default dates rule, enter the specific URL or pattern for the file and place these rules at the top of your list. The rules are handled in the order in which they are specified in the rule list. The first rule containing a valid date for the document determines the date of the document.

To specify rules for dates of documents:

  1. Click Crawl and Index > Document Dates.
  2. In the Host or URL Pattern column, enter the host or pattern to which the rule will apply.
  3. Use the drop-down list in the Locate Date In column to select the location of the date for the documents in the specified URL pattern.
  4. If you select Meta Tag, specify the name of the meta tag in the Meta Tag Name column.
  5. To add more rules, click the Add More Lines button.
  6. After all the rules are specified, click the Save Changes button.

Examples of Rules

Rule #

Host or URL Pattern

Date Located In

Meta Tag Name

1

www.foo.com/example/

Title

 

2

www.foo2.com/archives/

URL

 

3

www.foo.com/

Meta Tag

publication_date

4

www.foo2.com/

Body

 

5

/

Last Modified

 

Because the document http://www.foo.com/example/foo.html matches the URL pattern in rule 1, the search appliance first checks for the date in the title of the document. The URL doesn’t match rule 2, so the search appliance checks against rule 3. If the search appliance is unable to find a valid date in the title or the URL, the search appliance looks for the date in the meta tag named publication_date according to rule 3. If the search appliance is unable to find a valid date in the meta tag, the search appliance defaults to the last modified date of the HTTP server, according to rule 5.

The search appliance extracts the date from the http://www.foo2.com/archives/20040605/abc.html URL.

Because the document http://www.foo.com/foo.html does not match the URL pattern in rule 1, the search appliance looks for the date in the meta tag, according to rule 3 and defaults to rule 5 if the search appliance cannot find a valid date in rule 3.

For the document http://www.foo2.com/foo.html, the search appliance looks for the date in the body and defaults to the last-modified date.

For the document http://www.foo3.com/foo.html, the search appliance looks for the date only on the last-modified header as it only matches the URL pattern of rule 5.

Back to top

Appendix D: Compressed Results


The Google Search Appliance supports serving compressed results.

The search appliance serves compressed results to browsers that support compression. The browser must send the following HTTP header to the search appliance:

Accept-Encoding: gzip

The search appliance will then serve compressed results. The browser uncompresses the results.

This applies to both XML and XSLT-transformed results. If the Accept-Encoding: gzip header is not present, the results are not compressed.

Back to top

Was this helpful?
How can we improve it?