Administering Crawl

Introduction

Crawling is the process where the Google Search Appliance discovers enterprise content to index. This chapter provides an overview of how the Google Search Appliance crawls public content.

For information about specific feature limitations, see Specifications and Usage Limits.

Back to top

What Is Search Appliance Crawling?


Before anyone can use the Google Search Appliance to search your enterprise content, the search appliance must build the search index, which enables search queries to be quickly matched to results. To build the search index, the search appliance must browse, or “crawl” your enterprise content, as illustrated in the following example.

The administration at Missitucky University plans to offer its staff, faculty, and students simple, fast, and secure search across all their content using the Google Search Appliance. To achieve this goal, the search appliance must crawl their content, starting at the Missitucky University Web site’s home page.

Missitucky University has a Web site that provides categories of information such as Admissions, Class Schedules, Events, and News Stories. The Web site’s home page lists hyperlinks to other URLs for pages in each of these categories. For example, the News Stories hyperlink on the home page points to a URL for a page that contains hyperlinks to all recent news stories. Similarly, each news story contains hyperlinks that point to other URLs.

The relations among the hyperlinks within the Missitucky University Web site constitute a virtual web, or pathway that connects the URLs to each other. Starting at the home page and following this pathway, the search appliance can crawl from URL to URL, browsing content as it goes.

Crawling Missitucky University’s content actually begins with a list of URLs (“start URLs”) where the search appliance should start browsing; in this example, the first start URL is the Missitucky University home page.

The search appliance visits the Missitucky University home page, then it:

  1. Identifies all the hyperlinks on the page. These hyperlinks are known as “newly-discovered URLs.”
  2. Adds the hyperlinks to a list of URLs to visit. The list is known as the “crawl queue.”
  3. Visits the next URL in the crawl queue.

By repeating these steps for each URL in the crawl queue, the search appliance can crawl all of Missitucky University’s content. As a result, the search appliance gathers the information that it needs to build the search index, and ultimately, to serve search results to end users.

Because Missitucky University’s content changes constantly, the search appliance continuously crawls it to keep the search index and the search results up-to-date.

Crawl Modes

The Google Search Appliance supports two modes of crawling:

  • Continuous crawl
  • Scheduled crawl

For information about choosing a crawl mode and starting a crawl, see Selecting a Crawl Mode.

Continuous Crawl

In continuous crawl mode, the search appliance is crawling your enterprise content at all times, ensuring that newly added or updated content is added to the index as quickly as possible. After the Google Search Appliance is installed, it defaults to continuous crawl mode and establishes the default collection (see Default Collection).

The search appliance does not recrawl any URLs until all new URLs have been discovered or the license limit has been reached (see What Is the Search Appliance License Limit?). A URL in the index is recrawled even if there are no longer any links to that URL from other pages in the index.

Scheduled Crawl

In scheduled crawl mode, the Google Search Appliance crawls your enterprise content at a scheduled time.

Back to top

What Content Can Be Crawled?


The Google Search Appliance can crawl and index content that is stored in the following types of sources:

  • Public Web servers
  • Secure Web servers
  • Compressed files

Crawling FTP is not supported on the Google Search Appliance.

Public Web Content

Public Web content is available to all users. The Google Search Appliance can crawl and index both public and secure enterprise content that resides on a variety of Web servers, including these:

  • Apache HTTP server
  • BroadVision Web server
  • Sun Java System Web server
  • Microsoft Commerce server
  • Lotus Domino Enterprise server
  • IBM WebSphere server
  • BEA WebLogic server
  • Oracle server

Secure Web Content

Secure Web content is protected by authentication mechanisms and is available only to users who are members of certain authorized groups. The Google Search Appliance can crawl and index secure content protected by:

  • Basic authentication
  • NTLM authentication

The search appliance can crawl and index content protected by forms-based single sign-on systems.

For HTTPS websites, the Google Search Appliance uses a serving certificate as a client certificate when crawling. You can upload a new serving certificate using the Admin Console. Some Web servers do not accept client certificates unless they are signed by trusted Certificate Authorities.

Compressed Files

The Google Search Appliance supports crawling and indexing compressed files in the following formats: .zip, .tar, .tar.gz, and .tgz.

For more information, refer to Crawling and Indexing Compressed Files.

Back to top

What Content Is Not Crawled?


The Google Search Appliance does not crawl or index enterprise content that is excluded by these mechanisms:

  • Crawl patterns
  • robots.txt
  • nofollow Robots META tag

Also the Google Search Appliance cannot:

  • Follow any links that appear within an HTML area tag.
  • Discover unlinked URLs. However, you can enable them for crawling.
  • Crawl any content residing in 192.168.255 subnet, because this subnet is used for internal configuration.

The following sections describe all these exclusions.

Content Prohibited by Crawl Patterns

A Google Search Appliance administrator can prohibit the crawler from following and indexing particular URLs. For example, any URL that should not appear in search results or be counted as part of the search appliance license limit should be excluded from crawling. For more information, refer to Configuring a Crawl.

Content Prohibited by a robots.txt File

To prohibit any crawler from accessing all or some of the content on an HTTP or HTTPS site, a content server administrator or webmaster typically adds a robots.txt file to the root directory of the content server or Web site. This file tells the crawlers to ignore all or some files and directories on the server or site. Documents crawled using other protocols, such as SMB, are not affected by the restrictions of robots.txt. For the Google Search Appliance to be able to access the robot.txt file, the file must be public. For examples of robots.txt files, see Using robots.txt to Control Access to a Content Server.

The Google Search Appliance crawler always obeys the rules in robots.txt. You cannot override this feature. Before crawling HTTP or HTTPS URLs on a host, a Google Search Appliance fetches the robots.txt file. For example, before crawling any URLs on http://www.mycompany.com/ or https://www.mycompany.com/, the search appliance fetches http://www.mycompany.com/robots.txt.

When the search appliance requests the robots.txt file, the host returns an HTTP response that determines whether or not the search appliance can crawl the site. The following table lists HTTP responses and how the Google Search Appliance crawler responds to them.

HTTP Response

File Returned?

Google Search Appliance Crawler Response

200 OK

Yes

The search appliance crawler obeys exclusions specified by robots.txt when fetching URLs on the site.

404 Not Found

No

The search appliance crawler assumes that there are no exclusions to crawling the site and proceeds to fetch URLs.

Other responses

 

The search appliance crawler assumes that it is not permitted to crawl the site and does not fetch URLs.

When crawling, the search appliance caches robots.txt files and refetches a robots.txt file if 30 minutes have passed since the previous fetch. If changes to a robots.txt file prohibit access to documents that have already been indexed, those documents are removed from the index. If the search appliance can no longer access robots.txt on a particular site, all the URLs on that site are removed from the index.

For detailed information about HTTP status codes, visit http://en.wikipedia.org/wiki/List_of_HTTP_status_codes.

Content Excluded by the nofollow Robots META Tag

The Google Search Appliance does not crawl a Web page if it has been marked with the nofollow Robots META tag (see Using Robots meta Tags to Control Access to a Web Page).

Links within the area Tag

The Google Search Appliance does not crawl links that are embedded within an area tag. The HTML area tag is used to define a mouse-sensitive region on a page, which can contain a hyperlink. When the user moves the pointer into a region defined by an area tag, the arrow pointer changes to a hand and the URL of the associated hyperlink appears at the bottom of the window.

For example, the following HTML defines an region that contains a link:

<map name="n5BDE56.Body.1.4A70"> 
   <area shape="rect" coords="0,116,311,138" id="TechInfoCenter"                            href="http://www.bbb.com/main/help/ourcampaign/ourcampaign.html">
</map>

When the search appliance crawler follows newly discovered links in URLs, it does not follow the link (http://www.bbb.com/main/help/ourcampaign/ourcampaign.htm) within this area tag.

Unlinked URLs

Because the Google Search Appliance crawler discovers new content by following links within documents, it cannot find a URL that is not linked from another document through this process.

You can enable the search appliance crawler to discover any unlinked URLs in your enterprise content by:

  • Adding unlinked URLs to the crawl path.
  • Using a jump page (see Ensuring that Unlinked URLs Are Crawled), which is a page that can provide links to pages that are not linked to from any other pages. List unlinked URLs on a jump page and add the URL of the jump page to the crawl path.

Back to top

Configuring the Crawl Path and Preparing the Content


Before crawling starts, the Google Search Appliance administrator configures the crawl path (see Configuring a Crawl), which includes URLs where crawling should start, as well as URL patterns that the crawler should follow and should not follow. Other information that webmasters, content owners, and search appliance administrators typically prepare before crawling starts includes:

  • Robots exclusion protocol (robots.txt) for each content server that it crawls
  • Robots META tags embedded in the header of an HTML document
  • googleon/googleoff tags embedded in the body of an HTML document
  • Jump pages

Back to top

How Does the Search Appliance Crawl?


This section describes how the Google Search Appliance crawls Web and network file share content as it applies to both scheduled crawl and continuous crawl modes.

About the Diagrams in this Section

This section contains data flow diagrams, used to illustrate how the Google Search Appliance crawls enterprise content. The following table describes the symbols used in these diagrams.

Symbol

Definition

Example

Start state or Stop state

Start crawl, end crawl

Process

Follow links within the document

Data store, which can be a database, file system, or any other type of data store

Crawl queue

Data flow among processes, data stores, and external interactors

URLs

External input or terminator, which can be a process in another diagram

Delete URL

Callout to a diagram element

 

Crawl Overview

The following diagram provides an overview of the following major crawling processes:

  • Starting the crawl and populating the crawl queue
  • Attempting to fetch a URL and index the document
  • Following links within the document

The sections following the diagram provide details about each of the these major processes.

Starting the Crawl and Populating the Crawl Queue

The crawl queue is a list of URLs that the Google Search Appliance will crawl. The search appliance associates each URL in the crawl queue with a priority, typically based on estimated Enterprise PageRank. Enterprise PageRank is a measure of the relative importance of a Web page within the set of your enterprise content. It is calculated using a link-analysis algorithm similar to the one used to calculate PageRank on google.com.

The order in which the Google Search Appliance crawls URLs is determined by the crawl queue. The following table gives an overview of the priorities assigned to URLs in the crawl queue.

Source of URL

Basis for Priority

Start URLs (highest)

Fixed priority

New URLs that have never been crawled

Estimated Enterprise PageRank

Newly discovered URLs

For a new crawl, estimated Enterprise PageRank

For a recrawl, estimated Enterprise PageRank and a factor that ensures that new documents are crawled before previously indexed content

URLs that are already in the index (lowest)

Enterprise PageRank, the last time it was crawled, and estimated change frequency

By crawling URLs in this priority, the search appliance ensures that the freshest, most relevant enterprise content appears in the index.

After configuring the crawl path and preparing content for crawling, the search appliance administrator starts a continuous or scheduled crawl (see Selecting a Crawl Mode). The following diagram provides an overview of starting the crawl and populating the crawl queue.

When crawling begins, the search appliance populates the crawl queue with URLs. The following table lists the contents of the crawl queue for a new crawl and a recrawl.

Type of Crawl

Crawl Queue Contents

New crawl

The start URLs that the search appliance administrator has configured.

Recrawl

The start URLs that the search appliance administrator has configured and the complete set of URLs contained in the current index.

Attempting to Fetch a URL and Indexing the Document

The Google Search Appliance crawler attempts to fetch the URL with the highest priority in the crawl queue. The following diagram provides an overview of this process.

If the search appliance successfully fetches a URL, it downloads the document. If you have enabled and configured infinite space detection, the search appliance uses the checksum to test if there are already 20 documents with the same checksum in the index (20 is the default value, but you can change it when you configure infinite space detection). If there are 20 documents with the same checksum in the index, the document is considered a duplicate and discarded (in Index Diagnostics, the document is shown as “Considered Duplicate”). If there are fewer than 20 documents with the same checksum in the index, the search appliance caches the document for indexing. For more information, refer to Enabling Infinite Space Detection.

Generally, if the search appliance fails to fetch a URL, it deletes the URL from the crawl queue. Depending on several factors, the search appliance may take further action when it fails to fetch a URL.

When fetching documents from a slow server, the search appliance paces the process so that it does not cause server problems. The search appliance administrator can also adjust the number of concurrent connections to a server by configuring the web server host load schedule (see Configuring Web Server Host Load Schedules).

Determining Document Changes with If-Modified-Since Headers and the Content Checksum

During the recrawl of an indexed document, the Google Search Appliance sends the If-Modified-Since header based on the last crawl date of the document. If the web server returns a 304 Not Modified response, the appliance does not further process the document. If the web server returns content, the Google Search Appliance uses the Last-Modified header, if present, to detect change. If the Last-Modified header is not present, the search appliance computes the checksum of the newly downloaded content and compares it to the checksum of the previous content. If the checksum is the same, then appliance does not further process the document.

To detect changes to cached documents when recrawling it, the search appliance:

  1. Downloads the document.
  2. Computes a checksum of the file.
  3. Compares the checksum to the checksum that was stored in the index the last time the document was indexed.
  4. If the checksum has not changed, the search appliance stops processing the document and retains the cached document.

If the checksum has changed since the last modification time, the search appliance determines the size of the file (see File Type and Size), modifies the file as necessary, follows newly discovered links within the document (see Following Links within the Document), and indexes the document.

File Type and Size

When the Google Search Appliance fetches a document, it determines the type and size of the file. The search appliance attempts to determine the type of the file by first examining the Content-Type header. Provided that the Content-Type header is present at crawl time, the search appliance crawls and indexes files where the content type does not match the file extension. For example, an HTML file saved with a PDF extension is correctly crawled and indexed as an HTML file.

If the search appliance cannot determine the content type from the Content-Type header, it examines the file extension by parsing the URL.

As a search appliance administrator, you can change the maximum file size for the downloader to use when crawling documents. By default, the maximum file sizes are:

  • 20MB for text or HTML documents
  • 100MB for all other document types

To change the maximum file size, enter new values on the Content Sources > Web Crawl > Host Load Schedule page. For more information about setting the maximum file size to download, click Admin Console Help > Content Sources > Web Crawl > Host Load Schedule.

If the document is:

  • A text or HTML document that is larger than the maximum file size, the search appliance truncates the file and discards the remainder of the file
  • Any other type of document that does not exceed the maximum file size, the search appliance converts the document to HTML
  • Any other type of document that is larger than the maximum file size, the search appliance discards it completely

By default, the search appliance indexes up to 2.5MB of each text or HTML document, including documents that have been truncated or converted to HTML. You can change the default by entering an new amount of up to 10MB. For more information, refer to Changing the Amount of Each Document that Is Indexed.

Compressed document types, such as Microsoft Office 2007, might not be converted properly if the uncompressed file size is greater than the maximum file size. In these cases, you see a conversion error message on the Index > Diagnostics > Index Diagnostics page.

LINK Tags in HTML Headers

The search appliance indexes LINK tags in HTML headers. However, it strips these headers from cached HTML pages to avoid cross-site scripting (XSS) attacks.

Following Links within the Document

For each document that it indexes, the Google Search Appliance follows newly discovered URLs (HTML links) within that document. When following URLs, the search appliance observes the index limit that is set on the Index > Index Settings page in the Admin Console. For example, if the index limit is 5MB, the search appliance only follows URLs within the first 5MB of a document. There is no limit to the number of URLs that can be followed from one document.

Before following a newly discovered link, the search appliance checks the URL against:

  • The robots.txt file for the site
  • Follow and crawl URL patterns
  • Do not crawl URL patterns

If the URL passes these checks, the search appliance adds the URL to the crawl queue, and eventually crawls it. If the URL does not pass these checks, the search appliance deletes it from the crawl queue. The following diagram provides an overview of this process.

The search appliance crawler only follows HTML links in the following format:

<a href="/page2.html">link to page 2</a>

It follows HTML links in PDF files, Word documents, and Shockwave documents. The search appliance also supports JavaScript crawling (see JavaScript Crawling) and can detect links and content generated dynamically through JavaScript execution.

Back to top

When Does Crawling End?


The Google Search Appliance administrator can end a continuous crawl by pausing it (see Stopping, Pausing, or Resuming a Crawl).

The search appliance administrator can configure a scheduled crawl to end at a specified time. A scheduled crawl also ends when the license limit is reached (see What Is the Search Appliance License Limit?). The following table provides more details about the conditions that cause a scheduled crawl to end.

Condition

Description

Scheduled end time

Crawling stops at its scheduled end time.

Crawl to completion

There are no more URLs in the crawl queue. The search appliance crawler has discovered and attempted to fetch all reachable content that matches the configured URL patterns.

The license limit is reached

The search appliance license limits the maximum number of URLs in the index. When the search appliance reaches this limit, it stops crawling new URLs. The search appliance removes the excess URLs (see Are Documents Removed From the Index?) from the crawl queue.

Back to top

When Is New Content Available in Search Results?


For both scheduled crawls and continuous crawls, documents usually appear in search results approximately 30 minutes after they are crawled. This period can increase if the system is under a heavy load, or if there are many non-HTML documents (see Non-HTML Content).

For a recrawl, if an older version of a document is cached in the index from a previous crawl, the search results refer to the cached document until the new version is available.

Back to top

How Are URLs Scheduled for Recrawl?


The search appliance determines the priority of URLs for recrawl using the following rules, listed in order from highest to lowest priority:

  1. URLs that are designated for recrawl by the administrator- for example, when you request a certain URL pattern to be crawled by using the Content Sources > Web Crawl > Start and Block URLs, Content Sources > Web Crawl > Freshness Tuning or Index > Diagnostics > Index Diagnostics page in the Admin Console or sent in web feeds where the crawl-immediately attribute for the record is set to true.
  2. URLs that are set to crawl frequently on the Content Sources > Web Crawl > Freshness Tuning page and have not been crawled in the last 23 hours.
  3. URLs that have not been crawled yet.
  4. URLs that have already been crawled. Crawled URLs’ priority is mostly based the number of links from a start URL. The last crawl date and frequency with which the URL changes also contribute to the priority of crawled URLs. URLs with a crawl date further in the past and that change more frequently also get higher priority.

There are some other factors that also contribute to whether a URL is recrawled, for example how fast a host can respond will also play a factor, or whether it received an error on the last crawl attempt.

If you need to give URLs high priority, you can do a few things to change their priority:

  • You can submit a recrawl request by using the Content Sources > Web Crawl > Start and Block URLs, Content Sources > Web Crawl > Freshness Tuning or Index > Diagnostics > Index Diagnostics pages, which gives the URLs the highest priority possible.
  • You can submit a web feed, which makes the URL’s priority identical to an uncrawled URL’s priority.
  • You can add a URL to the Crawl Frequently list on the Content Sources > Web Crawl > Freshness Tuning page, which ensures that the URL gets crawled about every 24 hours.

To see how often a URL has been recrawled in the past, as well as the status of the URL, you can view the crawl history of a single URL by using the Index > Diagnostics > Index Diagnostics page in the Admin Console.

Back to top

How Are Network Connectivity Issues Handled?


When crawling, the Google Search Appliance tests network connectivity by attempting to fetch every start URL every 30 minutes. If approximately 10% of the start URLs return HTTP 200 (OK) responses, the search appliance assumes that there are no network connectivity issues. If less than 10% return OK responses, the search appliance assumes that there are network connectivity issues with a content server and slows down or stops.

During a temporary network outage, slowing or stopping a crawl prevents the search appliance from removing URLs that it cannot reach from the index. The crawl speeds up or restarts when the start URL connectivity test returns an HTTP 200 response.

Back to top

What Is the Search Appliance License Limit?


Your Google Search Appliance license determines the number of documents that can appear in your index, as listed in the following table.

Search Appliance Model Maximum License Limit
GB-7007 10 million
GB-9009 30 million
G100 20 million
G500 100 million

Google Search Appliance License Limit

For a Google Search Appliance, between 500,000 and 100 million documents can appear in the index, depending on your model and license.

For example, if the license limit is 10 million, the search appliance crawler attempts to put the 10 million documents in the index. During a recrawl, when the crawler discovers a new URL, it must decide whether to crawl the document.

When the search appliance reaches its limit, it stops crawling new URLs, and removes documents from the index to bring the total number of documents to the license limit.

Google recommends managing crawl patterns on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console to ensure that the total number of URLs that match the crawl patterns remains at or below the license limit.

When Is a Document Counted as Part of the License Limit?

Generally, when the Google Search Appliance successfully fetches a document, it is counted as part of the license limit. If the search appliance does not successfully fetch a document, it is not counted as part of the license limit. The following table provides an overview of the conditions that determine whether or not a document is counted as part of the license limit.

Condition Counted as Part of the License Limit?
The search appliance fetches a URL without errors. This includes HTTP responses 200 (success), 302 (redirect, URL moved temporarily), and 304 (not modified) The URL is counted as part of the license limit.
The search appliance receives a 301 (redirect, URL moved permanently) when it attempts to fetch a document, and then fetches the URL without error at its destination. The destination URL is counted as part of the license limit, but not the source URL, which is excluded.
The search appliance cannot fetch a URL. Instead, the search appliance receives an HTTP error response, such as 404 (document not found) or 500 (temporary server error). The URL is not counted as part of the license limit.
The search appliance fetches two URLs that contain exactly the same content without errors. Both URLs are counted as part of the license limit, but the one with the lower Enterprise PageRank is automatically filtered out of search results. It is not possible to override this automatic filtering.
The search appliance fetches a document from a file share. The document is counted as part of the license limit.
The SharePoint connector indexes a folder. Each folder is indexed as a document and counted as part of the license limit.

If there are one or more robots meta tags embedded in the head of a document, they can affect whether the document is counted as part of the license limit. For more information about this topic, see Using Robots meta Tags to Control Access to a Web Page.

To view license information for your Google Search Appliance, use the Administration > License page. For more information about this page, click Admin Console Help > Administration > License in the Admin Console.

License Expiration and Grace Period

Google Search Appliance licensing has a grace period, which starts when the license expires and lasts for 30 days. During the 30-day grace period, the search appliance continues to crawl, index, and serve documents. At the end of the grace period, it stops crawling, indexing, and serving.

If you have configured your search appliance to receive email notifications, you will receive daily emails during the grace period. The emails notify you that your search appliance license has expired and it will stop crawling, indexing and serving in n days, where n is the number of days left in your grace period.

At the end of the grace period, the search appliance will send one email stating that the license has completely expired, the grace period has ended, and the software has stopped crawling,

indexing, and serving. The Admin Console on the search appliance will still be accessible at the end of the grace period.

To configure your search appliance to receive email notifications, use the Administration > System Settings page. For more information about this page, click Admin Console Help > Administration > System Settings in the Admin Console.

Back to top

How Many URLs Can Be Crawled?


The Google Search Appliance crawler stores a maximum number of URLs that can be crawled. The maximum number depends on the search appliance model and license limit, as listed in the following table.

Search Appliance Model Maximum License Limit Maximum Number of URLs that Match Crawl Patterns
GB-7007 10 million ~ 13.6 million
GB-9009 30 million ~ 40 million
G100 20 million ~133 million
G500 100 million ~666 million

If the Google Search Appliance has reached the maximum number of URLs that can be crawled, this number appears in URLs Found That Match Crawl Patterns on the Content Sources > Diagnostics > Crawl Status page in the Admin Console.

Once the maximum number is reached, a new URL is considered for crawling only if it has a higher priority than the least important known URL. In this instance, the higher priority URL is crawled and the lower priority URL is discarded.

For an overview of the priorities assigned to URLs in the crawl queue, see Starting the Crawl and Populating the Crawl Queue.

Back to top

How Are Document Dates Handled?


To enable search results to be sorted and presented based on dates, the Google Search Appliance extracts dates from documents according to rules configured by the search appliance administrator (see Defining Document Date Rules).

In Google Search Appliance software version 4.4.68 and later, document dates are extracted from Web pages when the document is indexed.

The search appliance extracts the first date for a document with a matching URL pattern that fits the date format associated with the rule. If a date is written in an ambiguous format, the search appliance assumes that it matches the most common format among URLs that match each rule for each domain that is crawled. For this purpose, a domain is one level above the top level. For example, mycompany.com is a domain, but intranet.mycompany.com is not a domain.

The search appliance periodically runs a process that calculates which of the supported date formats is the most common for a rule and a domain. After calculating the statistics for each rule and domain, the process may modify the dates in the index. The process first runs 12 hours after the search appliance is installed, and thereafter, every seven days. The process also runs each time you change the document date rules.

The search appliance will not change which date is most common for a rule until after the process has run. Regardless of how often the process runs, the search appliance will not change the date format more than once a day. The search appliance will not change the date format unless 5,000 documents have been crawled since the process last ran.

If you import a configuration file with new document dates after the process has first run, then you may have to wait at least seven days for the dates to be extracted correctly. The reason is that the date formats associated with the new rules are not calculated until the process runs. If no dates were found the first time the process ran, then no dates are extracted until the process runs again.

If no date is found, the search appliance indexes the document without a date.

Normally, document dates appear in search results about 30 minutes after they are extracted. In larger indexes, the process can several hours to complete because the process may have to look at the contents of every document.

Back to top

Are Documents Removed From the Index?


The Google Search Appliance index includes all the documents it has crawled. These documents remain in the index and the search appliance continues to crawl them until either one of the following conditions is true:

  • The search appliance administrator resets the index.
  • The search appliance removes the document from the index during the document removal process.

The search appliance administrator can also remove documents from the index (see Removing Documents from the Index) manually.

Removing all links to a document in the index does not remove the document from the index.

Document Removal Process

The following table describes the conditions that cause documents to be removed from the index.

Condition

Description

The license limit is exceeded

The limit on the number of URLs in the index is the value of Maximum number of pages overall on the Administration > License page.

The crawl pattern is changed

To determine which content should be included in the index, the search appliance uses the start urls, follow patterns, and do not follow URL patterns specified on the Content Sources > Web Crawl > Start and Block URLs page. If these URL patterns are modified, the search appliance examines each document in the index to determine whether it should be retained or removed.

If the URL does not match any follow and crawl patterns, or if it matches any do not crawl patterns, it is removed from the index. Document URLs disappear from search results between 15 minutes and six hours after the pattern changes, depending on system load.

The robots.txt file is changed

If the robots.txt file for a content server or web site has changed to prohibit search appliance crawler access, URLs for the server or site are removed from the index.

Authentication failure (401)

If the search appliance receives three successive 401 (authentication failure) errors from the Web server when attempting to fetch a document, the document is removed from the index after the third failed attempt.

Document is not found (404)

If the search appliance receives a 404 (Document not found) error from the Web server when attempting to fetch a document, the document is removed from the index.

Document is indexed, but removed from the content server.

What Happens When Documents Are Removed from Content Servers?.

Back to top

What Happens When Documents Are Removed from Content Servers?


During the recrawl of an indexed document, the search appliance sends an If-Modified-Since header based on the last crawl date of the document. Even if a document has been removed from a content server, the search appliance makes several attempts to recrawl the URL before removing the document from the index.

When a document is removed from the index, it disappears from the search results. However, the search appliance maintains the document in its internal status table. For this reason, the URL might still appear in Index Diagnostics.

The following table lists the timing of recrawl attempts and removal of documents from the index based on different scenarios.

Scenario Recrawl Attempts Document Removal from the Index

The search appliance encounters an error during crawling that could be a server timeout error (500 error code) or forbidden (403 errors).

First recrawl attempt: 1 Day
Second recrawl attempt: 3 Days
Third recrawl attempt: 1 Week
Fourth recrawl attempt: 3 Weeks

The document is removed if the search appliance encounters the error for the fourth time.

The search appliance encounters an unreachable message during crawling, which might be caused by network issues, such as DNS server issues.

First recrawl attempt: 5 hours
Second recrawl attempt: 1 Day
Third recrawl attempt: 5 Days
Fourth recrawl attempt: 3 Weeks

The document is removed if the search appliance encounters the error for the fourth time.

The search appliance encounters issues caused by robots meta-tag setup, for example the search appliance is blocked by a robots meta-tag.

First recrawl attempt: 5 Days
Second recrawl attempt: 15 Days
Third recrawl attempt: 1 Month

The document is removed if the search appliance encounters the error for the third time.

The search appliance encounters garbage data, that is data that is similar to other documents, but which is not marked as considered duplicate

First recrawl attempt: 1 day
Second recrawl attempt: 1 week
Third recrawl attempt: 1 month
Fourth recrawl attempt: 3 months

The document is removed if the search appliance encounters the error for the fourth time.

Back to top

Was this helpful?
How can we improve it?