Search
Clear search
Close search
Google apps
Main menu
true

Administering Crawl

Monitoring and Troubleshooting Crawls

Crawling is the process where the Google Search Appliance discovers enterprise content to index. This chapter tells search appliance administrators how to monitor a crawl. It also describes how to troubleshoot some common problems that may occur during a crawl.

Back to top

Using the Admin Console to Monitor a Crawl


The Admin console provides Reports pages that enable you to monitor crawling. The following table describes monitoring tasks that you can perform using these pages.

Task

Admin Console Page

Comments

Monitor crawling status

Content Sources > Diagnostics > Crawl Status

While the Google Search Appliance is crawling, you can view summary information about events of the past 24 hours using the Content Sources > Diagnostics > Crawl Status page.

You can also use this page to stop a scheduled crawl, or to pause or restart a continuous crawl (see Stopping, Pausing, or Resuming a Crawl).

Monitor crawling history

Index > Diagnostics > Index Diagnostics

While the Google Search Appliance is crawling, you can view its history using the Index > Diagnostics > Index Diagnostics page. Index diagnostics, as well as search logs and search reports, are organized by collection (see Using Collections).

When the Index > Diagnostics > Index Diagnostics page first appears, it shows the crawl history for the current domain. It shows each URL that has been fetched and timestamps for the last 10 fetches. If the fetch was not successful, an error message is also listed.

From the domain level, you can navigate to lower levels that show the history for a particular host, directory, or URL. At each level, the Index > Diagnostics > Index Diagnostics page displays information that is pertinent to the selected level.

At the URL level, the Index > Diagnostics > Index Diagnostics page shows summary information as well as a detailed Crawl History.

In release 7.6.250, you can find the crawl diagnostics for a document by using the Search for URLs button.

You can also use this page to submit a URL for recrawl (see Submitting a URL to Be Recrawled).

Take a snapshot of the crawl queue

Content Sources > Diagnostics > Crawl Queue

Any time while the Google Search Appliance is crawling, you can define and view a snapshot of the queue using the Content Sources > Diagnostics > Crawl Queue page. A crawl queue snapshot displays URLs that are waiting to be crawled, as of the moment of the snapshot.

For each URL, the snapshot shows:

  • Enterprise PageRank
  • Last crawled time
  • Next scheduled crawl time
  • Change interval

View information about crawled files

Index > Diagnostics > Content Statistics

At any time while the Google Search Appliance is crawling, you can view summary information about files that have been crawled using the Index > Diagnostics > Content Statistics page. You can also use this page to export the summary information to a comma-separated values file.

Crawl Status Messages

In the Crawl History for a specific URL on the Index > Diagnostics > Index Diagnostics page, the Crawl Status column lists various messages, as described in the following table.

Crawl Status Message

Description

Crawled: New Document

The Google Search Appliance successfully fetched this URL.

Crawled: Cached Version

The Google Search Appliance crawled the cached version of the document. The search appliance sent an if-modified-since field in the HTTP header in its request and received a 304 response, indicating that the document is unchanged since the last crawl.

Retrying URL: Connection Timed Out

The Google Search Appliance set up a connection to the Web server and sent its request, but the Web server did not respond within three minutes or the HTTP transaction didn’t complete after 3 minutes.

Retrying URL: Host Unreachable while trying to fetch robots.txt

The Google Search Appliance could not connect to a Web server when trying to fetch robots.txt.

Retrying URL: Network unreachable during fetch

The Google Search Appliance could not connect to a Web server due to networking issue.

Retrying URL: Received 500 server error

The Google Search Appliance received a 500 status message from the Web server, indicating that there was an internal error on the server.

Excluded: Document not found (404)

The Google Search Appliance did not successfully fetch this URL. The Web server responded with a 404 status, which indicates that the document was not found. If a URL gets a status 404 when it is recrawled, it is removed from the index within 30 minutes.

Cookie Server Failed

The Google Search Appliance did not successfully fetch a cookie using the cookie rule. Before crawling any Web pages that match patterns defined for Forms Authentication, the search appliance executes the cookie rules.

Error: Permanent DNS failure

The Google Search Appliance cannot resolve the host. Possible reasons can be a change in your DNS servers while the appliance still tries to access the previously cached IP.

The crawler caches the results of DNS queries for a long time regardless of the TTL values specified in the DNS response. A workaround is to save and then revert a pattern change on the Content Sources > Web Crawl > Proxy Servers page. Saving changes here causes internal processes to restart and flush out the DNS cache.

Back to top

Network Connectivity Test of Start URLs Failed


When crawling, the Google Search Appliance tests network connectivity by attempting to fetch every start URL every 30 minutes. If less than 10% return OK responses, the search appliance assumes that there are network connectivity issues with a content server and slows down or stops and displays the following message: “Crawl has stopped because network connectivity test of Start URLs failed.” The crawl restarts when the start URL connectivity test returns an HTTP 200 response.

Back to top

Slow Crawl Rate


The Content Sources > Diagnostics > Crawl Status page in the Admin Console displays the Current Crawling Rate, which is the number of URLs being crawled per second. Slow crawling may be caused by the following factors:

These factors are described in the following sections.

Non-HTML Content

The Google Search Appliance converts non-HTML documents, such as PDF files and Microsoft Office documents, to HTML before indexing them. This is a CPU-intensive process that can take up to five seconds per document. If more than 100 documents are queued up for conversion to HTML, the search appliance stops fetching more URLs.

You can see the HTML that is produced by this process by clicking the cached link for a document in the search results.

If the search appliance is crawling a single UNIX/Linux Web server, you can run the tail command-line utility on the server access logs to see what was recently crawled. The tail utility copies the last part of a file. You can also run the tcpdump command to create a dump of network traffic that you can use to analyze a crawl.

If the search appliance is crawling multiple Web servers, it can crawl through a proxy.

Complex Content

Crawling many complex documents can cause a slow crawl rate.

To ensure that static complex documents are not recrawled as often as dynamic documents, add the URL patterns to the Crawl Infrequently URLs on the Content Sources > Web Crawl > Freshness Tuning page (see Freshness Tuning).

Host Load

If the Google Search Appliance crawler receives many temporary server errors (500 status codes) when crawling a host, crawling slows down.

To speed up crawling, you may need to increase the value of concurrent connections to the Web server by using the Content Sources > Web Crawl > Host Load Schedule page (see Configuring Web Server Host Load Schedules).

Network Problems

Network problems, such as latency, packet loss, or reduced bandwidth can be caused by several factors, including:

  • Hardware errors on a network device
  • A switch port set to a wrong speed or duplex
  • A saturated CPU on a network device

To find out what is causing a network problem, you can run tests from a device on the same network as the search appliance.

Use the wget program (available on most operating systems) to retrieve some large files from the Web server, with both crawling running and crawling paused. If it takes significantly longer with crawling running, you may have network problems.

Run the traceroute network tool from a device on the same network as the search appliance and the Web server. If your network does not permit Internet Control Message Protocol (ICMP), then you can use tcptraceroute. You should run the traceroute with both crawling running and crawling paused. If it takes significantly longer with crawling running, you may have network performance problems.

Packet loss is another indicator of a problem. You can narrow down the network hop that is causing the problem by seeing if there is a jump in the times taken at one point on the route.

Slow Web Servers

If response times are slow, you may have a slow Web server. To find out if your Web server is slow, use the wget command to retrieve some large files from the Web server. If it takes approximately the same time using wget as it does while crawling, you may have a slow Web server.

You can also log in to a Web server to determine whether there are any internal bottlenecks.

If you have a slow host, the search appliance crawler fetches lower-priority URLs from other hosts while continuing to crawl the slower host.

Query Load

The crawl processes on the search appliance are run at a lower priority than the processes that serve results. If the search appliance is heavily loaded serving search queries, the crawl rate drops.

Back to top

Wait Times


During continuous crawling, you may find that the Google Search Appliance is not recrawling URLs as quickly as specified by scheduled crawl times in the crawl queue snapshot. The amount of time that a URL has been in the crawl queue past its scheduled recrawl time is the URL’s “wait time.”

Wait times can occur when your enterprise content includes:

  • Large numbers of documents
  • Large PDF files or Microsoft Office documents
  • Many frequently changing URLs
  • New content with high Enterprise PageRank

If the search appliance crawler needs four hours to catch up to the URLs in the crawl queue whose scheduled crawl time has already passed, the wait time for crawling the URLs is four hours. In extreme cases, wait times can be several days. The search appliance cannot recrawl a URL more frequently than the wait time.

It is not possible for an administrator to view the maximum wait time for URLs in the crawl queue or to view the number of URLs in the queue whose scheduled crawl time has passed. However, you can use the Content Sources > Diagnostics > Crawl Queue page to create a crawl queue snapshot, which shows:

  • Last time a URL was crawled
  • Next scheduled crawl time for a URL

Back to top

Errors from Web Servers


If the Google Search Appliance receives an error when fetching a URL, it records the error in Index > Diagnostics > Index Diagnostics. By default, the search appliance takes action based on whether the error is permanent or temporary:

  • Permanent errors--Permanent errors occur when the document is no longer reachable using the URL. When the search appliance encounters a permanent error, it removes the document from the crawl queue; however, the URL is not removed from the index.
  • Temporary errors--Temporary errors occur when the URL is unavailable because of a temporary move or a temporary user or server error. When the search appliance encounters a temporary error, it retains the document in the crawl queue and the index, and schedules a series of retries after certain time intervals, known as “backoff” intervals, before removing the URL from the index. The search appliance maintains an error count for each URL, and the time interval between retries, increases as the error count rises. The maximum backoff interval is three weeks.

You can either use the search appliance default settings for index removal and backoff intervals, or configure the following options for the selected error state:

  • Immediate Index Removal --Select this option to immediately remove the URL from the index
  • Number of Failures for Index Removal --Use this option to specify the number of times the search appliance is to retry fetching a URL
  • Successive Backoff Intervals (hours) --Use this option to specify the number of hours between backoff intervals

To configure settings, use the options in the Configure Backoff Retries and Remove Index Information section of the Content Sources > Web Crawl > Crawl Schedule page in the Admin Console. For more information about configuring settings, click Admin Console Help > Content Sources > Web Crawl > Crawl Schedule.

The following table lists permanent and temporary Web server errors. For detailed information about HTTP status codes, see http://en.wikipedia.org/wiki/List_of_HTTP_status_codes.

Error

Type

Description

301

Permanent

Redirect, URL moved permanently.

302

Temporary

Redirect, URL moved temporarily.

401

Temporary

Authentication required.

404

Temporary

Document not found. URLs that get a 404 status response when they are recrawled are removed from the index within 30 minutes.

500

Temporary

Temporary server error.

501

Permanent

Not implemented.

In addition, the search appliance crawler refrains from visiting Web pages that have noindex and nofollow Robots META tags. For URLs excluded by Robots META tags, the maximum retry interval is one month.

You can view errors for a specific URL in the Crawl Status column on the Index > Diagnostics > Index Diagnostics page.

URL Moved Permanently Redirect (301)

When the Google Search Appliance crawls a URL that has moved permanently, the Web server returns a 301 status. For example, the search appliance crawls the old address, http://myserver.com/301-source.html, and is redirected to the new address, http://myserver.com/301-destination.html. On the Index > Diagnostics > Index Diagnostics page, the Crawl Status of the URL displays Source page of permanent redirect” for the source URL and “Crawled: New Document” for the destination URL.

In search results, the URL of the 301 redirect appears as the URL of the destination page.

For example, if a user searches for info:http://myserver.com/301-<source>.html, the results display http://myserver.com/301-<destination>.html.

To enable search results to display a 301 redirect, ensure that start and follow URL patterns on the Content Sources > Web Crawl > Start and Block URLs page match both the source page and the destination page.

URL Moved Temporarily Redirect (302)

When the Google Search Appliance crawls a URL that has moved temporarily, the Web server returns a 302 status. On the Index > Diagnostics > Index Diagnostics page, the Crawl Status of the URL shows the following value for the source page:

  • Crawled: New Document

There is no entry for the destination page in a 302 redirect.

In search results, the URL of the 302 redirect appears as the URL of the source page.

If the redirect destination URL does not match a Follow pattern, or matches a Do Not Follow Pattern, on the Content Sources > Web Crawl > Start and Block URLs page, the document is not immediately deleted from search results. On the Index > Diagnostics > Index Diagnostics page, the Crawl Status of the URL shows the following value for the source page:

  • Excluded: In "Do Not Crawl" URLs.

The search appliance attempts to recrawl the document according to the default backoff intervals in 24 and 168 hours, and the document is deleted after 3 failures.

A META tag that specifies http-equiv="refresh" is handled as a 302 redirect.

Authentication Required (401) or Document Not Found (404) for SMB File Share Crawls

When the Google Search Appliance attempts to crawl content on SMB-based file systems, the web server might return 401 or 404 status. If this happens, take the following actions:

  • Ensure that the URL patterns entered on the Content Sources > Web Crawl > Start and Block URLs page are in the format smb://.//
  • Ensure that you have entered the appropriate patterns for authentication on the Content Sources > Web Crawl > Secure Crawl > Crawler Access page.
  • If the document that returns the error requires authentication, ensure that:
    • The authentication rule is appropriately configured with a URL pattern for this document or set of documents
    • You have provided the proper user name, domain and password for the document
    • There are no special characters in the password. If the password includes special characters, you might try to set one without special characters to see if it resolves the issue

On the file share server, ensure that the directories or files you have configured for crawling are not empty. Also, on the file share server (in the configuration panel), verify that:

  • The file share is not part of a Distributed File System (DFS) configuration
  • Basic Authentication or NTLM is used as the authentication protocol
  • Permissions are set properly (read access for user on this share, allow various permission sets including listings of files in the Share’s settings)
  • For a Windows file share, the read permissions are set specifically for the configured user in the Security tab in Share properties dialog

Also, ensure that the file server accepts inbound TCP connections on ports 139, 445. These ports on the file share need to be accessible by the search appliance. You can verify whether the ports are open by using the nmap command on a machine on the same subnet as the search appliance. Run the following command:

nmap <fileshare host> -p 139,445

The response needs to be “open” for both. If the nmap command is not available on the machine you are using, you can use the telnet command for each of the ports individually. Run the following commands:

telnet <fileshare-host> 139
telnet <fileshare-host> 445

A connection should be established rather than refused.

If the search appliance is crawling a Windows file share, verify that NTLMv2 is enabled on the Window file share by following section 10 in Microsoft Support’s document (http://support.microsoft.com/kb/823659). Take note that NTLMv1 is very insecure and is not supported.

Take note that you can also use a script on the Google Search Appliance Admin Toolkit project page for additional diagnostics outside the search appliance. To access the script, visit http://gsa-admin-toolkit.googlecode.com/svn/trunk/smbcrawler.py.

Cyclic Redirects

A cyclic redirect is a request for a URL in which the response is a redirect back to the same URL with a new cookie. The search appliance detects cyclic redirects and sets the appropriate cookie.

Back to top

URL Rewrite Rules


In certain cases, you may notice URLs in the Admin Console that differ slightly from the URLs in your environment. The reason for this is that the Google Search Appliance automatically rewrites or rejects a URL if the URL matches certain patterns. The search appliance rewrites the URL for the following reasons:

  • To avoid crawling duplicate content
  • To avoid crawling URLs that cause a state change (such as changing or deleting a value) in the Web server
  • To reject URLs that are binary files

Before rewriting a URL, the search appliance crawler attempts to match it against each of the patterns described for:

If the URL matches one of the patterns, it is rewritten or rejected before it is fetched.

BroadVision Web Server

In URLs for BroadVision Web server, the Google Search Appliance removes the BV_SessionID and BV_EngineID parameters before fetching URLs.

For example, before the rewrite, this is the URL:

http://www.broadvision.com/OneToOne/SessionMgr/home_page.jsp?BV_SessionID=NNNN0974886399.1076010447NNNN&BV_EngineID=ccceadcjdhdfelgcefe4ecefedghhdfjk.0

After the rewrite, this is the URL:

http://www.broadvision.com/OneToOne/SessionMgr/home_page.jsp

Sun Java System Web Server

In URLs for Sun Java System Web Server, the Google Search Appliance removes the GXHC_qx_session_id parameter before fetching URLs.

Microsoft Commerce Server

In URLs for Microsoft Commerce Server, the Google Search Appliance removes the shopperID parameter before fetching URLs.

For example, before the rewrite, this is the URL:

http://www.shoprogers.com/homeen.asp?shopperID=PBA1XEW6H5458NRV2VGQ909

After the rewrite, this is the URL:

http://www.shoprogers.com/homeen.asp

Servers that Run Java Servlet Containers

In URLs for servers that run Java servlet containers, the Google Search Appliance removes jsessionid, $jsessionid$, and $sessionid$ parameters before fetching URLs.

Lotus Domino Enterprise Server

Lotus Domino Enterprise URLs patterns are case-sensitive and are normally recognized by the presence of .nsf in the URL along with a well-known command such as “OpenDocument” or “ReadForm.” If your Lotus Domino Enterprise URL does not match any of the cases below, then it does not trigger the rewrite or reject rules.

The Google Search Appliance rejects URL patterns that contain:

  • The Collapse parameter
  • SearchView, SearchSite, or SearchDomain
  • The Navigate parameter and either To=Prev or To=Next
  • ExpandSection or ExpandOutline parameters, unless they represent a single-section expansion
  • $OLEOBJINFO, or FieldElemFormat
  • CreateDocument, DeleteDocument, SaveDocument, or EditDocument
  • OpenAgent, OpenHelp, OpenAbout, or OpenIcon
  • ReadViewEntries

The search appliance rewrites:

  • OpenDocument URLs
  • URLs with the suffix #
  • Multiple versions of the same URL

The following sections provide details about search appliance rewrite rules for Lotus Domino Enterprise server.

OpenDocument URLs

The Google Search Appliance rewrites OpenDocument URLs to substitute a 0 for the view name. This is a method for accessing the document regardless of view, and stops the search appliance crawler from fetching multiple views of the same document.

The syntax for this type of URL is http://Host/Database/View/DocumentID?OpenDocument. The search appliance rewrites this as http://Host /Database/0/DocumentID?OpenDocument

For example, before the rewrite, this is the URL:

http://www12.lotus.com/idd/doc/domino_notes/5.0.1/readme.nsf/8d7955daacc5bdbd852567a1005ae562/c8dac6f3fef2f475852567a6005fb38f

After the rewrite, this is the URL:

http://www12.lotus.com/idd/doc/domino_notes/5.0.1/readme.nsf/0/c8dac6f3fef2f475852567a6005fb38f?OpenDocument

URLs with # Suffixes

The Google Search Appliance removes suffixes that begin with # from URLs that have no parameters.

Multiple Versions of the Same URL

The Google Search Appliance converts a URL that has multiple possible representations into one standard, or canonical URL. The search appliance does this conversion so that it does not fetch multiple versions of the same URL with differing order of parameters. The search appliance’s canonical URL has the following syntax for the parameters that follow the question mark:

  • ?Command&Start=&Count=&Expand&...
  • ?Command&Start=&Count=&ExpandView&...

To convert a URL to a canonical URL, the search appliance makes the following changes:

  • Rewrites the “!” character that is used to mark the beginning of the parameters to “?”
  • Rewrites Expand=parameter to ExpandView. If there is not a number argument to expand, it is not modified.
  • Rejects URLs with more than one Expand parameter.
  • Places parameters in the following order: Start, Count, Expand, followed by any other parameters.
  • If the URL contains a Start parameter, but no Count parameter, adds Count=1000.
  • If the URL contains Count=1000, but no Start parameter, adds Start=1.
  • If the URL contains the ExpandView parameter, and has a Start parameter but no Count parameter, sets Start=1&Count=1000.
  • Removes additional parameters after a command except Expand/ExpandView, Count, or Start.

For example, before the rewrite, this is the URL:

http://www-12.lotus.com/ldd/doc/domino_notes/5.0.1/readme.nsf?OpenDatabase&Count=30&Expand=3

After the rewrite, this is the URL:

http://www12.lotus.com/ldd/doc/domino_notes/5.0.1/readme.nsf?OpenDatabase&Start=1&Count=1000&ExpandView

ColdFusion Application Server

In URLs for ColdFusion application server, the Google Search Appliance removes CFID and CFTOKEN parameters before fetching URLs.

Index Pages

In URLs for index pages, the Google Search Appliance removes index.htm or index.html from the end of URLs before fetching them. It also automatically removes them from Start URLs that you enter on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console.

For example, before the rewrite, this is the URL:

http://www.google.com/index.html

After the rewrite, this is the URL:

http://www.google.com/

Back to top

Was this article helpful?
How can we improve it?