Administering Crawl

Crawl Quick Reference

Crawling is the process where the Google Search Appliance discovers enterprise content to index. This chapter provides reference information about crawl administration tasks.

Crawling and Indexing Features


The following table lists Google Search Appliance crawl and index features. For each feature, the table lists the page in the Admin Console where you can use the feature and a reference to a section in this document that describes it.

Feature

Admin Console Page

Reference

Always force recrawl URLs

Content Sources > Web Crawl > Freshness Tuning

Freshness Tuning

Content statistics

Index > Diagnostics > Content Statistics

Using the Admin Console to Monitor a Crawl

Continuous crawl

Content Sources > Web Crawl > Crawl Schedule

Selecting a Crawl Mode

Coverage tuning

Content Sources > Web Crawl > Coverage Tuning

Coverage Tuning

Index diagnostics

Index > Diagnostics > Index Diagnostics

Using the Admin Console to Monitor a Crawl

Crawl frequently URLs

Content Sources > Web Crawl > Freshness Tuning

Freshness Tuning

Crawl infrequently URLs

Content Sources > Web Crawl > Freshness Tuning

Freshness Tuning

Crawl modes

Content Sources > Web Crawl > Crawl Schedule

Selecting a Crawl Mode

Crawl queue snapshots

Content Sources > Diagnostics > Crawl Queue

Using the Admin Console to Monitor a Crawl

Crawl schedule

Content Sources > Web Crawl > Crawl Schedule

Scheduling a Crawl

Crawl status

Content Sources > Diagnostics > Crawl Status

Using the Admin Console to Monitor a Crawl

Crawl URLs

Content Sources > Web Crawl > Start and Block URLs

Configuring a Crawl

Do not follow patterns

Content Sources > Web Crawl > Start and Block URLs

Configuring a Crawl

Document dates

Index > Document Dates

Defining Document Date Rules

Duplicate hosts

Content Sources > Web Crawl > Duplicate Hosts

Preventing Crawling of Duplicate Hosts

Entity recognition

Index > Entity Recognition

Discovering and Indexing Entities

Follow patterns

Content Sources > Web Crawl >Start and Block URLs

Configuring a Crawl

Freshness tuning

Content Sources > Web Crawl > Freshness Tuning

Freshness Tuning

Host load exceptions

Content Sources > Web Crawl > Host Load Schedule

Configuring Web Server Host Load Schedules

Host load schedule

Content Sources > Web Crawl > Host Load Schedule

Configuring Web Server Host Load Schedules

HTTP headers

Content Sources > Web Crawl > HTTP Headers

Identifying the User Agent

Index limits

Index > Index Settings

Changing the Amount of Each Document that Is Indexed 

Infinite space detection

Content Sources > Web Crawl > Duplicate Hosts

Enabling Infinite Space Detection

Maximum number of URLs to crawl

Content Sources > Web Crawl > Host Load Schedule

Configuring Web Server Host Load Schedules

Metadata indexing

Index > Index Settings

Configuring Metadata Indexing

Proxy servers

Content Sources > Web Crawl > Proxy Servers

Crawling over Proxy Servers

Recrawl URLs

Content Sources > Web Crawl > Freshness Tuning

Freshness Tuning

Scheduled crawl

Content Sources > Web Crawl > Crawl Schedule

Selecting a Crawl Mode

Start crawling from the following URLs

Content Sources > Web Crawl > Start and Block URLs

Configuring a Crawl

Web server host load

Content Sources > Web Crawl > Host Load Schedule

Configuring Web Server Host Load Schedules

Back to top

Crawling and Indexing Administration Tasks


The following table lists Google Search Appliance crawl and index administration tasks. For each task, the table gives a reference to a section in this document that describes it, as well as the page in the Admin Console that you use to accomplish the task.

Task

Reference

Admin Console Page

Prepare your data for crawling: robots.txt, Robots META tags, googleoff/googleon tags, no_crawl directories, shared folders, and jump pages

Preparing Data for a Crawl

 

Setup the crawl path: start URLs, follow patterns, do not follow patterns

Configuring a Crawl

Content Sources > Web Crawl > Start and Block URLs

Test URL patterns in the crawl path

Testing Your URL Patterns

Select a crawl mode: continuous crawl or scheduled crawl

Selecting a Crawl Mode

Content Sources > Web Crawl > Crawl Schedule

Schedule a crawl

Scheduling a Crawl

Configure a continuous crawl: URLs to crawl frequently, URLs to crawl infrequently, URLs to always force recrawl

Freshness Tuning

Content Sources > Web Crawl > Freshness Tuning

Pause or restart a continuous crawl

Stopping, Pausing, or Resuming a Crawl

Content Sources > Diagnostics > Crawl Status

Stop a scheduled crawl

Submit a URL to be recrawled

Freshness Tuning

Content Sources > Web Crawl > Freshness Tuning

Submitting a URL to Be Recrawled

Index > Diagnostics > Index Diagnostics

Change the amount of each document that is indexed

Changing the Amount of Each Document that Is Indexed 

Index > Index Settings

Configure metadata indexing

Configuring Metadata Indexing 

Set up entity recognition

Discovering and Indexing Entities

Index > Entity Recognition

Control the number of URLs the search appliance crawls for a site

Coverage Tuning

Content Sources > Web Crawl > Coverage Tuning

Set up proxies for Web servers

Crawling over Proxy Servers

Content Sources > Web Crawl > Proxy Servers

Locate or change the user-agent name

Identifying the User Agent

Content Sources > Web Crawl > HTTP Headers

Enter additional HTTP headers for the search appliance crawler to use

Prevent recrawling of content that resides on duplicate hosts

Preventing Crawling of Duplicate Hosts

Content Sources > Web Crawl > Duplicate Hosts

Prevent crawling of duplicate content to avoid infinite space indexing

Enabling Infinite Space Detection

Define rules for the search appliance crawler to use as it indexes documents

Defining Document Date Rules

Index > Document Dates

Specify the maximum number of URLs to crawl for a host and the average number of concurrent connections to open to each Web server for crawling

Configuring Web Server Host Load Schedules

Content Sources > Web Crawl > Host Load Schedule

View the current crawl mode and summary information about events of the past 24 hours in a crawl

Using the Admin Console to Monitor a Crawl

Content Sources > Diagnostics > Crawl Status

View crawl history for all hosts, a specific host, or a specific file

Index > Diagnostics > Index Diagnostics

Define and view a snapshot of uncrawled URLs in the crawl queue

Content Sources > Diagnostics > Crawl Queue

View summary information about files that have been crawled

Index > Diagnostics > Content Statistics

View current license information

What Is the Search Appliance License Limit?

Administration > License

Back to top

Admin Console Basic Crawl Pages


The following table lists Google Search Appliance Admin Console pages that are used to administer a basic crawl. For each Admin Console page, the table provides a reference to a section in this document that describes using the page.

Admin Console Page

Reference

Content Sources > Web Crawl > Start and Block

Configuring a Crawl

Content Sources > Web Crawl > Crawl Schedule

Selecting a Crawl Mode

Scheduling a Crawl

Content Sources > Web Crawl > Proxy Servers

Crawling over Proxy Servers

Content Sources > Web Crawl > HTTP Headers

Identifying the User Agent

Content Sources > Web Crawl > Duplicate Hosts

Preventing Crawling of Duplicate Hosts

Index > Document Dates

Defining Document Date Rules

Content Sources > Web Crawl > Host Load Schedule

Configuring Web Server Host Load Schedules

Content Sources > Web Crawl > Coverage Tuning

Coverage Tuning

Content Sources > Web Crawl > Freshness Tuning

Freshness Tuning

Index > Index Settings

Changing the Amount of Each Document that Is Indexed

Configuring Metadata Indexing

Index > Entity Recognition

Discovering and Indexing Entities

Content Sources > Diagnostics > Crawl Status

Using the Admin Console to Monitor a Crawl

Index > Diagnostics > Index Diagnostics

Content Sources > Diagnostics > Crawl Queue

Index > Diagnostics > Content Statistics

Back to top

Was this helpful?
How can we improve it?