Running a Crawl
- Selecting a Crawl Mode
- Scheduling a Crawl
- Stopping, Pausing, or Resuming a Crawl
- Submitting a URL to Be Recrawled
- Starting a Database Crawl
Crawling is the process where the Google Search Appliance discovers enterprise content to index. This chapter tells search appliance administrators how to start a crawl.
If you select scheduled crawl, you must schedule a time for crawling to start and a duration for the crawl (see Scheduling a Crawl). If you select and save Continuous crawl mode, crawling starts and a link to the Freshness Tuning page appears (see Freshness Tuning).
For complete information about the Content Sources > Web Crawl > Crawl Schedule page, click Admin Console Help > Content Sources > Web Crawl > Crawl Schedule in the Admin Console.
The search appliance starts crawling in scheduled crawl mode according to a schedule that you can specify using the Content Sources > Web Crawl > Crawl Schedule page in the Admin Console. Using this page, you can specify:
- The day, hour, and minute when crawling should start
- Maximum duration for crawling
- Stop crawling (scheduled crawl mode)
- Pause crawling (continuous crawl mode)
- Resume crawling (continuous crawl mode)
When you stop crawling:
- The documents that were crawled remain in the index
- The index contains some old documents and some newly crawled documents
When you pause crawling, the Google Search Appliance only stops crawling documents in the index. Connectivity tests still run every 30 minutes for Start URLs. You may notice this activity in access logs.
For complete information about the Content Sources > Diagnostics > Crawl Status page, click Admin Console Help > Content Sources > Diagnostics > Crawl Status in the Admin Console.
Occasionally, there may be a recently changed URL that you want to be recrawled sooner than the Google Search Appliance has it scheduled for recrawling (see How Are URLs Scheduled for Recrawl?). Provided that the URL has been previously crawled, you can submit it for immediate recrawling from the Admin Console using one of the following methods:
- Selecting Recrawl from the Actions menu for a start URL or follow pattern on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console.
- Using the Recrawl these URL Patterns box on the Content Sources > Web Crawl > Freshness Tuning page in the Admin Console (see Freshness Tuning)
- Clicking Recrawl this URL in a detail view of a URL on the Index > Diagnostics > Index Diagnostics page in the Admin Console (see Using the Admin Console to Monitor a Crawl)
URLs that you submit for recrawling are treated the same way as new, uncrawled URLs in the crawl queue. They are scheduled to be crawled in order of Enterprise PageRank, and before any URLs that the search appliance has automatically scheduled for recrawling.
How quickly the search appliance can actually crawl these URLs depends on multiple other factors, such as network latency, content server responsiveness, and existing documents already queued up. A good place to check is the Content Sources > Diagnostics > Crawl Queue page (see Using the Admin Console to Monitor a Crawl), where you can observe the crawler backlog to ensure there isn’t a content server acting as a bottleneck in the crawl progress.
In GSA release 7.4, the on-board database crawler is deprecated. For more information, see Deprecation Notices.
The process of crawling a database is called “synchronizing” a database. After you configure database crawling (see Configuring Database Crawl), you can start synchronizing a database by using the Content Sources > Databases page in the Admin Console.
To synchronize a database:
- Click Content Sources > Databases.
- In the Current Databases section of the page, click the Sync link next to the database that you want to synchronize.
The database synchronization runs until it is complete.