- Identifying the User Agent
- Coverage Tuning
- Freshness Tuning
- Changing the Amount of Each Document that Is Indexed
- Configuring Metadata Indexing
- Crawling over Proxy Servers
- Preventing Crawling of Duplicate Hosts
- Preventing Crawling of Duplicate URLs
- Enabling Infinite Space Detection
- Configuring Web Server Host Load Schedules
- Removing Documents from the Index
- Using Collections
- Discovering and Indexing Entities
- Wildcard Indexing
Crawling is the process where the Google Search Appliance discovers enterprise content to index. The information in this chapter extends beyond basic crawl.
Web servers see various client applications, including Web browsers and the Google Search Appliance crawler, as “user agents.” When the search appliance crawler visits a Web server, the crawler identifies itself to the server by its User-Agent identifier, which is sent as part of the HTTP request.
The User-Agent identifier includes all of the following elements:
- A unique identifier that is assigned for each search appliance
- A user agent name
- An email address that is associated with the search appliance
The default user agent name for the Google Search Appliance is “gsa-crawler.” In a Web server’s logs, the server administrator can identify each visit by the search appliance crawler to a Web server by this user agent name.
You can view or change the User-Agent name or enter additional HTTP headers for the search appliance crawler to use with the Content Sources > Web Crawl > HTTP Headers page in the Admin Console.
Including an email address in the User-Agent identifier enables a webmaster to contact the Google Search Appliance administrator in case the site is adversely affected by crawling that is too rapid, or if the webmaster does not want certain pages crawled at all. The email address is a required element of the search appliance User-Agent identifier.
For complete information about the Content Sources > Web Crawl > HTTP Headers page, click Admin Console Help > Content Sources > Web Crawl > HTTP Headers in the Admin Console.
You can control the number of URLs the search appliance crawls for a site by using the Content Sources > Web Crawl > Coverage Tuning page in the Admin Console. To tune crawl coverage, a URL pattern and setting the maximum number of URLs to crawl for it. The URL patterns you provide must conform to the Rules for Valid URL Patterns in "Administering Crawl."
For complete information about the Content Sources > Web Crawl > Coverage Tuning page, click Admin Console Help > Content Sources > Web Crawl > Coverage Tuning in the Admin Console.
You can improve the performance of a continuous crawl using URL patterns on the Content Sources > Web Crawl > Freshness Tuning page in the Admin Console. The Content Sources > Web Crawl > Freshness Tuning page provides four categories of crawl behaviors, as described in the following table. To apply a crawl behavior, specify URL patterns for the behavior.
Use Crawl Frequently patterns for URLs that are dynamic and change frequently. You can use the Crawl Frequently patterns to give hints to the search appliance crawler during the early stages of crawling, before the search appliance has a history of how frequently URLs actually change.
Any URL that matches one of the Crawl Frequently patterns is scheduled to be recrawled at least once every day. The minimum wait time (see Wait Times) is 15 minutes, but if you have too many URLs in Crawl Frequently patterns, wait time increases.
Use Crawl Infrequently Patterns for URLs that are relatively static and do not change frequently. You can use this feature for Web pages that do not change and do not need to be recrawled. You can also use it for Web pages where a small part of their content changes frequently, but the important parts of their content does not change.
You can set the crawler to crawl these URLs once a week, once a month, or no more than once every 3 months, regardless of their Enterprise PageRank or how frequently they change.
Always Force Recrawl
Use Always Force Recrawl patterns to prevent the search appliance from crawling a URL from cache (see Determining Document Changes with If-Modified-Since Headers and the Content Checksum).
Recrawl these URL Patterns
Use Recrawl these URL Patterns to submit a URL to be recrawled. URLs that you enter here are recrawled as soon as possible.
For complete information about the Content Sources > Web Crawl > Freshness Tuning page, click Admin Console Help > Content Sources > Web Crawl > Freshness Tuning in the Admin Console.
By default, the search appliance indexes up to 2.5MB of each text or HTML document, including documents that have been truncated or converted to HTML. After indexing, the search appliance caches the indexed portion of the document and discards the rest.
You can change the default by entering an new amount of up to 10MB in Index Limits on the Index > Index Settings page.
For complete information about changing index settings on this page, click Admin Console Help > Index > Index Settings in the Admin Console.
The search appliance has default settings for indexing metadata, including which metadata names are to be indexed, as well as how to handle multivalued metadata and date fields. You can customize the default settings or add an indexing configuration for a specific attribute by using the Index > Index Settings page. By using this page you can perform the following tasks:
- Including or excluding metadata names in dynamic navigation
- Specifying multivalued separators
- Specifying a date format for metadata date fields
For complete information about configuring metadata indexing, click Admin Console Help > Index > Index Settings in the Admin Console.
You might know which indexed metadata names you want to use in dynamic navigation. In this case, you can create a whitelist of names to be used by entering an RE2 regular expression that includes those names in Regular Expression and checking Include.
If you know which indexed metadata names you do not want to use in dynamic navigation, you can create a blacklist of names by entering an RE2 regular expression that includes those names in Regular Expression and selecting Exclude. Although blacklisted names do not appear in dynamic navigation options, these names are still indexed and can be searched by using the inmeta, requiredfields, and partialfields query parameters.
This option is required for dynamic navigation. For information about dynamic navigation, click Admin Console Help > Search > Search Features > Dynamic Navigation.
By default, the regular expression is ".*" and Include is selected, that is, index all metadata names and use all the names in dynamic navigation.
For complete information about creating a whitelist or blacklist of metadata names, click Admin Console Help > Index > Index Settings in the Admin Console.
A metadata attribute can have multiple values, indicated either by multiple meta tags or by multiple values within a single meta tag, as shown in the following example:
<meta name="authors" content="S. Jones, A. Garcia">
In this example, the two values (S. Jones, A. Garcia) are separated by a comma.
By using the Multivalued Separator options, you can specify multivalued separators for the default metadata indexing configuration or for a specific metadata name. Any string except an empty string is a valid multivalued separator. An empty string causes the multiple values to be treated as a single value.
For complete information about specifying multivalued separators, click Admin Console Help > Index > Index Settings in the Admin Console.
By using the Date Format menus, you can specify a date format for metadata date fields. The following example shows a date field:
<meta name="releasedOn" content="20120714">
To specify a date format for either the default metadata indexing configuration or for a specific metadata name, select a value from the menu.
The search appliance tries to parse dates that it discovers according to the format that you select for a specific configuration or, in case you do not add a specific configuration, the default date format. If the date that the search appliance discovers in the metadata isn't of the selected format, the search appliance determines if it can parse it as any date format.
For complete information about specifying a date format, click Admin Console Help > Index > Index Settings in the Admin Console.
If you want the Google Search Appliance to crawl outside your internal network and include the crawled data in your index, use the Content Sources > Web Crawl > Proxy Servers page in the Admin Console. For complete information about the Content Sources > Web Crawl > Proxy Servers page, click Admin Console Help > Content Sources > Web Crawl > Proxy Servers in the Admin Console.
Many organizations have mirrored servers or duplicate hosts for such purposes as production, testing, and load balancing. Mirrored servers are also the case where multiple aliases are used or a Web site has changed names, which usually occurs when companies or departments merge.
Disadvantages of allowing the Google Search Appliance to recrawl content on mirrored servers include:
- Increasing the time it takes for the search appliance to crawl content.
- Indexing the same content twice, because both versions count towards the license limit.
- Decreasing the relevance of search results, because the search appliance cannot discover accurate information about the link structure of crawled documents.
To prevent crawling of duplicate hosts, you can specify one or more “canonical,” or standard, hosts using the Content Sources > Web Crawl > Duplicate Hosts page.
For complete information about the Content Sources > Web Crawl > Duplicate Hosts page, click Admin Console Help > Content Sources > Web Crawl > Duplicate Hosts in the Admin Console.
Starting in release 7.6.250, you can use the "canonical" link tag to reduce the number of duplicate URLs that the search appliance crawls. HTML files that have the canonical element (<link rel="canonical" />) are treated as duplicate pages and do not show up in search results.
If an HTML file contains other redirects, the canonical element needs to be the first tag inside the <head> tag.
For information about adding the canonical <link> tag to HTML documents, see the Google Webmaster Central Blog.
In “infinite space,” the search appliance repeatedly crawls similar URLs with the same content while useful content goes uncrawled. For example, the search appliance might start crawling infinite space if a page that it fetches contains a link back to itself with a different URL. The search appliance keeps crawling this page because, each time, the URL contains progressively more query parameters or a longer path. When a URL is in infinite space, the search appliance does not crawl links in the content.
By enabling infinite space detection, you can prevent crawling of duplicate content to avoid infinite space indexing.
To enable infinite space detection, use the Content Sources > Web Crawl > Duplicate Hosts page.
For complete information about the Content Sources > Web Crawl > Duplicate Hosts page, click Admin Console Help > Content Sources > Web Crawl > Duplicate Hosts in the Admin Console.
A Web server can handle several concurrent requests from the search appliance. The number of concurrent requests is known as the Web server’s “host load.” If the Google Search Appliance is crawling through a proxy, the host load limits the maximum number of concurrent connections that can be made through the proxy. The default number of concurrent requests is 4.0.
Increasing the host load can speed up the crawl rate, but it also puts more load on your Web servers. It is recommended that you experiment with the host load settings at off-peak time or in controlled environments so that you can monitor the effect it has on your Web servers.
To configure a Web Server Host Load schedule, use the Content Sources > Web Crawl > Host Load Schedule page. You can also use this page to configure exceptions to the web server host load.
Regarding file system crawling: if you’ve configured the search appliance to crawl documents from a SMB file system, it only follows the configurable default value of Web Server Host Load (default to 4.0), it does not follow the Exceptions to Web Server Host Load specifically for the SMB host. Due to design constraint, the default Web Server Host Load value can only be set to 8.0 or below, or it may effect the performance of your file system crawling.
For complete information about the Content Sources > Web Crawl > Host Load Schedule page, click Admin Console Help > Content Sources > Web Crawl > Host Load Schedule in the Admin Console.
Collections are subsets of the index used to serve different search results to different users. For example, a collection can be organized by geography, product, job function, and so on. Collections can overlap, so one document can be relevant to several different collections, depending on its content. Collections also allow users to search targeted content more quickly and efficiently than searching the entire index.
For information about using the Index > Collections page to create and manage collections, click Admin Console Help > Index > Collections in the Admin Console.
During initial crawling, the Google Search Appliance establishes the default_collection, which contains all crawled content. You can redefine the default_collection but it is not advisable to do this because index diagnostics are organized by collection. Troubleshooting using the Index > Diagnostics > Index Diagnostics page becomes much harder if you cannot see all URLs crawled.
Documents that are added to the index receive a tag for each collection whose URL patterns they match. If you change the URL patterns for a collection, the search appliance immediately starts a process that runs across all the crawled URLs and retags them according to the change in the URL patterns. This process usually completes in a few minutes but can take up to an hour for heavily-loaded appliances. Search results for the collection are corrected after the process finishes.
- Logical Redirects by Assignments to window.location
- Links and Content Added by document.write and document.writeln Functions
- Links that are Generated by Event Handlers
- Links with an onclick Return Value
The search appliance only executes scripts embedded inside a document. The search appliance does not support:
- DOM tracking to support calls, such as
- External scripts execution
- AJAX execution
The search appliance crawls links specified by a logical redirect by assignment to
window.location, which makes the web browser load a new document by using a specific URL.
The following code example shows a logical redirect by assignment to
The search appliance crawls links and indexes content that is added to a document by
document.writeln functions. These functions generate document content while the document is being parsed by the browser.
The following code example shows links added to a document by
The search appliance crawls links that are generated by event handlers, such as
The following code example shows links generated by event handlers in an
anchor and a
The search appliance crawls links with an onclick return value other than false. If onclick script returns false, then the URL will not be crawled. The following code example shows both situations.
<HTML> <HEAD></HEAD> <BODY> <a href="http://bad.com" onclick="return false;">This link will not be crawled</a> <a href="http://good.com" onclick="return true;">This link will be crawled</a> </BODY> </HTML>
Any content added to the document by
document.write/writeln calls (as shown in the following example) will be indexed as a part of the original document.
For example, suppose that your search appliance crawls and indexes multiple content sources, but only one of these sources has robust metadata. By using entity recognition, you can enrich the metadata-poor content sources with discovered entities and discover new, interesting entities in the source with robust metadata.
After you configure and enable entity recognition, the search appliance automatically discovers specific entities in your content sources during indexing, annotates them, and stores them in the index. Once the entities are indexed, you can enhance keyword search by adding the entities in dynamic navigation, which uses metadata in documents and entities discovered by entity recognition to enable users to browse search results by using specific attributes. To add the entities to dynamic navigation, use the Search > Search Features > Dynamic Navigation page.
Additionally, by default, entity recognition extracts and stores full URLs in the index. This includes both document URLs and plain text URLs that appear in documents. So you can match specific URLs with entity recognition and add them to dynamic navigation, enabling users to browse search results by full or partial URL. For details about this scenario, see Use Case: Matching URLs for Dynamic Navigation.
The Index > Entity Recognition page enables you to specify the entities that you want the search appliance to discover in your documents. If you want to identify terms that should not be stored in the index, you can upload the terms in an entity blacklist file.
Before you can specify entities on the Index > Entity Recognition page, you must define each entity by creating dictionaries of terms and regular expressions. Dictionaries for terms are required for entity recognition. Dictionaries enable entity recognition to annotate entities, that is, to discover specific entities in the content and annotate them as entities.
Generally, with dictionaries, you define an entity with lists of terms and regular expressions. For example, the entity "Capital" might be defined by a dictionary that contains a list of country capitals: Abu Dhabi, Abuja, Accra, Addis Ababa, and so on. After you create a dictionary, you can upload it to the search appliance.
Entity recognition accepts dictionaries in either TXT or XML format.
Optionally, you can also create composite entities that run on the annotated terms. Like dictionaries, composite entities define entities, but composite entities enable the search appliance to discover more complex terms. In a composite entity, you can define an entity with a sequence of terms. Because composite entities run on annotated terms, all the words in a sequence must be tagged with an entity and so depend on dictionaries.
For example, suppose that you want to define a composite entity that detects full names, that is, combinations of titles, names, middlenames, and surnames. First, you need to define four dictionary-based entities, Title, Name, Middlename, and Surname, and provide a dictionary for each one. Then you define the composite entity, FullName, which detects full names.
A composite entity is written as an LL1 grammar.
The search appliance provides sample dictionaries and composite entities, as shown on the Index > Entity Recognition page.
If entity recognition matches text to an entity term in a dictionary, it will not match the same text again to a different entity term in the same dictionary.
For example, if you set up these entity terms in dictionary:
... Work Google Cloud ...
For a document containing "Google Cloud," entity recognition matches "Google Cloud" but that same text is not matched again.
For a document containing "Google Cloud - Enterprise Solutions to Work the Way You Live," the same dictionary will match both "Google Cloud" and "Work" because "Work" is repeated in the text.
Google recommends that you perform these tasks for setting up entity recognition, in the following order:
- Creating dictionaries and, optionally, composite entities.
- Adding new entities by adding dictionaries and, optionally, composite entities.
- Enabling entity recognition.
This use case describes matching URLs with entity recognition and using them to enrich dynamic navigation options. It also shows you how to define the name of the dynamic navigation options that display, either by explicitly specifying the name or by capturing the name from the URL.
This use case assumes you have already enabled entity recognition on your GSA and added entities to dynamic navigation. Having seen how easy this feature makes browsing results, your users also want to be able to browse by URLs. These URLs include:
They want dynamic navigation results to include just the domains "services," "policies," and so on. You can achieve this goal by performing the following steps:
- Creating an XML dictionary that defines the entity
- Adding the entity and dictionary to entity recognition
- Adding the entity to dynamic navigation
Creating an XML Dictionary that Defines an Entity for Matching URLs
The following example shows an XML dictionary for entity recognition that matches URLs. In this example, the names displayed for the dynamic navigation options are defined using the name element:
<?xml version="1.0"?> <instances> <instance> <name>services</name> <pattern>http://.*/services.*</pattern> <store_regex_or_name>name</store_regex_or_name> </instance> <instance> <name>policies</name> <pattern>http://.*/policies/.*</pattern> <store_regex_or_name>name</store_regex_or_name> </instance> <instance> <name>history</name> <pattern>http://.*/history/.*</pattern> <store_regex_or_name>name</store_regex_or_name> </instance> </instances>
Creating an XML Dictionary that Defines an Entity for Capturing the Name from the URL
The following example shows an XML dictionary that matches URLs and captures the name of the dynamic navigation options by using the group term in the regular expression pattern:
<?xml version="1.0"?> <instances> <instance> <name> Anything - will not be used </name> <pattern> http://www.mycompany.com/(\w+)/[^\s]+ </pattern> <store_regex_or_name> regex_tagged_as_first_group </store_regex_or_name> </instance> </instances>
There are two important things to note about this example:
- The regular expression has a (\w+) term. The term is in parenthesis, which defines a capturing group. The \w means that this expression will capture any word characters (≡ [0-9A-Za-z_]).
- The <store_regex_or_name> is set to regex_tagged_as_first_group. This indicates that if the pattern has a match, the text matched by the capturing group will be used as the name for the entity.
Adding the Entity to Entity Recognition
Add a new entity, which is defined by the dictionary:
- Click Index > Entity Recognition > Simple Entities.
- On the Simple Entities tab, enter the name of the entity in the Entity name field, for example "type-of-doc.”
- Click Choose File to navigate to the dictionary file in its location and select it.
- Under Case sensitive?, click Yes.
- Under Transient?, click No.
- Click Upload.
- (Optional) Click Entity Diagnostics to test that everything works.
Adding the Entity to Dynamic Navigation
To show URLs as dynamic navigation options, add the entity:
- Click Search > Search Features > Dynamic Navigation.
- Under Existing Configurations, click Add.
- In the Name box, type a name for the new configuration, for example “domains.”
- Under Attributes, click Add Entity.
- In the Display Label box, enter the name you want to appear in the search results, for example “TypeOfUrl.” This name can be different from the name of the entity.
- From the Attribute Name drop-down menu, select the name of the entity that you created, for example “type-of-doc.”
- From the Type drop-down menu, select STRING.
- Select options for sorting entities in the dynamic navigation panel.
- Click OK.
Viewing URLs in the Search Results
After you perform the steps described in the preceding sections, your users will be able to view URLs in the dynamic navigation options, as shown in the following figure.
Note that dynamic navigation only displays the entities of the documents in the result set (the first 30K documents). If documents that contain entities are not in the result set, their entities are not displayed.
However, take note that entity recognition only runs on documents that are added to the index after you enable entity recognition. Documents already in the index are not affected. To run entity recognition on documents already in the index, force the search appliance to recrawl URL patterns by using the Index > Diagnostics > Index Diagnostics page.
This use case describes how you can test your entity recognition configuration on an indexed document that is not in HTML format. To run entity diagnostics on HTML documents, use the Index > Entity Recognition > Entity Diagnostics page in the Admin Console.
Note: In release 7.6.250, entity diagnostics supports all file types.
Testing Entity Recognition on a Cached Non-HTML Document
- Click Index > Diagnostics > Index Diagnostics.
- Click List format.
- Under All hosts, click the URL of the document that you want to test.
- Under More information about this page , click Open in entity diagnostics, as shown below.
Entity diagnostics runs on the cached version of the document and displays the entities found in the document, as shown below.
The search appliance can get entities from metadata content. Choose to get entities from metadata when adding an entity definition:
- Click Index > Entity Recognition.
- Add a new entity by completing the fields.
- From the application area drop-down list, select All or Metadata.
- Click Upload.
Here is an example:
<?xml version="1.0"?> <instances><instance> <name>writer</name> <pattern>author=(.*)</pattern> <pattern>creator=(.*)</pattern> <store_regex_or_name>regex_tagged_as_first_group</store_regex_or_name> </instance></instances>
Note that entity recognition works on metatags; it does not work on already recognized entities.
Wildcard search enables your users to enter queries that contain substitution patterns rather than exact spellings of terms. Wildcard indexing makes words in your content available for wildcard search.
To disable or enable wildcard indexing or the change the type of wildcard indexing, use the Index > Index Settings page in the Admin Console. For more information about wildcard indexing, click Admin Console Help > Index > Index Settings.
By default, wildcard search is enabled for each front end of the search appliance. You can disable or enable wildcard search for one or more front ends by using the Filters tab of the Search > Search Features > Front Ends page. Take note that wildcard search is not supported with Chinese, Japanese, Korean, or Thai. For more information about wildcard search, click Admin Console Help > Search > Search Features > Front Ends > Filters.