Search
Clear search
Close search
Google apps
Main menu

Configuring Distributed Crawling and Serving

This guide contains the information you need to use distributed crawling and serving, a feature of the Google Search Appliance. Distributed crawling and serving, is a scalability feature in which several search appliances are configured to behave as though they are a single search appliance. This greatly increases the number of documents that can be crawled and served and greatly simplifies search appliance administration. Use distributed crawling and serving when you need to index content exceeding the license limits of an individual search appliance.

This document is for you if you are a search appliance administrator, network administrator, or another person who configures search appliances or networks. You need to be familiar with the Google Search Appliance and how to configure crawl, serve, and other features.

On the Admin Console, distributed crawling and serving is configured under Admin Console > GSA n.

Introduction to Distributed Crawling and Serving


Distributed crawling and serving is a Google Search Appliance features that expands the search appliance’s capacity. Distributed crawling and serving is a scalability feature in which several search appliances are configured to act as though they are a single search appliance, which greatly increases the number of documents that can be crawled and served. After distributed crawling is enabled, all crawling, indexing, and serving are configured on one search appliance, called the admin master.

For example, if you have four search appliances that are each licensed to crawl 10 million documents, the search appliances can crawl a total of 40 million documents after you create a distributed crawling configuration that includes all four search appliances.

In this release, you can serve from the master and nonmaster nodes.

After distributed crawling and serving is configured, the indexes on all search appliances are balanced to distribute the documents evenly among the search appliances.

All search appliances in distributed crawling configurations must be the same search appliance model; for example, all must be model GB-7007 or all must be model G500. You cannot have a GB-7007 and a G500 in the same distributed crawling and serving configuration.

All search appliances must be on the same software version as well. For example, you cannot have one search appliance in the configuration on version 6.8 and another on version 7.0. When you update from one software version to the next, ensure that you update all search appliances in the configuration.

All search appliances must be in the same data center. Distributed crawling requires high bandwidth between the search appliances, and works best when latency is low.

You can use GSA mirroring with a distributed crawling and serving configuration. If a master or nonmaster primary node in the distributed crawling configuration fails, you can promote the mirror node to function as a primary node in the distributed crawling and serving configuration.

Limitations

For information about distributed crawling and serving limitations, see Specifications and Usage Limits.

Back to top

Distributed Crawling Overview


In the following diagram, four search appliances are configured with distributed crawling. Each search appliance is designated as a particular shard in the distributed crawling configuration. Shard 0 is the master search appliance. The shard number is incremented by 1 for each additional search appliance in the configuration. The distributed crawling configuration is created on the master and the settings are exported in a configuration file. The configuration file is uploaded to Shard 1, Shard 2, and Shard 3. After the configuration file is uploaded, all search appliance features are configured on the master. The indexes on all of the nodes are synchronized when the master node takes control of the non-master nodes. The crawl is distributed among the search appliances and a single index is created. Each search appliance is considered a primary (non-replica) search appliance. All of the search appliances can serve results. The results for a search query will be identical regardless of which search appliance serves the results.

After the distributed crawl configuration is set up, the four search appliances behave as if they are a single search appliance. Crawling, serving, collections, front ends, and other features are configured on Shard 0, the master node of the configuration. Feeds are sent only to the admin master. The crawl process is automatically distributed among the four search appliances. Any of the nodes can serve results. Each search appliance in the distributed crawl configuration communicates with all of the other search appliances. The diagram above does not show each of the connections between search appliances.

After the configuration is set up, you can add nodes on the Admin Console and the index will automatically be redistributed among the existing and new nodes. You can delete nodes by disabling distributed crawling and serving, resetting the index on each search appliance, and reconfiguring distributed crawling and serving, then reindexing the content.

Serving from Master and Nonmaster Nodes

In this release, you can serve results from both the master and nonmaster nodes in distributed crawling and serving configurations whether or not you have replicas configured and regardless of whether the mirroring configuring is active-active or active passive.

If you are using a load balancer, a client creates a separate session for each node that it uses. In some cases, this might slow down initial searches because of the overhead added by uses authentication requests. You can minimize this issue by using a sticky load balancer that can preserve user sessions for time periods of five minutes or more. In the absence of a sticky load balancer, search users may have to log in N times, where N is the number of search appliances in the configuration.

Back to top

About Security


The Google Search Appliance uses secret tokens and private IP addresses to enforce security within a distributed crawling configuration.

The search appliances in a distributed crawling configuration authenticate each other using shared secret tokens that you provide during configuration. The shared secret tokens must consist only of printable ASCII characters.

There are no restrictions on the public IP addresses assigned to the search appliances in the configuration beyond a requirement that a search appliance must able to reach another search appliance’s public IP address on UDP port 500 and on IP protocol number 51 (IPsec AH). Both ports are used by IPSec, the security protocol for communications among the appliances in the configuration.

Certain communications among the search appliances in a distributed crawling configuration are conducted over a virtual private network, including search requests, search credentials transmitted as sessions, and search results that include snippets, whether the results are authorized or not authorized. When you set up a distributed crawling configuration, you must assign the private IP addresses and secret tokens to each machine in the configuration.

The following guidelines apply to the private network IP addresses that you assign in a distributed crawling configuration:

  • You can assign or change the private IP addresses at any time.
  • The private IP addresses must be different from the IP addresses that will be crawled on your internal network. For example, if you use 10.0.0.0/8 for your intranet then you should choose the private IP addresses from the 192.168.0.0/24 network. If the 192.168.0.0/24 network is also in use, try 192.168.1.0/24 or the 172.16.0.0/12 range.
  • The private IP addresses must conform to the private address space as defined in RFC 1918 and must not overlap with any other private address space used on your network.
  • The private network addresses cannot be in the range spanning subnet /16 to /8.

Back to top

Before You Configure Distributed Crawling and Serving


This section provides a checklist of information you need to collect and decisions you need to make before you configure distributed crawling and serving.

Task

Description

Your Values

Determine which Google Search Appliance will participate in the configuration.

Any Google Search Appliance model running software version 6.0 or later can participate, but all search appliances must be the same model running the same software version.

 

Determine the appliance IDs of the participating search appliances

The appliances IDs can be found on the Admin Console under Administration > License or by right-clicking the About link on any Admin Console page and choosing Open link in new tab.

 

Determine the host names or public IP addresses of the search appliances in the configuration.

The host names or IP addresses are required during the initial configuration process.

 

Determine the virtual private network IP addresses for the search appliances.

The network IP addresses are used for private communication among the search appliances in the configuration. The network IP addresses must conform to the private address space as defined in RFC 1918 and must not overlap with any other private address space in use on your network.

 

Determine which search appliance is the master search appliance in the configuration.

Crawl, search, and index are all configured on the primary search appliance.

 

Determine the secret token that the search appliances will use to recognize each other within the configuration.

The nodes in the configuration use the secret tokens to authenticate to each other. The secret token must include only printable ASCII characters. Each search appliance in a distributed crawling configuration has its own associated secret token, which you specify on the GSA n> Host Configuration page.

 

Determine whether the master node is crawling or has an index from which it is serving.

Do not start the crawl on the node before configuring distributed crawling and serving.

 

Determine whether the search appliances in the configuration crawled substantially similar bodies of documents.

If the search appliances crawled similar bodies of documents, the indexes are substantially similar and rebalancing the index after you set up the distributed crawling and serving configuration will be inefficient. In this situation, Google recommends that you reset the index on the non-master nodes before you set up the configuration.

 

Configure feeds only on the master.

Feeds can only be indexed on the master.

 

If you are using Kerberos, ensure that you configure Kerberos on the master and non-master nodes.

Kerberos keytab files are unique and cannot be used on more than one search appliance. You must generate and import a different Kerberos keytab file for each search appliance. When you configure Kerberos on a non-master node, use a different Mechanism Name from the one used for the master. The Mechanism Name for the non-master node will be synchronized automatically with the master’s Mechanism Name. After they are synchronized, the non-master node’s Mechanism Name will match the master’s Mechanism Name.

 

If you are using SSL certificates, ensure that you install them on the master and non-master nodes.

   

Back to top

Configuring Distributed Crawling and Serving


Observe the following precautions in configuring distributed crawling:

  • Do not configure a unified environment and distributed crawling.
  • Feeds must be configured only on the admin master search appliance.

If the search appliances you are using in the distributed crawling and serving configuration crawled similar document bodies, Google recommends that you reset the indexes on the nonmaster search appliances before configuring distributed crawling and serving.

To configure distributed crawling and serving:

  1. Log in to the Admin Console of the machine intended to be the master search appliance.
  2. If the crawl is currently running or if the search appliance already has an index from which it is serving, click Content Sources > Diagnostics > Crawl Status > Pause Crawl.
  3. Click GSA n> Configuration.
  4. Type the number of shards in the Number of shards field. A shard in the distributed crawling configuration comprises a primary search appliance, and optionally one more search appliances (replicas) in a mirroring configuration.
  5. Type the total number of nodes (search appliances) to be configured in the Number of nodes field. This number includes the primary search appliances, as well as replica search appliances to be configured.
  6. Under Distributed Crawling & Serving Administration, click Enable. A configuration form is displayed, listing each shard in the configuration by number. The master node is shard 0. Each additional shard is assigned a number incremented by 1. If there are four search appliances in the configuration, the shards are assigned numbers 0, 1, 2, and 3.
  7. If you previously saved a configuration that you want to reapply, load the saved configuration file using the Import/Export GSA n Configuration field and skip to step 20.
  8. Click the View/Edit link corresponding to the master shard. You see a screen that says There is no node in this shard. Add a node to this shard.
  9. Click Add. A form appears on which you enter information about the new node.
  10. On the drop-down list, choose Primary.
  11. Type in the node’s GSA Appliance ID.
  12. Type in the Appliance hostname or the IP address of the search appliance.
  13. Type in the Username for the search appliance
  14. Type in the Password for the Admin username.
  15. Type in the Network IP of the search appliance.
  16. Type in the Secret token of this search appliance.
  17. If Admin NIC is enabled on the search appliance that you are adding, click Admin NIC enabled on remote node? and type the IP address of the search appliance in IP Address.
  18. Click Save.
  19. Click the GSA n Configuration link.
  20. Repeat steps 8 through 18 on the current search appliance for each of the other shards in the distributed crawling configuration. When you are finished, each shard in the configuration is defined. Do not proceed to step 20 until all nodes are configured.
  21. When all nodes are configured, click Apply Configuration. This broadcasts the configuration data to all appliances in the GSAn network. Note that document serving will be interrupted briefly on the master node after you click Apply Configuration.
  22. Optionally, click Export and save the distributed crawling configuration file to your local computer.
  23. On the admin master node, click Content Sources > Diagnostics > Crawl Status and restart the crawl.

Adding a Node to an Existing Configuration

Use these instructions to add a node to an existing distributed crawling and serving configuration.

To add a node:

  1. Log in to the Admin Console of the master search appliance.
  2. If the crawl is currently running or if the search appliance already has an index from which it is serving, click Content Sources > Diagnostics > Crawl Status > Pause Crawl.
  3. Click GSA n> Configuration.
  4. Click the View/Edit link corresponding to the shard in which new node is to be added.
  5. Click Add. A form appears on which you enter information about the new node.
  6. On the drop-down list, choose Secondary.
  7. Type in the node’s GSA Appliance ID.
  8. Type in the Appliance hostname or the IP address of the search appliance.
  9. Type in the Admin username for the search appliance
  10. Type in the Password for the Admin username.
  11. Type in the Network IP of the search appliance.
  12. Type in the Secret token of this search appliance.
  13. If Admin NIC is enabled on the node, click Admin NIC enabled on remote node? and type the IP address of the node in IP Address.
  14. Click Save.
  15. Click the GSA n Configuration link.
  16. Click Apply Configuration. This broadcasts the configuration data to all appliances in the GSAn network. Note that document serving will be interrupted briefly on the master node after you click Apply Configuration.
  17. Optionally, click Export and save the distributed crawling configuration file to your local computer.
  18. On the admin master node, click Content Sources > Diagnostics > Crawl Status > Resume Crawl.

Adding a Shard to an Existing Configuration

Use these instructions to add a shard to an existing distributed crawling and serving configuration.

To add a shard:

  1. Log in to the Admin Console of the master search appliance.
  2. If the crawl is currently running or if the search appliance already has an index from which it is serving, click Content Sources > Diagnostics > Crawl Status > Pause Crawl.
  3. Click GSA n> Configuration.
  4. Click the Add Shard link and click on the View/Edit link corresponding to the newly added shard.
  5. Click Add. A form appears on which you enter information about the new node.
  6. On the drop-down list, choose Secondary.
  7. Type in the node’s GSA Appliance ID.
  8. Type in the Appliance hostname or the IP address of the search appliance.
  9. Type in the Admin username for the search appliance.
  10. Type in the Password for the Admin username.
  11. Type in the Network IP of the search appliance.
  12. Type in the Secret token of this search appliance.
  13. If Admin NIC is enabled on the shard that you are adding, click Admin NIC enabled on remote node? and type the IP address of the shard in IP Address.
  14. Click Save.
  15. Click the GSA n Configuration link.
  16. Click Apply Configuration. This broadcasts the configuration data to all appliances in the GSAn network. Note that document serving will be interrupted briefly on the master node after you click Apply Configuration.
  17. Optionally, click Export and save the distributed crawling configuration file to your local computer.
  18. On the admin master node, click Content Sources > Diagnostics > Crawl Status > Resume Crawl.

Deleting a Node from an Existing Configuration

  1. Log in to the Admin Console of the master node.
  2. If the crawl is currently running, click Content Sources > Diagnostics > Crawl Status > Pause Crawl.
  3. Click Index > Reset Index and click Reset the Index Now.
  4. Log in to each node and reset the index on each node.
  5. On the master node, click GSA n> Configuration.
  6. Click the Edit link for the shard configuration that contains the failed node.
  7. Delete the node you want to delete.
  8. Click Save.
  9. Click the GSA n Configuration link.
  10. Click Apply Configuration. This broadcasts the configuration data to all appliances in the GSAn network. Note that document serving will be interrupted briefly on the master node after you click Apply Configuration.
  11. Optionally, click Export and save the distributed crawling configuration file to your local computer.
  12. On the admin master node, click Content Sources > Diagnostics > Crawl Status > Pause Crawl and restart the crawl.

Recovering When a Node Fails


In a distributed crawling and serving configuration, crawling is divided among the different nodes. For example, if node 1 in a three-node configuration discovers a URL that node 2 should crawl, node 1 forwards the URL to node 2.

When a node in the distributed crawling and serving configuration fails, crawling continues on the running nodes unless one of the running nodes discovers a URL that the failed node should crawl. At this point, all crawling stops until the failed node is running again and the link can be forwarded for crawling.

Back to top

Recovering from Node Failure When GSA Mirroring is Enabled

When a primary search appliance fails in a distributed crawling configuration and GSA mirroring is enabled, promote a mirror node to primary and update the other search appliances in the configuration by importing a new GSAn configuration file.

If the primary Google Search Appliance fails and a replica search appliance is promoted to be the primary, do not directly add the former primary node back as the primary, because this will cause problems in the mirroring configuration. If you need to use the former primary search appliance as the primary, add it as a replica of the new primary first. Wait until all index and configuration data are fully synchronized with the new primary node, and then you can add the search appliance as the primary again.

When the Failed Node is the Master Node

To recover from a node failure when GSA mirroring is enabled and the failed node is the master node:

  1. On all nodes, log in to the Admin Consoles and click Content Sources > Diagnostics > Crawl Status > Pause Crawl.
  2. On all nodes, click GSA n > Configuration and click Disable GSA n.
  3. Log in to the Admin Console of the previous replica node to promote it as new master node.
  4. Re-configure GSAn distributed crawling and serving by selecting a previous non-master node as non-master to this master node.
  5. Click Apply Configuration. This broadcasts the configuration data to all appliances in the GSAn network. Note that document serving will be interrupted briefly on the master node after you click Apply Configuration.

When the Failed Node is Not the Master Node

To recover from a node failure when GSA mirroring is enabled and the failed node is a primary search appliance but not the master:

  1. Log in to the Admin Console of the master node in the distributed crawling and serving configuration.
  2. Click GSA n> Configuration.
  3. Click the Edit link for the shard configuration that contains the failed node.
  4. Delete the failed node.
  5. Add a replica to replace the failed primary.
  6. Click Save.
  7. Remove the new primary search appliance from the list of replica search appliances.
  8. Click the GSA n Configuration link.
  9. Click Apply Configuration. This broadcasts the configuration data to all appliances in the GSAn network. Note that document serving will be interrupted briefly on the master node after you click Apply Configuration.

Recovering from Node Failure When GSA Mirroring is Not Enabled

To recover from a node failure when GSA mirroring is not enabled, you must add a new Google Search Appliance to the configuration. If you do not have an additional search appliance, delete and recreate the distributed crawling and serving configuration and recrawl the content.

When the Failed Node is the Master Node

To recover from a node failure when GSA mirroring is not enabled and the failed node is the master node:

  1. To promote any non-master node as master node, log in to the Admin Console of a non-master node and click Content Sources > Diagnostics > Crawl Status > Pause Crawl.
  2. On all nodes, click GSA n > Configuration and click Disable GSA n.
  3. Re-configure GSAn distributed crawling and serving by adding the new node as a non-master node.
  4. Click Apply Configuration. This broadcasts the configuration data to all appliances in the GSAn network. Note that document serving will be interrupted briefly on the master node after you click Apply Configuration.

When the Failed Node is Not the Master Node

To recover from a node failure when GSA mirroring is not enabled and the failed node is not the master node:

  1. Log in to the Admin Console of the master search appliance in the distributed crawling configuration.
  2. Click GSA n> Configuration.
  3. Edit the shard containing the failed node.
  4. Delete the failed node.
  5. Click Save.
  6. Add the new search appliance to the configuration.
  7. Click Save.
  8. Click the GSA n Configuration link.
  9. Click Apply Configuration. This broadcasts the configuration data to all appliances in the GSAn network. Note that document serving will be interrupted briefly on the master node after you click Apply Configuration.

Back to top

Was this article helpful?
How can we improve it?