Designing a search solution

This document provides information on best practices for designing an enterprise-class solution using a Google Search Appliance.

About this document

The information in this document is for customers that want to deploy the Google Search Appliance models GB-1001, GB-5005, GB-7007 and GB-8008. Google provides a Planning guide and an Installation guide that explain how to set up and configure the search appliance. This document provides information about things that need to be done "off the box." You can use this document in conjunction with the planning and installation guides as a checklist to be sure that you have followed best practices in your deployment.

The information in this document came from the experience of Google's support organization when helping customers and is intended to cover situations outside the scope of the product documentation For example, it is highly recommended that you set up external monitoring and there are some issues that search appliance administrators need to understand when setting this up. By following the recommendations below, you can ensure that your solution avoids some of the pitfalls which we've seen at customer sites which have lead them to prolonged interactions with support.

Setting up monitoring

The search appliance does extensive monitoring of its internal processes and has the ability to fix itself when it detects a problem. However, there are cases in which the internal monitoring on the search appliance will not pick up a failure:

  • If the problem is "off the box." For example, serving could be down due to a DNS problem, a switch problem, network connectivity problem, or a problem in the portal.
  • If the problem cannot be picked up by internal monitoring. For example, a customer may consider it a serving failure if crawl problems for a specific server prevent it from getting into the index.
  • If the internal monitoring on the search appliance is generating a false positive. Serving could be interrupted without process failures, for example due to high load.

For this reason, customers should implement external monitoring to check that serving, crawling and indexing are all working. In many cases, a customer will have existing monitoring software that can do HTTP-level monitoring to verify that serving is up. However, there are some specific issues described below that need to be taken into account, so it may be necessary to use a customized solution for the external monitoring.

Monitoring serving

Serving results on the search appliance could fail for the following reasons:

  • A problem outside the appliance, such as a network failure.
  • A problem on the appliance, that has not been detected by the appliance's internal monitoring. For example, high query load on the appliance may be causing slow responses.
  • A problem that is specific to a particular environment, such as the appliance's crawling credentials no longer fetch the correct content.

Search appliance administrators should monitor serving of results in order to detect possible failures. With good monitoring, you will be able to failover more quickly to a hot backup in the event of a serving failure. You will make it easier for Google support to assist you in finding the root cause of a failure if you can get good data on the extent of a problem.

Some best practices for monitoring serving are described below:

  • Send HTTP and/or HTTPS search queries directly to the appliance. If you normally retrieve search results through a portal, you should separately monitor the health of your portal.
  • Check that the HTTP response code is 200 and also that results are returned.
  • If a non-200 response is returned then log the detailed error along with the timestamp.
  • Send a unique search term to the appliance every time you send a monitoring query. If Google support needs to investigate a failure, this makes it easier to trace a query term through the logs.
  • Append a custom CGI parameter to your monitoring queries so that you can easily remove them from your search reports.
  • The timestamp on your monitoring system should be synchronized with the system clock on the appliance, by using a NTP server to set both clocks. This makes it easier to investigate slow responses.
  • Run queries that do not consume significant resources on the appliance. Keep the num parameter low and do not perform date sorts. Sending one query per minute is usually sufficient to reveal problems in a timely manner.
  • Keep a history of response times for your monitoring queries so that you can detect periods of slower than usual responses.
  • If you have some specific requirements that determine whether or not search is working, you should monitor for these conditions. For example, you should send your credentials to the appliance if all your results are secure. You should check that the date attribute is set in the results XML if date sort is critical.
  • Send a notification if a failure has occurred. You can also perform automatic failover if needed.

An example script for monitoring serving is available in the Google Search Appliance Admin Toolkit.

It is not possible to monitor the CPU and disk activity on the appliance. The best way to avoid overloading the appliance with too many concurrent queries is to following the instructions in the section on managing high load.

Monitoring crawling and indexing

In many cases, if an appliance does not have fresh content in the index, it is considered to be a critical failure. Therefore, appliance administrators should monitor that documents are being crawled and indexed on schedule. The appliance does provide Crawl Diagnostics in the Admin Console, but it is advisable to have monitoring off the box to check for failures that were not detected by the appliance itself.

Here is a one way to do this by monitoring the cached copy of a document:

  • Create a document on your web server that displays the current timestamp each time it is fetched.
  • This document should contain at least 20 words. The search appliance does not calculate the change interval for documents that are very small, so that recrawl will not be frequent.
  • Configure your appliance to crawl this document. Add the URL to Crawl Frequently patterns in Freshness tuning to ensure that it gets crawled at least once per day.
  • The maximum frequency with which the URL on your web server can be recrawled is every 15 minutes, but recrawl may be less frequent if there is a queue of other URLs on the appliance. It can take up to a week for the appliance to learn the change interval of your URL, so you will need to wait that long for the appliance to reach its maximum recrawl rate.
  • Periodically send a search query for the cached copy of the document with the timestamp. You can get the cached copy of a document in the index using the cache special query term.
  • Calculate the difference between the current time and the time the document was crawled, according to the timestamp in the cached content.
  • Record a history of the time differences and trigger an error if the cached content in the index is stale.
  • This monitoring strategy is not sufficient to detect all crawl and index failures in clusters (GB-5005 and GB-8008 models). Please contact Google support for advice on configuring monitoring of crawling and indexing if you have a cluster.

An example script for checking the timestamp in the cached copy is available in the Google Search Appliance Admin Toolkit.

Network Security

Appliance administrators will need to ensure that the appliance meets their organization's policies for network security.

Firewall configuration

If you isolate the appliance behind a firewall, you can selectively block access. Here are some reasons you may want to do this:

  • Block access to the Admin Console on port 8000, so that users can only get to the Admin Console on port 8443 (which uses HTTPS).
  • Restrict access to the appliance based on end users' IP address.
  • Prevent a Denial Of Service attack.
  • Restrict users from accessing the appliance except through a web application, sometimes known as a "portal".

In order to configure the firewall, you will need to know what ports are used by the network interface during normal operations. See the Planning Guide for a list of the ports used by the search appliance.

Designing a search application

You can permit your users to directly connect to the Google Search Appliance to retrieve search results. In some circumstances, however, you may find advantages to placing a system in front of the appliance.

This system can provide additional functions that are not part of search, yet may be considered useful when running a network service. Below are two benefits that the additional system can provide: error handling and managing high load.

Your application should be designed so that you do not add too much latency to the user experience. For example, if your application sends multiple queries in parallel to the appliance in order to satisfy a single search request from a user, you should have a strategy to ensure that you can respond quickly even if one query in a batch is slow.

Error handling

The search appliance is designed to correct it's own problems. In rare cases, however, users can get an error from a search request. You can control how these errors are presented to the user with a script that runs on your portal. Users send a search request to the script. The script formats the request and sends it to the appliance. The search appliance sends the response back to the web server which can process the results, before sending them to the user. Here are some example strategies for handling errors in a script.

  • If the HTTP status code of the response is 200 no error has occurred. Send the results back to the user.
  • If the HTTP status code is 500 then an unexpected error has occurred. The script can retry the search request or send an error to the user.
  • If the HTTP status code is 404 the user has requested a URL that does not exist. Send an appropriate error message to the user.
  • Set a timeout in your script. If the appliance does not respond within the specified time, the script can attempt to ping the appliance. If ping fails, then send an error to the user. If ping succeeds, then retry the search request once more. If that fails, send an error to the user.

A benefit of handling errors in your application is that you will have real-time statistics on the number and type of errors that your are getting, and do not have to rely on exporting reports from the appliance.

Managing high load

You can avoid errors or timeouts from exceeding the capacity of the search appliance by limiting the number of concurrent requests that your application sends. Google cannot give a value for the maximum throughput of a search appliance in queries per second, because it will be different for every customer, depending on index size and type of queries. However, Google can tell you the maximum number of queries that each appliance model can process concurrently. If you exceed this number, the search appliance will queue requests until a processing thread becomes available.

Model Max. concurrent requests
GB-1001 5
GB-7007 50
GB-9009 50
G100 50
G500 50

If you send more than maximum number of concurrent requests to the search appliance, it will queue requests until a processing thread becomes available. If too many requests are queued, the search appliance will immediately return a 503 "Service Unavailable" error rather than add a new request to the queue. The search appliance can also return a 500 or 504 error response if a processing thread is unable to respond with results within a time period. The internal timeout period on the search appliance before a 500/504 error is thrown can vary depending on the state of the response.

Your application can limit the number of requests sent to the search appliance so that you don't exceed the number of processing threads available, in order to make it unlikely that you will exceed the capacity of your appliance. All queries for a search appliance would need to be passsed through a reverse proxy that has the capability to monitor the number of currently active queries.

Google recommends that search applications are designed to respond as fast as possible to user queries. If you find that your search requests are getting queued by the reverse proxy before being sent to the search appliance, you can consider deploying additional search appliances or making your queries run more efficiently.

An example script for queueing connections to the search appliance is available in the Google Search Appliance Admin Toolkit.

System testing

Performance testing

You should run tests to be sure that you will have acceptable performance under production loads. You should be sure that the search appliance will handle short term spikes in load that may occur infrequently. The performance of the search appliance can vary greatly depending on factors such as index size, document size, document type and search parameters. The assumptions that you use when running your tests will have a big impact on the results that you see. Factors that can reduce serving performance include:

  • Larger number of results per query, set by the num parameter. Note that the query will not be slower if num is 100 but only 10 results actually match the search term.
  • Larger value for the start parameter. If start is set to 100 then the search appliance has to process results 1-99 before finding the results that need to be returned.
  • Lots of documents being filtered from the results, if the value of the filter parameter is enabled.
  • Date sorts, particularly where mode is set to R.
  • Large index sizes. The index size can be related to the number of distinct terms in documents, as well as the total number and size of documents. Note that there can be a significant drop in performance for a small increase in index size, if the increase causes the search appliance to start going to disk rather than memory to respond to queries.
  • Slow onebox responses.
  • Expensive Remove URL patterns, which get matched at serving time.
  • Slow authorization responses for secure searches.
  • Setting the proxyreload parameter to 1 which refreshes the stylesheet cache on each request.
  • Using results biasing.
  • search appliance features that generate additional search queries:
  • The search appliance model that you are using. You should be careful not make assumptions about performance from tests on a different model than the one you will use in production. A five node cluster does not have five times the throughput of a GB-1001.
  • Crawling or feeding at a high rate, particularly if the documents require conversion to HTML.
  • Using the query expansion feature.

Ideally, therefore, you should run your tests with the same corpus of documents that you will be using in production. You should also crawl or feed documents as normal while running your tests. It is important to use realistic search queries for your load tests. You can get a list of query terms from your legacy search solution, if one is available. You should also pay particular attention to the query parameters that you send to the appliance, since these can have a big effect on performance. For example, if you expect frequent date sorts or queries that return a large number of results, you should be sure to include these in your tests.

When measuring serving performance you should consider both throughput and latency since both will have an impact on the experience of your users. Throughput and latency are closely related. The search appliance has a fixed number of threads for processing requests. If your load testing script uses the same number of threads to send queries to the search appliance, you can calculate the maximum throughput given the average latency or vice-versa. For example, you have five threads continually sending queries to a GB-1001 and you see a throughput of 1200 queries per minute. Your average latency would then be 0.05 seconds per query (60 / 1200). In many cases, it will be more meaningful to know the median latency or the maximum latency seen by the fastest 90 per cent of all queries.

An example script for load testing is available in the Google Search Appliance Admin Toolkit.

Search Appliance Features that Generate Additional Search Queries

The following search appliance features generate search queries in addition to a user's original search query:

Each search query consumes a separate processing thread.

OneBox modules that Use an Internal Provider

When the search appliance invokes a OneBox module that uses an internal provider, the search appliance performs a query against the collection defined in the Onebox module definition. The query is submitted through the same front end as the query submitted by the user, and consumes as much resources. If the search appliance invokes several OneBox modules that use internal providers, a query is generated for each of them.

Dynamic Result Clusters

To generate a dynamic result cluster, the search appliance submits a query. The search appliance uses the results from the query to build the dynamic result cluster.

Query Suggestions

Query suggestions generate a query every time the search appliance provides a new suggestion list. Each query takes one processing thread.

When query suggestions are enabled, the user keystrokes for a query term are queued before they are sent to the back end. The default idling interval for fast typers is 300 milliseconds. This value can potentially generate one query for each keystroke, thereby consuming more than one processing thread.

This value for idling interval for fast typers is specified in the XSLT stylesheet for the front end. If the search appliance experiences capacity issues, you might try increasing the value from 300 milliseconds to 2000 milliseconds.

To change the value of the idling interval for fast typers:

  1. Open the XSLT stylesheet for the front end in an editor.
  2. Scroll to Idling interval for fast typers and change the value of var ss_wait_millisec from 300 to 2000, as shown in the following code example.
      * Idling interval for fast typers.
      * @type {number}
    var ss_wait_millisec = 2000;
  3. Save your changes.

Relevance testing

For some of your most popular queries, find out what users believe to the be most relevant result. What is the actual top result in the search appliance? Note that relevance ranking should be done by users, because search admins, in our experience, can occasionally have a different viewpoint on the most relevant result. To understand why a document is considered relevant for a search query, you can look at the context of your search term in the document. If the search terms are in a header or title, for example, then the document is likely to be more relevant. Note that search terms in your query may be expanded by the search appliance to include related queries. When assessing relevance, you should also consider the PageRank of a document. If the document has a lot of inbound links from well-linked pages, then its relevance ranking will be boosted.

Feature testing

If your users depend on special features of the search appliance then your testing should test those features. Some examples of special features that you may require:

  • Sort by date
  • Crawling of large documents
  • Secure crawling and serving
  • Filtering by meta data

High availability

All search appliances are susceptible to hardware and software failure. Even GB-5005 and GB-8008 clusters have single points of failure in their design. For example, the power supplies, switch, and load balancer are not redundant. Therefore, it is necessary to plan for failover in the event of a failure. There are several possible strategies, depending on how critical your search application is to the business.

You can configure redundant systems and failover to new search appliances if you suffer a failure. Normally, it is sufficient to have enough redundancy to handle a single search appliance failure. For example, if your peak load can be handled by three search appliances, you would only need one additional search appliance for failover. However, in some cases, you may also want to protect yourself against network-level failures and locate your redundant systems in a different data center. In this case, if your peak load could be handled by three search appliances, then you would need an additional three search appliances to provide redundancy in a separate data center.

If the search application is not critical to the business, then you could consider alternative failover strategies. For example, if your content is hosted on a public web site you could failover to Google Site Search. In some cases, it may be possible to accept search outages and therefore you can simply display an error message on your search form page.

In cases where search is critical to the business, high availability can be provided by a load balancer or DNS switchover. For more information on how to set this up, see Configuring Search Appliances for Load Balancing or Failover. Note that load balancers may be used to provide additional capacity as well as failover capabilities.

Planning for problem recovery

If you take steps at the beginning to prepare for potential problems, you will find it easier to recover if a problem occurs. Some things that you should do in the deployment phase in order to resolve future problems more efficiently:

  • Configure remote access for Google support to access your search appliances. You should verify that that potential remote support access needs are in compliance with local network security policies and will work in the search appliance's intended production environment.

  • Ensure that you have access to the web logs for all web servers that are crawled so that you can easily resolve crawling problems and also authorization problems. If you cannot access the web logs, you may want to set up a crawl proxy that you can use in the event of needing to know the exact request and response during the crawl.

Troubleshooting tools

You should have access to the following tools in order to troubleshoot problems on the search appliance.

  • Firefox LiveHTTPHeaders or another way to see the HTTP headers sent to and received from the search appliance.
  • Tcpdump, Wireshark or another method for packet capture between the search appliance and web servers to resolve crawling problems.
  • A selection of tools that were specifically written to help diagnose problems on the search appliance are available in the Google Search Appliance Admin Toolkit.

Managing multiple appliances

Some tips for managing multiple appliances:

  • Keep the configuration of each appliance in a version control system.
  • You can automate Admin Console tasks with a script. An example is available in the Google Search Appliance Admin Toolkit.
  • Ensure that you have a strategy for updating to new software versions to take advantage of bug fixes as soon as possible. Google support is likely to be able to provide a workaround fix on the most recent software version if you run into a bug that you consider a Severity 0 problem in your environment. In order to quickly update, you will need to have a method for acceptance testing of new versions that can be run without using too many resources. If you are unable to update quickly, it will likely cause a delay in the time taken to fix a problem.

Working around search appliance limitations

Because search appliance administrators cannot get shell access to the search appliance, they will not be able to perform the following tasks:

  • Make configuration changes
  • Use diagnostic tools
  • View error logs

This section discusses some specific limitations that administrators may encounter and how they can work around these limitations.

Limitation Work Around
Cannot set static routes or modify MTU on the search appliance In cases where a specific network configuration change needs to be made on the search appliance, a possible workaround is to place the search appliance behind a piece of hardware. For example, if you need to crawl a content server that requires a specific CPU, you can crawl through a proxy that will handle the correct MTU.
Cannot see performance bottlenecks by monitoring CPU load and disk activity on the search appliance The search appliance does not allow you to monitor CPU load or disk activity so it is difficult to know when you are exceeding its capacity. The best solution to ensure that you do not overload the appliance is to use the suggestions in the section on Managing high load.
Cannot view detailed response from the content server to the requests from the crawler on the search appliance In some cases, an error shown on the Status and Reporting > Crawl Diagnostics page in the Admin Console will not give sufficient details to enable you to troubleshoot the root cause of a crawling problem. In these cases, it is helpful to have access to the content server so that you can look at the error logs or take a packet trace. If it is not possible to get access to the content server, you can crawl through a proxy.
Cannot determine if critical processes are failing on the search appliance The search appliance monitors its internal processes and automatically corrects problems. It is possible, in rare cases, that the internal monitoring will not detect a problem. The best way to detect these problems is to have extensive monitoring of crawling, indexing and serving activities by the search appliance. Some suggestions on how to do this are in the section on Setting up monitoring.
Was this helpful?
How can we improve it?