Deployment Scenario Handbook

Basic Search on a Public Website

Scenario overview


Acme Inc. is a large, multinational producer of consumer electronics with a large external web presence. Their web content includes general corporate information, as well as specific marketing material for each of its product units. They also run some support forums for their products. In the use case for this scenario, they want to use the Google Search Appliance to drive a general search box, which would search across all content, as well as specific search boxes for each of its product units. All of Acme Inc.’s external web properties are public with no restricted access to specific users and/or groups.

Requirements


  • Index all web-exposed, public content.
  • Provide a general search box, which would return results across all indexed content, including content across all product units.
  • Provide specific search boxes, which would return results specific to a particular product unit.
  • Style the search box and result page according to Acme Inc. corporate branding standards.
  • Index the Acme Inc. support forum every hour, because its content is rapidly changing and/or being added.
  • Handle 20 queries per second at peak load time with high availability in case of a GSA issue/outage.
  • Do not mix content from different languages in search results.

Assumptions


  • There are distinct pages for Acme Inc.’s web content in each language.
  • There are existing web properties, on which search boxes will be placed.

Key considerations


  • Ensure there is enough capacity to handle 20 queries per second at peak load times.
  • Decide whether to present results directly from the search appliance or by means of a web application presentation layer.
  • Use reporting or analytics to gauge user interaction with search materials.

Recommended approach


Google’s recommended approach for implementing basic search on a public website covers the following areas:

Deployment architecture

To account for load and failover capabilities, Acme Inc. will use a total of three GSAs in a production configuration. Two of the three will be used as active-active configured search appliances for adequate capacity planning. The third search appliance will be used as a hot backup for failover.Acme Inc. will configure all three GSAs for mirroring, with one acting as the primary search appliance, on which all configuration changes should be made. To achieve a highly available, active-active configuration, they will deploy a load balancer in front of the GSAs. The load balancer will serve both of the following functions:

  • Actively distribute search query traffic evenly across the two active-active GSAs.
  • Ping the two active-active GSAs, failing over to the hot backup unit in case of a failed expected response from one of the active units.

Because the GSA is being deployed in an existing web application, Google recommends the approach of processing requests and responses from the GSA through the use of a web application presentation layer. In this case, the GSA will be used as a service, with the web application layer sending down request queries and parsing the resulting XML response in accordance with marketing and branding guidelines for page formatting. Because the search page won’t be exposed directly on the GSA, the GSA should not be exposed to the public and it should be firewalled on Acme Inc.’s network utilizing perimeter security via network firewalls.

Crawl and index configuration

Acme Inc. will configure collections for each language set of the web properties. In this way, the site parameter can be used to distinguish queries meant for a particular language, depending on which page the user initiated the search from. Because each product unit wants to have search over its own documents only, Acme Inc. will also configure collections for each product unit.

Acme Inc. will configure start URLs for top-level pages. For content that changes frequently, they can use crawler frequency to make sure the content gets crawled at least once a day. For more control over specific crawl times, they can use the Admin API or a web feed to ensure that specific pages get into the crawl queue at multiple points during the day.

Front end configuration

Each search box deployed on the Acme Inc. web properties will have a set of query parameters tied to it. These parameters will be sent down with the query to the GSA to shape the type of results that appear in the search results page.

For example, a search box deployed on an English product page should pass down the collection parameters for English, as well as that specific product unit. The Results DTD should be consulted to see in which XML elements the GSA returns information. These elements should be parsed by the front end and displayed on the page accordingly.

Administrative items

Acme Inc. will use the Advanced Search Reporting feature to create reports about what users are searching for and what they are clicking in search results pages. These reports should be generated and analyzed frequently, as they are a good indicator of general search satisfaction.

Alternative approach


Instead of using a web application layer in front of the GSA, Acme Inc. could expose search on the GSA directly, customizing the stylesheet for a front end accordingly. Although more difficult to customize fully, this approach might lead to less development effort and make it easier to take advantage of new, out-of-the box, front end features that become available on the GSA.

With this approach, make sure there isn’t any secure content marked as “public” in the GSA index, as users will get direct access to run queries on the search appliance. A reverse proxy can be used to restrict access to the GSA in terms of whitelisting certain URL patterns that can be submitted.

In order to keep the number of collections defined on the GSA at a reasonable level below 200, an alternative to using separate collections for product units can be to use a specific metadata parameter for each product unit that would get indexed along with the content. This metadata parameter would then be applied as a filter to queries applicable to a certain product unit; that way the GSA would retrieve content only applicable to a certain product unit.

Project task overview


The following table lists the project tasks and activities for implementing basic search on a public website.

Task Activities
Plan deployment architecture
  • Configure appliances and setup mirroring
  • Configure load balancer in front of the GSAs
  • Set up perimeter security around GSAs
Configure crawl and index
  • Configure collections identified for languages and product units
  • Identify frequently changing content and ensure it gets indexed one or more times a day
Configure front end
  • Parse Response XML from GSA and display results in accordance with company UI guidelines

Long term enhancements


  • Tweak search and features based on reports showing user search patterns.
  • Identify content for KeyMatches.
  • Enable more complex synonym lists.
  • Enable dynamic navigation for metadata-driven facet navigation.
Was this helpful?
How can we improve it?