Search
Clear search
Close search
Google apps
Main menu

Getting the Most from Your Google Search Appliance

Crawling and Indexing

After the Google Search Appliance has been set up (see Setting Up a Search Appliance), you can configure the search appliance to crawl the content sources that you identified during the planning phase, as described in Planning.

Crawl is the process by which the Google Search Appliance discovers enterprise content and creates a master index. The resulting index consists of all of the words, phrases, and meta-data in the crawled documents. When users search for information, their queries are executed against the index rather than the actual documents. Searching against content that is already indexed in the appliance is not interrupted, even as new content continues to be indexed.

The Google Search Appliance can crawl:

The Google Search Appliance is also capable of indexing:

This section briefly describes how the Google Search Appliance indexes each type of content.

Crawling Public Content

Public content is not restricted in any way; users don’t need credentials to view it. Some of the most common forms of public content include:

  • Employee portals
  • Frequently Asked Questions
  • Employee policies
  • Benefits information
  • Product documentation
  • Marketing literature

The Google Search Appliance supports crawling of many types of formats, including word processing, spreadsheet, presentation, and others.

The Google Search Appliance crawls content on web sites or file systems according to crawl patterns that you specify by using the Admin Console. As the search appliance crawls public content sources, it indexes documents that it finds. To find more documents, the crawler follows links within the documents that it indexes. The search appliance does not crawl content that you exclude from the index.

The following figure provides an overview of crawling public content.

Back to top

What Content Is Not Crawled?

The Google Search Appliance does not crawl unlinked URLs or links that are embedded within an area tag. Also, the search appliance does not crawl or index content that is excluded by these mechanisms:

  • Do not follow and crawl URLs that you specify by using the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console
  • robots.txt file--The Google Search Appliance always obeys the rules in robots.txt (see "Content Prohibited by a robots.txt File" in Administering Crawl and it is not possible to override this feature. Before the search appliance crawls any content servers in your environment, check with the content server administrator or webmaster to ensure that robots.txt allows the search appliance user agent access to the appropriate content (see "Identifying the User Agent" in Administering Crawl)
  • nofollow robots META tags that appear in content sources

Typically, webmasters, content owners, and search appliance administrators create robots.txt files and add META tags to documents before a search appliance starts crawling.

Configuring Crawl of Public Content

To configure a search appliance to crawl a content source, you specify top-level URLs and directory addresses and links that the search appliance should follow by using the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console. In addition to specifying start URLs, you can also specify URLs that the search appliance should not follow and crawl.

By default, the search appliance crawls in continuous crawl mode. This means that after the Google Search Appliance creates the index, it always crawls content sources looking for new or modified content and updates the index to ensure that it contains the freshest listings. The search appliance can also crawl content according to a schedule.

Configure continuous crawl by performing the following steps with the Admin Console:

  1. Specifying where to start the crawl by listing top-level URLs and directory addresses in the Start URLs section on the Content Sources > Web Crawl > Start and Block URLs page, shown in the following figure.
  2. Specifying links for the search appliance to follow and index by listing patterns in the Follow Patterns section.
  3. Listing any URLs that you don’t want the search appliance to crawl in the Do Not Follow Patterns section.
  4. Saving the URL patterns.

After you save the URL patterns, the search appliance begins crawling in continuous mode.

If you prefer to have the search appliance crawl according to scheduled times, you must also perform the additional following tasks by using the Content Sources > Web Crawl > Crawl Schedule page in the Admin Console:

  1. Selecting scheduled crawl mode.
  2. Creating a crawl schedule.
  3. Saving the crawl schedule.

To schedule crawl times for a specific host, you can change the host load and times in the Content Sources > Web Crawl > Host Load Schedule page. By setting a host load of 0, the crawler will not crawl that host during the configured time period.

If you wish to have a document added to the crawl queue right away, then you can do so by entering in the URL in Re-Crawl These URL Patterns on the Content Sources > Web Crawl > Freshness Tuning page.

Learn More about Public Crawl

For in-depth information about public crawl, configuring a search appliance to crawl, and starting a crawl, refer to the introduction in Administering Crawl.

For a complete list of file types that the search appliance can crawl, refer to Indexable File Formats.

Crawling and Serving Controlled-Access Content

Controlled-access content is secure content--it is restricted so that not all users have access to it. For access to controlled-access content, users need authorization.

A search appliance discovers and indexes controlled-access content in the same way that it indexes all other content: by performing a crawl through the content sources. However, the search appliance requires access credentials to discover and index controlled-access content. Once you set up the search appliance with access credentials, it maintains a copy of all crawled content in the index.

The following figure provides an overview of crawling controlled-access content.

The following table lists the access-control methods that the search appliance supports and whether the methods are supported for crawl, serve, or both.

Method

Crawl

Serve

HTTP Basic

X

X

NTLM HTTP

X

X

LDAP (Lightweight Directory Access Protocol)

 

X

Forms Authentication

X

X

x.509 Certificates

X

X

Integrated Windows Authentication/Kerberos

X

X

SAML Service Provider Interfaces (SPIs)

 

X

Connectors

X

X

Access Control Lists (ACLs)

 

X

Back to top

Configuring Crawl of Controlled-Access Content

If the content files you want crawled and indexed are in a location that requires a login, create a special user account on your network for the search appliance. When you configure crawl in the Admin Console, provide the user name and password for that account. The search appliance presents those credentials before crawling files in that location.

Configure a search appliance to crawl controlled-access content by performing the following steps with the Admin Console:

  1. Configuring the crawl as described in Configuring Crawl of Public Content, but also providing the search appliance with URL patterns that match the controlled content.
  2. Specifying access credentials for each URL pattern by using the appropriate Admin Console pages. The means by which you provide these credentials is different for each kind of authentication:
    • For HTTP Basic and NTLM HTTP, use the Content Sources > Web Crawl > Secure Crawl > Crawler Access page
    • For HTTPS web sites, the search appliance uses a serving certificate as a client certificate when crawling. Upload a new certificate by using the Administration > Certificate Authorities page

The following figure shows the Content Sources > Web Crawl > Secure Crawl > Crawler Access page.

Managing Serve of Controlled-Access Content

When a user issues a search request for controlled-access content, the search appliance verifies the user’s identity and determines whether the user has authorization to view the content. This check is performed before the search appliance displays any content in search results. By performing the results access control checks in real-time, the Google Search Appliance ensures that users only see results they are authorized to view.

A search appliance can use the following methods to establish the user’s identity:

  • HTML Forms-based Authentication
  • HTTP Basic or NTLM HTTP
  • Client Certificates
  • IWA (Integrated Windows Authentication) / Kerberos authentication against a domain controller.
  • The SAML Authentication and Authorization Service Provider Interface (SPI)
  • Connectors
  • LDAP

Once the user’s identity has been established, a search appliance attempts to determine whether the user has access to the secure content that matches their search. The search appliance performs authorization checks by applying flexible authorization rules. You can configure rules for:

  • Cache
  • Connectors
  • Deny
  • Headrequest
  • Policy Access Control List (ACL)
  • SAML
  • Per-URL ACL

The search appliance applies the rules in the order in which they appear in the authorization routing table on the Search > Secure Search > Flexible Authorization page.

If the authorization check is successful, the secure content that matches the search query is included in the user’s search results.

Configuring Serve of Controlled-Access Content

The process for configuring serve of controlled-access content is dependent on the security method you want to use, as described in the following list:

  • To configure a search appliance to perform forms authentication, use the Search > Secure Search > Universal Login Auth Mechanisms > Cookie page.
  • To configure a search appliance to perform HTTP Basic or NTLM HTTP authentication, use the Search > Secure Search > Universal Login Auth Mechanisms > HTTP page.
  • To configure the search appliance to require X.509 Certificate Authentication for search requests from users, use the Search > Secure Search > Universal Login Auth Mechanisms > Client Certificate page.
  • To enable the search appliance to use IWA/Kerberos authentication during secure serve, use the Search > Secure Search > Universal Login Auth Mechanisms > Kerberos page.
  • To configure the search appliance to use the Authentication SPI, use the Search > Secure Search > Universal Login Auth Mechanisms > SAML page.
  • To configure the search appliance to use connectors, use the Search > Secure Search > Universal Login Auth Mechanisms > Connectors page
  • To enable the search appliance to authenticate credentials against an LDAP server, use the Search > Secure Search > Universal Login Auth Mechanisms > LDAP page in the Admin Console.
  • To configure the search appliance to use the Authorization SPI, use the Search > Secure Search > Access Control page.
  • To configure flexible authorization rules, use the Search > Secure Search > Flexible Authorization page.
  • To enable a “trusted application,” such as a portal page, to send credentials that it has validated to the search appliance, use the Search > Secure Search > Trusted Applications page.

Learn More about Controlled-Access Content

For complete information about configuring a search appliance to crawl and serve controlled-access content, refer to Managing Search for Controlled-Access Content.

Indexing Content in Non-Web Repositories

If your organization has content that is stored in non-web repositories, such as Enterprise Content Management (ECM) systems, you can enable the Google Search Appliance to index and serve this content by using the connector framework.

The Google Search Appliance provides the indexing capabilities for the following content management systems and sources:

  • Microsoft SharePoint Portal Server
  • Microsoft SharePoint Services
  • EMC Documentum
  • Open Text Livelink Enterprise Server
  • IBM FileNet Content Manager
  • LDAP
  • Lotus Notes
  • File systems
  • Active Directory groups

Also, Google partners have developed connectors for other non-web repositories. For information about these connectors, visit the Google Solutions Marketplace (http://www.google.com/enterprise/marketplace/.

The connector manager is the central part of the connector framework for the Google Search Appliance. The Connector Manager itself manages creation, instantiation, scheduling and monitoring of connectors that supply content and provide authentication and authorization services to the Google Search Appliance. Connectors run on connector managers residing on servlet containers installed on computers on your network. All Google-supported connectors are certified on Apache Tomcat.

When connecting to a document repository through an enterprise connector, the Google Search Appliance uses a process called “traversal.” During traversal, the connector issues queries to the repository to retrieve document data to feed to the Google Search Appliance for indexing. The connector manager formats the content and any associated metadata for a feed to the Google Search Appliance, which then creates an index of the documents.

The following figure provides an overview of indexing content in non-web repositories.

You can also create a custom connector for the Google Search Appliance, as described in Developing Custom Connectors.

Back to top

Serving Results from a Content Management System

For public content in a repository, searches work the same way as they do with web and file-system content. The Google Search Appliance searches its index and returns relevant result sets to the user without any involvement by the connector.

To authorize access to private or protected content from a repository, the Google Search Appliance creates a connector instance at query time. The connector instance forwards authentication credentials to the repository for authorization checking. The connector manager recognizes identities passed from basic authentication, SAML authentication (see Authentication SPI), and client certificates. If a SAML authentication provider is setup to support single sign-on (SSO), the connector manager also recognizes identities passed from the SSO provider.

Obtaining the Connector Manager and Connectors

To run a connector, you need the software for the connector manager and the connector. The following table lists methods for obtaining the software components that you need to use connectors, as well as the support provided for each component.

Component

Obtain by

Support

Source code for the connector manager and connectors

Download the code from the Google Search Appliance Connector Manager project (http://code.google.com/p/googlesearchapplianceconnectors/.

The open-source software is for the development of third-party connectors. Developers using the resources provided in this project can create connectors for virtually any type of document-based repository. Google does not support the open-source software or changes you make to the open-source software.

An installer package that deploys Apache Tomcat, a connector manager, and a particular connector type

Download the package from Google Cloud Support web site.

Google supports the installer and the software packaged with the installer.

Configuring a Connector

Before you configure a connector, install the following software components:

  • The appropriate Java Development Kit (JDK) for the content management system
  • Apache Tomcat
  • Native client libraries required by the content management system

The specific process that you follow for configuring a connector depends on the type of connector. Generally, you can configure a connector by performing the following steps:

  1. Installing a connector on a host running Apache Tomcat.
  2. Registering a connector manager by using the Content Sources > Connector Managers page in the Admin Console.
  3. Adding a connector by using the Content Sources > Connectors page, shown in the following figure.
  4. Configuring crawl patterns by using the Content Sources > Web Crawl > Start and Block URLs page.
  5. If required by the connector, configuring feeds by using the Content Sources > Feeds page.
  6. If required by the connector, configuring secure crawling of the content management system by using the Admin Console page that is appropriate for the specific connector.
  7. Restarting the connector.
  8. Verifying that the search appliance is indexing URLs from the connector by using the Index > Diagnostics > Index Diagnostics page.

Learn More about Connectors

For in-depth information about connectors, refer to the Google Search Appliance connector documents.

Back to top

Indexing Hard-to-Find Content


During crawl, the search appliance finds most of the content that it indexes by following links within documents. However, many organizations have content that cannot be found this way because it is not linked from other documents. If your organization has content that cannot be found through links on crawled web pages, you can ensure that the Google Search Appliance indexes it by using Feeds. Feeds are also useful for the following types of content:

  • Documents that should be crawled at specific times that are different from those set in the crawl schedule
  • Documents that could be crawled, but are much more quickly uploaded using feeds.

You can also use feeds to delete data from the index on the search appliance.

The Google Search Appliance Supports two types of feeds, as described in the following table.

Type

Description

Web feed

A web feed does not provide content to the Google Search Appliance. Instead, a web feed provides a list of URLs to the search appliance. Optionally, a web feed may include metadata. The crawler queues the URLs listed in the web feed and fetches content for each document listed in the feed. Web feeds are incremental. The search appliance recrawls web feeds periodically, based on the crawl settings for your search appliance.

Content Feed

A content feed provides both URLs and their content to the search appliance. A content feed may include metadata. A content feed can be either full or incremental. The search appliance only crawls content feeds when they are pushed.

The following figure provides an overview of indexing hard-to-find content by using feeds.

Back to top

Pushing a Feed to the Search Appliance

To push a content feed to the search appliance, you must provide the following components:

  • Feed--An XML document that tells the search appliance about the contents that you want to push
  • Feed client--An application or web page that pushes the feed to a feeder process on the search appliance

You can use one of the feed clients described in the Feeds Protocol Developer’s Guide or write your own. For information about writing a feed client, refer to Writing Applications with the Feeds Protocol.

URL Patterns and Trusted IP lists that you define with the Admin Console ensure that your index only lists content from desirable sources. When pushing URLs with a feed, you must verify that the Admin Console will accept the feed and allow your content through to the index. For a feed to succeed, it must be fed from a trusted IP address and at least one URL in the feed must pass the rules defined in the Admin Console.

Push a content feed to the search appliance by performing the following steps:

  1. Adding the URL for the document defined in the Feed Client to crawl patterns by using the Content Sources > Web Crawl > Start and Block URLs page. URLs specified in the feed will only be crawled if they pass through the patterns specified on this page.
  2. Configuring the search appliance to accept the feed by using the Content Sources > Feeds page, shown in the following figure. To prevent unauthorized additions to your index, feeds are only accepted from machines that are specified on this page.
  3. Running the feed client script.
  4. Monitoring the feed by using the Admin Console.
  5. Checking for search results from the feed within 30 minutes of running the feed client script.

Learn More about Feeds

For complete documentation on feeds, refer to the Feeds Protocol Developer’s Guide.

Back to top

Indexing Entities


The Google Search Appliance is able to discover interesting entities in documents with missing or poor metadata and store these entities in the search index. Once the entities are indexed, you can enhance keyword search by adding the entities in dynamic navigation.

To specify the entities that you want the search appliance to discover in your documents, use the Index > Entity Recognition page.

Learn More about Entity Recognition

For a comprehensive description of the Entity Recognition feature, click Admin Console Help > Index > Entity Recognition.

Back to top

Testing Indexed Content


Once the content has been crawled and indexed, you can ensure that it is searchable by using the Test Center. The Test Center enables you to test search across the indexed content, limiting it to specific collections (see Segmenting the Index) or using specific front-ends (see Using Front Ends) and verifying that the correct content is indexed and that the results are what you expect.

You can find a link to the Test Center at the upper right side of the Admin Console. When you click the Test Center link, a new browser window opens and displays the Test Center page, as shown in the following figure.

Back to top

Was this article helpful?
How can we improve it?