Google Search Appliance software versions 7.0 and later
Connector Manager version 3.0.4 and later
- New Features
- Upgrading From Previous Versions
- Getting Started
- Configuring a Connector on the Admin Console
- Security Concepts, Tips, and Settings
- Understanding Serving
- Understanding the Lister/Retriever Model
- Troubleshooting the Google Search Appliance Connector for File Systems
- Uninstalling Connectors and Connector Managers
This document is for File Systems administrators and administrators who install and configure the Google Search Appliance. If you are not familiar with the system that the connector will traverse and index, work closely with your system administrators to determine the correct values for installing and configuring the connector.
The Google Search Appliance (GSA) Connector for File Systems release 3 is a major re-engineering of the File System Connector implementation. The connector is no longer using the Diffing Connector model. It is now using the to a Lister/Retriever model. In the past, the connector compared what had changed between crawls to gather new information for indexing. Now, the connector sends links (lists) to the GSA, which the GSA then can crawl using standard HTTP. The most significant improvements are:
- Faster feed and indexing performance.
- Support for ACL inheritance and ACL deny access for SMB file systems (with GSA 7.0 and higher).
- Coordination with the Traversal Schedule.
- Support for HTTPS requests from the Google Search Appliance
- Support for policy ACLs.
- Administrative access to the GSA that the Connector is feeding.
- Access to the computer running the Connector and its Tomcat application server. You will need sufficient access rights to start and stop Tomcat and modify files in its deployment directory.
- A binary distribution of the Connector Manager version 3 release, available on the Connector Manager Downloads page.
- A binary distribution of the File System Connector version 3 release, available on the File System Connector Downloads page.
- Familiarity with the command line environment of the deployment computer (cmd.exe on Windows or a shell environment on Unix/Linux).
- Knowledge of the appropriate rules for using late binding, if you wish to use late binding. The new file system connector doesn't use googleconnector:// URLs. This means that it isn't matched in the default Connector Flexible Authorization rule.
For detailed instructions, please see Instructions for manually upgrading to File System Connector version 3 in Release Notes.
- SMB 1.0 for Windows
- CIFS, which is a variety of SMB, on Windows
- Windows file shares with offboard servers
- DFS with stand-alone root and domain root
- Windows Clusters on Microsoft Cluster Service using the Shared Nothing model. The File Connector is not supported with the Shared Disk and Mirrored Disk models.
- Samba on UNIX and Linux
- Files and folders
- General properties
- Document-level ALLOW ACLs
- Document-level DENY ACLs (only with GSA 7.0 and higher)
- User name and password credentials
- Directory level ACLs
- Policy ACLs
- NFS v2 and v3, but the support does not include ACLs.
- Support for ACL inheritance and deny for SMB file systems (only with GSA 7.0 and higher)
- SMB 2.0, 2.2, or 3.0
- NFS 4
- Custom properties (extended attributes)
- Windows Workgroups
- Kerberos or X.509 certificates
For the on-board connector, ensure that the GSA can connect to the file share using the following ports. In case of off-board connector, ensure that the host of the Connector Manager can connect via these ports to your fileshare. These settings apply for Windows File Shares.
- On Windows, you must be an administrator.
- On Linux, you must have sufficient rights to execute the installer file. You can be a root or nonroot user.
On the File Share Server, if you are the user whose credentials are entered on the Google Search Appliance for performing the crawl, you must also be a member of one of the following groups on the file share server. Otherwise, the search appliance cannot extract ACLs from the documents:
- Windows file shares must be accessible directly using the CIFS protocol.
- Linux file shares must be accessible using the SMB protocol using Samba or other daemons.
The file connector indexes documents that match the included directory and file patterns that you enter on the Admin Console on the connector configuration form. Before you configure the file connector, determine which directories and files you want indexed and which you want excluded using the connector’s settings for “Include Patterns” and “Exclude Patterns”. Editing these settings is explained in Getting Started. This section contains tips about what to include or exclude. Configuring these settings wisely will cause the connector to index only those files that matter to searchers, enhancing performance, improving result relevance, and reducing troubleshooting efforts.
If you know you only desire to index a few document types (for example, MS Office docs or PDFs), using Include Patterns will be the easiest way to configure these settings and it is relatively simple. However, “Exclude Patterns” tends to be simpler to express, as shown in the example.
"contains:/."will exclude files or directories whose first character is "." - hidden files in Unix parlance.
"~$"would exclude files that end in the "~" character because these are backup files created by certain text editors.
But more important than deciding which file types to include or exclude is deciding which directories to include or exclude. At times, customers see log files full of warnings about traversing inappropriate content - either the traversal user does not have permission to access large bodies of content, or there are obvious directories that should not be included. For example:
Use the Add Connector page in the Google Search Appliance Admin Console to create and configure a file connector instance. The Add Connector page prompts you to enter values for all required configuration parameters.
- On the Google Search Appliance Admin Console, click Connector Administration > Connectors.
- Click Add New Connector.
- On the Type drop-down list, select File Connector Type.
- Click Get Configuration Form.
- In the Start paths field, type the root directories from which you want traversal to start. You must specify the hostname as a Fully qualified host name (FQDN). For example:
- To add more rows, click Add another row.
- In the Include patterns field, type any file patterns you want included in the traversal. For example, entering the pattern
regexp:.*instructs the connector to traverse all documents under the Start path. Please see Determining what to Index for guidelines and tips.
- To add more rows, click Add another row.
- In the optional Exclude patterns field, type any file patterns you want excluded from traversal. For example, entering the pattern
regexpIgnoreCase:\\.ppt$instructs the connector to ignore all PowerPoint presentation documents.
- To add more rows, click Add another row.
- In the optional Domain field, type in the domain to be traversed.
- In the Username field, type the user name of a user who has access to the file system.
- In the Password field, type the password for the file system user.
- In Full Traversal interval (days) enter the number of days appropriate for the amount and file types of the content. Refer to Full and Incremental Traversals for details.
- In the Advanced properties selection, choose Show to fine tune security and related settings.
To learn about these parameters, see Configuring ACE Formats using the Advanced Configuration properties.
- Enter values for Global and Local Namespace.
- In the Traversal Rate section, type the number of documents per minute that you want traversed.
- In the Retry Delay section, specify the number of minutes to wait between the time when a traversal is completed and the time when the next traversal starts.
- To disable traversal, check Disable traversal.
- In the Connector Schedule section, indicate the hours between which you want the repository traversed.
Note that a connector scheduled to run from 12 a.m. to 12 a.m. always runs. Any other schedule with the same beginning and ending time never runs, either for a connector or for the Google Search Appliance’s standard crawl function.
- Click Add Line to Schedule for each additional traversal period you want to schedule.
- Click Save Configuration.
Clicking Save Configuration runs a connectivity test. If there are any errors, you will see them displayed on the Admin Console. Correct the errors and click Save Configuration again. If the connector is configured correctly, the new connector name will appear on the Connectors list.
Access Control Lists (ACLs) may be inherited from a parent folder or file share. This reduces the number of files/directories that must be re-indexed as a result of an ACL change to a folder far up the directory hierarchy.
The default ACL format changed between file system connector version 2.x and version 3.0. The new default user ACL format is
domain\user. The new default group ACL format is
domain\group. These new defaults will be the most appropriate formats to use with GSA version 7.0 and later. The previous ACL formats,
group (without the domain), are unlikely to work with GSA version 7.0. If the Authentication mechanism for this connector returns a domain element, then the user and group ACLs must also include the domain.
If you have previously explicitly configured the user and group ACL formats in the Advanced Configuration, that explicit configuration will still be honored (although you may wish to change it if you are upgrading the GSA as well as the connector.)
If you have not explicitly configured the user and group ACL formats, the new default format will be used. If, when using file system connector version 3.0, you wish to use the unadorned “user” and “group” formats, you must explicitly enable those formats in the Advanced Configuration.
- Locate the
aceSecurityLevelparameter that you want to change.
- Uncomment the parameter if it is commented out.
- Change the value of the parameters as desired.
To use late binding, configure a flexible authorization rule for the connector's URLs and route to it using Connector Authorization. If you have multiple File System connectors, you must add multiple flex authorization rules with a specific URL pattern.
Please see the Google Search Appliance Connector documentation page for updated details.
As the Microsoft knowledge base explains, the size of an Access Control List (ACL) depends on the number and size of its Access Control Entries (ACEs. The maximum size of an ACL is 64K, which is about 1,820 ACEs. If this limitation is approached, the File System Connector will report failure to traverse warning messages, and will not be able to feed to the GSA.
The Google Search Appliance Connector for File System 3.0 is aware of the Traversal Schedule, including scheduled traversal intervals, Retry Delay, and run-once traversals (Retry Delay of -1). The Retry Delay governs the delay between traversals of the repository.
Previous versions of this connector ignored the Traversal Schedule, replicating certain functionality with advanced configuration options. Since the connector is now aware of the Traversal Schedule, the following advanced configurations have been deprecated and will be ignored if set:
Here are some important tips and concepts about full and incremental traversals. The Full Traversal Interval governs the interval between full, rather than incremental, traversals of the repository. Most traversals are incremental traversals and look only for items that have changed since the last traversal, based upon the file's last-modified timestamp.
- Some traversals are full traversals, which feed all appropriate contents of the repository to the Search Appliance.
- You may wish to configure the forced Full Traversal Interval for the needs of your organization. File and directory adds and copies and changes to file contents are detected during the connector's incremental traversals. However, moved or renamed files and changes to ACLs and other metadata may only be detected during full traversals.
- Frequent full traversals may overwhelm the Search Appliance, bogging down its feed processing. Long full traversal intervals increase the time it takes for the Search Appliance to notice certain types of changes.
- A full traversal may also be triggered manually at any time by resetting the connector in the Search Appliance Admin Console.
- A full traversal is automatically triggered if you change the connector's configuration or schedule, or restart the Tomcat web application server.
- If you have a large number of files in your repository (more than 1 million), the default Retry Delay and Full Traversal Interval values are likely too small. Consider Retry Delay values of hours (4 hours = 240 minutes).
- Consider a Full Traversal Interval that is at least 2 days for each million documents fed.
- At this time, the connector ignores the Traversal Rate configuration.
Typically, each instance of the Connector for File Systems can traverse hundreds of thousands of documents per day. However, if you have a large number of PowerPoint documents, the traversal rate is much lower. Google recommends that you set up multiple File System connector instances to ensure a high traversal speed during the discovery phase of the traversal. This is easier to configure if your content repository can be subdivided so that there are multiple start paths.
When you schedule connector instances, the performance of the repository is a significant consideration. Depending on the number of traversals and the size of the documents retrieved for indexing, the use of connectors may degrade repository performance. Monitoring and performance-tuning the repository server is especially important when you deploy a new connector or document repository.
Note that a connector scheduled to run from 12 a.m. to 12 a.m. never stops. Any other schedule with the same beginning and ending time never runs, either for a connector or for the Google Search Appliance’s standard crawl function.
- When to run the traversal process
- How long to run the traversal process
A connector instance cannot self-modify its traversal schedule. Therefore, you must monitor the performance of both the Google Search Appliance and the content management system regularly, and make manual adjustments to the traversal schedules of connectors to optimize performance. You can tune scheduling for optimal performance in these ways:
- Create a schedule that minimizes the number of concurrent traversal processes that are running.
- Restrict the times at which those processes run. For example, if the content management system is executing a resource-intensive job, the connector might run slowly. Schedule the connector to run at times when demand on the content management system is light.
The connector manager will interrupt a connector that takes too long to process a batch of documents. The default duration after which the connector manager interrupts the connector is 1800 seconds, or 30 minutes. The duration is set by the value of the
traversal.time.limit property in the
applicationContext.properties file. If you want a shorter duration, you can change the value of
- Stop Apache Tomcat.
- Open the
applicationContext.propertiesfile in a text editor. The top of the file contains comments with explanatory text. Do not uncomment any of the explanatory text, including the example for
- Examine the file to see whether there is a
- Save the file.
- Restart Tomcat.
The search appliance Admin Console enables you to modify the connector retry delay, which is the time period that elapses between when one traversal is completed and the next starts. For example, you might want the connector to traverse the repository every hour between 8 a.m. and 8 p.m. or every two hours from midnight to 9 a.m.
The Retry Delay determines how long the connector waits after completing a traversal before starting a new traversal. The Full Traversal interval (a separate configuration) determines whether the next traversal will be a Full Traversal or an Incremental traversal.
- When you manually edit the connector’s properties file or one of the configuration files (
connectorInstance.xml). Alternatively, for edits to the connectorInstance.xml file only, you can apply the changes on the Admin Console, without restarting the connector service. Click the Edit link for the connector instance, then click Save Configuration.
- When you install a connector or connector manager JAR file.
To locate particular information or documents in the repository, a user opens a browser window and navigates to a search page. The search page can be the default search page available on the Google Search Appliance or it can be a customized search page. The user types a search term in the search box and clicks Search.
When the Google Search Appliance finds all the documents that match the search request, it presents the user with a pop-up window and asks for the user’s user name and password. The connector manager passes the search results and the user credentials to the repository server. The repository server authenticates the user, evaluates the permissions for each document returned by the user’s search, determines which documents the user is authorized to view, and returns that information to the connector manager.
The Google Search Appliance displays a results page listing the documents the user is authorized to view. When the user clicks a link on the results page, a web client window opens in which the user can view the document or its metadata, depending on how the connector is configured. If the user does not have an open session to the repository, the web client asks for the user’s login credentials before displaying the document.
The Google Search Appliance (GSA) Connector for File System 3.0 uses a Lister/Retriever model to feed documents to the GSA. Rather than pushing a Content feed to the GSA, the Lister pushes a Metadata-and-URL feed, where the URL (referred to as the Content URL) points back to the connector's document content Retriever.
- Since the document content is no longer contained within the feed itself, the feeds are much smaller.
- The responsibility for document change detection moves from the connector to the search appliance, which uses HTTP “If-Modified-Since” request header to conditionally retrieve changed content. This change simplifies the connector implementation considerably, removing its dependence on the “Diffing Connector” infrastructure.
- Verbose connector logs. See Logging for information on changing the default logging level. If you are reporting a problem to Support, it is ideal if you can reproduce the problem with the logging level set to ALL. However, log files with entries made when the problem occurred are also helpful.
- Connector configuration files.
- Feed record and metadata log file. See Logging Feed Record and Metadata Information to a Text File for information on generating this log file.
Logging is a useful technique for recording information about how your installation is operating. You can use the information logged for troubleshooting the operations of the connector, the Google Search
When using Connector Manager 3.0, the logging level can be adjusted via the Admin Console. Use the Connector Log Level settings. However this change affects only the currently running process and will be reverted back to default upon restarting the connector manager.
The output from the
FileHandler appears in the
connectors_root_dir/connector_name/Tomcat/logs directory. The output appears in the
google-connectors.sequence.log file, where sequence is a series of numbers starting with 0 and incremented by 1 on each occurrence (0, 1, 2, 3...n). The first three log file names would be
If the Apache Tomcat instance where the Connector Manager is installed is not started or if the location you type in is incorrect or invalid, a message is displayed on the Connector Manager Administration page of the Admin Console saying “The appliance could not connect to the connector manager as specified in the location. Make sure that the URL is correct, or try again later.”
This means that the CATALINA_HOME environment variable is not set correctly on the Tomcat host. Examine the Tomcat startup script or
.bashrc and ensure that CATALINA_HOME points to the correct Tomcat installation.
When a connector manager is unavailable, the Admin Console displays a circular red indicator next to the connector manager name. If you try to add a connector to an unavailable connector manager, you see the following error message:
- Tomcat is not running on the registered host and port.
- The connector manager host is unreachable.
- The Tomcat Remote Address Filter is rejecting access.
You delete a connector instance only on the Admin Console of the Google Search Appliance. When you delete the instance, you delete the configuration information for the instance. The connector manager no longer creates and runs the instance.
Each connector instance is listed on the Admin Console in the Connector Administration > Connectors section. The indicator light is either green or red. Green indicates the existence of the connector instance.
- Log in to the Admin Console as an administrator.
- Click Connector Administration > Connectors.
- Click the Edit link for the correct connector.
- Click the Delete link on the line for the correct connector instance.
- Click OK.
Before you unregister a connector manager, you must delete all connector instances associate with that connector manager. If you have a large number of connector instances, you can first stop the Tomcat instance where the connector manager is running, then unregister the connector manager.
- Log in to the Admin Console as an administrator.
- Click Connector Administration > Connector Managers.
- Locate the connector manager you want to delete.
- Click the Unregister link on the line for the correct connector manager.
- Click OK.
- On Windows, click Start > All Programs > Google Search Appliance Connector version_number > Uninstall.
- On Linux, click the appropriate shortcut.