Configuring the Connector for SharePoint with Content Feeds (Deprecated)

Connector software version 3.0
Connector Manager version 3.0
Installer version 3.0



This document contains the information you need to install the Google Search Appliance Connector for SharePoint and configure the Google Search Appliance and the connector to traverse, index, and search content in a SharePoint content repository.

This document is for SharePoint administrators and administrators who install and configure the Google Search Appliance. If you are not familiar with the system that the connector will traverse and index, work closely with your system administrators to determine the correct values for installing and configuring the connector.

Use this document in conjunction with the following related documents:

Find connector documentation, including books referred to above, at Connector documentation. Google Search Appliance documentation can be found in the Google Search Appliance Help Center product documentation.


Introducing the Google Search Appliance Connector for SharePoint

The Google Search Appliance Connector for SharePoint enables the Google Search Appliance to traverse documents and attachments on SharePoint sites. This document provides installation and configuration instructions for deploying the SharePoint connector with a content feed, which can be used only with Microsoft SharePoint 2007 and 2010, and Microsoft SharePoint Foundation 2010 and above.

The Google Search Appliance can crawl SharePoint sites directly as a web site, but not all content from SharePoint sites can be crawled directly. SharePoint uses JavaScript extensively to page links for a list, display pages of an item, handle subsites, and perform other tasks. SharePoint also uses JavaScript for actions such as deleting or changing items. To keep your content safe, the Google Search Appliance does not execute JavaScript in a web page.

If you plan to configure the SharePoint connector with a metadata-and-URL feed, read Configuring the Connector for SharePoint with Metadata-and-URL Feeds in Connector documentation.

For a general overview of how the connector manager and connectors work, see Introducing the Connectors in Connector documentation.


Preinstalled SharePoint Connector

Google Search Appliance software version 7.0 and later includes a connector manager and SharePoint connector installed on the search appliance itself. You can use the preinstalled connector manager and SharePoint connector or create an installation on a freestanding host. If you use the preinstalled software, you do not need to configure a connector manager or connector on a standalone host. Note, however, that the preinstalled connected is limited to crawling and indexing approximately 500,000 documents.

If you are using the preinstalled connector, read the sections of this document on Preparing Microsoft SharePoint Server for the Connector, Traversing Multiple Site Collections with Google Services for SharePoint, Installing Google Services for SharePoint, and Preparing the Google Search Appliance for the Connector. You do not need to read the sections on supported operating systems, Java versions, and Apache Tomcat versions.

The section Installing the Google Search Appliance Connector for SharePoint contains instructions for installing on a standalone host. Read Configuring the Preinstalled Connector Manager and SharePoint Connector instead, as well as the balance of this document.


Components in the Google Search Appliance Connector Installation

A typical connector installation consists of these components:

  • What you are searching--a content repository. This consists of the content management system server, the content files, and the supporting database in which metadata is stored, if any. A Google Search Appliance can index multiple repositories.
  • A connector for each repository you index.
  • The content management system web client, installed on any platform supported by the content management system
  • The content management system’s native API, which is typically installed on the connector manager host
  • Any other supporting software components of the content management system
  • An LDAP server or other external mechanism used for user authentication
  • Java Development Kit (JDK) or Java Runtime Environment version 1.6
  • Google Search Appliance Connector installation, which consists of Apache Tomcat, the connector manager, and the connector for your content management system. These components are installed using a Google-provided installer
  • A Google Search Appliance

If you are using the preinstalled connector manager and SharePoint connector, you do not need a stand-alone host for the connector manager and connector or any of the components installed on that host.


Supported SharePoint Versions

This connector is supported on the following SharePoint versions:

  • Microsoft SharePoint Server 2010
  • Microsoft SharePoint Foundation 2010
  • Microsoft Office SharePoint Server 2007 (MOSS 2007)

Supported Operating Systems

The connector manager and Google Search Appliance Connector for SharePoint are supported on the following 32-bit operating systems:

  • Microsoft Windows Server 2008 SP2 Standard Edition
  • Red Hat Linux 5

The connector manager and Google Search Appliance Connector for SharePoint are supported on the following 64-bit operation systems:

  • Microsoft Windows Server 2008 SP2 Standard Edition
  • Microsoft Windows Server 2008 R2 Enterprise x64 Edition

The connector manager and the Google Search Appliance Connector for SharePoint are supported in virtualization environments. Google does not provide support for specific virtualization environments or for issues that are specific to virtualization.


Supported Authentication Mechanisms

The Google Search Appliance Connector for SharePoint supports the following user authentication mechanisms:

  • Basic authentication
  • NTLM authentication
  • Kerberos authentication

Supported Java Version

The Google Search Appliance Connector for SharePoint requires a minimum of Java Runtime Environment 6 and 7. JRE 5 is out of support.


Apache Tomcat Version

The installer installs a connector manager, a connector type, and Apache Tomcat 6.0.18. Tomcat 5.5.23 is supported for this release of the connector and connector manager.


Before You Install the Connector

Before installing the connector using the installer, ensure that Java Runtime Environment (JRE) version 6 or 7 is installed on the host where you are installing the connector.


Port Requirements

The following three port requirements must be met by both content feed connectors and metadata-and-URL feed connectors:

  1. The Google Search Appliance must be able to connect to the connector manager. If you are running an external connector manager, ensure that the correct port is open on your network for inbound traffic.The default connector manager port is 8080.
  2. The connector manager must also be able to connect to the correct ports for traversing any SharePoint WebApps that you plan to index. Ensure that those ports are open.
  3. Ensure that the connector manager is able to connect to the search appliance on port 19900, which is used for feeds.

In addition, if you are running a metadata-and-URL feed connector, ensure that your firewall permits the search appliance to connect to the SharePoint Server on all ports on which the SharePoint WebApps that will be indexed are running.


Preparing Microsoft SharePoint Server for the Connector

This section contains instructions for configuring SharePoint for the connector.


Configuring SharePoint 2007 and 2010 to Use Fully-Qualified Domain Names

This section applies only to Microsoft Office SharePoint Server 2007 or 2010 and Microsoft Windows SharePoint Services 3.0.

The Google Search Appliance can display cached copies of documents only if URLs contain fully qualified host names. By default, Microsoft SharePoint Server 2007 and 2010 use short names for user access to SharePoint sites. If your SharePoint sites are configured with short names, URLs are sent to the Google Search Appliance with short names and the search appliance is unable to display cached copies. This section tells you how to configure MOSS to use fully-qualified host names.

In addition, SharePoint bulk authorization requires you to use fully-qualified domain names.

If you do not configure SharePoint 2007 or 2010 to use fully-qualified domain names, create a front end that does not include a link for cached documents.

To configure SharePoint sites to use fully-qualified domain names:

  1. Open the MOSS Central Administration tool from the Start menu.
  2. Navigate to Central Administration > Operations > Alternate Access Mappings. The Alternate Access Mappings dialog box displays several internal URLs for the SharePoint site and the admin site. The default settings are short URLs. If you type a fully qualified host name in the browser bar, you are redirected to a short name. For example, if you type http://moss_host1.yourdomain.com/, you are redirected to http://moss_host1/Default.aspx.
  3. Click a shortened URL.
  4. Edit the URL so that it is a fully-qualified domain name. For example, change http://moss_host1/ to http://moss_host1.yourdomain.com/.
  5. Click Ok.

Required User Credentials for Connector Traversal and Indexing

The SharePoint connector and the Google Search Appliance require user credentials for traversal and indexing. Google recommends that you use a single user account for both. The connector instance uses the credentials to obtain document content from SharePoint. You provide the credentials on the connector configuration page.

You must provide a domain name or host name in order to enable Windows NT LAN Manager (NTLM) or HTTP Basic authentication. However, for HTTP Basic authentication, if the user belongs to the local machine on which SharePoint is installed, the user does not have to be a domain user.

Microsoft Office SharePoint Server 2007 or 2010 and Microsoft Windows SharePoint Services 3.0

When you configured SharePoint Connector to index content from Microsoft Office SharePoint sever 2007, Windows SharePoint Services 3.0, SharePoint server 2010, SharePoint Foundation 2010 or SharePoint Server 2013, it is recommended to use user account having "Full Read" permissions at web application policy level. Configuring connector to use user account with other than "Full Read" permission might impact connector functionality.

If you want to index "unpublished / expired content" from SharePoint, then configure connector to use user account with "Full Control" permission at web application policy level or add user as "Site collection Administrator" for each site collection you want to index. Alternatively you can also use one of the following permission level at individual site collection level.

  • Full control
  • Design
  • Manage hierarchy
  • Approve
  • Contribute

The permissions above are listed in decreasing order of privileges. For the connector, a user account with Contribute permissions is sufficient.

Permissions Needed for User Profile Search

In order to use the new User Profile search, the SharePoint traversal user ID must be given admin and Full Control rights to the User Profile Service Application. Please refer to Restrict or enable access to a service application (SharePoint 2013) for more information about these rights.

If you have a web application that is configured for NTLM claims authentication, be sure to remove anonymous access from the service application at the web application level. This is required by SharePoint, because additional security must be provided to access the UserProfileService in order to index SharePoint Profiles.


Traversing Multiple Site Collections with Google Services for SharePoint

The default web services provided with SharePoint are limited to a single site collection. Clients communicating with a site collection are unable to discover information about other site collections. Google Services for SharePoint provides custom web services which, when deployed on the SharePoint server, enable the discovery of all the sites in the SharePoint installation. Google Services for SharePoint also enables the connector to crawl all the sites in the farm. You can specify any site URL as the Crawl URL from which Google Services for SharePoint can be accessed. At present, this functionality is supported only for MOSS 2007 or 2010 and WSS 3.0. Lastly, if you are integrating the Google Search Box for SharePoint on a SharePoint site that is protected by NTLM, you must install Google Services for SharePoint and the SAML Bridge (Windows Integrated Authentication).

Google Services for SharePoint is a requirement for using content feeds. You must install Google Services for SharePoint before you install the a connector using content feeds.

Where you deploy Google Services for SharePoint depends on how you have the SharePoint installation configured.

  • If you have a single SharePoint server, deploy Google Services for SharePoint only on that SharePoint server. The Crawl URL for the connector should be for a site collection or site which is hosted on the SharePoint server where Google Services is deployed.
  • If you have a load-balanced farm configuration with multiple web front ends and the connector goes through a virtual host configured through the load balancer to communicate with the web front ends, install Google Services on all web front ends.
  • If you have a load-balanced farm configuration with multiple web front ends and the connector communicates with a dedicated web front end, install Google Services on the dedicated web front end.

During the process of discovering all the sites in the SharePoint installation, the SharePoint connector does not take into account URL mappings defined under Central Administration > Operations > Alternate Alias Mappings on SharePoint. This means that the connector only discovers the original default URLs of sites and not any other external URLs that might have been mapped to the original default URL.

Take the above into account when you configure the included and excluded patterns on the Connector Configuration form if you expect some sites to be discovered by Google Services for SharePoint. Use SharePoint Site Alias Mapping (see SharePoint Site Alias Mapping) so that these sites are indexed using the mapped URL name and not the URL used by the connector while crawling.


Installing Google Services for SharePoint

Google Services for SharePoint is available only if you are running Microsoft Office SharePoint Server 2007 or 2010 (MOSS 2007 or 2010) or Microsoft Windows SharePoint Services 3.0 (WSS 3.0). Install Google Services for SharePoint on all SharePoint hosts that will be traversed by the Google Connector for SharePoint. Ensure that the Microsoft.NET Framework version 2.0 is installed on the hosts.

Google Services for SharePoint is available in separate installers for 32-bit and 64-bit environments. Google Services for SharePoint is packaged in the Google Search Appliance Resource Kit for SharePoint installer, which you obtain from the connector download site. All connector software is linked to the page you see.

During the installation process, you must provide an unused port number between 1024 and 65535. This port is used by Windows Integrated Authentication (SAML Bridge). Google Services for SharePoint uses the same port as your SharePoint Site.

The Google Services for SharePoint URL is constructed from the Start URL for SharePoint. You do not need to provide a Google Services URL.

Google Services for SharePoint contains the following components:

  • The bulk authorization web service, which is used to perform search–time bulk authorization for the SharePoint documents fed to the GSA.
    • GSBulkAuthorization.asmx
    • GSBulkAuthorizationdisco.aspx
    • GSBulkAuthorizationwsdl.aspx
  • The site discovery web service, which is used to get all the top level URLs of all the site collections for a given SharePoint installation.
    • GSSiteDiscovery.asmx
    • GSSiteDiscoverydisco.aspx
    • GSSiteDiscoverywsdl.aspx
  • The ACL web service, which is used to get all of the Access Control Lists associated with each SharePoint URL being crawled.
    • GssAcl.asmx
    • GssAcldisco.aspx
    • GssAclwsdl.aspx
  • The check connectivity application, which is required to check the connectivity of the SharePoint web services for Google Connectors.
    • Verify Installation.exe
    • Verify Installation.exe.config
    • Verify Installation.InstallState

To install Google Services for SharePoint:

  1. Log in to the SharePoint host as a user with sufficient privileges to install software on the host.
  2. If a previous version of Google Services for SharePoint is installed, uninstall it.
  3. Use a browser to navigate to the connector download site on github.com and download the Google Search Appliance Resource Kit for SharePoint installer to the SharePoint host.
  4. Navigate to the location where you saved the installer file.
  5. Double-click the Google Search Appliance Resource Kit for SharePoint_x86(64).msi file.
  6. Click Next. The License Agreement screen is displayed.
  7. Accept the License Agreement and click Next. The Setup Type screen is displayed.
  8. Select Custom and click Next. The Custom Setup screen is displayed.
  9. Select the components that you want to install and click Next.
  10. Click Install. The Port Number Configuration screen is displayed. This is the port number for the SAML Bridge.
  11. Enter the port number and click OK. The port must be an unused port between 1024 and 65535.
  12. Click Next. The installer installs the software.
  13. Read the optional verification dialog box, which you can use to ensure that the web services are installed correctly and reachable.
    Parameter Name Description
    Local SharePoint Web Site URL The SharePoint web site URL for sites hosted on the machine where the custom web services are installed. Enter a valid SharePoint web site URL.
    Username The name of the SharePoint user who has access to the web site, for example, Administrator.
    Password The password for the SharePoint user.
    Domain The domain of the user
  14. Click Close.

Verifying Connectivity and the Installation

You can check connectivity and verify the installation for Google Services for SharePoint after the installation is complete. There are three increasingly complex ways to do this.

First, click Start > Programs > Google Services for SharePoint and click Verify Installation.exe.

Next, if the connectivity test fails, navigate to the following URLs directly from your browser:

http://FQDN/localhost/IP_address:port_number/_vti_bin/Lists.asmx
http://FQDN/localhost/IP_address:port_number/_vti_bin/GSBulkAuthorization.asmx

The URLs return a list of functions supported by the web services.

Lastly, you can use tools such as SOAP-UI to debug the Google Services for SharePoint web services when the Verify Installation utility fails.

To use SOAP-UI to debug the web services:

  1. Open SOAP-UI.
  2. Create a new project.
  3. When you are asked for a new web service endpoint or WSDL URL, provide a URL with the following format:

    http://SharePoint_Site_URL/_vti_bin/GSBulkAuthorization.asmx?WSDL

    Ensure that SharePoint_Site_URL opens the Sites HTML user interface as the user sees it and that the complete URL returns a WSDL (XML file) in the browser first.

  4. If required, enter the appropriate credentials. SOAP-UI creates the project with sample web server requests.
  5. In the top left project explorer tree, expand the checkConnectivity node.
  6. Open the sample checkConnectivity request.
  7. Enter the Domain, Username, and Password.
  8. Click Play, which sends the request.
  9. Examine the web services response in the right-side panel.
  10. Check whether the response contains any HTTP or server-side errors.
  11. If there are no errors, try to configure the connector.
Solving Connectivity Warnings

You might see the following warning in the log:

Jul 15, 2010 3:36:46 PM
com.google.enterprise.connector.sharepoint.wsclient.GSBulkAuthorizationWScheckConnectivity WARNING: Can not connect to GSBulkAuthorization web service.

The log entry indicates that the web service GSBulkAuthorization is unreachable. Use the connectivity tests described in Verifying Connectivity and the Installation to determine the source of the problem and correct it.


Preparing the Google Search Appliance for the Connector

Use the instructions in the following sections to set up the Google Search Appliance for the SharePoint connector.


Configuring Crawl and Feeds for the Connector

When you use the SharePoint connector with a content feed, you must make an addition to the Follow and Crawl URLs defined in the Admin Console. The Google Search Appliance rejects content in the repository without the addition. You must make a similar addition on the Crawler Access page.

For all documents sent by a specific connector instance, replace connector_name with the actual name of the connector. During search, GSA will forward requests for the authorization of search results to the connector instance. The search results are forwarded to the particular connector instance whose name is embedded in the document URL. If there is no connector running that has that name, the search appliance drops those search results.

To configure crawl URLs for the connector:

  1. On the Admin Console, navigate to the Crawl and Index > Crawl URLs page.

    Admin Console Crawl URLs page

  2. In the Follow and Only Crawl URLs with the Following Patterns box, add the following statement:

    ^googleconnector:

  3. Save the configuration.

The Google Search Appliance must be configured to accept feeds from the connector host.

To enable feeds:

  1. Click Crawl and Index > Feeds.
  2. In the List of Trusted IP Addresses section, select Trust feeds from all IP addresses or Only trust feeds from these IP addresses.
  3. If you selected Only trust feeds from these IP addresses in step 3, type in the trusted IP addresses of the connector host.
  4. Click Save Settings.

Making Content Public

If you want to make the content located at a particular URL public, follow these instructions.

To make content public:

  1. Navigate to the Crawl and Index > Crawler Access page.

    Admin Console Crawler Access page

  2. Type the following statement in the For URLs Matching Pattern field for each of your connectors, where connector_name is the name of the particular connector:

    ^googleconnector://connector_name.localhost/

  3. Type in the User Name and Password required for accessing the URLs.
  4. Confirm the Password.
  5. To make the content for a particular URL pattern public, check Make Public.
  6. To add an additional URL pattern, click Add More Rows and complete steps 5 through 8.
  7. Click Save Crawler Access Configuration.

Configuring Authorization

To use a content feed, configure the search appliance so that the connector itself performs authorization. Ensure that you select the By Connector option under Authorization Handling on the connector configuration page.

Authorization by connector option


Configuring the Preinstalled Connector Manager and SharePoint Connector

Google Search Appliance software versions 7.0 and later include a connector manager and SharePoint connector installed on the search appliance. This section provides instructions for configuring the preinstalled connector manager and SharePoint connector. If you are installing the connector and connector manager on a stand-alone host, use the instructions in Installing the Google Search Appliance Connector for SharePoint.

The preinstalled SharePoint connector is subject to some limitations:

  • The preinstalled connector supports up to 500,000 documents only. Do not attempt to crawl and index more than 500,000 documents with the preinstalled connector.
  • You cannot include or exclude certain metadata because this is configurable only for connectors that are installed on a separate host. The Google Search Appliance does not have a user interface for making these changes.
  • You cannot change logging levels.
  • You cannot enable feed logs.

If you need these features, use an external connector installed on a separate host.

The Google Search Appliance software version determines which SharePoint connector version is installed on the search appliance.

To export the logs and configuration file for the preinstalled connector manager and SharePoint connector, ensure that your firewall does not block port 7843 on the search appliance.

To configure the preinstalled connector manager and SharePoint connector:

  1. Log in to the Google Search Appliance using an account with administrator privileges.
  2. Click Connector Managers.
  3. Click ConnectorManager0.

    This is the preinstalled connector manager.

  4. Click Add New Connector.
  5. Type in a name for the new connector.

    The connector name must be all lower case. The connector name can have a maximum of 64 alphanumeric characters, and can include underscores (_) and hyphens (-). The name cannot begin with a number or a hyphen.

  6. Click Get Configuration Form.

    The connector configuration form is displayed.

  7. Use the table in Configuring a Connector Instance to complete the configuration form for the new connector.
  8. Type in a new Traversal Rate value or accept the default value of 200 documents per minute.
  9. Type in a new Retry Delay value, in minutes, or accept the default value of 5 minutes.
  10. To disable connector traversal of the SharePoint repository, check the Disable Traversal checkbox.
  11. Set the times when you want the connector to traverse the SharePoint repository.

    Note that a connector scheduled to run from 12 a.m. to 12 a.m. always runs. Any other schedule with the same beginning and ending time never runs, either for a connector or for the Google Search Appliance’s standard crawl function.

  12. Click Save Configuration.
  13. Proceed to the section Verifying That the Connector is Working.

Installing the Google Search Appliance Connector for SharePoint

This section describes the installation process for the Google Search Appliance Connector for SharePoint. You install the connector using an installer that installs Apache Tomcat, a connector manager, and the connector on a host computer.

If you are using the SharePoint connector and the connector manager that are preinstalled on the Google Search Appliance, skip these instructions. If you are installing the connector and connector manager on a free-standing host, that host is not required to be the SharePoint Server host.

The instructions that follow are in two parts. In the first part, you download and uncompress the installer package. In the second, you install the software on the connector host.

To download and uncompress the installation package:

  1. Log in to the host using an account with sufficient privileges to install the software.
  2. Start a web browser.
  3. Navigate to the connector download site on github.com.
  4. Download the correct software distribution package to the host where you are installing the software.
  5. Uncompress the package.
  6. If you are on Windows, skip step 7 and go to the instructions immediately below for installing Tomcat, a connector manager, and the connector.
  7. If you are on Linux, follow these instructions.
    1. Open a terminal window and go to the base directory of the GCI.bin file in the extracted folder.
    2. To run the installer in graphical mode, execute the following command:

      ./GCI.bin LAX_VM/java_location_to_java

      for example, ./GCI.bin LAX_VM /usr/java/j2sdk1.5.2_x/bin/java

    3. To run the installer in console mode, execute the command in Step 3 above with the -i console argument appended.
    4. Go to the following instructions and proceed from Step 2.

To install Apache Tomcat, a connector manager, and the Google Search Appliance Connector for SharePoint:

  1. Double-click the distribution file to start the installer.

    You will see an introductory panel.

  2. Click Next.

    The Licence Agreement panel appears.

  3. Indicate whether you accept or decline the terms of the license and click Next:
    • To accept the license, click I accept the terms of the License Agreement.
    • To decline the terms, click I do NOT accept the terms of the License Agreement.
  4. On the Select Connector panel, select the correct connector and click Next.
  5. On the Install Connector panel, choose Install new Google Connector and click Next.
  6. On the Connector Configuration panel, enter the name you want to assign the connector and a port number that is not already used by another application.

    If you are creating multiple installations of the connector, ensure that you do not use consecutive port numbers. Each connector installation requires two consecutive port numbers for use by Tomcat. For example, if ConnectorInstall1 is installed on port 8080, do not use port 8081 for ConnectorInstall2. In addition, do not use the AJP Connector port (port 8009) listed in the Tomcat server.xml file. In installations where SSL is supported, do not use the SSL port.

  7. Enter the Google Search Appliance IP Address, which is the IP address to which the connector sends feeds.

    Entering the address ensures that only the search appliance can communicate with the connector manager.

  8. If you do not want the connector service to start automatically, uncheck the Start SharePoint connector Service after Installation check box.
  9. If you do not want to register the connector manager on the search appliance during this installation process, uncheck the Register Connector Manager with GSA checkbox.
  10. Click Next.
  11. On the Choose Java Runtime Environment panel, choose the correct JRE for the connector to use and or click Search for Others if the correct JRE is not in the list.
  12. Click Next.
  13. On the Choose Install Folder panel, click Next to accept the default location or click Choose to navigate to a different folder, then click Next.

    The default location is the installation folder chosen in the previous step.

  14. On the Choose Shortcut Folder panel, indicate where you want icons created for the connector and click Next.
  15. Read the information on the Pre-Install\Update Summary panel and click Install.

    An informational panel indicates that the connector installation is in progress. The Register Connector Manager on the GSA panel is displayed.

  16. Type the search appliance administrator user name in the GSA UserID field.
  17. Type the password for the administrator in the GSA Password field.
  18. Type the search appliance port number in the GSA Port field.
  19. Type in the Connector Manager Name and Description.
  20. Click Next.

    The installer indicates whether the installation process succeeded or failed and displays information about connector manager connectivity status, the connector manager URL, search appliance status, and the search appliance display URL.

  21. Click Done.
  22. To start the connector service, click Yes.

    Apache Tomcat starts and deploys the connector manager and connector.

  23. If the Start SharePoint connector Service after Installation check box was left unchecked, start the connector service:
    • On Windows, click Start > Programs > Googleconnectors > connector_name > Start SharePoint connector Service.
    • On Linux, to start the connector as a console, open a terminal windows and navigate to the installation location. Use the following command:

      ./Start_SharePoint_Connector_Console

  24. If you did not register the connector manager from the connector installer, continue with the instructions in Registering a Connector Manager on the Admin Console. If you registered the connector manager from the connector installer, continue with the instructions in this document for Configuring a Connector on the Admin Console.

Registering a Connector Manager on the Admin Console

This section describes how to register a connector manager on the Admin Console.

If you registered the connector manager from the connector installer during the installation process, skip this section. If you are using the SharePoint connector and connector manager that are preinstalled on the search appliance, skip this section.

To register a connector manager on the Admin Console:

  1. Use a browser to log in as an administrator to the Admin Console on the target Google Search Appliance.
  2. Click Connector Administration > Connector Managers. If any connector managers are configured, a list of existing connector managers is displayed.
  3. In the Manager Name field, type a name to identify the new connector manager on the Admin Console.
  4. In the Description field, type a description of the new connector manager.
  5. In the Service URL field, type the URL to the Tomcat instance where the connector manager is running.

    This is the root access URL for the connector manager. Ensure that the location you enter is a fully-qualified host name or an IP address. For example, use http://example.com:8080/connector-manager, not http://example:8080/connector-manager.

    If you enter the Service URL and it contains a URL ending in .local or .domain, you see the error Invalid connector manager URL. Use the IP address of the host instead.

    For example, if the connector manager is located in the $CATALINA_HOME/webapps/connector-manager/ directory of a Tomcat server running on the myappserver host machine, its location is

    http://example.com:8080/connector-manager

    The following values are used in this example:

    • http://example.com—The host name of the computer on which Tomcat runs. This must be a fully-qualified domain name.
    • 8080—The default http port on which Tomcat serves web applications. The value is configurable. See the Apache Tomcat documentation for further information on changing the value.
    • /connector-manager—The name or context of the web application.

    If access from the Google Search Appliance to Apache Tomcat is through a proxy server, the URL in the Service URL field must include the proxy redirect. For example:

    http://proxy.myexample.com:81/tomcat/connector-manager

  6. Click Save. The Admin Console displays a message saying New Connector Manager successfully added. The new connector manager appears in the list of connector managers. If the connector manager is running and Google Search Appliance can connect to it, a green dot appears in the Status column next to its name.

Configuring the Connector on the Admin Console

This section describes tasks you must perform on the Google Search Appliance Admin Console to configure the connector and the crawl patterns required by the connector.


Configuring a Connector Instance

You can define a connector instance for each SharePoint subsite, or the main top-level site. A SharePoint connector instance traverses the site specified in its SharePoint URL, including any subsites that are located under that site.

To configure an instance of a SharePoint connector:

Admin Console Add Connector Configuration form

  1. Open the Admin Console.
  2. Click Connector Administration > Connectors.

    Admin Console Add Connector page

  3. Select the appropriate Connector Manager from the list.
  4. Click Add New Connector to create a new SharePoint Connector instance.

    Specify a Connector Name. Each connector instance added to a particular connector manager or Google Search Appliance must have a unique name. The connector name must consist of no more than 64 alphanumeric characters. All alphabetical characters must be lower-case. Connector names may include underscores (_) and hyphens (-), but they cannot begin with a hyphen.

  5. Choose sharepoint-connector from the Type drop-down box.
  6. Click Get Configuration Form.

The following table describes most of the fields that you must complete to configure a SharePoint connector.

 
Name Description Values and Usage
Crawl URL The URL for the SharePoint site that you want to traverse. This is the starting point for the Google SharePoint Connector to start its traversal. This is the Crawl URL you designate on the Connector Configuration page, not the Google Search Appliance Crawl page.The Google Search Appliance traverses this site and any subsites found under it. This is the Crawl URL you designate on the Connector Configuration page, not the Google Search Appliance Crawl Configuration page.

By default, one connector will only be able to traverse a single SharePoint site collection or a subset of it. To traverse more than one site collection, you must define multiple connector instances.

Google Services for SharePoint can discover all the site collections on a SharePoint server or in a SharePoint farm. When Google Services for SharePoint is deployed on the SharePoint server, only one connector instance is required for crawling and indexing all the content. No extra configuration is required with the connector. The connector checks for the presence of the Google Services for SharePoint on the SharePoint server and uses it if it is deployed.

The connector automatically detects the SharePoint installation type (2007, 2010) from the Crawl URL that you provide.

The URL must contain a fully qualified domain name. The following URLs are acceptable:

  • The root URL of the site, for example, http://www.abc.com.
  • Top-level of site, for example, http://www.abc.com/sites/whatever.
  • URLs starting with https, for example, https://www.abc.com/sites/secret.

We recommend that you do not have two connector instances accessing the same SharePoint Crawl URL.

Kerberos KDC Hostname

Note: Kerberos authentication is not available with the preinstalled SharePoint connector.

The fully-qualified domain name or the IP address of the Kerberos Key Distribution Center (KDC) server. When you provide this, the authentication method is Kerberos. If the field is blank, the authentication method at crawl time is Basic or NTLM. You see an error message if the value provided is not the fully-qualified domain name or IP address of the KDC server.
Domain A valid domain name. Optional field under some circumstances. The Windows domain name if a domain account will be used for the connector. If you are using a local (machine) user, provide the machine name or IP address. This field is optional if the credentials used are local to the SharePoint machine and the authentication scheme being used is HTTP Basic. In all other cases, you must give a correct value for Domain.
Username and password A valid username and password on the SharePoint Server’s domain. The user must have Site Collection Administrator privileges in SharePoint. When the Google Search Appliance authentication method is set to HTTP Basic and Domain User credentials are used, type the Username in the format Username@Domainname.
MySite URL (MOSS 2007 only) URL for the SharePoint MySite that you want to traverse with this connector instance. Optional field for MOSS 2007. The Google Search Appliance Connector for SharePoint uses the MySite base URL and the credentials you provide to determine the complete MySite URL, then crawls MySite and feeds metadata and URLs to the Google Search Appliance for indexing.

For example, if the MySite URL is: http://server.domain/personal/administrator/default.aspx, enter http://server.domain.

On MOSS 2010, do not put a value in this field. Instead, install Google Services for SharePoint (GSS), which is part of the Google Resource Kit for SharePoint. GSS provides the ability to crawl MySite. This works around Issue 169, which is described on the project hosting site.

Include URLs Matching the Following Patterns URL patterns that limit the sites that the connector traverses when it follows links and discovers SharePoint sites Enter regular expressions. Each URL must be on a new line.

The patterns must include the Crawl URL and must include the MySite URL if you specified MySite. The connector uses these patterns as boundaries when it discovers and traverses SharePoint sites. The connector might discover other sites linked to the SharePoint site defined with the Crawl URL, so the URL patterns you enter here must broad enough to include those other sites.

Although these URL patterns can be regular expressions, the format is slightly different from the regular expression patterns used throughout the Google Search Appliance Admin Console. The regular expression patterns elsewhere, such as on the Crawl and Index page are Google Regular expression, while the patterns on the SharePoint connector page use GNU Regexp.

Do Not Include URLs Matching the Following Patterns URL patterns that exclude particular parts of SharePoint sites that the connector discovers when it follows links during traversal Optional field. If used, enter regular expressions. Each URL must be on a new line.

The connector uses these patterns to exclude particular sections of the SharePoint sites that are discovered during traversal. See Traversal for information on the complete process. Because the SharePoint connector relies on a metadata and URL feed, the Google Search Appliance crawls and indexes SharePoint sites after the URLs are retrieved during traversal.

SharePoint Crawling Options Check the option(s) you prefer. This determines whether search visibility options are fetched at the site and list level and used before content is fetched Use SharePoint search visibility options, or Feed Unpublished Content. See the discussion below the table for more information about visibility options.
Authorization Handling Do you want to handle authorization through the search appliance or through the connector? Do you want to use Authorization by ACL? This option determines whether to handle authorization through the search appliance or through the connector Head Request means that the Google Search Appliance will authorize, connector means that the connector will authorize. There is a new setting, Authorization by ACL, with Connector 3.0. This is discussed at length in Guide to the 3.0 SharePoint Connector.

To use content feed, you must select By Connector.

Username Format in ACE This feature is provided to allow for namespace support. Use the arrow keys to choose between:
  • domain\username
  • username@domain
  • username
Groupname Format in ACE This feature is provided to allow for namespace support for groups. Use the arrow keys to choose between:
  • domain\groupname
  • groupname@domain
  • groupname
LDAP Server Host yourdomain.com Enter the LDAP server host
Port Number 389 (default) Enter the port number of the LDAP server.
Search Base See example dc=yourdomain, DC=com
Authentication Type Simple Choices are Simpleor Anonymous
Connection Method Standard Choices are Standard or SSL.
User Groups Cache Size 1000 (default)  
User Groups Cache Refresh Interval time in seconds 7200 (default)  
User Profile Crawl Option To scan user profiles so that experts can be located. Select option “Index only user profiles.” Default value is “Do not index user profiles”.
Collection Name for User Profiles Relevant only for User Profiles. Specify the collection name that the connector will create in GSA and use for indexing user profiles. This collection is later used in the expert search configuration. An example could be expert_collection.
GSA Admin Username Relevant only for User Profiles. Specify the GSA Admin username. This will be used by the connector to create the collection that holds the user profile URLs.
GSA Admin Password Relevant for User Profiles. Enter the GSA Admin password. This will be used by the connector to create the collection that holds the user profile URLs.
Advanced properties “userProfileFullTraversalInterval” Duration for updating user profiles. You can configure the duration in number of days to perform automatic full traversal of user profiles using advanced configuration ‘userProfileFullTraversalInterval.’ Default value is 7. When userProfileFullTraversalInterval=0, there will be a full traversal of User Profiles during each traversal cycle. When userProfileFullTraversalInterval <0, incremental traversal will always occur and there will not be automatic full traversals.
Global Namespace

Local Namespace

  Specify the namespace to be used for global usernames and groups in fed ACLs.

Specify the namespace to be used for local usernames and groups in fed ACLs.

Tip: Be sure to reselect this setting whenever changing any other values on this configuration page. Choose a setting appropriate for this connector.

Traversal Rate Default is 500 per minute.  
Retry Delay Default is 5 minutes.  
Disable Traversal Use to disable traversal. A checkbox. If checked, traversal is disabled.
Connector Schedule Default is 12 am and 12 am. Change the schedule if you wish to traverse only during off-peak hours, or in a way that is the most reflective of your organizations’ needs and data update schedule. Click Add Line to Schedule as needed.
Using SharePoint Visibility Options

Google Services for SharePoint must be installed to use this feature. In addition, you must configure SharePoint so that different types of pages participate:

  • On the List/Library Settings > Advanced Settings page, under Search, set the radio button Allow items from this document library to appear in search results? to Yes.
  • For sites and other default aspx pages, participation in the search results can be controlled from Site Settings > Search Visibility.
  • For site and page level inclusion/exclusion, appropriate search scopes can be created in SharePoint.
  • Under Search Visibility, in the section Indexing Site Content, set the radio button Allow this web to appear in search results? to Yes.
  • Under Search Visibility, in the section Indexing ASPX Page Content, set the radio button for ASPX page content to Do not index ASPX pages if this site contains fine-grained web permissions.
  • You can also set crawl rules MOSS enterprise search setting that behave similarly to the search appliance include and exclude URL patterns.

Scheduling the Connector

To ensure that the content is traversed on the schedule you require, complete the connector schedule page.

Note that a connector scheduled to run from 12 a.m. to 12 a.m. always runs. Any other schedule with the same beginning and ending time never runs, either for a connector or for the Google Search Appliance’s standard crawl function.


Verifying That the Connector is Working

After you save the connector configuration form, wait a few minutes and then verify on the Feeds page of Admin Console that the Google Search Appliance is receiving feeds. Ensure that the following entry is present on the Crawl Diagnostics page:

connector_instance_name.localhost

Clicking the above and successive links, you can verify all the documents IDs that have been sent to the search appliance by the connector named connector_instance_name during the content feed. The document IDs displayed under Crawl Diagnostics are in the following format:

doc?docid=Parent_List_URL%7COriginal_DocID

where Original_DocID is the actual ID of the document that SharePoint understands.

With the above rules, the complete document ID that is sent to the search appliance is in this format:

googleconnector://connector_instance_name.localhost/doc?docid=Parent_Litst_URL%7COriginal_Doc_ID

It might take some time to update the entries on the Feeds and Crawl Diagnostics pages, depending on the status of the search appliance.


Embedding the Google Search Box for SharePoint

You can embed the Google Search Box for Sharepoint on your SharePoint sites to give users the ability to search indexed content directly from SharePoint. The Google Search Box has the look and feel of a native SharePoint application, and replaces the SharePoint search box. For information on deploying the Google Search Box for SharePoint, read Configuring the Google Search Box for SharePoint in Connector documentation.


SharePoint Site Alias Mapping

SharePoint Site Alias Mapping defines the mapping of source URL patterns to the corresponding alias value that is used to rewrite URLs before they are sent to the Google Search Appliance.

The Site Alias Mapping feature is useful when the base URLs of the documents included in the feeds sent to the Google Search Appliance need to be different from the URLs discovered by the connector during traversal.

If Site Alias Mapping is specified, the alias becomes the URL used by the search appliance to crawl and index the content. The Site Alias Mapping is also the display URL in search results.

Alias mapping behaves differently, depending on which connector version you have.

  • In versions before 2.6.2, alias mapping was applied by default over both the display URL or the record URL.
  • In versions 2.6.2, 2.6.4 and 2.6.6, alias mapping was applied only over the display URL.
  • In versions 2.6.8 and later, alias mapping can be applied selectively to either or both of the display URL and record URL. The default is to apply alias mapping over the display URL only.

To add multiple alias mappings, click the Add More button on the connector configuration page, then specify the values for Source URL and Replace with Alias.

The SharePoint Site Alias is similar to the Alternative Access Mappings in SharePoint. Both features allow multiple entry points to a particular web application; for example, a SharePoint installation can be used internally by one group of users and externally by partners and other trusted individuals. The entry points for the internal and external users are different URLs. In such a case, the connector uses the internal URL to traverse the SharePoint content, but the Google Search Appliance uses the external URL to crawl and serve the content. Users also access the search results using the external URL.

SharePoint connectors after the 2.0 release support multiple aliasing, which is an extension to the Alias-Host and Alias-Port feature that was provided in the earlier releases. The feature allows complete support for aliasing where administrators can specify multiple URL patterns and their alias values. When documents are sent to the Connector Manager, if the document URL matches any of the source URL patterns that have been specified at the time of connector configuration, the corresponding alias value is used to rewrite the document URL. This is particularly useful in load balanced environments where one or more load balanced servers are used to manage traffic. The connector rewrites the URL with the URL pointing to the load balanced front end server (given as Replace with Alias) so that all Google Search Appliance traffic and the documents displayed in search results have the URL pointing to the load balanced front end server.

For a connector-discovered URL to match a given source URL pattern, the following constraints apply:

  1. The protocol must be the same.
  2. The host must be the same.
  3. The port must be the same. If no port is specified, the default port is used.
  4. If the source pattern contains a path, the URL to be matched must contain exactly the same path as its starting path value.

The user is allowed to specify a regular expression, allowing the port number to be ignored completely in the source URL. The character ^ is used to indicate this. This enables the administrator to define just one alias pattern and have it applied over URLs that differ only in their port numbers. If a port number is specified in the source URL pattern along with ^, the ^ symbol is ignored.

If there is an entry only for Source URL without a value for Replace with alias, the connector ignores the entry. You must provide both a Source URL pattern and the corresponding Alias. This feature in the connector requires JavaScript to be enabled on the Internet browser.

The following examples illustrate how a URL is matched and re-written using the alias value.

Example 1

Source URL: http://MyCompany/dev/
Replace with Alias: http://MyCompany:2020/

 
Connector-Discovered URL URL Sent to Search Appliance After Aliasing is Applied
http://MyCompany/dev/test/ http://MyCompany:2020/test/
http://MyCompany:80/dev/test/ http://MyCompany:2020/test/
http://MyCompany:8080/dev/test/ Does not match and is not re-written.

Example 2

Source URL: ^http://MyCompany/dev/
Replace with Alias: http://MyCompany:2020/

 
Connector-Discovered URL URL Sent to Search Appliance After Aliasing is Applied
http://MyCompany/dev/test/ http://MyCompany:2020/test/
http://MyCompany:80/dev/test/ http://MyCompany:2020/test/
http://MyCompany:8080/dev/test/ http://MyCompany:2020/test/
http://MyCompany:8080/test/ Does not match and is not re-written.

Example 3

Source URL: ^http://MyCompany:8080/dev/
Replace with Alias: http://MyCompany:2020/

 
Connector-Discovered URL URL Sent to Search Appliance After Aliasing is Applied
http://MyCompany/dev/test/ Does not match and is not re-written.
http://MyCompany:80/dev/test/ Does not match and is not re-written.
http://MyCompany:8080/dev/test/ http://MyCompany:2020/test/
http://MyCompany:8080/test/ Does not match and is not re-written.
http://MyCompany:2020/test/ Does not match and is not re-written.

Example 4

Source URL: ^http://MyCompany/dev/
Replace with Alias: https://loadBalancedServer.mycompany:4343/

 
Connector-Discovered URL URL Sent to Search Appliance After Aliasing is Applied
http://MyCompany/dev/test/ https://loadBalancedServer.mycompany:4343/test
http://MyCompany:80/dev/test/ https://loadBalancedServer.mycompany:4343/test
http://MyCompany:8080/dev/test/ https://loadBalancedServer.mycompany:4343/test
http://MyCompany:8080/test/ Does not match and is not re-written.
http://MyCompany:2020/test/ Does not match and is not re-written.

Using Connector URL Patterns to Exclude SharePoint Index and List Pages

You might want to index documents only in your SharePoint installation, excluding SharePoint index and list pages, which have the extension aspx. Index and list pages do not typically contain much content and indexing them consumes your search appliance license unnecessarily.

To exclude aspx pages, but include child documents that you want to index, ensure the following:

  • The URLs of these unwanted pages are excluded appropriately by exclusion filters of the connector, but they are not included by the inclusion filters.
  • The URLs of the child documents are not inadvertently excluded by the exclusion filters, and they are included in the inclusion filters.

Here are paired include and exclude patterns that work correctly to include child content files while excluding the aspx pages.

In the following example, all URLs except those ending in aspx are included. All documents under http://sharepoint.example.com/ except any aspx pages are fed to the search appliance.

  • Include pattern: http://sharepoint.example.com/
  • Exclude pattern: apsx$

In the following example, all documents under http://sharepoint.example.com/testSite/ are discovered and fed to the search appliance, but AllItems.aspx is excluded.

  • Include pattern: http://sharepoint.example.com/testSite/
  • Exclude pattern: http://sharepoint.example.com/testSite/forms/AllItems.apsx

In the following example, all documents under the testSite hierarchy, including AllItems.aspx pages, are excluded.

  • Include pattern: http://sharepoint.example.com/
  • Exclude pattern: http://sharepoint.example.com/testSite/

Traversal

The Google Search Appliance locates web and file system content for indexing through a process called crawl or crawling.

The Google Search Appliance locates content in a content repository using a process called traversal. Traversal is a process in which the connector issues queries to the repository to retrieve content files and the metadata associated with each content file. The content files and metadata are then fed to the Google Search Appliance as a content feed. For more information about content feeds, see the Feeds Protocol Developer’s Guide in GSA product documentation.

In the initial traversal of a repository, the files are retrieved by last-modified date, starting with the oldest documents in the repository. After the initial traversal, files are retrieved when they are added to a repository or modified.

This SharePoint connector is based on a content feed. The connector sends content files and related metadata as a feed to the Google Search Appliance, and the appliance indexes the content.

The connector identifies sites on other hosts as SharePoint sites by calling the appropriate web services. When you configure the connector, you provide URL patterns that define locations the connector must traverse and locations the connector is prohibited from traversing. Use these patterns to include your company’s domains and to exclude sites you do not control or do not want traversed.

The default web services provided with SharePoint are limited to a single site collection. If you need to traverse multiple site collections, see Traversing Multiple Site Collections with Google Services for SharePoint.


Objects That Are Indexed

By default, the SharePoint connector feeds all document versions to the search appliance.

The SharePoint connector can traverse different types of content, depending on the SharePoint version.

The SharePoint connector crawls and indexes documents that are not published. Unpublished documents are searchable. They are subject to the same access control rules as other objects. When a published document becomes unpublished, it is still indexed and still subject to the same access control rules.

Microsoft Office SharePoint Server 2007 or 2010 and Microsoft Windows SharePoint Services 3.0

Under Microsoft Office SharePoint Server 2007 or 2010 and Microsoft Windows SharePoint Services 3.0, the connector can traverse the following types of content:

  • Public and Personal Sites
  • Folder
  • File
  • Document Library
  • Discussion Boards
  • Site
  • Web
  • Attachment
  • Links
  • Wikipage Library
  • Tasks
  • Calendar
  • Contacts
  • Announcements
  • Issues
  • Surveys
  • Custom Lists
  • Picture Library
  • Unknown Document Type
  • Project Tasks
  • Administrative Tasks
  • Report Library
  • Translation Management Library
  • Data Connection Library
  • Slide Library
  • Form Library
  • Alerts

How Metadata is Handled

The list of metadata that the SharePoint connector sends to the Google Search Appliance is governed by the SharePoint web service response to a search request. For some attributes, the web service usually returns multiple copies of the same metadata with different names. When the connector parses the web service response, it does not remove such duplicate metadata. Because the duplicates are not removed, the same metadata is indexed with different names.

In addition, if documents are sent by a metadata-and-URL connector, the search appliance directly fetches some document-specific metadata. These properties might or might not have already been sent by the connector. Therefore, the list of metadata indexed by the search appliance is the union of metadata sent by the connector and metadata discovered by the search appliance.

For some properties, such as the author property, the SharePoint connector ensures that the metadata is present for every document. The connector always appends its own list of these properties for every document. The metadata values are retrieved in various ways. For example, for the author property, the connector sends a property named SharePoint:author. This property must be present for every document that connector feeds to the search appliance. The connector examines the following fields, which are returned in the web service response, to obtain the value that will be assigned to Sharepoint:author property:

  1. ows_Editor
  2. ows_Author

    * ows_Author is picked only if ows_Editor is not found. * ows_Editor indicates the user who last modified the document. It is similar to Modified_By. * Ows_Author indicates the use who created the document. It is similar to Created_by.

If none of the above is found a default value, No Author, is used.


How the Traversal Rate Affects Connector Behavior

When you configure a connector instance on the Google Search Appliance Admin Console, you set a traversal rate. The value indicates how many documents per minute the connector traverses in the repository. The default value is 500 documents per minute.

You can set the traversal rate to values higher or lower than 200 documents per minute. The connectors and connector manager are capable of faster traversal rates.

  • To reduce resource consumption in the repository, lower the traversal rate.
  • To increase indexing speed, raise the traversal rate.

If the traversal rate is set to 100 and the connector traverses 100 documents in less than one minute, the traversal process pauses. When the full minute elapses, the traversal process resumes.


Creating and Tuning Connector Schedules

When you schedule connector instances, the performance of the repository is a significant consideration. Depending on the number of traversals and the size of the documents retrieved for indexing, the use of connectors may degrade repository performance. Monitoring and performance-tuning the repository server is especially important when you deploy a new connector or document repository.

Note that a connector scheduled to run from 12 a.m. to 12 a.m. always runs. Any other schedule with the same beginning and ending time never runs, either for a connector or for the Google Search Appliance’s standard crawl function.

When you determine the connector schedule, taking the following factors into account:

  • When to run the traversal process

    You might add a connector instance to run in off-peak hours to spread out the initial index creation during times of low demand on the repository.

  • How long to run the traversal process

    You might add a connector instance with a very brief schedule to perform predeployment testing, and experiment to see the effects of lengthening the schedule.

A connector instance cannot self-modify its traversal schedule. Therefore, you must monitor the performance of both the Google Search Appliance and the content management system regularly, and make manual adjustments to the traversal schedules of connectors to optimize performance. You can tune scheduling for optimal performance in these ways:

  • Create a schedule that minimizes the number of concurrent traversal processes that are running.
  • Restrict the times at which those processes run. For example, if the content management system is executing a resource-intensive job, the connector might run slowly. Schedule the connector to run at times when demand on the content management system is light.

Additionally, the connector manager interrupts a connector that takes too long to process a batch of documents. The default duration after which the connector manager interrupts the connector is 1800 seconds, or 30 minutes. The duration is set by the value of the traversal.time.limit property in the applicationContext.properties file. If you want a shorter duration, you can change the value of traversal.time.limit.

To change the default value of the traversal.time.limit property:

  1. Stop Apache Tomcat.
  2. Open the applicationContext.properties file in a text editor. The top of the file contains comments with explanatory text. Do not uncomment any of the explanatory text, including the example for traversal.time.limit.
  3. Examine the file to see whether there is a traversal.time.limit entry.
    • If there is an entry, modify the duration.
    • If there is no entry, add one to the end of the file:

      traversal.time.limit=duration_in_seconds

  4. Save the file.
  5. Restart Tomcat.
Changing the Connector Retry Delay and Schedule

In connector manager 3.0 and search appliance software version 7.0 and later, the search appliance Admin Console enables you to modify the connector retry delay, which is the time period that elapses between when one traversal is completed and the next starts. For example, you might want the connector to traverse the repository every hour between 8 a.m. and 8 p.m. or every two hours from midnight to 9 a.m.

The default retry delay is 5 minutes.

To change the traversal schedule, set the start and end times for traversal on the Connector Schedule drop down menus.


Resetting Traversal

If traversal has stopped or no new documents are being fed to the search appliance, you can reset the connector traversal process. When you reset the traversal, the content is traversed in full from the beginning point and the index is recreated.

In search appliance software version 7.0 and later, use Reset link for the connector instance on the Admin Console > Connectors page. On search appliances running software versions earlier than 7.0, use the following instructions from a browser. If you are using the SharePoint connector and connector manager installed on the search appliance, use the Reset link.

To reset the traversal, open a browser and enter a URL in the following format, where connector_manager_host_address is the location of the connector manager and connector_name is the name of the connector whose traversal you are restarting:

http://connector_manager_host_address:8080/connector-manager/restartConnectorTraversal?ConnectorName=connector_name

For example, if the host address is http://www.myhost.com/ and the connector manager is named our_connector:

http://www.example.com:8080/connector-manager/restartConnectorTraversal?ConnectorName=our_connector

The URLs are case-sensitive. After you submit the command, you see a response in the browser window. Some browsers display only a zero (0). Other browsers display a full XML document. A 0 response indicates success. A nonzero response indicates a failure.

<CmResponse>
  <StatusId>0</StatusId>
</CmResponse>

Note that with the default Connector Manager v2.x configuration, connector_manager_host_address must be localhost (or more specifically, 127.0.0.1), and the request must originate from the machine on which the Connector Manager is running. If direct access to the Connector Manager machine is inconvenient, Connector Administrators may wish to add administration machines to the list of IP addresses allowed by the RemoteAddrValve.


When to Delete Feeds

Under the following circumstances, Google recommends that you delete connector feeds. This recommendation applies only to content-feed-based connectors.

  • When you reindex content and the expected new document set leaves out documents or metadata that were previously indexed.
  • When you delete a connector instance

When you are reindexing the content, follow this general procedure:

  1. On the Admin Console > Connector Administration > Add Connector page, check Disable Traversal.

    Traversal is enabled by default.

  2. Make any required updates to the connector configuration.
  3. Delete the feed.
  4. Monitor the Crawl Diagnostics page in the Admin Console.
  5. When the indexed documents are removed from the index, navigate to the Connector Administration >Connectors page and click the Reset link for the connector.
  6. On the Admin Console >Connector Administration > Add Connector page, enable traversal by unchecking Disable Traversal.

If you are deleting a connector instance, we recommend that you separately delete the feed. Otherwise, content indexed by the connector is not removed from the index and public content indexed by the connector continues to appear in search results. Secure content does not appear in search results because the authorization check fails.


When to Restart the Connector Service

Restarting the connector service means restarting Apache Tomcat. Restart the connector service only under the following circumstances:

  • When you manually edit the connector’s properties file or one of the configuration files (applicationContext.xml, applicationContext.properties, logging.properties, or connectorInstance.xml). Alternatively, for edits to the connectorInstance.xml file only, you can apply the changes on the Admin Console, without restarting the connector service. Click the Edit link for the connector instance, then click Save Configuration.
  • When you install a connector or connector manager JAR file.

If you are using the SharePoint connector and connector manager installed on the search appliance, this section does not apply.


Serving

The following sections describe how the connector serving process works and how serve-time security is maintained.


About Serving

Using the Google Search Appliance and Google Search Appliance Connector for SharePoint to search a SharePoint content repository is similar to using Google.com to search the web.

To locate particular information or documents in the repository, a user opens a browser window and navigates to a search page. The search page can be the default search page available on the Google Search Appliance or it can be a customized search page. The user types a search term in the search box and clicks Search.

The Google Search Appliance searches its index for documents and metadata containing the user’s search term.

When the Google Search Appliance finds all the documents that match the search request, it presents the user with a pop-up window and asks for the user’s user name and password. The connector manager passes the search results and the user credentials to the repository server. The repository server authenticates the user, evaluates the permissions for each document returned by the user’s search, determines which documents the user is authorized to view, and returns that information to the connector manager.

The Google Search Appliance displays a results page listing the documents the user is authorized to view. When the user clicks a link on the results page, a web client window opens in which the user can view the document or its metadata, depending on how the connector is configured. If the user does not have an open session to the repository, the web client asks for the user’s login credentials before displaying the document.


How Security is Supported

Serve Time User Authorization and Document Access Control

At serve time, the Google Search Appliance supports document-level authorization of each search user. Content in a SharePoint repository can be served as secure or public content.

The value of the Make Public check box on the Admin Console determines whether content is secure or public.

  • When you select the Make Public checkbox, content from the specified URL is made available as public content. Any user can search and view this content.
  • When you clear the Make Public checkbox, content from the specified URL is made available as secure content. A user performing a secure search is prompted to enter a username and password. The Google Search Appliance uses NTLM or HTTP Basic to verify the user’s credentials and authorize the user to view the content.

You can provide single-sign on capabilities (SSO). How you provide SSO depends on which software version is running on your Google Search Appliance:

Silent Authentication

By default, a Google Search Appliance user who searches for and views secure content must enter credentials. However, intranet users who logged on to a Windows domain before performing a search expect their Windows credentials to be passed automatically to other services, including the search appliance. This is called silent authentication.

You can configure silent authentication for the SharePoint connector in the following ways:

  • On Google Search Appliance software version 5.2 and later, the search appliance supports Kerberos authentication during serving.
  • On Google Search Appliance software version 5.0.x with a metadata-and-URL feed, use the Google SAML Bridge for Windows. For more information, see Enabling Windows Integrated Authentication in GSA product documentation.
Feeding Access Control Lists to the Google Search Appliance

The SharePoint connector can feed access control lists (ACLs) associated with a document to the search appliance as document metadata. The ACLs are then used to authorize users to see search results. The connector also keeps track of security changes that might affect the ACLs of SharePoint entities. In such cases, the entities are traversed and fed to the search appliance again. ACL change detection is a separate process from change detection for content and regular metadata.

You must deploy Google Services for SharePoint to use the ACL feature. By default, this feature is disabled.

To use the ACL indexing feature, the user who is configuring the connector must have either Full Read permissions at the web application level or must be a SharePoint Site Collection Administrator. To enable ACL indexing for connectors, select the Authorization by ACL checkbox on the connector configuration page in the admin console.

Authorization by ACL checkbox

After the connector is restarted, the connector sends document ACLs as part of the metadata. The ACLs are fed as per-URL ACLs, which are shown on the Admin Console under Status and Reports >Crawl Diagnostics.

Admin Console Crawl Diagnostics page

Note: The connector does not send ACLs for documents it traversed before you enabled the feature.

For More Security Information

For more information on authentication and authorization with connectors, see the chapters on “Crawl, Index, and Serve,” “Use Cases with Public and Secure Serve for Multiple Authentication Mechanisms,” and “Cookie-Based Authentication Scenarios” in Managing Search for Controlled-Access Content in GSA product documentation.


Uninstalling Connectors and Connector Managers


Deleting a Connector Instance from the Admin Console

You delete a connector instance only on the Admin Console of the Google Search Appliance. When you delete the instance, you delete the configuration information for the instance. The connector manager no longer creates and runs the instance.

Each connector instance is listed on the Admin Console in the Connector Administration > Connectors section. The indicator light is either green or red. Green indicates the existence of the connector instance.

To delete a connector instance:

  1. Log in to the Admin Console as an administrator.
  2. Click Connector Administration > Connectors.
  3. Click the Edit link for the correct connector.
  4. Check the Disable Traversal checkbox for the connector you are deleting.
  5. Click Save Configuration.
  6. On the Connector Administration > Connectors page, locate the connector instance you want to delete.
  7. Click the Delete link on the line for the correct connector instance.
  8. Click OK.

Deleting a Connector Manager

To delete a connector manager, you must first unregister the connector manager from the Admin Console, then uninstall the connector manager on the Tomcat host.

Before you unregister a connector manager, you must delete all connector instances associate with that connector manager. If you have a large number of connector instances, you can first stop the Tomcat instance where the connector manager is running, then unregister the connector manager.

It is also possible to uninstall the connector manager on the Tomcat host, then unregister the connector manager on the Admin Console.

Unregistering a Connector Manager from the Admin Console

To unregister a connector manager from the Admin Console:

  1. Log in to the Admin Console as an administrator.
  2. Click Connector Administration > Connector Managers.
  3. Locate the connector manager you want to delete.
  4. Click the Unregister link on the line for the correct connector manager.
  5. Click OK.
Uninstalling a Connector Manager

To uninstall a connector manager from the Tomcat host, do one of the following:

  • On Windows, click Start > All Programs > Google Search Appliance Connector version_number > Uninstall.
  • On Linux, click the appropriate shortcut.

To manually delete a connector manager on the Apache Tomcat host:

  1. Log in to the Apache Tomcat host as the installation owner (the user who installed Tomcat).
  2. Shut down Tomcat.
  3. Navigate to the $CATALINA_HOME/webapps directory.
  4. Delete the connector-manager.war file.
  5. Delete the $CATALINA_HOME/webapps/connector-manager directory.
  6. Restart Tomcat.

Troubleshooting the Google Search Appliance Connector for SharePoint

This section provides information on the following topics:

If you have a problem that requires you to file a ticket with Google Cloud Support, be prepared to provide Support with the following information:

  • Verbose connector logs. See Logging for information on changing the default logging level. If you are reporting a problem to Support, it is ideal if you can reproduce the problem with the logging level set to ALL. However, log files with entries made when the problem occurred are also helpful.
  • Connector configuration files.
  • Feed record and metadata log file. See Logging Feed Record and Metadata Information to a Text File for information on generating this log file.

Diagnosing Connector Problems

If you create a connector instance and no search results are returned, use the following checklist to help diagnose the problem.

 
Problem How to Diagnose
The connector has not traversed any documents. View the Admin Console Feeds page or Crawl Diagnostics page to confirm. View the connector logs to help determine the specific reason.
The search appliance has not accepted the feed. View the Admin Console Feeds page to determine whether the search appliance is accepting feeds.
The connector has not traversed the designated test documents. View the Admin Console Crawl Diagnostics page. Examine the connector logs and look for the end of a traversal or for errors associated with specific documents. Lastly, enable the teedFeedFile and reset the traversal.
The search appliance has not indexed the documents. This can be difficult to determine, but the Crawl Diagnostics page tells you which content files have not been indexed. Usually, you must wait until the content is indexed. This failure is more common with metadata-and-URL feed connectors.

With content feed connectors, a document can appear on the Crawl Diagnostics pages almost immediately, sometimes before the feed appears on the Feeds page. However, the document does not appear in search results for another 5 to 15 minutes. If a document does not appear on Crawl diagnostics, it has not been indexed and probably has not been traversed.

Secure documents were not included in test searches. Ensure that a secure search was performed.
There were authentication failures. Depending on the search appliance version, examine the Security Manager log or the connector logs.
There were authorization failures. Examine the authorization log on the search appliance Access Control page or the connector logs. For metadata-and-URL feeds or policy ACLs, this is where you will find the information you need. For connector authorization, the connector log has more details about failures than the search appliance authorization log.

When you examine the connector logs, error messages labeled SEVERE or Exception are good starting points. For authorization issues, search the logs for the user name of the users who experienced authorization failures.


Logging

Logging is a useful technique for recording information about how your installation is operating. You can use the information logged for troubleshooting the operations of the connector, the Google Search Appliance, and SharePoint.

The connector manager and connectors use the java.util.logging package for logging. The installer installs a logging mechanism for the connector and starts the logging process automatically. The default logging configuration is defined in the logging.properties file.

To customize the configuration, navigate to connectors_root_dir/connector_name/Tomcat/webapps/connector-manager/WEB-INF/classes and edit the logging.properties file there.

The following line in the file sets the default logging level for the SharePoint connector:

.level=INFO

The default logging level for most packages and output destinations (handlers) is INFO. To enable debugging at a finer level of granularity, you can change the default connector manager logging level to ALL or FINER. For example, you might change the logging level as follows:

.level = ALL

The possible values of the level property are OFF, SEVERE, WARNING, INFO, CONFIG, FINE, FINER, FINEST, and ALL. The default level is INFO.

The logging level can be adjusted via the Administration Console—however this change affects only the currently running process and will be reverted back to default upon restarting the connector manager.

The output from the FileHandler appears in the connectors_root_dir/connector_name/Tomcat/logs directory. The output appears in the google-connectors.sequence.log file, where sequence is a series of numbers starting with 0 and incremented by 1 on each occurrence (0, 1, 2, 3...n). The first three log file names would be google-connectors.0.log, google-connectors.1.log, and google-connectors.2.log.

To log all http communications between the connector and the SharePoint server, use the httpclient.wire log. Set this log in the logging.properties file only to debug problems, because a very large amount of data is logged, some of it in binary format.

The default level is SEVERE:

httpclient.wire.level=SEVERE

Change the level to ALL:

httpclient.wire.level=ALL

After editing the logging.properties file, restart Tomcat.

In addition, enable logging for the content management system’s native API on the Apache Tomcat host and, if relevant, on the repository server host.

Logging Excluded URLs

In the SharePoint Connector 2.0 release, the connector writes all URLs that it excludes during traversal to a file called excluded_url%g.txt. The file is saved in a directory called excluded-URLs. This file is created on a per-connector instance basis and is located under $CATALINA_HOME/webapps/connector-manager/WEB-INF/connectors/sharepoint-connector/connector_name/excluded-URLs. The connector also logs the exact cause of exclusion with each URL. This is helpful if you find that the connector is not sending feeds for particular URLs. You might have to change the inclusion and exclusion patterns with the connector configuration if some required URLs are also being excluded.


Error Messages

This section describes some commonly encountered error messages and their likely solutions.

Search Appliance Unable to Connect to the Connector Manager

If the Apache Tomcat instance where the connector manager is installed is not started or if the location you type in is incorrect or invalid, a message is displayed on the Connector Manager Administration page of the Admin Console saying “The appliance could not connect to the connector manager as specified in the location. Make sure that the URL is correct, or try again later.”

Unable to connect to the connection manager error

HTTP 404 Error When Registering a Connector Manager

When you are registering a new connector manager, you might see the following error message:

The HTTP response failed with the following code: 404. No external connector managers registered.

This means that the CATALINA_HOME environment variable is not set correctly on the Tomcat host. Examine the Tomcat startup script or .bashrc and ensure that CATALINA_HOME points to the correct Tomcat installation.

HTTP 401 Error When a Connector

When creating the connector, GSA admin may get the following error:

Cannot connect to the given SharePoint Site URL with the supplied Domain/Username/Password. Reason:(401) Unauthorized

  1. Check that the username and password are correct. Configure the crawler access under Crawl and Index > Crawler Access and perform a manual fetch under Status and Reports > Real-time Diagnostics in the Admin Console to verify connectivity and validate the credentials. If you get a 401, then please confirm the username and password again. If you get a http status of 200, check logs for information below.
  2. Check the connector log. If you see the following error, please check that the user has contribute access.

    Aug 23, 2011 11:18:56 AM com.google.enterprise.connector.sharepoint.wsclient.WebsWS checkConnectivity
    WARNING: Unable to connect.
    AxisFault
    faultCode: {http://xml.apache.org/axis/}HTTP
    faultSubcode:
    faultString: (401)Unauthorized
    faultActor:
    faultNode:
    faultDetail:
    {}:return code: 401
    401 UNAUTHORIZED
    {http://xml.apache.org/axis/}HttpErrorCode:401

Feed Exception During Traversal

You might see the following error message if you installed a connector manually or you are using a connector manager earlier than version 2.0:

SEVERE: Feed Exception during traversal.
com.google.enterprise.connector.pusher.FeedException: Connection refused: connect

This happens when the connector service is reinstalled, whether or not it is the same version, to a new location, but it is not reregistered on the Admin Console. The connector service points at localhost by default, rather than pointing to the search appliance. In this situation, the connectors are unable to feed documents to the search appliance.

To fix this issue:

  1. Log in to the Admin Console and navigate to the Connector Managers page.
  2. Click the Edit link for your connector manager.
  3. Click the Save button.

Alternatively, you can manually edit the applicationContext.properties file in the Tomcat/webapps/connector-manager/WEB-INF directory by changing localhost to the IP address of the GSA in the following line:

gsa.feed.host=localhost

If you manually edit the file, you must restart Tomcat after you save your changes.

Error Message When Trying to Add a Connector to an Unavailable Connector Manager

When a connector manager is unavailable, the Admin Console displays a circular red indicator next to the connector manager name. If you try to add a connector to an unavailable connector manager, you see the following error message:

The appliance encountered an error while trying to make the following servlet call: getConnectorList

The connector manager might be unavailable for one of the following reasons:

  • Tomcat is not running on the registered host and port.
  • The connector manager host is unreachable.
  • The Tomcat Remote Address Filter is rejecting access.

Check each condition and correct any problems.

Error When Using a Self-Signed Certificate

The SharePoint connector supports discovering and crawling content protected by a certificate. If you have a self-signed certificate that is not from a trusted authority, you see the following error when you create a connector instance:

sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

To fix this error, import the certificate to the JDK keystore.

Crawl URL does not match “Include URL” patterns or matches “Do Not Include URL” patterns.

You see this message when a user-provided Crawl URL does not match patterns specified under “Include URLs Matching the Following Patterns” or matches patterns specified under “Do Not Include URLs Matching the Following Patterns”. The administrator should provide non-conflicting patterns for Include URLs Matching the Following Patterns and Do Not Include URLs Matching the Following Patterns.

Required field not specified.

Fields marked with an asterisk (*) on the Configuring Connector Instances form are required. You must provide appropriate values for these fields.

The Crawl URL must contain a fully qualified domain name. Please check the Crawl URL value.

You must provide the appropriate SharePoint Site URL with a fully qualified domain name for SharePoint Site URL field on Configuring Connector Instances.

Cannot connect to the given SharePoint Site URL with the supplied Domain/Username/Password. Please re-enter.

This error means that the connectivity test between the connector and SharePoint failed. One possible cause is that you did not provide appropriate values for all the mandatory fields when you configured the connector instances. Another possible cause is that you did not copy the catalina.jar file to the $CATALINA_HOME/shared/lib/ directory during manual installation of the sharepoint connector. For more details on the problem, look for the call to checkConnectivity() in the Tomcat log file. For example, here is the error you would see if you did not copy the catalina.jar file correctly:

Feb 9, 2009 3:04:24 PM com.google.enterprise.connector.sharepoint.SharepointConnectorType checkConnectivity
WARNING: checkConnectivity():java.lang.NoClassDefFoundError: org/apache/catalina/util/URLEncoder

Note: All other error messages are available in the Tomcat log file.
Crawl Diagnostics Error Message

If there is no robots.txt file or if the robots.txt file is not correctly defined in SharePoint, you see an error message:

Retrying URL: Host unreachable while trying to fetch robots.txt.

To correct the error:

  1. Check whether the robots.txt file exists in the SharePoint root directory.
  2. If there is no robots.txt file there, create one.
  3. Ensure that the robots.txt file is correctly excluded from SharePoint’s managed path.
  4. Ensure that the path to the robots.txt file is defined correctly on the on the Crawler Access page on the Admin Console.
ProcessNode Error

You might see the following error message on the Crawl Diagnostics page in the Admin Console, where URL is the URL to a graphic file:

ProcessNode: Not match URL patterns, skipping record with URL: URL

Ensure that you have modified the crawl patterns correctly. For information on crawl patterns, see the SharePoint Site Alias Mapping.


Logging Feed Record and Metadata Information to a Text File

You can log all URLs and metadata fed to a Google Search Appliance without recording all content. There are two ways to implement this logging technique.


Using the feedLoggingLevel Property

To use the feedLoggingLevel property to log URLs and metadata:

  1. Log on to the Apache Tomcat host with the user account under which Tomcat runs.
  2. Shut down the Tomcat instance that hosts the connector manager.
  3. Navigate to the webapps/connector-manager/WEB-INF/ directory.
  4. Open the applicationContext.properties file in a text editor.
  5. Set the feedLoggingLevel property to the value ALL:

    feedLoggingLevel=ALL

  6. Save the applicationContext.properties file.
  7. Restart Tomcat.

    The logging information is recorded in the $CATALINA_BASE/logs/google-connectors.feed%g.log files, where %g is a generation number used to distinguish among rotated logs.


Using a logging.properties Configuration File

To use a logging.properties configuration file to log URLs and metadata:

  1. Log on to the Apache Tomcat host with the user account under which Tomcat runs.
  2. Shut down the Tomcat instance that hosts the connector manager.
  3. Navigate to the logging.properties file.
    • If you installed the connector using the installer, the file is in the connector_directory/Tomcat/webapps/connector-manager/WEB-INF/classes/ directory.
    • If you installed the connector manually, navigate to the location where you created a logging.properties file. The logging.properties file is probably in the $CATALINA_HOME/webapps/connector-manager/WEB-INF/classes directory. If not, copy the logging.properties file from the $JAVA_HOME/lib/ directory to the $CATALINA_HOME/webapps/connector-manager/WEB-INF/classes directory. You might have to create the /classes directory manually.
  4. Open the logging.properties file in a text editor.
  5. Add the following line to the file:

    com.google.enterprise.connector.pusher.DocPusher.FEED_WRAPPER.FEED.level=FINER

  6. Save the logging.properties file.
  7. Restart Tomcat.

    The logging information is recorded in connector_directory/Tomcat/logs/google-connectors.feed%g.log, where %g is a generation number used to distinguish among rotated logs.


Related Documentation

For more information on the connector manager, see Introducing the Connectors in Connector documentation. For information on developing connectors, see:

Was this helpful?
How can we improve it?