Search
Clear search
Close search
Google apps
Main menu

How to use the headrequestor process

Google Search Appliance (GB-1001, GB-7007, and GB-9009) software version 4.6.2.S.18 and later
Posted May, 2010

This document describes the headrequestor, a process on the search appliance that checks whether a user is authorized to view a secure search result.



Headrequestor Overview

The headrequestor is the process on the search appliance that checks whether or not a user is authorized to view a secure search result. The search appliance will use the headrequestor under any of the following conditions:

  • You configured Forms Authentication (called cookie-based authentication in release 6.2 and later) on the search appliance.
  • You configured the search appliance to crawl content using NTLM or Basic Authentication.
  • You configured the search appliance to use the Authorization SPI.

To understand how the headrequestor works, you need some background on what happens when a user sends a search request to the appliance. Here is what happens:

  1. The user sends a search request to the search appliance.

    The user will request a certain number of results, determined by the value of the num parameter, which defaults to 10. In addition, the user specifies a start parameter, which defaults to 0. If the start parameter is larger than zero, for example 50, the search appliance will need to fetch 50 + num results.

  2. The search appliance gets authentication credentials from the user.

    If some or all of the search results are marked as secure, the search appliance will need to get authentication credentials from the user.

    If the search appliance is configured to serve secure results using Basic Auth or NTLM, it will send a 401 Unauthorized response to the user. The user's web browser will pop up a dialog window asking for the user to enter her username and password. The search appliance will not store the user's password. Each subsequent time the user enters a search request, the username/password will be passed in an encoded format to the search appliance through the Authorization HTTP header. The search appliance can serve secure results using HTTPS on port 443, so the user's password is sent securely to the search appliance.

    If you have configured Forms Authentication (called cookie-based authentication in release 6.2 and later), the user will be redirected to a login form. The exact mechanism for how the user's authentication cookie gets transferred to the search appliance depends on whether you configure Cookie Forwarding, User Impersonation or an External Login Server.

    If the search appliance is configured to use the Authorization SPI, the user can pass her identity using a client certificate or through the Authentication SPI.

  3. The search appliance generates the search results.

    The search appliance generates a list of the most relevant documents for the query. The number of documents is always 1000, unless there aren't 1000 relevant documents in the index, in which case the number of documents is the total number of relevant results.

  4. The search appliance generates the URLs and snippets for the top search results.

    The search appliance will initially generate URLs and snippets for slightly more results than the user requested. For example, if the user requested 10 results, the search appliance will generate the top 15 URLs/snippets. This is done for performance reasons. The request for URLs/snippets is expensive. It is likely that some documents will get filtered in the next step. We want to make allowances for filtering, so that we do not need to repeat the URL/snippet generation stage.

  5. The search appliance applies filtering to the results.

    By default, the search appliance filters results that have duplicate snippets or duplicate paths. Users can disable this filtering with the filter parameter. In addition, the search appliance will filter any documents with URLs that match patterns in Remove URLs for the frontend specified.

    Filtering at serving time is more expensive than applying filters at indexing time. For example, documents that contain a robots noindex meta tag are not returned in the search results and therefore do not need to be filtered at serving time.

  6. The search appliance determines whether more results are needed. The method for how this is determined varies by version and is described below.

    For 4.6.2.x and earlier, if the search appliance does not have sufficient results for the user after filtering, it will generate another set of URLs/snippets. The number that it will generate can depend on many factors. In general it is the following:

    ( Num of results still needed + 1 ) * ( 1 / Percent valid docs ) + 7

    For example, if we tried to get 15 documents, and 10 were filtered, we will need 5 more. The total number of URLs/snippets that will be generated will be 25.

    For 4.6.4 and later, if the search appliance does not have sufficient results for the user after filtering, it will generate another set of URLs/snippets. The number of URLs in the set will be 30% more than needed plus 1.

    For example, if we tried to get 15 documents, and 9 were filtered, we will need 6 more. The total number of URLs/snippets that will be generated will be 9.

  7. Check that the user is authorized to view each document in the results.

    If the user is running a secure search, the search appliance will check that the user is authorized to view each document in the results.

    The search appliance will mark documents in the index that are secure. Any document that is crawled with Basic Auth or NTLM is marked as secure. The crawler on the search appliance will request Basic Auth or NTLM URLs without sending its credentials. If it gets a 401 response, it will send appropriate credentials by matching the URL against patterns in Crawler Access. The URL will only be marked as secure if it gave a 401 response. Any URL that matches a Forms Authentication pattern (called cookie-based authentication in release 6.2 and later) will be marked as secure. If the search appliance is configured to use the Authorization SPI, you must also configure the search appliance to mark documents as secure using Crawler Access or Forms Authentication (called cookie-based authentication in release 6.2 and later) patterns. If you are using the Authorization SPI, then the search appliance can authenticate users with client certificates or the Authentication SPI, eliminating the need for getting credentials with a Basic Auth login dialog or a Forms Auth (called cookie-based authentication in release 6.2 and later) login form.

    The search appliance will send every secure URL that it has obtained from the above steps to the headrequestor process in a single batch.

    If a URL is protected by NTLM or Basic Auth, the search appliance sends a HEAD request to the web server with the user's Basic Auth or NTLM credentials. A typical Basic Auth HEAD request looks like this:

    HEAD /path/to/file.html HTTP/1.0
    Host: hostname
    Connection: Keep-Alive
    User-Agent: gsa-crawler
    Authorization: Basic base64-encoded-credentials

    A HEAD request using NTLM requires a challenge and response so it requires two HTTP requests and responses. Here is an example of the HTTP headers for each of the three stages of an NTLM request showing the initial request, the challenge from the server and the response from the client.

    HEAD /test1/ HTTP/1.0
    Connection: Keep-Alive
    Host: ntlmserver:8888
    Authorization: NTLM TlRMTVNTUAABAAAAA7IAAAYABgAlAAAABQAFACAAAABURVNUMVpFQUxPVA==
    
    HTTP/1.1 401 Access Denied
    Server: Microsoft-IIS/5.0
    Date: Fri, 11 Oct 2002 17:07:43 GMT
    WWW-Authenticate: NTLM TlRMTVNTUAACAAAAAAAAADAAAAABggAAg4oSng5+tKUAAAAAAAAAAAAAAAAwAAAA
    Connection: keep-alive
    Content-Length: 3245
    Content-Type: text/html
    
    HEAD /test1/ HTTP/1.0
    Connection: Keep-Alive
    Host: ntlmserver:8888
    Authorization: NTLM TlRMTVNTUAADAAAAGAAYAHIAAAAYABgAigAAAAwADABAAAAAHAAcAEwAAAAKAAoAaAAAAAAAAA
    CiAAAAAYIAAFoARQBBAEwATwBUAGkAaQBzAC0AZQBuAHQAZQByAHAAcgBpAHMAZQBUAEUAUwBUADEACEvmEYgvvUlIkhJC+
    fXM59kBexzXKC382THVxiD3mOKu64xGDo7/EKFCgB3Drs5b

    The Google Search Appliance uses HTTP/1.0 only, so your web servers must support HTTP/1.0 keep-alive.

    If your web server advertises that it supports both Basic Auth and NTLM in its WWW-Authenticate headers, then the search appliance will use Basic. An example of these headers is below:

    WWW-Authenticate: Basic Authentication
    WWW-Authenticate: NTLM

    If a URL is protected by Forms Authentication (called cookie-based authentication in release 6.2 and later), the search appliance sends a GET request to the web server with the user's cookie. The GET request includes the Range header which, if supported by the web server, means that no content will be returned in the body of the response. A typical GET request looks like this:

    GET /path/to/file.html HTTP/1.0
    Cookie: SMSESSION=cookie-value
    Range: bytes=0-0
    Host: hostname
    Connection: Keep-Alive

    If the search appliance is configured to use the Authorization SPI, the headrequestor will use the authz checker process to send a SAML request to the to the Access Connector URL that is configured in the Admin Console. If the SAML response is indeterminate -- i.e. neither Permit nor Deny -- then the search appliance will also try sending a HEAD or GET request from the headrequestor process, if it has Basic Auth, NTLM or Forms Authentication (called cookie-based authentication in release 6.2 and later) credentials.

    If the search appliance doesn't get sufficient authorized URLs back from the batch sent to the headrequestor it will rerun step #6 above to generate a new batch of URLs to send to the headrequestor.


HEAD and GET Requests Sent by the Headrequestor

The host load settings are used to determine how many simultaneous authorization requests to send to each web server. The default host load setting is 4, meaning that the search appliance, by default, will not send more than 4 concurrent requests to each web server from the headrequestor. The headrequestor doesn't support host load exceptions on a per-host basis.

The user is authorized to see that URL in the results if the web server returns a 200, 204 or 206 HTTP status code.

Whether the head requestor follows 301 and 302 redirects depends on the authentication method the search appliance is using.

  • The search appliance follows 301 and 302 redirects under Basic Authentication and NTLM.
  • The search appliance does not follow 301 and 302 redirects under single sign-on and assumes that users do not have access to the content.

The headrequestor always sends an HTTP/1.0 keep-alive header so that it can follow redirects without opening a new TCP connection. If there is no redirect, the search appliance closes the TCP connection once it has received the number of bytes specified by the Content-Length header in the HTTP response. If the web server doesn't send a content-length response, the web server itself will close the connection.

If the headrequestor gets a 2XX response from the target of the redirect then it will assume the user is authorized to view that URL.

A 401, 403, or any other response to the headrequestor will cause the URL not to be displayed in that user's search results. Note that an initial 401 response is expected when using NTLM because the search appliance needs to receive a challenge from the web server.

By default, the request from the search appliance's headrequestor will time out after 2.5 seconds. You can configure the request timeout in the Admin Console. The request timeout period includes the DNS lookup, if needed, as well as the web server response time.

When a head request times out, the search appliance tries to terminate the network connection normally by sending a FIN packet. Most web servers will not close the connection on their end, but will continue to respond after the search appliance has sent a FIN. The headrequestor ignores these responses since the TCP connection on the search appliance is closed. the search appliance will send a RST packet to the web server when it tries to respond to the timed out request. In these cases, the TCP connection on the web server will be closed when the web server completes its response.

If you are running a script on your web server that doesn't exit, the web server will not close the connection after sending the response. In this case, the search appliance will send a FIN after the request timeout period. The web server will try to respond to this packet and the search appliance will then send a RST. In these cases, the TCP connection on the web server will be closed after the request timeout, which defaults to 2.5 secs.

If a request from the headrequestor gets a timeout, the search appliance can retry, up to two times, and then it stops trying to authenticate to a particular URL.

The default batch timeout is 5 seconds. If the search appliance doesn't have sufficient results, it will send another batch to the headrequestor. The headrequestor will not return until all URLs in the batch have been tested. the search appliance will return results after 30 seconds, even if the headrequestor is still running. The batch timeout is configurable in the Admin Console, with a maximum permitted value of 25 seconds.

Here is an example to show how long it will take for a user to see a response to a secure query. Lets assume a query has 100 results. We want to display the first 10 in the first results page. Lets assume that five among the first 10 are secure. the search appliance will try to send simultaneous headrequests for about 10 secure results so that it doesn't need to send a second batch of requests if the user is not authorized to view any of the first five results. The headrequestor receives the entire batch of 10 URLs, but it will send not more than 4 requests concurrently to the web server.

The search appliance will send a second batch of headrequests if it doesn't obtain five secure results that the user is authorized to view in the first batch of headrequests. Each batch request can take up to 5 secs. the search appliance continues to send batches of headrequests until it gets 10 good results or until 30 seconds have passed.

By default, the search appliance caches the results of the headrequestor for one hour. It caches up to 10,000 entries. The least recently used entries are purged to make room for new ones. You can flush the cache or set the time out on cache entries in the Admin Console in 4.x versions (go to Admin Console > Serving > Authorization for 4.x or Admin Console > Serving > Access Control for 4.6.x). Versions running 3.4.14 and earlier do not have this capability from the Admin Console. For these versions, the only way to flush the cache is to reboot the search appliance.

The headrequestor can be configured to add a host to a list of unreachable hosts if requests to that host get 100 request timeouts within 120 seconds. You can specify how long a host will remain on the unreachable list. The headrequestor will not send requests for any user to an unreachable host. This can be used to protect hosts that can be overwhelmed by the headrequestor.


Using Headrequestor Deny Rules

This section applies only to Google Search Appliance software version 6.4 and later.

The Head Requestor Deny Rules page on the Admin Console enables you to identify URLs where content servers deny users access with codes other than HTTP code 401 and define for the search appliance the access-denied responses to expect from the content server. For example, a content server might send an HTTP code 200 instead of a 401 code, or the access-denied response might be in the body of content returned by the content server.

If the content server uses the standard HTTP 401 code, you do not need to configure head requestor deny rules.

If the content server response matches any of the rules configured for its URL, the response is considered an access-denied response. For example, the content might be in a different language.

Before you set any Head Requestor Deny Rules, check the status codes returned by content servers in your installation when a user is denied access to a page. If a content server does not return the HTTP 401 status code, determine the URL or URL pattern for the content server.

To set a head requestor deny rule:

  1. Log in to the Admin Console.

  2. Click Serving > Head Requestor Deny Rules.

  3. In the URL Pattern field, type the URL or URL pattern identifying a content server that issues a code other than HTTP 401 when it denies access to a user.

  4. Click Create New Deny Rule.

  5. Select the type of request the search appliance sends to the content server. If you are using content as a deny rule, you must select one of the GET request options.

    • A HEAD request retrieves document headers only and does not detect deny codes in the document content.

    • A GET request of a particular length. Select this type of request when the access-denied content or code is known to be within the number of bytes you enter in the field.

    • A GET request that retrieves the entire response.

  6. Type Status Codes the search appliance should interpret as denying access.

  7. Click Add Status Code to add additional expected status codes.

  8. To add a Header Name and Header Value the search appliance should interpret as denying access, click Add Header, then type in the name and value.

  9. To add content that the search appliance should interpret as denying access, click Add Content, then type the expected content returned by
    the webserver to indicate that access was denined, for example "Not Authorized" or "Access Denied." The content is case-sensitive and doesn't have to be the entire error page or html source. Only the error message text is necessary.

  10. Click Save.

Troubleshooting the Headrequestor

If secure documents do not appear in the search results, it may be caused by headrequestor not getting an authorized response. Here are some ways to verify what responses are obtained by headrequestor.

  1. Sniff the network packets using tcpdump, Ethereal or equivalent. You can run this from the web server or from a network device that has access to packets sent by the search appliance. You can pause the crawl to ensure that requests sent by the crawler are not mixed in with headrequestor requests.

  2. If your web server records the response time for each request, you can look in your web server access logs.

    If your web server uses w3c log format, you should include the "time-taken" field as part of their web server log format. Details for this log format are available at: http://www.w3.org/TR/WD-logfile.html

  3. You can mimic the head requests yourself and send them to web server. This can be helpful to determine whether or not a web server is responding slowly. Below is an example of a shell command that mimics the headrequestor. You can time how long it takes to execute this command:
    nc hostname 80 <<EOF
    GET /path/to/file.html HTTP/1.0
    Cookie: SMSESSION=cookie-value
    Range: bytes=0-0
    Host: hostname
    Connection: Keep-Alive
    
    EOF
  4. Use a transparent proxy that shows timestamps of request and response.

Determining if the Headrequestor is Receiving Timeouts from the Web Server

Here is one method to determine whether the search appliance is getting time outs from the headrequestor. The information below is designed for search administrators who are familiar with network troubleshooting tools such as tcpdump.

You can run these commands from a Unix/Linux system or from Windows with Cygwin installed. You may have to modify the commands slightly due to slight differences between various operating systems.

First, generate the authentication credentials. For example, you can use the following command to generate a Basic Authorization HTTP header:

$ echo -n "username:password" | uuencode -m foo
begin-base64 640 foo
dXNlcm5hbWU6cGFzc3dvcmQ=
====

The search appliance caches results of requests for up to one hour. Therefore, you should select a username that has not made any queries to the search appliance for the past hour.

Next run a series of search queries against the search appliance using these credentials. The later queries may include results that have already been cached by the head requestor. In our tests, we have found that there is not a significant number of cache hits if you do a series of 100 queries on a corpus of 80,000 documents.

Run the search queries against an appliance that isn't answering any queries from other clients and which has a paused crawl. Be sure to run this script from a client that has a fast network connection to the search appliance.

Create a file, named qterms, containing one URL-escaped query term per line, then run the following script:

for q in `cat qterms`
do
   echo -n "$q "
   time wget -o /dev/null -q --header="Authorization: Basic dXNlcm5hbWU6cGFzc3dvcmQ=" -O /dev/null \
   "http:///search?q=$q&site=my_collection&output=xml_no_dtd&client=my_collection&num=10&access=a&filter=0" 2>&1 \
   | grep elapsed | sed -e 's/.* \(.*\)elapsed.*/\1/g'
done
The output shows the query term and how long it took to return. It will look something like this:
pager 0:01.58
%22cash+cow%22 0:01.65
swot+analysis 0:01.60
computer 0:01.60
phones 0:00.10
paradigm+shift 0:01.63

Note that the behavior of the time command is often system dependent so you may need to alter the substitutions necessary to correctly display the elapsed time.

While this script is running, you can sniff the connection between the search appliance and the web server to see the requests. Assuming you only have one web server, you can run the following command:

tcpdump -i eth0 -w /tmp/dump.out port 80 and host web-server-hostname

The above command will generate a lot of data. You can usually terminate it after a few seconds to get sufficient packets to

#!/bin/sh

tcpdumpfile=/tmp/dump.out
echo "Port  Packets  First packet  Last packet  Total time"
first_packet_time=""
i=0
for port in `tcpdump -r $tcpdumpfile | cut -d ">" -f 2 | grep -v http | cut -b 15-19 | sort | uniq`
do
   packets=`tcpdump -r $tcpdumpfile | grep -c "$port"`
   first=`tcpdump -r $tcpdumpfile | grep "$port" | cut -d " " -f 1 | head -1`
   last=`tcpdump -r $tcpdumpfile | grep "$port" | cut -d " " -f 1 | tail -1`
   first_time=`echo $first | cut -d "." -f 1`
   first_secs=`date -d $first_time +%s`
   first_frac=`echo $first | cut -d "." -f 2`
   last_time=`echo $last | cut -d "." -f 1`
   last_secs=`date -d $last_time +%s`
   last_frac=`echo $last | cut -d "." -f 2`
   if [ -z $first_packet_time ]; then
       first_packet_time=$first_secs.$first_frac
   fi
   first_display=`echo $first_secs.$first_frac - $first_packet_time | bc`
   last_display=`echo $last_secs.$last_frac - $first_packet_time | bc`
   total=`echo $last_secs.$last_frac - $first_secs.$first_frac | bc`
   printf "%-7d %5d   %11.2f  %11.2f %11.2f\n" $port $packets $first_display $last_display $total
   i=`expr $i + 1`
done
echo -e "\nTotal number of connections: $i"

Note that your version of tcpdump may give slightly different output, which would require you to modify the separators used in the cut command.

The output shows a single record for each head request. We show the port of the request; the number of packets in the request; the number of seconds after the start in which the first and last packets in a connection are seen; the total time between the first and last packets in the connection.

If you see just one packet in a request then the TCP handshake has failed. If you see approximately 7 packets then it is likely that the search appliance sent the request but the web server didn't respond.

Here is some example output for a single search query against a slow web server. We are using the default host load of 4 with a request timeout of 5 seconds.

Port  Packets  First packet  Last packet  Total time
54083      11          0.00        11.78       11.78
54084      11          0.00         7.66        7.66
54085      11          0.00        11.60       11.60
54086      11          0.00        11.12       11.12
54089       8          5.01        10.02        5.01
54090      13          5.01        14.68        9.67
54091      12          5.01        14.77        9.76
...
54452       7        125.40       130.41        5.01
54456       5        130.33       130.55        0.22
54457       5        130.33       130.55        0.22
54458       5        130.39       130.55        0.16
54459       5        130.41       130.55        0.14

Total number of connections: 120

The above analysis does not give any information on the number of cache hits, the number of requests that were not needed for displaying results, or the queueing due to host load limitations.

If you are getting lots of unexplained timeouts, you should check that there are no network errors, such as excessive collisions, that could indicate a possible speed/duplex mismatch.


Known Issues and Feature Requests

  • #205765 - The headrequestor continues to send requests to the web server after the search appliance has returned a search results page.
  • #205528 - Due to the keep-alive, the headrequestor does not close the TCP connection immediately unless the web server sends a content-length header in its response.
  • #205036 - Headrequestor doesn't stop sending requests when it has got enough for the search results because it needs to complete the full batch.
  • #181680 - You cannot specify whether to send a GET or HEAD for the headrequestor requests.
  • #199534 - Need to display the headrequestor logs in the Admin Console so that customers can debug headrequestor problems without tcpdump.
  • #203614 - Headrequestor doesn't follow redirects that don't include a hostname in the Location header.
  • #203149 - The authz checker process cannot send more than one request concurrently.
  • #205987 - Don't timeout head requests since the web servers generally ignore this and serve up a response.
Was this article helpful?
How can we improve it?