Administering Crawl

Constructing URL Patterns

A URL pattern is a set of ordered characters to which the Google Search Appliance matches actual URLs that the crawler discovers. You can specify URL patterns for which your index should include matching URLs and URL patterns for which your index should exclude matching URLs. This document explains how to construct a URL pattern.

Back to top

Introduction


A URL pattern is a set of ordered characters that is modeled after an actual URL. The URL pattern is used to match one or more specific URLs. An exception pattern starts with a hyphen (-).

URL patterns specified in the Start and Block URLs page control the URLs that the search appliance includes in the index. To configure the crawl, use the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console to enter URLs and URL patterns in the following boxes:

  • Start URLs
  • Follow Patterns
  • Do Not Follow Patterns

The search appliance starts crawling from the URLs listed in the Start URLs text box. Each URL that the search appliance encounters is compared with URL patterns listed in the Follow Patterns and Do Not Follow Patterns text boxes.

A URL is included in the index when all of the following are true:

  • The URL is reachable through the URLs specified in the Start URLs field.
  • The URL matches at least one pattern in the Follow Patterns field.
  • The URL does not match an exception pattern in the Follow Patterns field.
  • The URL meets one of the following criteria:
    • The URL does not match a pattern in the Do Not Follow Patterns field.
    • The URL matches an exception in the Do Not Follow Patterns field.

Alternatively, URLs can be excluded from an index through the use of a robots.txt file or robots meta tags.

For complete information about the Start and Block URLs page, in the Admin Console, click Admin Console Help > Content Sources > Web Crawl > Start and Block URLs.

Back to top

Rules for Valid URL Patterns


When specifying the URLs that should or should not be crawled on your site or when building URL-based collections, your URLs must conform to the valid patterns listed in the following table.

Valid URL Patterns Examples Explanation
Any substring of a URL that includes the host/path separating slash http://www.google.com/ Any page on www.google.com using the HTTP protocol.
www.google.com/ Any page on www.google.com using any supported protocol.  
google.com/ Any page in the google.com domain.  
Any suffix of a string. You specify the suffix with the $ at the end of the string. home.html$ All pages ending with home.html.
.pdf$ All pages with the extension .pdf.  
Any prefix of a string. You specify the prefix with the ^ at the beginning of the string. A prefix can be used in combination with the suffix for exact string matches. For example, ^candy cane$ matches the exact string for “candy cane.” ^http:// Any page using the HTTP protocol.
^https:// Any page using the HTTPS protocol.  
^http://www.google.com/page.html$ Only the specified page.  
An arbitrary substring of a URL. These patterns are specified using the prefix “contains”. contains:coffee Any URL that contains “coffee.”
contains:beans.com Any URL that contains “beans.com” such as http://blog.beans.com/ or http://www.page.com/?goto=beans.com/images  
Exceptions denoted by - (minus) sign.

candy.com/
-www.candy.com/

Means that “www.chocolate.candy.com” is a match, but “www.candy.com” is not a match.

Regular expressions from the GNU Regular Expression library. In the search appliance, regular expressions:

  1. Are case sensitive unless you specify regexpIgnoreCase:
  2. Must use two escape characters (a double backslash “\\”) when reserved characters are added to the regular expression.
regexp: and regexpCase: are equivalent.

(Wrapped for readability)

regexp:-sid=[0-9A-Z]+/
regexp:
http://www\\.example
\\.google\\.com/.*/images/
regexpCase:
http://www\\.example
\\.google\\.com/.*/images/
regexpIgnoreCase:
http://www\\.Example
\\.Google\\.com/.*/IMAGES/

The GNU Regular Expression library
Comments #this is a comment Empty lines and comments starting with # are permissible. These comments are removed from the URL pattern and ignored.

Back to top

Comments in URL Patterns


A line that starts with a # (pound) character is treated as a comment, as shown in the following example.

#This is a comment.

Case Sensitivity


URL patterns are case sensitive. The following table uses www.example.com/ to illustrate an example that does not match the URL pattern, and another example that does match the pattern.

URL Pattern

www.example.com/mypage

Invalid URL

http://www.example.com/MYPAGE.html

Matching URL

http://www.example.com/mypage.html

The Google Search Appliance treats URLs as case-sensitive, because URLs that differ only by case can legitimately be different pages. The hostname part of the URL, however, is case-insensitive. To capture URLs with variable case use a regular expression. More information about regular expressions, see Google Regular Expressions.

Back to top

Simple URL Patterns


The following notation is used throughout this document:

  • Brackets < > denote variable strings in the expression format.
  • The slash (/) at the end of the site name is required.

Format

<site>/

Example

www.example.com/

Matching domains

To match URLs from all sites in the same domain, specify the domain name. The following example matches all sites in the domain example.com.

Format

<domain>/

Example

example.com/

Matching URLs

www.example.com
support.example.com
sales.example.com

Matching directories

To describe URLs that are in a specific directory or in one of its sub-directories, specify the directory and any sub-directory in the pattern.

The following example matches all URLs in the products directory and all sub-directories under products on the site sales.example.com.

Format

<site>/<directory>/

Example

sales.example.com/products/

Matching URLs

sales.example.com/products/about.html
http://www.sales.example.com/products/cost/priceList.html

The following example matches all URLs in the products directory and all sub-directories under products on all sites in the example.com domain.

Format

<domain>/<directory>/

Example

example.com/products/

Matching URLs

accounting.example.com/products/prices.htm
example.com/products/expensive.htm

The following example matches all URLs in an images directory or sub-directory, in any side.

If one of the pages on a site links to another external site or domain, this example would also match the /image/ directories of those external sites.

Format

/<directory>/

Example

/images/

Matching URLs

www.example1423.com/images/myVacation/
www.EXAMPLE.com/images/tomato.jpg
sales.example.com/images/

Matching files

To match a specific file, specify its name in the pattern and add the dollar ($) character to the end of the pattern. Each of the following examples will only match one page.

Format

<site>/<directory>/<file>$

Example

www.example.com/products/foo.html

 

Format

<domain>/<directory>/<file>$

Example

example.com/products/foo.html

 

Format

/<directory>/<file>$

Example

/products/foo.html

 

Format

/<file>$

Example

/mypage.html

Without the dollar ($) character at the end of the pattern, the URL pattern may match more than one page.

Format

/<directory>/<file>

Example

/products/mypage.html

Matching URLs

/products/mypage.html
/product/mypage.html
/products/mypage.htmlx

Matching protocols

To match URLs that are accessible by a specific protocol, specify the protocol in the pattern. The following example matches HTTPS URLs that contain the products directory.

Format

<protocol>://<site>/<path>/

Example

https://www.example.com/products/mydir/mydoc.txt/

Matching ports

To match URLs that are accessible by means of a specific port, specify the port number in the pattern. If you don’t specify a port, the search appliance matches any URLs with the site regardless of the port.

  • These examples matches host www.example.com/foo on any port: www.example.com:*/foo or www.example.com/foo
  • This example matches host www.example.com on port 8888: www.example.com:8888/
If you explicitly include a port number, the pattern matches only URLs that explicitly include the port number, even if you use the default port. For example, a URL pattern that includes www.example.com:80/products/ does not match www.example.com/products/.

Using the prefix option

To match the beginning of a URL, add the caret (^) character to the start of the pattern. Do not match a prefix character followed by only a protocol because the result could resolve to most of the Internet.

Format

^<protocol>://<site>/<directory>/

Example

^http://www.example.com/products/

 

Format

^<protocol>://<site>/

Example

^http://www.example.com/

 

Format

^<protocol>

Example

^https

 

Format

^<protocol>://<partial_site>

Example

^http://www.example

Matching URLs

http://www.example.com/
http://www.example.de/
http://www.example.co.jp/

Using the suffix option

To match the end of a URL, add the dollar ($) character to the end of the pattern.

The following example matches http://www.example.com/mypage.jhtml, but not http://www.example.com/mypage.jhtml;jsessionid=HDUENB2947WSSJ23.

Format

<protocol>://<site>/<directory>/<file>$

Example

http://www.example.com/mypage.jhtml$

 

Format

<site>/<directory>/<file>$

Example

www.example.com/products/mypage.html$

 

Format

<domain>/<directory>/<file>$

Example

example.com/products/mypage.html$

 

Format

/<directory>/<file>$

Example

/products/mypage.html$

The following example matches mypage.htm, but does not match mypage.html.

Format

<file>$

Example

mypage.htm$

The following example is useful for specifying all files of a certain type, including .html, .doc, .ppt, and .gif.

Format

<partial_file_name>$

Example

.doc$

Matching specific URLs

To exactly match a single URL, use both caret (^) and dollar ($). The following example matches only the URL: http://www.example.com/mypage.jhtml

Format

^<exact url>$

Example

^http://www.example.com/mypage.jhtml$

Matching specified strings

To match URLs with a specified string use the contains: prefix. The following example matches any URL containing the string “product.”

Format

contains:<string>

Example

contains:product

Matching URLs

http://www.example.com/products/mypage.html
https://sales.example.com/production_details/inventory.xls

Back to top

SMB URL Patterns


In GSA release 7.4, on-board file system crawling (File System Gateway) was deprecated. For more information, see Deprecation Notices.

To match SMB (Server Message Block) URLs, the pattern must have a fully-qualified domain name and begin with the smb: protocol. SMB URLs refer to objects that are available on SMB-based file systems, including files, directories, shares, and hosts. SMB URLs use only forward slashes. Some environments, such as Microsoft Windows, use backslashes (“\”) to separate file path components. However, for these URL patterns, you must use forward slashes. SMB paths to folders must end with a trailing forward slash (“/”).

The following example shows the correct structure of an SMB URL.

Format

smb://<fully-qualified-domain-name>/<share>/<directory>/<file>

Example

smb://fileserver.domain/myshare/mydir/mydoc.txt

The following SMB URL patterns are not supported:

  • Top-level SMB URLs, such as the following: smb://
  • URLs that omit the fully-qualified domain name, such as the following: smb://myshare/mydir/
  • URLs with workgroup identifiers in place of hostnames, such as the following: smb://workgroupID/myshare/

Back to top

Exception Patterns


The exception patterns below cannot be used with any version of the Google Connector for Microsoft SharePoint.

To specify exception patterns, prefix the expression with a hyphen (-). The following example includes sites in the example.com domain, but excludes secret.example.com.

Format

-<expression>

Example

example.com/
-secret.example.com/

The following example excludes any URL that contains content_type=calendar.

Example

-contains:content_type=calendar

You can override the exception interpretation of the hyphen (-) character by preceding the hyphen (-) with a plus (+).

Example

+-products.xls$

Matching URLs

http://www.example.com/products/new-products.xls

Back to top

Google Regular Expressions


A Google regular expression describes a complex set of URLs. For more information on GNU regular expressions, see the Google Search for “gnu regular expression tutorial” (http://www.google.com/search?hl=en&q=gnu+regular+expression+tutorial). Google regular expressions are similar to GNU regular expressions, with the exception of the following differences:

  • A case insensitive expression starts with the following prefix: regexpIgnoreCase:
  • A case sensitive expression does not require a prefix, but the regexpCase: and regexp: prefixes can be used to specify case sensitivity.
  • Special characters are escaped with a double backslash (\\).

Metacharacters are either a special character or special character combination, which is used in a regular expression to match a specific portion of a pattern. Metacharacters are not used as literals in regular expressions. The following list describes available metacharacters and metacharacter combinations:

  • The . character matches any character.
  • The .* character combination matches any number of characters.
  • The ^ character specifies the start of a string.
  • The $ character specifies the end of a string.
  • The [0-9a-zA-Z]+ character combination matches a sequence of alphanumeric characters.
  • The following characters must be preceded with the double backslash (\\) escape sequence: ^.[$()|*+?{\

The following example matches any URL that references an images directory on www.example.com using the HTTP protocol.

Example

regexp:http://www\\.example\\.com.*/images/

Matching URLs

http://www.example.com/images/logo.gif
http://www.example.com/products/images/widget.jpg

The following example matches any URL in which the server name starts with auth and the URL contains .com.

Example

regexpCase:http://auth.*\\.com/

Matching URLs

http://auth.www.example.com/mypage.html
http://auth.sales.example.com/about/corporate.htm

This example does not match http://AUTH.engineering.example.com/mypage.html because the expression is case sensitive.

The following pattern matches JHTML pages from site www.example.com. These pages have the jsessionid, type=content parameters, and id.

Example

regexp:^http://www\\.example\\.com/page\\.jhtml;jsessionid=
[0-9a-zA-Z]+&type=content&id=[0-9a-zA-Z]+$

Matching URLs

http://www.example.com/page.jhtml;jsessionid=
A93KF8M18M5XP&type=content&id=gpw9483

Do not begin or end a URL pattern with period+asterisk (.*) if you are using the regexp: prefix, as this pattern is ineffective and may cause performance problems.
Invalid regular expression patterns entered on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console can cause search appliance crawling to fail.

For proxy servers, regular expressions are also case sensitive, but must use a single escape character (backslash “\”) when reserved characters are added to the regular expression.

Using Backreferences with Do Not Follow Patterns

A backreference stores the part of a URL pattern matched by a part of a regular expression that is grouped within parentheses. The search appliance supports using backreferences with Do Not Follow Patterns. The Content Sources > Web Crawl > Start and Block URLs page in the Admin Console includes default backreferences in the Do Not Follow Patterns. The search appliance does not support using backreferences with Follow Patterns.

The following examples illustrate backreferences and are similar to the default backreferences on the Content Sources > Web Crawl > Start and Block URLs page. These backreferences prevent the search appliance from crawling repetitive URLs.

Example

regexp:example\\.com/.*/([^/]*)/\\1/\\1/

Matching URL

http://example.com/corp/corp/corp/...

 

Example

regexp:example\\.com/.*/([^/]*)/([^/]*)/\\1/\\2/

Matching URL

http://example.com/corp/hr/corp/hr/...

 

Example

regexp:example\\.com/.*&([^&]*)&\\1&\\1

Matching URL

http://example.com/corp?hr=1&hr=1&hr=1...

Back to top

Controlling the Depth of a Crawl with URL Patterns


Google recommends crawling to the maximum depth, allowing the Google algorithm to present the user with the best search results. You can use URL patterns to control how many levels of subdirectories are included in the index.

For example, the following URL patterns cause the search appliance to crawl the top three subdirectories on the site www.mysite.com:

regexp:www\\.mysite\\.com/[^/]*$
regexp:www\\.mysite\\.com/[^/]*/[^/]*$
regexp:www\\.mysite\\.com/[^/]*/[^/]*/[^/]*$

Back to top

Was this article helpful?
How can we improve it?