Block URLs with robots.txt

Create a robots.txt file

In order to make a robots.txt file, you need access to the root of your domain. If you're unsure about how to access the root, you can contact your web hosting service provider. Also, if you know you can't access to the root of the domain, you can use alternative blocking methods, such as password-protecting the files on your server, and inserting meta tags into your HTML.

You can make or edit an existing robots.txt file using the robots.txt Tester tool. This allows you to test your changes as you adjust your robots.txt.

Learn robots.txt syntax

The simplest robots.txt file uses two key words, User-agent and Disallow. User-agents are search engine robots (or web crawler software); most user-agents are listed in the Web Robots Database. Disallow is a command for the user-agent that tells it not to access a particular URL. On the other hand, to give Google access to a particular URL that is a child directory in a disallowed parent directory, then you can use a third key word, Allow.

Google uses several user-agents, such as Googlebot for Google Search and Googlebot-Image for Google Image Search. Most Google user-agents follow the rules you set up for Googlebot, but you can override this option and make specific rules for only certain Google user-agents as well.

The syntax for using the keywords is as follows:

User-agent: [the name of the robot the following rule applies to]

Disallow: [the URL path you want to block]

Allow: [the URL path in of a subdirectory, within a blocked parent directory, that you want to unblock]

These two lines are together considered a single entry in the file, where the Disallow rule only applies to the user-agent(s) specified above it. You can include as many entries as you want, and multiple Disallow lines can apply to multiple user-agents, all in one entry. You can set the User-agent command to apply to all web crawlers by listing an asterisk (*) as in the example below:

User-agent: *

URL blocking commands to use in your robots.txt file

Block...

Sample

The entire site with a forward slash (/):

Disallow: /

A directory and its contents by following the directory name with a forward slash:

Disallow: /sample-directory/

A webpage by listing the page after the slash:

Disallow: /private_file.html

A specific image from Google Images:

User-agent: Googlebot-Image

Disallow: /images/dogs.jpg

All images on your site from Google Images:

User-agent: Googlebot-Image

Disallow: /

Files of a specific file type (for example, .gif):

User-agent: Googlebot

Disallow: /*.gif$

Pages on your site, but show AdSense ads on those pages, disallow all web crawlers other than Mediapartners-Google. This implementation hides your pages from search results, but the Mediapartners-Google web crawler can still analyze them to decide what ads to show visitors to your site.

User-agent: *

Disallow: /

User-agent: Mediapartners-Google

Allow: /

Note that directives are case-sensitive. For instance, Disallow: /file.asp would block http://www.example.com/file.asp, but would allow http://www.example.com/File.asp. Googlebot also ignores white-space, and unknown directives in the robots.txt.
Pattern-matching rules to streamline your robots.txt code

Pattern-matching rule

Sample

To block any sequence of characters, use an asterisk (*). For instance, the sample code blocks access to all subdirectories that begin with the word "private":

User-agent: Googlebot

Disallow: /private*/

To block access to all URLs that include question marks (?). For example, the sample code blocks URLs that begin with your domain name, followed by any string, followed by a question mark, and ending with any string:

User-agent: Googlebot

Disallow: /*?

To block any URLs that end in a specific way, use $. For instance, the sample code blocks any URLs that end with .xls:

User-agent: Googlebot

Disallow: /*.xls$

To block patterns with the Allow and Disallow directives, see the sample to the right. In this example, a ? indicates a session ID. URLs that contain these IDs should typically be blocked from Google to prevent web crawlers from crawling duplicate pages. Meanwhile, if some URLs ending with ? are versions of the page that you want to include, you can use the following approach of combining Allow and Disallow directives:

  1. The Allow: /*?$ directive allows any URL that ends in a ? (more specifically, it allows a URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).
  2. The Disallow: / *? directive blocks any URL that includes a ? (more specifically, it blocks a URL that begins with your domain name, followed by a string, followed by a question mark, followed by a string).

User-agent: *

Allow: /*?$

Disallow: /*?

Save your robots.txt file

You must apply the following saving conventions so that Googlebot and other web crawlers can find and identify your robots.txt file:

  • You must save your robots.txt code as a text file,
  • You must place the file in the highest-level directory of your site (or the root of your domain), and
  • The robots.txt file must be named robots.txt.

As an example, a robots.txt file saved at the root of example.com, at the URL address http://www.example.com/robots.txt, can be discovered by web crawlers, but a robots.txt file at http://www.example.com/not_root/robots.txt cannot be found by any web crawler.

How helpful is this article:

Feedback recorded. Thanks!
  • Not at all helpful
  • Not very helpful
  • Somewhat helpful
  • Very helpful
  • Extremely helpful