Search
Clear search
Close search
Google apps
Main menu
true

Block URLs with robots.txt

Create a robots.txt file

Getting started

A robots.txt file consists of one or more rules. Each rule blocks (or or allows) access for a given crawler to a specified file path in that website.

Here is a simple robots.txt file with two rules, explained below:

# Rule 1
User-agent: Googlebot
Disallow: /nogooglebot/

# Rule 2
User-agent: *
Allow: /

Sitemap: http://www.example.com/sitemap.xml

 

Explanation:

  1. The user agent named "Googlebot" crawler should not crawl the folder http://example.com/nogooglebot/ or any subdirectories.
  2. All other user agents can access the entire site. (This could have been omitted and the result would be the same, as full access is the assumption.)
  3. The site's Sitemap file is located at http://www.example.com/sitemap.xml

We will provide a more detailed example later.

Basic robots.txt guidelines

Here are some basic guidelines for robots.txt files. We recommend that you read the full syntax of robots.txt files because the robots.txt syntax has some subtle behavior that you should understand.

Format and location

You can use almost any text editor to create a robots.txt file. The text editor should be able to create standard ASCII or UTF-8 text files; don't use a word processor, because word processors often save files in a proprietary format and can add unexpected characters, such as curly quotes, which can cause problems for crawlers.

Use the robots.txt Tester tool to write or edit robots.txt files for your site. This tool enables you to test the syntax and behavior against your site.

Format and location rules:

  • The file must be named robots.txt
  • Your site can have only one robots.txt file.
  • The robots.txt file must be located at the root of the website host that it applies to. For instance, to control crawling on all URLs below http://www.example.com/, the robots.txt file must be located at http://www.example.com/robots.txt. It cannot be placed in a subdirectory ( for example, at http://example.com/pages/robots.txt). If you're unsure about how to access your website root, or need permissions to do so, contact your web hosting service provider. If you can't access your website root, use an alternative blocking method such as meta tags.
  • A robots.txt file can apply to subdomains (for example, http://website.example.com/robots.txt) or on non-standard ports (for example, http://example.com:8181/robots.txt).
  • Comments are any lines 

Syntax

  • robots.txt must be an ASCII or UTF-8 text file. No other characters are permitted.
  • A robots.txt file consists of one or more rules.
  • Each rule consists of multiple directives (instructions), one directive per line.
  • A rule gives the following information:
    • Who the rule applies to (the user agent)
    • Which directories or files that agent can access, and/or
    • Which directories or files that agent cannot access.
  • Rules are processed from top to bottom, and a user agent can match only one rule set, which is the first, most-specific rule that matches a given user agent.
  • The default assumption is that a user agent can crawl any page or directory not blocked by a Disallow: rule.
  • Rules are case-sensitive. For instance, Disallow: /file.asp applies to http://www.example.com/file.asp, but not http://www.example.com/FILE.asp.

The following directives are used in robots.txt files:

  • User-agent: [Required, one or more per rule] The name of a search engine robot (web crawler software) that the rule applies to. This is the first line for any rule. Most user-agent names are listed in the Web Robots Database or in the Google list of user agents. Supports the * wildcard for a path prefix, suffix, or entire string. Using an asterisk (*) as in the example below will match all crawlers except the various AdsBot crawlers, which must be named explicitly. (See the list of Google crawler names.) Examples:
    # Example 1: Block only Googlebot
    User-agent: Googlebot
    Disallow: /
    
    # Example 2: Block Googlebot and Adsbot
    User-agent: Googlebot
    User-agent: AdsBot-Google
    Disallow: /
     
    # Example 3: Block all but AdsBot crawlers
    User-agent: * 
    Disallow: /
  • Disallow: [At least one or more Disallow or Allow entries per rule] A directory or page, relative to the root domain, that should not be crawled by the user agent. If a page, it should be the full page name as shown in the browser; if a directory, it should end in a / mark.  Supports the * wildcard for a path prefix, suffix, or entire string.
  • Allow: [At least one or more Disallow or Allow entries per rule] A directory or page, relative to the root domain, that should be crawled by the user agent just mentioned. This is used to override Disallow to allow crawling of a subdirectory or page in a disallowed directory. If a page, it should be the full page name as shown in the browser; if a directory, it should end in a / mark. Supports the * wildcard for a path prefix, suffix, or entire string.
  • Sitemap: [Optional, zero or more per file] The location of a sitemap for this website. Must be a fully-qualified URL; Google doesn't assume or check http/https/www.non-www alternates. Sitemaps are a good way to indicate which content Google should crawl, as opposed to which content it can or cannot crawl. Learn more about sitemaps. Example:
    Sitemap: https://example.com/sitemap.xml
    Sitemap: http://www.example.com/sitemap.xml

Unknown keywords are ignored.

Another example file

A robots.txt file consists of one or more blocks of rules, each beginning with a User-agent line that specifies the target of the rules. Here is a file with two rules; inline comments explain each rule:

# Block googlebot from example.com/directory1/... and example.com/directory2/...
# but allow access to directory2/subdirectory1/...
# All other directories on the site are allowed by default.
User-agent: googlebot
Disallow: /directory1/
Disallow: /directory2/
Allow: /directory2/subdirectory1/

# Block the entire site from anothercrawler.
User-agent: anothercrawler
Disallow: /

Full robots.txt syntax

You can find the full robots.txt syntax here. Please read the full documentation, as the robots.txt syntax has a few tricky parts that are important to learn.

Useful robots.txt rules

Here are some common useful robots.txt rules:

Rule Sample
Disallow crawling of the entire website. Keep in mind that in some situations URLs from the website may still be indexed, even if they haven't been crawled. Note: this does not match the various AdsBot crawlers, which must be named explicitly.
User-agent: *
Disallow: /
Disallow crawling of a directory and its contents by following the directory name with a forward slash. Remember that you shouldn't use robots.txt to block access to private content: use proper authentication instead. URLs disallowed by the robots.txt file might still be indexed without being crawled, and the robots.txt file can be viewed by anyone, potentially disclosing the location of your private content.
User-agent: *
Disallow: /calendar/
Disallow: /junk/
Allow access to a single crawler
User-agent: Googlebot-news
Allow: /

User-agent: *
Disallow: /
Allow access to all but a single crawler
User-agent: Unnecessarybot
Disallow: /

User-agent: *
Allow: /

Disallow crawling of a single webpage by listing the page after the slash:

Disallow: /private_file.html

Block a specific image from Google Images:

User-agent: Googlebot-Image
Disallow: /images/dogs.jpg

Block all images on your site from Google Images:

User-agent: Googlebot-Image
Disallow: /

Disallow crawling of files of a specific file type (for example, .gif):

User-agent: Googlebot
Disallow: /*.gif$

Disallow crawling of entire site, but show AdSense ads on those pages, disallow all web crawlers other than Mediapartners-Google. This implementation hides your pages from search results, but the Mediapartners-Google web crawler can still analyze them to decide what ads to show visitors to your site.

User-agent: *
Disallow: /

User-agent: Mediapartners-Google
Allow: /
Match URLs that end with a specific string, use $. For instance, the sample code blocks any URLs that end with .xls:
User-agent: Googlebot
Disallow: /*.xls$
Was this article helpful?
How can we improve it?
false