A robots.txt file restricts access to your site by search engine robots that crawl the web. These bots are automated, and before they access pages of a site, they check to see if a robots.txt file exists that prevents them from accessing certain pages. (All respectable robots will respect the directives in a robots.txt file, although some may interpret them differently. However, a robots.txt is not enforceable, and some spammers and other troublemakers may ignore it. For this reason, we recommend password protecting confidential information.)
To see which URLs Google has been blocked from crawling, visit the Blocked URLs page of the Health section of Webmaster Tools.
You need a robots.txt file only if your site includes content that you don't want search engines to index. If you want search engines to index everything in your site, you don't need a robots.txt file (not even an empty one).
While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.
In order to use a robots.txt file, you'll need to have access to the root of your domain (if you're not sure, check with your web hoster). If you don't have access to the root of a domain, you can restrict access using the robots meta tag.
Create a robots.txt file
The simplest robots.txt file uses two rules:
- User-agent: the robot the following rule applies to
- Disallow: the URL you want to block
These two lines are considered a single entry in the file. You can include as many entries as you want. You can include multiple Disallow lines and multiple user-agents in one entry.
Each section in the robots.txt file is separate and does not build upon previous sections. For example:
User-agent: * Disallow: /folder1/ User-Agent: Googlebot Disallow: /folder2/
In this example only the URLs matching /folder2/ would be disallowed for Googlebot.
User-agents and bots
A user-agent is a specific search engine robot. The Web Robots Database lists many common bots. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk). An entry that applies to all bots looks like this:
Google uses several different bots (user-agents). The bot we use for our web search is Googlebot. Our other bots like Googlebot-Mobile and Googlebot-Image follow rules you set up for Googlebot, but you can set up specific rules for these specific bots as well.
The Disallow line lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a forward slash (/).
- To block the entire site, use a forward slash.
- To block a directory and everything in it, follow the directory name with a forward slash.
- To block a page, list the page.
- To remove a specific image from Google Images, add the following:
User-agent: Googlebot-Image Disallow: /images/dogs.jpg
- To remove all images on your site from Google Images:
User-agent: Googlebot-Image Disallow: /
- To block files of a specific file type (for example, .gif), use the following:
User-agent: Googlebot Disallow: /*.gif$
- To prevent pages on your site from being crawled, while still displaying AdSense ads on those pages, disallow all bots other than Mediapartners-Google. This keeps the pages from appearing in search results, but allows the Mediapartners-Google robot to analyze the pages to determine the ads to show. The Mediapartners-Google robot doesn't share pages with the other Google user-agents. For example:
User-agent: * Disallow: / User-agent: Mediapartners-Google Allow: /
Note that directives are case-sensitive. For instance,
Disallow: /junk_file.asp would block http://www.example.com/junk_file.asp, but would allow http://www.example.com/Junk_file.asp. Googlebot will ignore white-space (in particular empty lines)and unknown directives in the robots.txt.
Googlebot supports submission of Sitemap files through the robots.txt file.
Googlebot (but not all search engines) respects some pattern matching.
- To match a sequence of characters, use an asterisk (*). For instance, to block access to all subdirectories that begin with private:
User-agent: Googlebot Disallow: /private*/
- To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
User-agent: Googlebot Disallow: /*?
- To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
User-agent: Googlebot Disallow: /*.xls$
You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
User-agent: * Allow: /*?$ Disallow: /*?
The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).
The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).
Save your robots.txt file by downloading the file or copying the contents to a text file and saving as robots.txt. Save the file to the highest-level directory of your site. The robots.txt file must reside in the root of the domain and must be named "robots.txt". A robots.txt file located in a subdirectory isn't valid, as bots only check for this file in the root of the domain. For instance, http://www.example.com/robots.txt is a valid location, but http://www.example.com/mysite/robots.txt is not.
Test a robots.txt file
The Test robots.txt tool will show you if your robots.txt file is accidentally blocking Googlebot from a file or directory on your site, or if it's permitting Googlebot to crawl files that should not appear on the web. When you enter the text of a proposed robots.txt file, the tool reads it in the same way Googlebot does, and lists the effects of the file and any problems found.
Test a site's robots.txt file:
- On the Webmaster Tools Home page, click the site you want.
- Under Health, click Blocked URLs.
- If it's not already selected, click the Test robots.txt tab.
- Copy the content of your robots.txt file, and paste it into the first box.
- In the URLs box, list the site to test against.
- In the User-agents list, select the user-agents you want.
Any changes you make in this tool will not be saved. To save any changes, you'll need to copy the contents and paste them into your robots.txt file.
This tool provides results only for Google user-agents (such as Googlebot). Other bots may not interpret the robots.txt file in the same way. For instance, Googlebot supports an extended definition of the standard robots.txt protocol. It understands Allow: directives, as well as some pattern matching. So while the tool shows lines that include these extensions as understood, remember that this applies only to Googlebot and not necessarily to other bots that may crawl your site.