/webmasters/community?hl=en
/webmasters/community?hl=en
11/25/10
Original Poster
JohnMu

Comprehensive documentation of the robots.txt & robots meta tags

Hi everyone!

We recently released a comprehensive documentation of the robots.txt crawling and robots meta tag indexing and serving directives. You can find it at http://code.google.com/web/controlcrawlindex/

It covers everything regarding how Google treats these directives, including tidbits like how the various HTTP response codes are handled (did you know that serving a 503 HTTP result code for the robots.txt will block crawling of your site completely?). 

Feel free to post here should you have any questions!

Cheers
John
Community content may not be verified or up-to-date. Learn more.
All Replies (11)
Google user
11/25/10
Google user
Nice - comprehensive and informative ... and all in one place :D


SteveJohnston
11/25/10
SteveJohnston
In your first Tweet today, you mentioned "disallow: /unknowns". I can't find such a reference in the documents you linked to.
jrsanfeliu
11/25/10
jrsanfeliu
Thanks @JohnMu for the documentation, it's very helpful!
chrisevans77
11/25/10
chrisevans77
Thanks John.
 
In the Robots.txt Specifications, you give an example of a valid robots.txt URL of http://www.müller.eu/robots.txt .. which redirects to http://www.xn--mller-kva.eu/robots.txt.
 
You then say that is is valid for both www.müller.eu and www.xn--mller-kva.eu.
 
Does this mean that a domain which is redirected to another inherits the robots.txt of the target domain?
chrisevans77
11/25/10
chrisevans77
Also, this bit confuses me. Can you clarify?
 
"Handling of robots.txt redirects to disallowed URLs is undefined and discouraged. Handling of logical redirects for the robots.txt file based on HTML content that returns 2xx (frames, JavaScript, or meta refresh-type redirects) is undefined and discouraged."
 
You say "The path value must start with "/" to designate the root. If a path without a beginning slash is found, it may be assumed to be there". You then give an example below and say that fish/ is equivalent to /fish/. However the robots.txt testing tool in Webmaster tools doesn't assume a / at the start of a directive. Is this an issue with the testing tool? Does the testing tool use identical code to the real Googlebot or is it a simulator?
 
Finally, you give an example of the URL http://example.com/page.htm and say that Allow: /page and Disallow: /*.htm would result in an undefined outcome. The testing tool says that this example would be disallowed which follows the rule about "the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule". Does this mean that the testing tool is not consistent with the real Googlebot?
 
Thanks. Overall the documentation is very thorough and I have learned a lot.
11/25/10
Original Poster
JohnMu
Hi chrisevans77,

IDN host names like www.müller.eu are equivalent to the punycode transcription www.xn--mller-kva.eu -- so while in some cases the browser may show the one or the other version (I believe the logic for that is to prevent phishing with look-alike characters), they are actually the same. So any robots.txt file on there will similarly be applied to both versions (though technically it's all the same, so it's not really two versions). 

"Handling of robots.txt redirects to disallowed URLs is undefined and discouraged."

Assume that http://example.com/robots.txt redirects to http://johnmu.com/example/robots.txt. Additionally, http://johnmu.com/robots.txt includes a disallow for /example/. Without the disallow, we would follow that redirect and use the robots.txt file at http://johnmu.com/example/robots.txt; with the disallow, it's an undefined situation -- can a crawl access that URL or not? Therefore, we suggest not relying on a robots.txt file that redirects to a disallowed URL.

"Handling of logical redirects for the robots.txt file based on HTML content that returns 2xx (frames, JavaScript, or meta refresh-type redirects) is undefined and discouraged."

In many cases, we can follow logical redirects when we crawl. For instance, if we see a meta-refresh-type redirect, we'll generally treat that as a normal redirect. However, given the nature of a robots.txt file and the room for interpretation of logical redirects, we would not recommend relying on all crawlers following such a redirect, so it's generally better to either use a 301 server-side redirect, or even better, to serve the robots.txt file directly.

The Webmaster Tools robots.txt testing tool is a bit stricter in regards to the "fish/" vs "/fish/", I suppose. I'll double-check to see if there are other differences. 

As the outcome of "/*.htm" vs "/page" is undefined, it's possible that some tools will go one way, while others go the other way. By explicitly calling it undefined, we want to make sure that webmasters will not rely on either outcome to be generally valid across all crawlers and tools. In that sense, it would probably make sense to change the response in the testing tool as well - good catch.

Cheers
John
Thomas P. - Google Search Expert
11/25/10
Thomas P. - Google Search Expert
Please do remember to add the new  http://code.google.com/web/controlcrawlindex/
to the Google Code Site Directory of
Resources @ http://code.google.com/more/#google-resources

BTW: Yes, It looks very nice ! (by saying "It looks", I only imply that I haven't had a chance to read it yet)
semetrical
11/25/10
semetrical
Can you clarify the following issue please?
 
In the robots.txt specification: http://code.google.com/web/controlcrawlindex/docs/robots_txt.html you mentioned that the following URL:
http://www.domain.com/page.htm
 
is *undefined* if robots.txt contains the following rules:
User-agent: *
Disallow: /page
Allow: /*.htm
 
When you test this example in Webmastertools “Test robots.txt” tool though the results are slightly different and are clearly giving the upper hand to Allow in this case. See results below:
* Allowed by line 4: Allow: /*.htm
 
Can you confirm which one is correct: the specification or the Webmastertools “Test robots.txt” tool?
Richard Hearne
11/27/10
Richard Hearne
I'd trust the spec over the gwt robots tool if I were you.
Thomas P. - Google Search Expert
11/30/10
Thomas P. - Google Search Expert
On http://code.google.com/web/controlcrawlindex/
Would it be possible to add a starring option ? i.e. like for APIs, ex. on http://code.google.com/apis/accounts/ you've got the option to click the star left of the header ("Authentication and Authorization for Google APIs"), and it will shown up in the "My favorites"-dropdown (located right of your email/account-login displayed, in top right part of the page).
Thomas P. - Google Search Expert
2/24/11
Thomas P. - Google Search Expert
How the heck do you expect for anyone to discover this (http://code.google.com/web/controlcrawlindex/) comprehensive documentation through Google Code, when ignoring to:
1. include it
in the Google Code Site Directory of Resources
2. support the starring option offered by Google Code.

I don't need this documentation very often, so I haven't bookmarked it  - but when I do need it, ... I haven't figured out a natural way of locating it; Each time I've had to do a WebSearch to find this thread. (maybe I just need to either bookmark it, - or bump this thread at regular interval, so to attempt making sure that it'll never get dropped from Google's index).

Actually, I can't even find a reference to it from within any WebMaster Help Center article - all those articles, only ever links to pages on www.robotstxt.org.
Hmmm, I guess that
http://code.google.com/web/controlcrawlindex/ might just be a dumping ground, ... to be ignored for future use.
Were these replies helpful?
How can we improve them?
 
This question is locked and replying has been disabled. Still have questions? Ask the Help Community.

Badges

Some community members might have badges that indicate their identity or level of participation in a community.

 
Expert - Google Employee — Googler guides and community managers
 
Expert - Community Specialist — Google partners who share their expertise
 
Expert - Gold — Trusted members who are knowledgeable and active contributors
 
Expert - Platinum — Seasoned members who contribute beyond providing help through mentoring, creating content, and more
 
Expert - Alumni — Past members who are no longer active, but were previously recognized for their helpfulness
 
Expert - Silver — New members who are developing their product knowledge
Community content may not be verified or up-to-date. Learn more.

Levels

Member levels indicate a user's level of participation in a forum. The greater the participation, the higher the level. Everyone starts at level 1 and can rise to level 10. These activities can increase your level in a forum:

  • Post an answer.
  • Having your answer selected as the best answer.
  • Having your post rated as helpful.
  • Vote up a post.
  • Correctly mark a topic or post as abuse.

Having a post marked and removed as abuse will slow a user's advance in levels.

View profile in forum?

To view this member's profile, you need to leave the current Help page.

Report abuse in forum?

This comment originated in the Google Product Forum. To report abuse, you need to leave the current Help page.

Reply in forum?

This comment originated in the Google Product Forum. To reply, you need to leave the current Help page.