Mar 17, 2021

Limit pagination crawling via Robots.txt on Forums

Hi all, we run a successful forum where threads often grow over 3 digits series of pages. The implementation is quite standard: www.site.com/forum/thread-name?page=1

The no. of pages for the thread is unlimited, so we can hypotetically hade an unlimited no. of pages:   www.site.com/forum/thread-name?page=1500

We;d like to limit/prevent Google from crawling all this pagination as the content tends to digress anyway. I've been toying around with RegEx to undertand if this can be done via Robots.txt and tried the below to limit Google's crawl to pages from 1 to 9, however the rule doesn't work on the robots.txt tester:

Disallow: /forum/*?page=[1-9]

Would love to hear your thoughts on this/potential implementations and/or a better solution - thanks in advance.
Locked
Informational notification.
This question is locked and replying has been disabled.
Community content may not be verified or up-to-date. Learn more.
Recommended Answer
Mar 17, 2021
robots.txt does not actually support regex. Well 'consumers' of robots.txt dont!
 
It may look somewhat like a regular expression, but the syntax is a special syntax just for robots.txt. 
 
Many bots don't even support wildcards. But Googelbot does support a 'extension' to original robots.txt standard and supports * as a wildcard. 
 
https://developers.google.com/search/docs/advanced/robots/robots_txt#url-matching-based-on-path-values
 
So ranges like [1-9] are out! There isnt even "at least one" charactor wildcard!
 
If you really want do it, probably looking at using something like
 
Disallow: /forum/*?page=
Allow: /forum/*?page=1$
Allow: /forum/*?page=2$
Allow: /forum/*?page=3$
Allow: /forum/*?page=4$
Allow: /forum/*?page=5$
Allow: /forum/*?page=6$
Allow: /forum/*?page=7$
Allow: /forum/*?page=8$
Allow: /forum/*?page=9$
 
You specifically allow pages 1-9, while disallowing all other page urls. 
 
 
 
 
Original Poster Pedro Teixeira marked this as an answer
Helpful?
All Replies (6)
Recommended Answer
Mar 17, 2021
robots.txt does not actually support regex. Well 'consumers' of robots.txt dont!
 
It may look somewhat like a regular expression, but the syntax is a special syntax just for robots.txt. 
 
Many bots don't even support wildcards. But Googelbot does support a 'extension' to original robots.txt standard and supports * as a wildcard. 
 
https://developers.google.com/search/docs/advanced/robots/robots_txt#url-matching-based-on-path-values
 
So ranges like [1-9] are out! There isnt even "at least one" charactor wildcard!
 
If you really want do it, probably looking at using something like
 
Disallow: /forum/*?page=
Allow: /forum/*?page=1$
Allow: /forum/*?page=2$
Allow: /forum/*?page=3$
Allow: /forum/*?page=4$
Allow: /forum/*?page=5$
Allow: /forum/*?page=6$
Allow: /forum/*?page=7$
Allow: /forum/*?page=8$
Allow: /forum/*?page=9$
 
You specifically allow pages 1-9, while disallowing all other page urls. 
 
 
 
 
Original Poster Pedro Teixeira marked this as an answer
Mar 17, 2021
Hi,
 
There's not really full regex support, you can achieve what you want with:
 
Disallow: /forum/*?page=*
Allow: /forum/*?page=1$
Allow: /forum/*?page=2$
Allow: /forum/*?page=3$
Allow: /forum/*?page=4$
Allow: /forum/*?page=5$
Allow: /forum/*?page=6$
Allow: /forum/*?page=7$
Allow: /forum/*?page=8$
Allow: /forum/*?page=9$
 
Which would allow page=1 thorough page=9, but not page=10 onwards. You'd have to have a line allowing all the page numbers you do want to allow for crawling.
 
However, always worth remembering that robots.txt doesn't prevent indexing, just crawling, these pages could be partially indexed anyway.
 
So, if the concern is crawl budget, the robots.txt thing can work (make sure to test, and test again!), but if it's just that page=10 onwards doesn't provide much value and you don't want it indexed, just use a noindex on those instead
 
 
Last edited Mar 17, 2021
Mar 17, 2021
Ha, @barry to the rescue first!
Mar 17, 2021
Hello Pedro, 
 
Best wishes for your forums, As you are using "parameter" for pagination pages then why don't you use Google's URL parameters Tool. It will allow you to Block crawling of parameterized duplicate content or we can use is for pagination because there is a pagination option as well.
 
 
Hope this will help you.
Mar 17, 2021
@barryhunter @dwsmart - this is brilliant thank you! Loads of great information happening in a couple of answers. I had no idea robots.txt syntax wasn't regex, and, I was obsessed in blocking that never thought on the allow reverse approach - brilliant!

@dwsmart: and thanks for the additional nudge on the indexing/crawling concepts. For this case is really about mitigating crawl budget. If a couple of pages beyond page 9 end up being indexed because Google found the link via backlinks or somewhere else, it's not the end of the world. In fact, if the latter is the case it's because some soul found it valuable enough to link so another soul might find it useful in the SERP. 

@Saket: Thanks as well - for this case in particular we don't want to block all ?page= URLs, just a certain threshold. Another reason is that this parameter is used sitewide in other URLs and this differentiation wouldn't be possible to make in the Parameters tools.  

Thanks all!
Mar 18, 2021
Hello Pedro, 

There is an option "Only URLs with value=x:" Googlebot will crawl only those URLs where the value of this parameter matches this specified value. URLs with a different parameter value won’t be crawled. So you can set your custom rules. 

Parameter behavior applies to the entire property; you cannot limit crawling behavior for a given parameter to a specific URL or branch of your site. Yes, in that case, you have to see what is beneficial for your website.
false
14043559677536936469
true
Search Help Center
true
true
true
true
true
83844
Search
Clear search
Close search
Main menu
false
false