Mar 17, 2021

Limit pagination crawling via Robots.txt on Forums

Hi all, we run a successful forum where threads often grow over 3 digits series of pages. The implementation is quite standard: www.site.com/forum/thread-name?page=1 

The no. of pages for the thread is unlimited, so we can hypotetically hade an unlimited no. of pages: www.site.com/forum/thread-name?page=1500

We;d like to limit/prevent Google from crawling all this pagination as the content tends to digress anyway. I've been toying around with RegEx to undertand if this can be done via Robots.txt and tried the below to limit Google's crawl to pages from 1 to 9, however the rule doesn't work on the robots.txt tester:

Disallow: /forum/*?page=[1-9]

Would love to hear your thoughts on this/potential implementations and/or a better solution - thanks in advance.

Details

Crawling, Indexing and Ranking

Locked

Informational notification.

This question is locked and replying has been disabled.

Community content may not be verified or up-to-date. Learn more.

Recommended Answer

Mar 17, 2021

robots.txt does not actually support regex. Well 'consumers' of robots.txt dont!

It may look somewhat like a regular expression, but the syntax is a special syntax just for robots.txt.

Many bots don't even support wildcards. But Googelbot does support a 'extension' to original robots.txt standard and supports * as a wildcard.

https://developers.google.com/search/docs/advanced/robots/robots_txt#url-matching-based-on-path-values

So ranges like [1-9] are out! There isnt even "at least one" charactor wildcard!

If you really want do it, probably looking at using something like

Disallow: /forum/*?page=

Allow: /forum/*?page=1$

Allow: /forum/*?page=2$

Allow: /forum/*?page=3$

Allow: /forum/*?page=4$

Allow: /forum/*?page=5$

Allow: /forum/*?page=6$

Allow: /forum/*?page=7$

Allow: /forum/*?page=8$

Allow: /forum/*?page=9$

You specifically allow pages 1-9, while disallowing all other page urls.

Original Poster Pedro Teixeira marked this as an answer

dwsmart

Diamond Product Expert

JavaScript / Web Dev / SEO

Mar 17, 2021

Hi,

There's not really full regex support, you can achieve what you want with:

Disallow: /forum/*?page=*
Allow: /forum/*?page=1$
Allow: /forum/*?page=2$
Allow: /forum/*?page=3$
Allow: /forum/*?page=4$
Allow: /forum/*?page=5$
Allow: /forum/*?page=6$

Allow: /forum/*?page=7$
Allow: /forum/*?page=8$
Allow: /forum/*?page=9$

Which would allow page=1 thorough page=9, but not page=10 onwards. You'd have to have a line allowing all the page numbers you do want to allow for crawling.

However, always worth remembering that robots.txt doesn't prevent indexing, just crawling, these pages could be partially indexed anyway.

So, if the concern is crawl budget, the robots.txt thing can work (make sure to test, and test again!), but if it's just that page=10 onwards doesn't provide much value and you don't want it indexed, just use a noindex on those instead

Last edited Mar 17, 2021

dwsmart

Diamond Product Expert

JavaScript / Web Dev / SEO

Mar 17, 2021

Ha, @barry to the rescue first!

Saket Gupta

Diamond Product Expert

SEO Specialist 🩺

Mar 17, 2021

Hello Pedro,

Best wishes for your forums, As you are using "parameter" for pagination pages then why don't you use Google's URL parameters Tool. It will allow you to Block crawling of parameterized duplicate content or we can use is for pagination because there is a pagination option as well.

Know more When and how to use the URL Parameters tool

Hope this will help you.

Pedro Teixeira

Original Poster

Mar 17, 2021

@barryhunter @dwsmart - this is brilliant thank you! Loads of great information happening in a couple of answers. I had no idea robots.txt syntax wasn't regex, and, I was obsessed in blocking that never thought on the allow reverse approach - brilliant!

@dwsmart: and thanks for the additional nudge on the indexing/crawling concepts. For this case is really about mitigating crawl budget. If a couple of pages beyond page 9 end up being indexed because Google found the link via backlinks or somewhere else, it's not the end of the world. In fact, if the latter is the case it's because some soul found it valuable enough to link so another soul might find it useful in the SERP.

@Saket: Thanks as well - for this case in particular we don't want to block all ?page= URLs, just a certain threshold. Another reason is that this parameter is used sitewide in other URLs and this differentiation wouldn't be possible to make in the Parameters tools.

Thanks all!

Saket Gupta

Diamond Product Expert

SEO Specialist 🩺

Mar 18, 2021

Hello Pedro,

There is an option "Only URLs with value=x:" Googlebot will crawl only those URLs where the value of this parameter matches this specified value. URLs with a different parameter value won’t be crawled. So you can set your custom rules.

Parameter behavior applies to the entire property; you cannot limit crawling behavior for a given parameter to a specific URL or branch of your site. Yes, in that case, you have to see what is beneficial for your website.