May 16, 2020

Crawling 8.8 million pages a day, stuck in one section, searching for keywords. Other sections hurt.

After the May update, Google crawlers have indexed 23 million new pages on our user-generated website.  Prior to the update, we had a steady 5.5 million total pages indexed and crawl rate was around 400k - 500k per day .  As of 5/13, they are currently crawling at rate of 8.9 million pages a day and increasing.  To give you an idea of the rate of increase, 5/11 : 4.7 million, 5/12 : 7 million pages, 5/13 : 8.8 million.  It's great that they are able to crawl at such a rate but unfortunately it's a major problem and believe it's the main cause for our rankings dropping in all other sections as we'll explain further.

They seem to be stuck in an infinite "loop" in one section mainly because the crawlers are searching an endless list of keywords ranging from "dog" to really random words like "rk6ddhw03o56de6" while also changing filters that narrow results.  As you can imagine, most of these internal keyword searches return "0 results".  This particular section only has 8k individual pages.  How do we get them to move onto or at least divide their attention to our other sections which are actually far more important to us than the one they are stuck in since May 3rd? The section it's stuck in dropped 0.2 rank while all others have dropped between 3 - 20 ranks.  We were on the first page of many our target keywords and that's no longer the case.  We thought they'd naturally move on but it's been too long to ignore and the impact on our traffic is severe.

To make matters worse, or perhaps the main issue, is that we did not have noindex set on these pages until tonight.  Haven't had it set for 10+ years.  We now have the meta data properly setup to noindex keyword searches but face the problem of how to deindex the millions of 0 result pages.  Is there anyway to deindex with a pattern?  

For example: /section/$1?keywords=$2

Do we need to wait for crawlers to search through the massive list of keywords again to deindex what they searched for?  From what we've read, adding a rule to robots.txt would stop them from searching the section but then they won't be able to see the noindex tag to remove the pages.  Who knows how long it will be before they search through the keyword list again.

Ultimately, it's created a very skewed view of what our website is about.  Having 23 million pages about one subject with the vast majority have 0 results vs the 5 million higher quality pages they knew about before the update is causing all our rankings to plummet.

We need help / advice.  Mainly, how to deindex the problematic pages via the pattern described above.
Locked
Informational notification.
This question is locked and replying has been disabled.
Community content may not be verified or up-to-date. Learn more.
All Replies (5)
MaxL, if you have marked the pages as NOINDEX that is the right thing to do.

If you want to request Google to deindex those pages faster, just create new sitemap files to list all pages to be deindexed and submit those sitemaps to Google Search Console. Then wait for Google crawl those pages. Keep in mind that large sitemaps can only 50000 pages each. So you may need to split your new sitemaps in multiple sitemaps and submit each of them individually.

I have done that a few years ago in a site of mine many years ago. Would you like to get personalized help from me you on how to implement that in practice?
May 16, 2020
Thanks for the offer. I can certainly use any help or advice from anyone on this situation.

Deindexing keyword searches that were indexed:
The trouble with deindexing what was indexed is that I don't have a list of the keywords they searched for to create the url sitemaps to submit to them. It's a seemingly random variable.  I thought about logging the keywords they are searching to see if there's a limited # of unique terms that they are cycling through with the various filters.  However, that wouldn't help because I don't know what filters they used with which keywords to submit to them to deindex.  I'd end up asking them to visit pages they didn't index in the first place only to not index them with the hope they deindex the ones they did visit previously.

I'm going to check the Apache access logs to see how far back they go but this activity has been going on for over a week and processing the logs and generating 50k sitemaps based on keywords= matches will take considerable time to investigate, create the parsing script, process and ultimately submit hundreds of sitemaps of 50k urls.  Is that something that you've done and that I should do next?  I'll do it if it's what's necessary.

I see that you can temporarily remove URL's with a prefix...
That would be a drastic measure to remove an entire section but perhaps it's my best option here.  That should get rid of the bulk of the indexing problem and give them 6 months to potentially search and permanently remove them when they see the deindex robots meta data.  Maybe this is the best option?

Crawler searching in one section isn't stopping.
They are searching for keywords in this particular section this morning. At any given moment, there's 280+ requests from them with this pattern. I'm not sure how to get them to stop, other than using robots.txt and targeting the particular section.  At the same time, if I do that they won't be able to naturally deindex the list only they know about. Quite the predicament.  

I feel like I need the activity to stop or at least show signs of slowing up before I focus on potentially helping them to deindex what they already have.

Disallow: /section*?keywords=*

If I did that temporarily maybe that would stop the keyword searches and they will look at other sections?  Later, I could remove the disallow and they could return to hopefully deindex what they searched for.  The thought of them searching everything again seems... like it will take even longer for them to get around to, if ever.

Should I disallow keywords searches in general in the robots.txt?
I don't know what would prevent them from running the keyword list on other sections though and falling into the same problem minus not indexing my internal search pages this time.  

I know no one can answer this but can't help but ask why is this an issue today but not in past 10 years?
The 23 million new indexed pages is May 11th data in webmaster tools, they haven't updated since then so it's likely much more now.  I'm quite surprised that they consider all of these 0 result pages valid / worth indexing in the first place.  The only difference between them is the page title & h1 tags.  Other than that it's the site wrapper & the message "no results."

I'm glad to have deindex in place now but I need to clean up the mess.
Last edited May 16, 2020
May 16, 2020
I don't think the Removals tool will work here.  It sounds like it would just remove /section* from search results but would have no real impact in what is internally indexed / potentially impacting the other sections of my site. All sections have dropped in rank 3-20 spots while /section they are stuck in rose 0.3 as of today.  

I now have more than 23 million+ pages of very poor quality in one /section vs ~5 million in all others.

I don't know how to fix this yet.  I may have to wait until they naturally drop them all out over time.  This is painful.

I think I should probably block the search activity in the section via robots.txt to at least stop them from wasting resources / budget there.
Last edited May 16, 2020
May 16, 2020
Hi
 
I agree, the removals tool isn't the right thing here, that just hides the from showing up in results, this isn't really the issue here.
 
So the decision is really, is this crawling level currently an issue? It seems to me from what you've said that that is the case. In which case the robots.txt block is the right tool for the job. As you say, you can selectively add and remove this to have some control over the crawling and therefore deindexing of these pages as time goes on.
 
But that's a fairly blunt tool, so you might want to look into the URL parameters tool, scroll down and take a look at the "Block crawling of URLs containing specific parameters" section for some guidance.
 
 
MaxL, you need help to solve your problem of deindexing which is quite difficult but possible to solve. I solved that problem in my sites years ago. It was time consuming but it work. I have other projects of mine to take care and generate money. I can help you to solve the problem as long as you are willing to hire my services to do it. Can you please let me know if you are interested to hire me so I can help you to solve the problem very quickly?
false
3109459914798072241
true
Search Help Center
true
true
true
true
true
83844
Search
Clear search
Close search
Main menu
false
false