/webmasters/community?hl=en
/webmasters/community?hl=en
12/10/15
Original Poster
Google user

Google Crawler Truncates Comma in URL and Reports 404

Hi, I've found numerous discussions about this but could not find an authorized answer that confirms one way or another.

Here's the problem in a nutshell:
We have a very large number of pages with a comma in the URL.
These pages are also served through an XML sitemap, which is properly encoded.
Up until recently we never had a problem with that.
In the past few months, we're seeing reports about 404 in Search Console, which appear to be coming from URLs served through the sitemap. However, the URLs reported as 404 are not really in the sitemap, it seems they are a truncated version of URL that appear (properly) in the sitemap.
Example:
Search Console reports 404 on http://example.com/x and claims it was found in the sitemap.
The sitemap does not have http://example.com/x.

The main reason I suspect it's a bug on Google's side, is that we're only getting 404s for a very small number of our URLs with comma. If it had been something we are doing consistently wrong (e.g., encoding), I'd expect to see a much much larger number of 404s and URLs being truncated.

My question is: can someone confirm this is a known issue with Google Crawler?
Should we be doing something differently or just ignore the 404 warnings and wait for the issue to be resolved on its own? (based on reports on the web it seems to exists for years)

Here's some of the reports I found:
Thank you very much for your help.

Roee.
Community content may not be verified or up-to-date. Learn more.
Recommended Answer
Was this answer helpful?
How can we improve it?
All Replies (7)
h3nce
12/10/15
h3nce
can someone confirm this is a known issue with Google Crawler?

I don't think it's an issue with Google's crawler, I have had sitemaps with commas in the URLs without any 404 issues. Potentially, it could be how your server encodes the comma. If the encoding differs between XML and the server, naturally Google would discover 404s, because they are crawling different URLs.

If you haven't already done so, you can likely remove the issue, or its potential to happen, by wrapping URLs within your XML sitemap with CDATA tags. For example:

<loc><![CDATA[http://example.com/x,y]]></loc>

Reference: http://stackoverflow.com/questions/2784183/what-does-cdata-in-xml-mean


12/10/15
Original Poster
Google user
Thanks for help. I'll try that.
The reason I don't think it's something about the way encode/decode the XML or the URL is that the XML sitemap was generated automatically. It contains tens of millions of URLs with commas, which 99% of crawled and indexed properly.
We're only getting 404s on a few hundreds or thousands a day (small fraction of the Google daily's crawl), so if it was something technical on our side I'd expect to see this on millions of URLs.

Roee.
h3nce
12/10/15
h3nce
Where do receive the 404 error messages? 'Crawl Errors' report or 'Sitemap' report?

I ask because it would be good to know 100% that the sitemap is the source of the error. For instance, I've seen YouTube created backlinks from truncated URLs, which was the source of the problem.

Additionally, if encoding is not the problem, 'Crawl Errors' might provide extra context e.g. 'Soft 404s', which might point to an on-page issue or redirect issue.

One other useful tip — Internet Explorer / Edge does not re-encode URLs. A quick test you can do is copy & paste the URL in the XML sitemap into IE / Edge browser bar. Then navigate to the same URL on your website, and again copy & paste it into IE / Edge's browser bar. Look out for any differences where the commas appear in the URL. CDATA would fix this, but you find the root cause.
12/11/15
Original Poster
Google user
Thanks h3nce. I'll do the IE check as well.

This is not a soft 404.
I am seeing it in Crawl Errors -> Desktop tab -> Not found tab. It appears in the Smartphone tab as well but in much smaller numbers.


Looking at the list of the 'Not found' URLs I see they are of the form http://example.com/x. Clicking on it to see the 'Linked from' tab, I see it's being linked from sitemap files.
When checking each of those sitemap files, said URL does not appear in them, but there are plenty URLs of the form http://example.com/x,y in those sitemap files.

This what leads me to believe that:
1. The problem is comma truncation
2. It's coming from URLs in sitemap files

Combining with the fact that these type of URLs are the majority of the site, and 99% of them are crawled and indexed properly for years now, and seeing similar reports from other people (referenced above), makes me suspect it's on Google's side - that's why I'm looking for stronger confirmation.

Roee.
 


JohnMu
12/14/15
JohnMu
Hi Roee

I generally recommend avoiding special characters like commas, semicolons, colons, spaces, quotes etc. in URLs, to help keep things simple. URLs like that are often harder to automatically link (when someone posts in a forum or elsewhere), and hard for us to recognize correctly when we parse text content to try to find new URLs. When they're linked normally or submitted through a sitemap directly, they work as expected. However, when we try to recognize the URL in something that we crawl as a HTML or a text page, then we'll probably "guess" them wrong -- which is fine, since we've probably already seen them through the normal links & sitemap usage. 

In practice this doesn't matter, finding links which don't work is perfectly normal for us; it won't break the crawling, indexing, or ranking of your site assuming we can crawl it otherwise. We'll show these as 404s in Search Console because they return 404, but they're not something critical that you need to suppress. 

If you want to move to a cleaner URL structure that's less-likely to be misinterpreted like that, you can use normal 301 redirects & rel=canonical elements on the page. It'll generally take some time to crawl & reindex the URLs like that though, so you'll continue to see these old URLs in Search Console in the meantime. 

Cheers
John

12/14/15
Original Poster
Google user
Thank you John for the answer and advice.
We'll consider moving to a different URL structure, although what baffles me is that it seems that these errors were originated by sitemap links (went back and forth over this and the sitemaps seems fine). At least I'm less concerned about this now given your explanation.

Roee.
amit kumar roy
12/15/15
amit kumar roy
Is hyphen use a better practice in this case, John? What about meta titles, where people use Pipe or special characters as well? Any suggestions for that!!
 
This question is locked and replying has been disabled. Still have questions? Ask the Help Community.

Badges

Some community members might have badges that indicate their identity or level of participation in a community.

 
Google Employee — Google product team members and community managers
 
Community Specialist — Google partners who help ensure the quality of community content
 
Platinum Product Expert — Community members with advanced product knowledge who help other Google users and Product Experts
 
Gold Product Expert — Community members with in-depth product knowledge who help other Google users by answering questions
 
Silver Product Expert — Community members with intermediate product knowledge who help other Google users by answering questions
 
Product Expert Alumni — Former Product Experts who are no longer members of the program
Community content may not be verified or up-to-date. Learn more.

Levels

Member levels indicate a user's level of participation in a forum. The greater the participation, the higher the level. Everyone starts at level 1 and can rise to level 10. These activities can increase your level in a forum:

  • Post an answer.
  • Having your answer selected as the best answer.
  • Having your post rated as helpful.
  • Vote up a post.
  • Correctly mark a topic or post as abuse.

Having a post marked and removed as abuse will slow a user's advance in levels.

View profile in forum?

To view this member's profile, you need to leave the current Help page.

Report abuse in forum?

This comment originated in the Google Product Forum. To report abuse, you need to leave the current Help page.

Reply in forum?

This comment originated in the Google Product Forum. To reply, you need to leave the current Help page.