Troubleshooting article extraction issues
In Search Console, you can already view news-specific crawl errors. For instance, if one of your articles failed extraction due to the "Article too long" error, this means that the article body that we extracted from the HTML page appears to be too long to be a news article. This error could trigger because we're extracting comments at the bottom of a page, or sidebar text.
The troubleshooting tool gives you more visibility into the extraction process. It enables you to test on demand if a specific article or a specific section page with several articles can be extracted correctly. The errors returned can help you towards a fix, which means you can avoid running into crawling issues.
To access the troubleshooting tool, open your news source in the Google News Publisher Center and click Troubleshooting from the side navigation menu.
To check whether or not we can discover articles on specific section URL (e.g. example.com/technology) or sitemap, select Troubleshooting > Sections. Next enter the relevant sitemap or section URL in the field in the middle of the page and click Test.
You will then see up to 100 article URLs that we found on the provided section URL. To test article extraction for any given article URL, click the Test button to the right of each URL.
To test article extraction for a specific news article, select Troubleshooting > Articles.
Next, enter the Section URL for the article that you want to troubleshoot, or select None or Sitemap. The section URL or sitemap that Google News discovers your content on impacts the labels applied to your article. Our testing tool mimics this behavior. Thus, when deciding which section URL to select, note that selecting a certain section URL will test extraction as if the article was discovered on that section, and apply any relevant labels accordingly. Conversely, choosing "None or Sitemap" will run extraction as if the article was discovered on a section URL without any labels, or on a Sitemap.
After completing the Section URL field, enter the URL of the news article you’d like to test, and click Test.
When troubleshooting sections, sitemaps or articles, you may encounter a failure message. Most often, these messages will pertain to news-specific crawl errors also found in Google Search Console. However, some failures are specific to the troubleshooting tool and are explained below:
We are unable to crawl the article requested. This could have been due to an HTTP 404 or server error, or the page may be restricted by a robots.txt file.
Check your article URL to make sure it is correct and it is not restricted from crawling by Googlebot-News. You may also want to examine your robots.txt (if applicable) and make sure that the article page is not restricted.
We are unable to crawl the section page requested. This could have been due to an HTTP 404 or server error, or the page may be restricted by a robots.txt file.
Check your section URL to make sure it is correct and it is not restricted from crawling by Googlebot-News. You may also want to examine your robots.txt (if applicable) and make sure that the section page is not restricted.
This error message is displayed for one or more of the following reasons:
- You do not have permission to view this news source.
- You have provided an article URL that is not on the same domain (example.com) as your source URL.
- You have provided a section or sitemap URL that is not on the same domain (example.com) as your source URL.
- Article troubleshooting only: The section or sitemap URL provided is not active or listed on your news source’s main Sections page.
- Make sure that you have permission to view this news source. Check that you’re logged into the Publisher Center with the same email account that you used to verify ownership of your news source in Google Search Console.
- Ensure that your section, sitemap, and article URLs are on the same domain (example.com) as the source URL listed on the Source page.
- When troubleshooting articles, make sure that you are selecting an active section or sitemap URL that is also listed on the main Sections page. If you are unsure, you can click the text field to reveal a list of allowed section URLs or you can select None or sitemap.
The Sections page mentioned in this specific explainer refers to the main Sections page, not Troubleshooting > Sections.