Here are some of the most common extraction issues related to the article text, and how you can solve them. If the article content:
- If the article content appears to be too long to be a news article, our crawler may not recognize it as an article. This may happen with news articles that contain user-contributed comments below the article, or HTML layouts that contain other material besides the news article itself.
- If the article content doesn't have punctuated sequences of contiguous words, we won't be able to include it in Google News. Make sure that the text of your articles is made up of sentences, and that you don't use frequent tags within your paragraphs.
- If the article content appears to consist only of isolated sentences not grouped into paragraphs, we won't be able to crawl it. Try formatting your articles into text paragraphs of a few sentences each.
- If the article content constitutes a small fraction of the text on the page, we won't be able to include it in our News index. Consider removing some of the non-article text on the page.
- If the article content appears to contain too few words to be a news article, we won't be able to include it. This applies to most links that would lead to news briefs or multimedia content, rather than full news articles.
- If the article content appears to be empty, we won't be able to crawl it. Make sure that the full text of each of your articles is available in the source code of your article pages (and not embedded in a JavaScript file, for example).
- If the article content is prevented from being crawled by a robots.txt file or a robots Meta tag, Googlebot won't be able to access your article. Try removing the "noindex" and/or "nofollow" Meta tags, or checking that your robots.txt file allows "User-agent: Googlebot" access to the file where your news articles are stored.
