Avoid indexing duplicate contents


GSA is indexing duplicate contents which takes up additional license count. 


GSA considers each URL as a unique entity (unless their URLs are exactly similar) and compares their content checksum for duplicates. 


Remove URLs which have duplicate content.

GSA has Infinite space detection (Crawl and Index > Duplicate Hosts ) to configure the number of identical documents required to detect duplicate contents. It can also be used to detect repetitive path or query strings. Infinite space detection happens during crawling and hence if the URL has already been crawled, recrawl is necessary for the contents to be deleted.

If two content servers serve the same content (with different URLs), then they can be defined as duplicate hosts (Crawl and Index > Duplicate hosts)).  





Was this helpful?
How can we improve it?