Why an old and new appliance have different index sizes
Summary: After exporting a configuration from an old search appliance to a new search appliance, the number of documents in the indexes differs significantly, even after waiting for the new search appliance to crawl all the content.
Cause: There are a couple reasons that can cause this issue:
- Corpus changes over time: The common reason for the divergence in the index size is that the older search appliance has this past dataset of URLs which are no longer linked to the current main content sites, whereas the new search appliance has no way to discover these links using the current configuration and site navigation links. When encountering this situation, it helps to have some background of how the crawler works. As it crawls, the search appliance maintains a list of both discovered and indexed URLS. Over time, some websites may retire or deprecate certain content by removing links to the old content, but do not remove the actual documents. In these cases, the search appliance retains knowledge of the unlinked documents and keeps them in the index, even though they are not reachable from the larger website via regular site navigation.
- Determine what URLs are different: The key to understanding the differences in the index is to analyze what URLs are present in the two different search appliances. For quick comparisons, you can use the Index > Diagnostics > Index Diagnostics interface to quickly compare where document counts diverge, and to drill down and understand what URL patterns represent the differences.
The more comprehensive solution is to actually use the Index > Diagnostics > Export URLs interface to generate the lists of URLs that exist in both search appliances, so that you can programmatically compare them for differences.
- Update the crawl patterns if necessary: Once the divergence between the old index URLs and the new has been identified, the search appliance administrator can then decide what course of action will provide the best search results. If the content is clearly deprecated and irrelevant, then no changes need to be made. Alternately, the administrator can make changes to the start URLs and crawl patterns to ensure that the missing content is traversed by the crawl and index process.
Workaround: Establish a GSA^n mirroring environment temporarily, so the new search appliance receives a replica copy of the original index. After the two search appliances are synced, the mirror is then severed and the search appliances operate independently. This requires the two search appliances to run the same version of the search appliance software, but more information can be found here: Configuring GSA Mirroring