The search appliance crawled too many URLs or the wrong URLs
Summary: The search appliance either crawled too many URLs and reached the license limit, or it crawled URLs that are not supposed to be in the index.
Cause: The search appliance will crawl links it has found in the previous documents it crawled. However it will only crawl the links if they match the crawling patterns listed in the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console. The following cases are examples where the search appliance can crawl unintended URLs:
- The crawl patterns are not restrictive enough. For example a pattern like 'yoursite.com/' would crawl any URL whose hostname ended in 'yoursite.com' like 'https://dontcrawlme.yoursite.com/dontcrawl.html.
- The search appliance is recursively crawling URLs. In this case you will see URLs with repeating patterns like 'http://www.yousite.com/app?arg&arg&arg' show in Index > Diagnostics > Index Diagnostics in Admin Console.
Fix: There are a few ways to fix this issue depending on the urgency and the content crawled:
- Use the coverage tuning feature located in the Admin Console on the Content Sources > Web Crawl > Coverage Tuning page. This will allow the administrator to limit the number of URLs for certain hosts.
- Use the infinite space detection feature located in the Admin Console on the Content Sources > Web Crawl > Duplicate Hosts page. This will prevent the search appliance from crawling duplicate content due to recursive URLs.
- Add more restrictive patterns to the "Do not crawl URLs with the Following Patterns" box on the Content Sources > Web Crawl > Start and Block URLs page in Admin Console. See the crawl pattern document for info on configuring crawl patterns. Removing docs from the index can take about 30 mins.
- If the URLs need to be removed from search results faster than 30 mins, use the remove URLs feature. Note that this will not remove the URLs from the index. It will only prevent them from showing in search results.
- Reset index. If the entire URL set needs to be cleared use the reset index feature.