PDF conversion produces unreadable documents

Summary: After upgrading to version 7.6.x, you may discover that some PDF files are crawled incorrectly even though these files were indexed properly in version 7.4. These files cannot be found using the original text from the PDF and contain corrupted snippets.

Cause: Issue #62118098 "PDF conversion on 7.6.x produces unreadable documents."

The new PDF converter does not correctly convert PDFs with embedded CID fonts, which are frequently used for CJK languages.

Fix: Ensure that your GSA uses version 7.6.50.G.64 or a later version and create a new support case for Google Search Appliance support.

Provide examples of PDF files that were converted correctly in version 7.4 but are unreadable in 7.6.x. The support team will verify if you have encountered issue #62118098 and will provide instructions on how to switch your search appliance to the old PDF converter.

Note: The new converter introduced in 7.6.x has fixed the following known issues:

  • 13336749 - GSA does not pick the right title for PDF file.
  • 9693599 - Some PDFs in format PDF/A, or generated with ghostscript, are not correctly indexed.
  • 650575 - PDF Document Title uses document content instead of Title property.

If your GSA was affected by these issues, switching back to the old converter will reintroduce these issues again.

Versions affected: 7.6.0.G.X, 7.6.50.G.X

Was this helpful?
How can we improve it?