Details
-
Type: Bug
-
Status: Closed
-
Priority: Major
-
Resolution: Fixed
-
Affects Version/s: 6.2.5
-
Fix Version/s: 6.3
-
Component/s: Core/Parsing, Viewer RI
-
Labels:None
-
Environment:any
Description
A client has provided a test case of document exported from open office and saved as a PDF and PDF/A formats. The PDF/A document when searched did not return any results where the PDF version worked more or less as expected.
After quite a bit of debugging it was found that the PDF/a document used a non breaking spacer char 160 instead of the usual 32. I took a good look as how search is currently setup and have made some fairly far reaching changes. Search term will no longer take into account white space when applying matches. The end result is a some high quality search results. However given the huge number of variations on PDF notation we'll need to kick the tires a bit to makes sure we didn't break anything.
The PDF/A document also had another peculiarity with regards to all text in the document being written as marked content, with two names, "Marked" and "Artifact". While debugging the problems in general I noticed the the optionalContentGroup class was not correctly generating the correct hash value based on the optional content name and similarly for the equals. The single issue was causing a lot of duplicate text to be stored in the PageText Model.