[PDF-1211] Page text not correctly storing marking optional content text. - ICEsoft JIRA Issue Tracker

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 6.2.5
Fix Version/s: 6.3
Component/s: Core/Parsing, Viewer RI
Labels:
None
Environment:
any

Description

A client has provided a test case of document exported from open office and saved as a PDF and PDF/A formats. The PDF/A document when searched did not return any results where the PDF version worked more or less as expected.

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Patrick Corless added a comment - 31/Oct/17 11:31 AM

After quite a bit of debugging it was found that the PDF/a document used a non breaking spacer char 160 instead of the usual 32. I took a good look as how search is currently setup and have made some fairly far reaching changes. Search term will no longer take into account white space when applying matches. The end result is a some high quality search results. However given the huge number of variations on PDF notation we'll need to kick the tires a bit to makes sure we didn't break anything.

The PDF/A document also had another peculiarity with regards to all text in the document being written as marked content, with two names, "Marked" and "Artifact". While debugging the problems in general I noticed the the optionalContentGroup class was not correctly generating the correct hash value based on the optional content name and similarly for the equals. The single issue was causing a lot of duplicate text to be stored in the PageText Model.

Show

Patrick Corless added a comment - 31/Oct/17 11:31 AM After quite a bit of debugging it was found that the PDF/a document used a non breaking spacer char 160 instead of the usual 32. I took a good look as how search is currently setup and have made some fairly far reaching changes. Search term will no longer take into account white space when applying matches. The end result is a some high quality search results. However given the huge number of variations on PDF notation we'll need to kick the tires a bit to makes sure we didn't break anything. The PDF/A document also had another peculiarity with regards to all text in the document being written as marked content, with two names, "Marked" and "Artifact". While debugging the problems in general I noticed the the optionalContentGroup class was not correctly generating the correct hash value based on the optional content name and similarly for the equals. The single issue was causing a lot of duplicate text to be stored in the PageText Model.

Hide

Permalink

Patrick Corless added a comment - 08/Jan/18 1:36 PM

Marking as fixed.

Show

Patrick Corless added a comment - 08/Jan/18 1:36 PM Marking as fixed.

People

Assignee:

Patrick Corless

Reporter:

Patrick Corless

Votes:

0 Vote for this issue

Watchers:

1 Start watching this issue

Dates

Created:

31/Oct/17 11:21 AM

Updated:

25/Jan/18 12:58 PM

Resolved:

08/Jan/18 1:36 PM