[PDF-418] Update text extraction to convert Ligatures in Unicode to plain text - ICEsoft JIRA Issue Tracker

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 4.3.2
Fix Version/s: 4.3.3
Component/s: Core/Parsing
Labels:
None
Environment:
any

Description

This one came in via support request. The file in question has embedded CID fonts and the toUnicode conversion is handled correctly. However there is a slight hickup related to typographic ligatures and our search system.

When the word "flight" is rendered it is actually converted to "U+FB01ight" and if pasted into a utf-8 encoded looked as it should. However if someone is searching the document for "flight" the match will not be found.

The trick now is how to efficiently updated our text parser to convert Ligatures to their respective "full" unicode values. Values 0x64257 -> 0x64261 are common Latin codes which would need to be converted too two individual character codes.

fi ﬁ U+FB01 ﬁ
fl ﬂ U+FB02 ﬂ
ffi ﬃ U+FB03 ﬃ
ffl ﬄ U+FB04 ﬄ
ſt ﬅ U+FB05 ﬅ

I think this can be implemented fairly easily just need to research the full scope of home many different Ligatures are available.
st ﬆ U+FB06 ﬆ

Activity

Ascending order - Click to sort in descending order

Patrick Corless created issue - 04/Apr/12 1:32 PM

Evgheni Sadovoi made changes - 04/Apr/12 2:05 PM

Field	Original Value	New Value
Salesforce Case		[5007000000LGR7W]

Repository	Revision	Date	User	Message
ICEsoft Public SVN Repository	#28778	Fri Apr 20 08:44:49 MDT 2012	patrick.corless	Creating tag for patch release for ~~PDF-418~~
				Files Changed
				ADD /icepdf/tags/icepdf-4.3.2_lsa

Hide

Permalink

Patrick Corless added a comment - 20/Apr/12 9:43 AM

I've created a new ligature index in the CMAP class. This map is accessed when the toUnicode() method is called and checks to see if the converted cid matches any of the latin ligatures and will sub in the correct number of respective characters. This address search and text extraction issues and doesn't appear to slow down the content parser in a significant way.

Show

Patrick Corless added a comment - 20/Apr/12 9:43 AM I've created a new ligature index in the CMAP class. This map is accessed when the toUnicode() method is called and checks to see if the converted cid matches any of the latin ligatures and will sub in the correct number of respective characters. This address search and text extraction issues and doesn't appear to slow down the content parser in a significant way.

Patrick Corless made changes - 20/Apr/12 9:43 AM

Fix Version/s		4.3.3 [ 10333 ]
Fix Version/s	5.0 [ 10314 ]

Hide

Permalink

Patrick Corless added a comment - 20/Apr/12 11:07 AM

Marking as fixed.

Show

Patrick Corless added a comment - 20/Apr/12 11:07 AM Marking as fixed.

Patrick Corless made changes - 20/Apr/12 11:07 AM

Status	Open [ 1 ]	Resolved [ 5 ]
Resolution		Fixed [ 1 ]

Patrick Corless made changes - 01/Apr/15 3:01 PM

Status

Resolved [ 5 ]

Closed [ 6 ]

People

Assignee:

Patrick Corless

Reporter:

Patrick Corless

Votes:

0 Vote for this issue

Watchers:

0 Start watching this issue

Dates

Created:

04/Apr/12 1:32 PM

Updated:

01/Apr/15 3:01 PM

Resolved:

20/Apr/12 11:07 AM