[PDF-356] Error extractiong CID font unicode data - ICEsoft JIRA Issue Tracker

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 4.2.2
Fix Version/s: 4.3
Component/s: Core/Parsing
Labels:
None
Environment:
pro

Description

The file in question was sent in from a client. Page 9 of the pdf contains some Asian characters back by a CID font with now unicode data. When the client uses the OSX PDF preview tool the Asian text is extracted correctly. However when ICEpdf is used the extract text is not correctly encoded for unicode.

I've taken one pass at this issue and the CID font in question has no unicode data associated with it so I can only assume the orginal CID's are valid unicode values. For the first list of Asican charcters the cid's are:

5797, 3388, 2694, 3879, 618, 1186, 5242, 1625 etc.

The cid to GID is
595 -> 1
601 -> 2
618 -> 3
660 -> 4
752 -> 5
1186 -> 6
1393 -> 7
1625 -> 8
1954 -> 9
2694 -> 10
3388 -> 11
3543 -> 12
3879 -> 13
4469 -> 14
5242 -> 15
5797 -> 16

When we output one of these codes via text extraction we don't do any special byte handling for UTF-8 which I think is causing the output issue. More debuging is needed to get to the bottom of this one.

Activity

There are no subversion log entries for this issue yet.

People

Assignee:

Patrick Corless

Reporter:

Patrick Corless

Votes:

0 Vote for this issue

Watchers:

0 Start watching this issue

Dates

Created:

30/Nov/11 1:55 PM

Updated:

29/Mar/12 11:42 AM

Resolved:

01/Dec/11 9:17 AM