It turns out that there was a bug in our content parser logic and cmap logic. When text extraction takes place the toUnicode cmaps are used to map a CID to a String. In our old code we would make a CID to a char but there cases where CID can be represented by more then character.
In the Pro version the issue only showed up with text extraction as it uses the toUnocide cmaps. A little work had to be done to fix the cmap parsing issue. The problem didn't show up at render time because Pro can property read and render CID fonts.
In the OS version the issue showed up at render time and during text extraction. Some work had to be done to first correct the cmap parsing error so we could get the correct String for a given CID and the OFont class was updated to layout and draw the extra characters found in the returned Cmap string.
[ Show » ]
Patrick Corless added a comment - 24/Feb/11 12:52 PM It turns out that there was a bug in our content parser logic and cmap logic. When text extraction takes place the toUnicode cmaps are used to map a CID to a String. In our old code we would make a CID to a char but there cases where CID can be represented by more then character. In the Pro version the issue only showed up with text extraction as it uses the toUnocide cmaps. A little work had to be done to fix the cmap parsing issue. The problem didn't show up at render time because Pro can property read and render CID fonts. In the OS version the issue showed up at render time and during text extraction. Some work had to be done to first correct the cmap parsing error so we could get the correct String for a given CID and the OFont class was updated to layout and draw the extra characters found in the returned Cmap string.
the code for this bug was checked in accedentally under PDF-228, -r24000
I'll have to take a closer look next week. Generally speaking there should be a entry in the toUnicode to handle the conversion but I guess that is difficult given the one to two mapping. Interestingly enough acrobat doesn't do a stellar job either. I suspect the we'll have ot put in some stiffer code to look for these types of glyphs.