ICEpdf
  1. ICEpdf
  2. PDF-418

Update text extraction to convert Ligatures in Unicode to plain text

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.3.2
    • Fix Version/s: 4.3.3
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      any

      Description

      This one came in via support request. The file in question has embedded CID fonts and the toUnicode conversion is handled correctly. However there is a slight hickup related to typographic ligatures and our search system.

      When the word "flight" is rendered it is actually converted to "U+FB01ight" and if pasted into a utf-8 encoded looked as it should. However if someone is searching the document for "flight" the match will not be found.

      The trick now is how to efficiently updated our text parser to convert Ligatures to their respective "full" unicode values. Values 0x64257 -> 0x64261 are common Latin codes which would need to be converted too two individual character codes.

      fi fi U+FB01 fi
      fl fl U+FB02 fl
      ffi ffi U+FB03 ffi
      ffl ffl U+FB04 ffl
      ſt ſt U+FB05 ſt

      I think this can be implemented fairly easily just need to research the full scope of home many different Ligatures are available.
      st st U+FB06 st

        Activity

        Repository Revision Date User Message
        ICEsoft Public SVN Repository #28778 Fri Apr 20 08:44:49 MDT 2012 patrick.corless Creating tag for patch release for PDF-418
        Files Changed
        Commit graph ADD /icepdf/tags/icepdf-4.3.2_lsa

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Patrick Corless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: