ICEpdf
  1. ICEpdf
  2. PDF-13

Add support for multi-threaded document stream loading/parsing

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.0
    • Fix Version/s: 5.0.0 alpha1, 5.0
    • Component/s: Core/Parsing
    • Labels:
      None
    • Environment:
      ICEpdf

      Description

      Note from customer:

      Please notice that the test program uses two threads to convert two identical PDF files here. If I use single thread to convert two PDF files sequentially, the comparison will success (image files are the same). If I save the converted images and open in Microsoft paint, I will see that one of the image is missing a chart.
      1. Test.java
        2 kB
        Tyler Johnson

        Activity

        Hide
        Patrick Corless added a comment -

        I've been testing this issue with a similar application that tries to initialize or extract text using multiple threads. After numerous days of testing and debugging I think I have found a couple hots spots.

        The first area of concern is the thread access mechanism related to the Implementations of the SeekableInput. There are two implementations, RandomAccessFileInputStream and SeekableByteArrayInputStream. When a SeekableInput implementations use a wait notify mechanism for starting and ending thread access. On a micro scale this locking work pretty well but becomes problematic when working dealing with more then one input stream per thread. For example

        Font.init (Thread 1)

        • start thread 1 access
          font.getFontDescriptor (new inputstream)
        • starting thread1 access
        • closing thread 1 access
              • This where a thread 2 can gain access to leaving the inistial startthread 1 hanging.

        The problem seems to happen more often that not in the SequenceInputStream when parsing though a pag content stream that has more then one Content Stream. When sequenceInputStream closes one stream and goes on tot he next it makes it possible for another thread to jump in which usually causes problems.

        Something that is not clear to me is when we uses the SeekableInput becuase it inherits from InputStream and thus hard to find all the access points. I'm pretty sure there is still some SeekableInput usuages that don't use the startThreadAccess and endThreadAccess and thus mess up

        There is one last problem that seems to show up do to shared PDF resources. For example it is possible for a Font object to initialize some other object form the document file which will cause the current thread to loose access to the stream, at which point a new thread get access and will try and initialize the same font resulting in a liveLock. The initial font object will be stuck waiting for thread access and the second thread will be stock waiting for access to the font init method.

        Show
        Patrick Corless added a comment - I've been testing this issue with a similar application that tries to initialize or extract text using multiple threads. After numerous days of testing and debugging I think I have found a couple hots spots. The first area of concern is the thread access mechanism related to the Implementations of the SeekableInput. There are two implementations, RandomAccessFileInputStream and SeekableByteArrayInputStream. When a SeekableInput implementations use a wait notify mechanism for starting and ending thread access. On a micro scale this locking work pretty well but becomes problematic when working dealing with more then one input stream per thread. For example Font.init (Thread 1) start thread 1 access font.getFontDescriptor (new inputstream) starting thread1 access closing thread 1 access This where a thread 2 can gain access to leaving the inistial startthread 1 hanging. The problem seems to happen more often that not in the SequenceInputStream when parsing though a pag content stream that has more then one Content Stream. When sequenceInputStream closes one stream and goes on tot he next it makes it possible for another thread to jump in which usually causes problems. Something that is not clear to me is when we uses the SeekableInput becuase it inherits from InputStream and thus hard to find all the access points. I'm pretty sure there is still some SeekableInput usuages that don't use the startThreadAccess and endThreadAccess and thus mess up There is one last problem that seems to show up do to shared PDF resources. For example it is possible for a Font object to initialize some other object form the document file which will cause the current thread to loose access to the stream, at which point a new thread get access and will try and initialize the same font resulting in a liveLock. The initial font object will be stuck waiting for thread access and the second thread will be stock waiting for access to the font init method.
        Hide
        Patrick Corless added a comment -

        As I mentioned I didn't get too far on solving the problem. Interestingly enough if a pdf file doesn't used shared font and has only one content stream per page, we can use as many thread as we want to parse the file. I'm curious if we another high level thread access system couldn't be used to make sure that we don't loose access to a Input stream in the middle of an object parse?

        I've going to check in my modification as they did fix a couple smaller problems. Hopefully we can look at this again for the next major release.

        Show
        Patrick Corless added a comment - As I mentioned I didn't get too far on solving the problem. Interestingly enough if a pdf file doesn't used shared font and has only one content stream per page, we can use as many thread as we want to parse the file. I'm curious if we another high level thread access system couldn't be used to make sure that we don't loose access to a Input stream in the middle of an object parse? I've going to check in my modification as they did fix a couple smaller problems. Hopefully we can look at this again for the next major release.
        Hide
        Patrick Corless added a comment -

        A client has sent in a test case will take quick look at this issue for 4.3.4

        Show
        Patrick Corless added a comment - A client has sent in a test case will take quick look at this issue for 4.3.4
        Hide
        Patrick Corless added a comment -

        As part of the major refactoring for PDF-319 the handling of stream content has been reworked. When a stream is parsed the uncompressed stream bytes are stored and the file lock is removed. The Stream is then available for processing on a new thread. This work was also coupled with the removal of the memory manager in favor of Soft and Weak Reference model. From early testing three threads seems to be optimal for processing a document. Any more and the file locks tend to slow things down on an 4 core machine, new systems maybe able to do more.

        Another interesting refactor was to move image loading onto a new worker thread to avoid pausing the content parse while the image is decoded. The new ImageProxy will however block if paint is called before the image is fully decoded. So not true imageProxy but overall parse speed has been greatly improved for some image heavy documents.

        Hopefully in the future we can take the rework a step further using NIO and avoid having to manage the file marker one thread at time.

        Show
        Patrick Corless added a comment - As part of the major refactoring for PDF-319 the handling of stream content has been reworked. When a stream is parsed the uncompressed stream bytes are stored and the file lock is removed. The Stream is then available for processing on a new thread. This work was also coupled with the removal of the memory manager in favor of Soft and Weak Reference model. From early testing three threads seems to be optimal for processing a document. Any more and the file locks tend to slow things down on an 4 core machine, new systems maybe able to do more. Another interesting refactor was to move image loading onto a new worker thread to avoid pausing the content parse while the image is decoded. The new ImageProxy will however block if paint is called before the image is fully decoded. So not true imageProxy but overall parse speed has been greatly improved for some image heavy documents. Hopefully in the future we can take the rework a step further using NIO and avoid having to manage the file marker one thread at time.

          People

          • Assignee:
            Patrick Corless
            Reporter:
            Tyler Johnson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: