When using the JBIG2 library, it first decodes the JBIG2 compressed input from a byte array, and then it allows for getting the BufferedImage. PDF-9 optimised the memory usage of getting the BufferedImage.
This optimisation is for reducing the memory usage after decoding, but before getting the BufferedImage. Analysis of the internal data-structures of the JBIG2 decoder shows that intermediary sections of the JBIG2 image are assembled together to form the final image. While decoding, these subsections must remain accessible, but after decoding, these subsections (called "segments") become redundant, as the page segment then holds the entire image. So, I've added some cleanup code to clear away every segment except the page segments. Depending on the format of any given JBIG2 image in my test suite, I found that these other segments could account for anywhere between several kilobytes and more than 4 megabytes, for a given full page fax scan.
Initial analysis does not indicate that those segments could be cleared away progressively, as the segments are decoded, since it appears that the whole file consists of different segment types, one after the other, and the PDF spec dictates that they be processed in turn, and any later segment could refer to any earlier segment, so there's no way of knowing that a segment is no longer needed until the last segment has been processed. There is a random access mode of JBIG2 decoding, that might allow for more intelligent dependency calculations, but that appears to not be relevant to JBIG2 images embedded in PDFs.
Here is the post-decode cleanup commit.
Subversion 19987
icepdf\core\src\org\icepdf\core\pobjects\Stream.java
icepdf\core\src\org\jpedal\jbig2\JBIG2Decoder.java
icepdf\core\src\org\jpedal\jbig2\decoders\JBIG2StreamDecoder.java
When using the JBIG2 library, it first decodes the JBIG2 compressed input from a byte array, and then it allows for getting the BufferedImage.
PDF-9optimised the memory usage of getting the BufferedImage.This optimisation is for reducing the memory usage after decoding, but before getting the BufferedImage. Analysis of the internal data-structures of the JBIG2 decoder shows that intermediary sections of the JBIG2 image are assembled together to form the final image. While decoding, these subsections must remain accessible, but after decoding, these subsections (called "segments") become redundant, as the page segment then holds the entire image. So, I've added some cleanup code to clear away every segment except the page segments. Depending on the format of any given JBIG2 image in my test suite, I found that these other segments could account for anywhere between several kilobytes and more than 4 megabytes, for a given full page fax scan.
Initial analysis does not indicate that those segments could be cleared away progressively, as the segments are decoded, since it appears that the whole file consists of different segment types, one after the other, and the PDF spec dictates that they be processed in turn, and any later segment could refer to any earlier segment, so there's no way of knowing that a segment is no longer needed until the last segment has been processed. There is a random access mode of JBIG2 decoding, that might allow for more intelligent dependency calculations, but that appears to not be relevant to JBIG2 images embedded in PDFs.
Here is the post-decode cleanup commit.
Subversion 19987
icepdf\core\src\org\icepdf\core\pobjects\Stream.java
icepdf\core\src\org\jpedal\jbig2\JBIG2Decoder.java
icepdf\core\src\org\jpedal\jbig2\decoders\JBIG2StreamDecoder.java