Commit Graph

33 Commits

Author SHA1 Message Date
Tissevert d9f69014a0 Make a couple improvements in performance + add an example script to extract pages from a PDF 2020-05-28 18:54:15 +02:00
Tissevert 729e312f90 Actually, the spec calls 'catalog' what we call 'origin' — use 'catalog' for more clarity in regard to the spec 2020-03-19 10:27:29 +01:00
Tissevert 11640c8465 Replace 'cacheFonts' by more versatile 'withFonts' inspired by 'withResources' that avoid having to declare an inline function to capture the 'layer' argument and pass it twice 2020-03-19 10:27:29 +01:00
Tissevert ba7dd6a690 Make cacheFonts slightly more useful by passing layer directly to it and run the ReaderT underneath 2020-03-19 10:27:29 +01:00
Tissevert 5027b079eb Include page numbers in chunks label, needed for long documents with many pages 2020-03-19 10:27:28 +01:00
Tissevert 5722dd1a04 Use IntMap for all Maps on Ids 2020-03-19 10:27:28 +01:00
Tissevert f31e9eb38b Generalize Ids out of Content to handle Object Ids too 2020-03-19 10:27:21 +01:00
Tissevert 0f857c457d Use a defined monadic stack in Pages to lift the MonadReader ambiguity and allow finishing to reimplement getText demo 2020-03-14 16:57:16 +01:00
Tissevert 40475a3093 Clean unneeded stuff separating the monadic type constraint from the actual monad stack used, one more step towrds MonadFail -> MonadError 2020-03-14 16:55:34 +01:00
Tissevert 5b8d951516 WIP: Try about everything that's possible to try, OrderedMap or [(,)], try to decouple Box instance for Content and the one for Indexed Text, breaks getText… will probably require some advanced effect library, there seems to be a weird MonadReader conflict in the errors messages 2020-03-11 18:55:18 +01:00
Tissevert 3b1a5152e4 Try connecting all the Box instance in the getText demo, try to encode pages contents with a simple assoc list 2020-03-10 22:57:11 +01:00
Tissevert 2b9abc24b6 Add a separate instance for Raw streams that don't try to decode them 2020-03-04 18:31:30 +01:00
Tissevert 309f6ed461 Actually re-implement getText with the simpler Box instance 2020-03-04 18:19:10 +01:00
Tissevert cb257fc07e Rename function for clarity : actually it's doing just what w StreamContent does, but without checking the headers to re-zlib-encode the stream content 2020-02-27 17:30:42 +01:00
Tissevert 99014ff30d Recognize openStream was just an implementation of r for the Box m () Object ByteString, and extend it implementing the w operation while we're at it 2020-02-26 22:13:29 +01:00
Tissevert bcf2e05bfb Move Content out of Object module into a separate one incorporating PDF.Update (which is actually an operation that is defined only on that structure), and rename it Layer to avoid confusion with Content streams as defined in the specs (which have their own PDF.Content module already) 2020-02-17 15:29:59 +01:00
Tissevert 6096a1a237 Simplify navigations by centering everything on Objects to avoid needing to many conversion tools between DirectObject / Object / Dictionary 2020-02-15 13:51:24 +01:00
Tissevert 23186100a8 Reimplement getObj with the newest tools in PDF.Object.Navigation, in particular implement browsing by paths or random objectId access 2020-02-15 10:25:09 +01:00
Tissevert ae938acc02 Merge branch 'main' into extract-text 2020-02-12 17:34:56 +01:00
Tissevert 325250383a Add support for fonts and implement MacRomanEncoding 2020-02-08 08:15:32 +01:00
Tissevert 8373bd1ea0 Removing +x permission on getText source that shouldn't ever have been set 2019-11-29 19:07:54 +01:00
Tissevert 7eca875900 Improve getObj example to catch no-existing ObjectId and default to listing existing ObjectIds when none is provided 2019-11-29 11:53:08 +01:00
Tissevert f9f799c59b Take the dirty code of «getText» and turn it into a relatively clean module exposing pages, that can be retrieved all at once or by page number (numbered human-style, starting from 1) 2019-11-29 11:51:35 +01:00
Tissevert c9f050e64b Remove deprecated debug script and forgotten comments to bypass the selective export of Text module 2019-10-14 10:17:15 +02:00
Tissevert 3a3e1533b4 Clean ByteString types to identify when a ByteString contains the representation of an integer in a given base and fix the last remaining PDF string (un)escaping issue 2019-10-14 10:17:15 +02:00
Tissevert 36d7f9b819 Still debugging, broke pretty much everything and finally implementing a proper coderange parsing for CMap because apparently that's necessary 2019-10-14 10:17:15 +02:00
Tissevert 3b59fd0c61 Separate CMap and Text in two distinct modules 2019-10-14 10:17:15 +02:00
Tissevert 0374b72920 Finish implementing reading, still bugs to investigate 2019-10-14 10:17:15 +02:00
Tissevert e7484ef536 Completely lost, the same old Char8 / Word8 again, implemented all the text reading, still needing a couple details to parse CMaps 2019-10-14 10:17:15 +02:00
Tissevert f9e5683bf4 WIP: Use previous changes to start implementing font caching and text parsing (still very broken, doesn't compile) 2019-10-14 10:17:15 +02:00
Tissevert 6f3c159ea7 Adding a module to implement text reading and a demo program to go with it 2019-10-14 10:17:15 +02:00
Tissevert 68f90d20e2 Implement PDF's multilayer updates and use it in getObj to display only the current version of the object taken into account instead of the concatenation of all its versions 2019-09-22 01:40:39 +02:00
Tissevert 9ab010de61 Add to example programs to show how the lib can be used 2019-09-20 22:42:17 +02:00