Commit Graph

34 Commits

Author SHA1 Message Date
Tissevert d9f69014a0 Make a couple improvements in performance + add an example script to extract pages from a PDF 2020-05-28 18:54:15 +02:00
Tissevert 09bd706748 Export Content operators, needed to write filters like reveal 2020-03-19 10:27:29 +01:00
Tissevert ba7dd6a690 Make cacheFonts slightly more useful by passing layer directly to it and run the ReaderT underneath 2020-03-19 10:27:29 +01:00
Tissevert d21e14f9a4 Hey, zlib isn't needed anymore for getText since all decoding is done directly in the Box instance for Streams 2020-03-19 10:27:28 +01:00
Tissevert f31e9eb38b Generalize Ids out of Content to handle Object Ids too 2020-03-19 10:27:21 +01:00
Tissevert f2a99e1fd2 Reorder module PDF.Body in alphabetical order 2020-03-14 16:25:26 +01:00
Tissevert 5b8d951516 WIP: Try about everything that's possible to try, OrderedMap or [(,)], try to decouple Box instance for Content and the one for Indexed Text, breaks getText… will probably require some advanced effect library, there seems to be a weird MonadReader conflict in the errors messages 2020-03-11 18:55:18 +01:00
Tissevert 3b1a5152e4 Try connecting all the Box instance in the getText demo, try to encode pages contents with a simple assoc list 2020-03-10 22:57:11 +01:00
Tissevert dce10ae63a Keep Page as only a reference object keeping the ObjectId explicit so we can modify the actual objects one day, write an OrderedMap data structure to help 2020-03-08 22:18:47 +01:00
Tissevert a9252b129a Start a Box module to describe inclusion relations between different types and get a MonadState action on the top type for any modification down there 2020-02-23 22:24:59 +01:00
Tissevert bcf2e05bfb Move Content out of Object module into a separate one incorporating PDF.Update (which is actually an operation that is defined only on that structure), and rename it Layer to avoid confusion with Content streams as defined in the specs (which have their own PDF.Content module already) 2020-02-17 15:29:59 +01:00
Tissevert 23186100a8 Reimplement getObj with the newest tools in PDF.Object.Navigation, in particular implement browsing by paths or random objectId access 2020-02-15 10:25:09 +01:00
Tissevert a72d76e229 Add unit tests to make sure I'm not breaking things too much 2020-02-14 17:58:03 +01:00
Tissevert aed7af376a WIP: still trying to figure things out, moved to a separate submodule for Navigation, proper naming is hell 2020-02-11 08:29:08 +01:00
Tissevert 9f1b1afafe Implement Text rendering from parsed Content 2020-02-10 10:54:44 +01:00
Tissevert 20466c4f13 WIP: Clean code parsing «pages» (now Content), separated from text rendering (will be reimplemented as an upper layer, also providing modification as stream filters) — Page is also forgotten for now, will need a big improvement in Object navigation 2020-02-09 22:42:57 +01:00
Tissevert 325250383a Add support for fonts and implement MacRomanEncoding 2020-02-08 08:15:32 +01:00
Tissevert f9f799c59b Take the dirty code of «getText» and turn it into a relatively clean module exposing pages, that can be retrieved all at once or by page number (numbered human-style, starting from 1) 2019-11-29 11:51:35 +01:00
Tissevert 42a02808c1 Merge branch 'main' into extract-text 2019-11-27 18:05:47 +01:00
Tissevert 380c1e439b Fix a bug preventing Hufflepdf from reading objects with a ' ' after the `obj` keyword 2019-11-27 18:01:19 +01:00
Tissevert c9f050e64b Remove deprecated debug script and forgotten comments to bypass the selective export of Text module 2019-10-14 10:17:15 +02:00
Tissevert 36d7f9b819 Still debugging, broke pretty much everything and finally implementing a proper coderange parsing for CMap because apparently that's necessary 2019-10-14 10:17:15 +02:00
Tissevert 3b59fd0c61 Separate CMap and Text in two distinct modules 2019-10-14 10:17:15 +02:00
Tissevert 1dd22c3889 Going to try with Text, naturally handling UTF-16 but will still have to parse «int codes» manually from strings 2019-10-14 10:17:15 +02:00
Tissevert c349d9b4c2 Don't trust serializer, they have nothing todo with a reasonable binary encoding 2019-10-14 10:17:15 +02:00
Tissevert e7484ef536 Completely lost, the same old Char8 / Word8 again, implemented all the text reading, still needing a couple details to parse CMaps 2019-10-14 10:17:15 +02:00
Tissevert 6f3c159ea7 Adding a module to implement text reading and a demo program to go with it 2019-10-14 10:17:15 +02:00
Tissevert d6994f0813 Release 0.2.0.0 2019-10-14 10:16:14 +02:00
Tissevert 68f90d20e2 Implement PDF's multilayer updates and use it in getObj to display only the current version of the object taken into account instead of the concatenation of all its versions 2019-09-22 01:40:39 +02:00
Tissevert 9ab010de61 Add to example programs to show how the lib can be used 2019-09-20 22:42:17 +02:00
Tissevert dd79cb3fc7 Release bugfix v0.1.1.1 2019-05-31 15:16:23 +02:00
Tissevert 11cb6504d7 Go strict ByteStrings with attoparsec 2019-05-24 10:48:09 +02:00
Tissevert b60f337cc4 First useable version 2019-05-18 11:09:03 +02:00
Tissevert 2c165daaa7 Finally opt for uppercase Hufflepdf and rename cabal package 2019-05-18 09:49:31 +02:00