Commit Graph

34 Commits

Author SHA1 Message Date
d9f69014a0 Make a couple improvements in performance + add an example script to extract pages from a PDF 2020-05-28 18:54:15 +02:00
09bd706748 Export Content operators, needed to write filters like reveal 2020-03-19 10:27:29 +01:00
ba7dd6a690 Make cacheFonts slightly more useful by passing layer directly to it and run the ReaderT underneath 2020-03-19 10:27:29 +01:00
d21e14f9a4 Hey, zlib isn't needed anymore for getText since all decoding is done directly in the Box instance for Streams 2020-03-19 10:27:28 +01:00
f31e9eb38b Generalize Ids out of Content to handle Object Ids too 2020-03-19 10:27:21 +01:00
f2a99e1fd2 Reorder module PDF.Body in alphabetical order 2020-03-14 16:25:26 +01:00
5b8d951516 WIP: Try about everything that's possible to try, OrderedMap or [(,)], try to decouple Box instance for Content and the one for Indexed Text, breaks getText… will probably require some advanced effect library, there seems to be a weird MonadReader conflict in the errors messages 2020-03-11 18:55:18 +01:00
3b1a5152e4 Try connecting all the Box instance in the getText demo, try to encode pages contents with a simple assoc list 2020-03-10 22:57:11 +01:00
dce10ae63a Keep Page as only a reference object keeping the ObjectId explicit so we can modify the actual objects one day, write an OrderedMap data structure to help 2020-03-08 22:18:47 +01:00
a9252b129a Start a Box module to describe inclusion relations between different types and get a MonadState action on the top type for any modification down there 2020-02-23 22:24:59 +01:00
bcf2e05bfb Move Content out of Object module into a separate one incorporating PDF.Update (which is actually an operation that is defined only on that structure), and rename it Layer to avoid confusion with Content streams as defined in the specs (which have their own PDF.Content module already) 2020-02-17 15:29:59 +01:00
23186100a8 Reimplement getObj with the newest tools in PDF.Object.Navigation, in particular implement browsing by paths or random objectId access 2020-02-15 10:25:09 +01:00
a72d76e229 Add unit tests to make sure I'm not breaking things too much 2020-02-14 17:58:03 +01:00
aed7af376a WIP: still trying to figure things out, moved to a separate submodule for Navigation, proper naming is hell 2020-02-11 08:29:08 +01:00
9f1b1afafe Implement Text rendering from parsed Content 2020-02-10 10:54:44 +01:00
20466c4f13 WIP: Clean code parsing «pages» (now Content), separated from text rendering (will be reimplemented as an upper layer, also providing modification as stream filters) — Page is also forgotten for now, will need a big improvement in Object navigation 2020-02-09 22:42:57 +01:00
325250383a Add support for fonts and implement MacRomanEncoding 2020-02-08 08:15:32 +01:00
f9f799c59b Take the dirty code of «getText» and turn it into a relatively clean module exposing pages, that can be retrieved all at once or by page number (numbered human-style, starting from 1) 2019-11-29 11:51:35 +01:00
42a02808c1 Merge branch 'main' into extract-text 2019-11-27 18:05:47 +01:00
380c1e439b Fix a bug preventing Hufflepdf from reading objects with a ' ' after the obj keyword 2019-11-27 18:01:19 +01:00
c9f050e64b Remove deprecated debug script and forgotten comments to bypass the selective export of Text module 2019-10-14 10:17:15 +02:00
36d7f9b819 Still debugging, broke pretty much everything and finally implementing a proper coderange parsing for CMap because apparently that's necessary 2019-10-14 10:17:15 +02:00
3b59fd0c61 Separate CMap and Text in two distinct modules 2019-10-14 10:17:15 +02:00
1dd22c3889 Going to try with Text, naturally handling UTF-16 but will still have to parse «int codes» manually from strings 2019-10-14 10:17:15 +02:00
c349d9b4c2 Don't trust serializer, they have nothing todo with a reasonable binary encoding 2019-10-14 10:17:15 +02:00
e7484ef536 Completely lost, the same old Char8 / Word8 again, implemented all the text reading, still needing a couple details to parse CMaps 2019-10-14 10:17:15 +02:00
6f3c159ea7 Adding a module to implement text reading and a demo program to go with it 2019-10-14 10:17:15 +02:00
d6994f0813 Release 0.2.0.0 2019-10-14 10:16:14 +02:00
68f90d20e2 Implement PDF's multilayer updates and use it in getObj to display only the current version of the object taken into account instead of the concatenation of all its versions 2019-09-22 01:40:39 +02:00
9ab010de61 Add to example programs to show how the lib can be used 2019-09-20 22:42:17 +02:00
dd79cb3fc7 Release bugfix v0.1.1.1 2019-05-31 15:16:23 +02:00
11cb6504d7 Go strict ByteStrings with attoparsec 2019-05-24 10:48:09 +02:00
b60f337cc4 First useable version 2019-05-18 11:09:03 +02:00
2c165daaa7 Finally opt for uppercase Hufflepdf and rename cabal package 2019-05-18 09:49:31 +02:00