Commit Graph

41 Commits

Author SHA1 Message Date
d9f69014a0 Make a couple improvements in performance + add an example script to extract pages from a PDF 2020-05-28 18:54:15 +02:00
a1c2fbf110 Add an alias to Id to lift type ambiguities like 'chunk' in PDF.Content.Text 2020-03-19 10:27:28 +01:00
5722dd1a04 Use IntMap for all Maps on Ids 2020-03-19 10:27:28 +01:00
f31e9eb38b Generalize Ids out of Content to handle Object Ids too 2020-03-19 10:27:21 +01:00
bcf2e05bfb Move Content out of Object module into a separate one incorporating PDF.Update (which is actually an operation that is defined only on that structure), and rename it Layer to avoid confusion with Content streams as defined in the specs (which have their own PDF.Content module already) 2020-02-17 15:29:59 +01:00
923d1800b0 Gain a bit of speed by using native Attoparsec for number types instead of reimplementing them with ByteString conversion and call to read 2020-02-14 18:02:40 +01:00
1c457d71d8 Fix the reading of Hexadecimal string objects detected by running the tests implemented from the spec 2020-02-14 18:00:12 +01:00
a72d76e229 Add unit tests to make sure I'm not breaking things too much 2020-02-14 17:58:03 +01:00
919f640443 Merge branch 'extract-text' into navigation 2020-02-12 17:35:56 +01:00
32f9866106 Use peek to improve directObject parser avoiding a large <|> disjunction 2020-02-12 17:34:27 +01:00
704d7a7fcf It turns out Output.concat wasn't necessary, OBuilder seems already is a Monoid so mconcat works (that fact was used in the very implementation of concat…) 2020-02-11 17:36:29 +01:00
aed7af376a WIP: still trying to figure things out, moved to a separate submodule for Navigation, proper naming is hell 2020-02-11 08:29:08 +01:00
e77bbbcda9 WIP: start moving some navigation-related routines from Pages into Object directly and generalize them to multi-component to allow easier browsing 2020-02-10 17:43:04 +01:00
42a02808c1 Merge branch 'main' into extract-text 2019-11-27 18:05:47 +01:00
380c1e439b Fix a bug preventing Hufflepdf from reading objects with a ' ' after the obj keyword 2019-11-27 18:01:19 +01:00
3a3e1533b4 Clean ByteString types to identify when a ByteString contains the representation of an integer in a given base and fix the last remaining PDF string (un)escaping issue 2019-10-14 10:17:15 +02:00
d07c286f8e Clean exported ByteString custom functions 2019-10-14 10:17:15 +02:00
36d7f9b819 Still debugging, broke pretty much everything and finally implementing a proper coderange parsing for CMap because apparently that's necessary 2019-10-14 10:17:15 +02:00
3b59fd0c61 Separate CMap and Text in two distinct modules 2019-10-14 10:17:15 +02:00
c349d9b4c2 Don't trust serializer, they have nothing todo with a reasonable binary encoding 2019-10-14 10:17:15 +02:00
e7484ef536 Completely lost, the same old Char8 / Word8 again, implemented all the text reading, still needing a couple details to parse CMaps 2019-10-14 10:17:15 +02:00
b8eb9e6856 Generalize the Parser type into a MonadParser class to use with MonadTrans and remove redundant code already defined in Applicative or Attoparsec 2019-10-14 10:17:15 +02:00
51db57ec67 Ugly commit, breaks everything, still trying to figure a grammar for text 2019-10-14 10:17:15 +02:00
6f3c159ea7 Adding a module to implement text reading and a demo program to go with it 2019-10-14 10:17:15 +02:00
3a39c75e6a Stop requiring an empty line between subsections in a xref section 2019-09-22 01:37:28 +02:00
29c5823f34 Fix precision bug caused by using Floats to represent PDF Number values sometimes used to represent a byte offset within a file 2019-09-22 01:34:17 +02:00
699f830a45 Simplify XRef structure, clarify integer types and remove nextLine 2019-09-20 22:39:14 +02:00
264b0dc92b Stop requiring «trailer» keywords to live on a separate line as counter-examples have been found 2019-05-31 15:08:54 +02:00
9dac275f68 Keep comment-opening '%' along with the comment and support empty lines 2019-05-31 15:07:41 +02:00
85e4eb9273 Fix bypassed error message for lines + add one for occurrences 2019-05-31 15:06:20 +02:00
11cb6504d7 Go strict ByteStrings with attoparsec 2019-05-24 10:48:09 +02:00
5614a25048 Generate valid PDF 2019-05-18 09:01:13 +02:00
0336baa687 Fix output implementation with dynamic XRefs 2019-05-17 16:14:06 +02:00
e23618da68 Implement output 2019-05-16 22:41:14 +02:00
645466024a Starting to implement output with String builder 2019-05-16 17:04:45 +02:00
9b2f890227 Boyer-Moore is canceled, implement the rest of parsing with naive search 2019-05-16 11:01:50 +02:00
fc41f815a3 Broken state : trying to implement Boyer-Moore for fast-forwarding to the end of a section 2019-05-15 19:13:35 +02:00
379a821550 Fix bugs preventing the objects from loading 2019-05-15 15:03:55 +02:00
44508a204c Reuse Parser type in PDF.Body (and generalize the type of the comment parser) 2019-05-15 09:04:17 +02:00
91292d6401 Implement retrieving objects in the body of the document and use it to populate the structure previously parsed 2019-05-14 18:42:11 +02:00
8043f84da8 Cut PDF module in two, implement basic parsing up to reading XRef table 2019-05-13 18:22:05 +02:00