Hufflepdf

Author	SHA1	Message	Date
Tissevert	d9f69014a0	Make a couple improvements in performance + add an example script to extract pages from a PDF	2020-05-28 18:54:15 +02:00
Tissevert	a1c2fbf110	Add an alias to Id to lift type ambiguities like 'chunk' in PDF.Content.Text	2020-03-19 10:27:28 +01:00
Tissevert	5722dd1a04	Use IntMap for all Maps on Ids	2020-03-19 10:27:28 +01:00
Tissevert	f31e9eb38b	Generalize Ids out of Content to handle Object Ids too	2020-03-19 10:27:21 +01:00
Tissevert	bcf2e05bfb	Move Content out of Object module into a separate one incorporating PDF.Update (which is actually an operation that is defined only on that structure), and rename it Layer to avoid confusion with Content streams as defined in the specs (which have their own PDF.Content module already)	2020-02-17 15:29:59 +01:00
Tissevert	923d1800b0	Gain a bit of speed by using native Attoparsec for number types instead of reimplementing them with ByteString conversion and call to read	2020-02-14 18:02:40 +01:00
Tissevert	1c457d71d8	Fix the reading of Hexadecimal string objects detected by running the tests implemented from the spec	2020-02-14 18:00:12 +01:00
Tissevert	a72d76e229	Add unit tests to make sure I'm not breaking things too much	2020-02-14 17:58:03 +01:00
Tissevert	919f640443	Merge branch 'extract-text' into navigation	2020-02-12 17:35:56 +01:00
Tissevert	32f9866106	Use peek to improve directObject parser avoiding a large <\|> disjunction	2020-02-12 17:34:27 +01:00
Tissevert	704d7a7fcf	It turns out Output.concat wasn't necessary, OBuilder seems already is a Monoid so mconcat works (that fact was used in the very implementation of concat…)	2020-02-11 17:36:29 +01:00
Tissevert	aed7af376a	WIP: still trying to figure things out, moved to a separate submodule for Navigation, proper naming is hell	2020-02-11 08:29:08 +01:00
Tissevert	e77bbbcda9	WIP: start moving some navigation-related routines from Pages into Object directly and generalize them to multi-component to allow easier browsing	2020-02-10 17:43:04 +01:00
Tissevert	42a02808c1	Merge branch 'main' into extract-text	2019-11-27 18:05:47 +01:00
Tissevert	380c1e439b	Fix a bug preventing Hufflepdf from reading objects with a ' ' after the `obj` keyword	2019-11-27 18:01:19 +01:00
Tissevert	3a3e1533b4	Clean ByteString types to identify when a ByteString contains the representation of an integer in a given base and fix the last remaining PDF string (un)escaping issue	2019-10-14 10:17:15 +02:00
Tissevert	d07c286f8e	Clean exported ByteString custom functions	2019-10-14 10:17:15 +02:00
Tissevert	36d7f9b819	Still debugging, broke pretty much everything and finally implementing a proper coderange parsing for CMap because apparently that's necessary	2019-10-14 10:17:15 +02:00
Tissevert	3b59fd0c61	Separate CMap and Text in two distinct modules	2019-10-14 10:17:15 +02:00
Tissevert	c349d9b4c2	Don't trust serializer, they have nothing todo with a reasonable binary encoding	2019-10-14 10:17:15 +02:00
Tissevert	e7484ef536	Completely lost, the same old Char8 / Word8 again, implemented all the text reading, still needing a couple details to parse CMaps	2019-10-14 10:17:15 +02:00
Tissevert	b8eb9e6856	Generalize the Parser type into a MonadParser class to use with MonadTrans and remove redundant code already defined in Applicative or Attoparsec	2019-10-14 10:17:15 +02:00
Tissevert	51db57ec67	Ugly commit, breaks everything, still trying to figure a grammar for text	2019-10-14 10:17:15 +02:00
Tissevert	6f3c159ea7	Adding a module to implement text reading and a demo program to go with it	2019-10-14 10:17:15 +02:00
Tissevert	3a39c75e6a	Stop requiring an empty line between subsections in a xref section	2019-09-22 01:37:28 +02:00
Tissevert	29c5823f34	Fix precision bug caused by using Floats to represent PDF Number values sometimes used to represent a byte offset within a file	2019-09-22 01:34:17 +02:00
Tissevert	699f830a45	Simplify XRef structure, clarify integer types and remove nextLine	2019-09-20 22:39:14 +02:00
Tissevert	264b0dc92b	Stop requiring «trailer» keywords to live on a separate line as counter-examples have been found	2019-05-31 15:08:54 +02:00
Tissevert	9dac275f68	Keep comment-opening '%' along with the comment and support empty lines	2019-05-31 15:07:41 +02:00
Tissevert	85e4eb9273	Fix bypassed error message for lines + add one for occurrences	2019-05-31 15:06:20 +02:00
Tissevert	11cb6504d7	Go strict ByteStrings with attoparsec	2019-05-24 10:48:09 +02:00
Tissevert	5614a25048	Generate valid PDF	2019-05-18 09:01:13 +02:00
Tissevert	0336baa687	Fix output implementation with dynamic XRefs	2019-05-17 16:14:06 +02:00
Tissevert	e23618da68	Implement output	2019-05-16 22:41:14 +02:00
Tissevert	645466024a	Starting to implement output with String builder	2019-05-16 17:04:45 +02:00
Tissevert	9b2f890227	Boyer-Moore is canceled, implement the rest of parsing with naive search	2019-05-16 11:01:50 +02:00
Tissevert	fc41f815a3	Broken state : trying to implement Boyer-Moore for fast-forwarding to the end of a section	2019-05-15 19:13:35 +02:00
Tissevert	379a821550	Fix bugs preventing the objects from loading	2019-05-15 15:03:55 +02:00
Tissevert	44508a204c	Reuse Parser type in PDF.Body (and generalize the type of the comment parser)	2019-05-15 09:04:17 +02:00
Tissevert	91292d6401	Implement retrieving objects in the body of the document and use it to populate the structure previously parsed	2019-05-14 18:42:11 +02:00
Tissevert	8043f84da8	Cut PDF module in two, implement basic parsing up to reading XRef table	2019-05-13 18:22:05 +02:00

41 Commits