Hufflepdf

Tissevert/Hufflepdf

Fork 0

Commit graph

5fa32e35db Implement Font retrieving for simple fonts with an /Encoding and no ToUnicode Tissevert 2020-02-05 22:15:18 +0100
b5a15a692b Forgot to remove commented-out dead code Tissevert 2020-02-05 19:49:03 +0100
b859338a57 Start implementing the MacRomanEncoding Tissevert 2020-02-05 18:03:44 +0100
764e2c6a4f Removing deprecated hidding for «fail» Tissevert 2020-02-05 18:02:52 +0100
6ed57d66e8 Reimplement cMap as a type of Font and make the code ready for other Fonts Tissevert 2020-02-05 17:42:17 +0100
22cde37025 Add a Font class type to allow text rendition schemes other than CMaps Tissevert 2020-02-05 14:42:51 +0100
c48ab22808 Forgot some useless parentheses when playing with operator precedences Tissevert 2020-02-04 17:05:15 +0100
a2b66ac6d6 Generalize the getFont function because some /Resources have a direct dictionary as value for their /Font property Tissevert 2020-02-04 17:04:42 +0100
cefb08ee50 Going a step further in «optimization» (slowing it even more…) by replacing choice by a search in a Map Tissevert 2019-11-30 21:46:22 +0100
afbbcbffc5 Finish implementing the new stack-based call parser Tissevert 2019-11-30 12:39:40 +0100
8373bd1ea0 Removing +x permission on getText source that shouldn't ever have been set Tissevert 2019-11-29 19:07:54 +0100
bac08446dd WIP: starting to fix this criminally inefficient parser for PDF's postfix-operator instructions Tissevert 2019-11-29 17:42:57 +0100
7eca875900 Improve getObj example to catch no-existing ObjectId and default to listing existing ObjectIds when none is provided main Tissevert 2019-11-29 11:53:08 +0100
f9f799c59b Take the dirty code of «getText» and turn it into a relatively clean module exposing pages, that can be retrieved all at once or by page number (numbered human-style, starting from 1) Tissevert 2019-11-29 11:51:35 +0100
08a9717b3a Get rid of wrapper PageContents structure returned by PageContent in the PDF.Text module (and return directly [ByteString] instead) Tissevert 2019-11-29 11:48:28 +0100
42a02808c1 Merge branch 'main' into extract-text Tissevert 2019-11-27 18:05:47 +0100
380c1e439b Fix a bug preventing Hufflepdf from reading objects with a ' ' after the obj keyword Tissevert 2019-11-27 18:01:03 +0100
c9f050e64b Remove deprecated debug script and forgotten comments to bypass the selective export of Text module Tissevert 2019-10-07 12:30:07 +0200
3a3e1533b4 Clean ByteString types to identify when a ByteString contains the representation of an integer in a given base and fix the last remaining PDF string (un)escaping issue Tissevert 2019-10-04 18:46:07 +0200
a96e36ec5a Fix error silently discarding code ranges, make sure ByteString intervals are created with the correct byte length and decode utf16BE encoded values in single-value ranges Tissevert 2019-10-03 14:59:06 +0200
d07c286f8e Clean exported ByteString custom functions Tissevert 2019-10-03 14:43:56 +0200
7a15113285 Try and re-implement string decoding — compiles but now fails to decode any string Tissevert 2019-10-03 07:59:09 +0200
36d7f9b819 Still debugging, broke pretty much everything and finally implementing a proper coderange parsing for CMap because apparently that's necessary Tissevert 2019-09-30 14:13:12 +0200
b8ca7281aa Fix parsing errors forgetting to make sure there's a space after special operator arguments like names and stringObjects Tissevert 2019-09-28 09:25:59 +0200
32efdcdd6b Try and fix stuff by generalizing a signature to ease debugging and add parenthesis which I think should have been here all along Tissevert 2019-09-27 18:38:03 +0200
3b59fd0c61 Separate CMap and Text in two distinct modules Tissevert 2019-09-27 18:16:12 +0200
0374b72920 Finish implementing reading, still bugs to investigate Tissevert 2019-09-27 12:21:06 +0200
1dd22c3889 Going to try with Text, naturally handling UTF-16 but will still have to parse «int codes» manually from strings Tissevert 2019-09-26 16:56:13 +0200
98d029c4d4 In complete debug, more or less implemented CMap parsing but apparently it uses UTF16 ?! Tissevert 2019-09-26 15:51:41 +0200
c349d9b4c2 Don't trust serializer, they have nothing todo with a reasonable binary encoding Tissevert 2019-09-25 23:46:24 +0200
e7484ef536 Completely lost, the same old Char8 / Word8 again, implemented all the text reading, still needing a couple details to parse CMaps Tissevert 2019-09-25 18:42:34 +0200
f9e5683bf4 WIP: Use previous changes to start implementing font caching and text parsing (still very broken, doesn't compile) Tissevert 2019-09-24 18:38:12 +0200
b8eb9e6856 Generalize the Parser type into a MonadParser class to use with MonadTrans and remove redundant code already defined in Applicative or Attoparsec Tissevert 2019-09-24 18:36:17 +0200
66d315b7fe Reflect the distinction between eval and run from State monad into the Parser module Tissevert 2019-09-24 18:32:23 +0200
51db57ec67 Ugly commit, breaks everything, still trying to figure a grammar for text Tissevert 2019-09-23 23:19:27 +0200
6f3c159ea7 Adding a module to implement text reading and a demo program to go with it Tissevert 2019-09-23 18:00:47 +0200
d6994f0813 Release 0.2.0.0 v0.2.0.0 Tissevert 2019-10-14 10:16:14 +0200
68f90d20e2 Implement PDF's multilayer updates and use it in getObj to display only the current version of the object taken into account instead of the concatenation of all its versions Tissevert 2019-09-22 01:40:39 +0200
3a39c75e6a Stop requiring an empty line between subsections in a xref section Tissevert 2019-09-22 01:37:28 +0200
29c5823f34 Fix precision bug caused by using Floats to represent PDF Number values sometimes used to represent a byte offset within a file Tissevert 2019-09-22 01:34:17 +0200
9ab010de61 Add to example programs to show how the lib can be used Tissevert 2019-09-20 22:42:17 +0200
699f830a45 Simplify XRef structure, clarify integer types and remove nextLine Tissevert 2019-09-20 22:39:14 +0200
dd79cb3fc7 Release bugfix v0.1.1.1 v0.1.1.1 Tissevert 2019-05-31 15:16:23 +0200
264b0dc92b Stop requiring «trailer» keywords to live on a separate line as counter-examples have been found Tissevert 2019-05-31 15:08:54 +0200
9dac275f68 Keep comment-opening '%' along with the comment and support empty lines Tissevert 2019-05-31 15:07:41 +0200
85e4eb9273 Fix bypassed error message for lines + add one for occurrences Tissevert 2019-05-31 15:06:20 +0200
11cb6504d7 Go strict ByteStrings with attoparsec v0.1.1.0 Tissevert 2019-05-24 10:48:09 +0200
0daa03d958 Remove commented out dead code Tissevert 2019-05-21 09:07:37 +0200
b60f337cc4 First useable version v0.1.0.0 Tissevert 2019-05-18 11:09:03 +0200
2c165daaa7 Finally opt for uppercase Hufflepdf and rename cabal package Tissevert 2019-05-18 09:49:31 +0200
5614a25048 Generate valid PDF Tissevert 2019-05-18 09:01:13 +0200
0336baa687 Fix output implementation with dynamic XRefs Tissevert 2019-05-17 16:14:06 +0200
e23618da68 Implement output Tissevert 2019-05-16 22:41:14 +0200
088637b2c0 Compat stuff for Monoid / Semigroup Tissevert 2019-05-16 21:40:19 +0200
96190a8ca4 Forgot to add changes to cabal file Tissevert 2019-05-16 17:06:14 +0200
645466024a Starting to implement output with String builder Tissevert 2019-05-16 17:04:45 +0200
9b2f890227 Boyer-Moore is canceled, implement the rest of parsing with naive search Tissevert 2019-05-16 11:01:50 +0200
fc41f815a3 Broken state : trying to implement Boyer-Moore for fast-forwarding to the end of a section Tissevert 2019-05-15 19:12:38 +0200
379a821550 Fix bugs preventing the objects from loading Tissevert 2019-05-15 15:03:55 +0200
44508a204c Reuse Parser type in PDF.Body (and generalize the type of the comment parser) Tissevert 2019-05-15 09:04:17 +0200
91292d6401 Implement retrieving objects in the body of the document and use it to populate the structure previously parsed Tissevert 2019-05-14 18:42:11 +0200
8043f84da8 Cut PDF module in two, implement basic parsing up to reading XRef table Tissevert 2019-05-13 18:22:05 +0200
6eacb55fc4 Fix bug preventing startXref to be found for files with a single byte EOL encoding Tissevert 2019-05-13 11:34:15 +0200
c036334b6f Prototype successfully parsing (only last) startxref Tissevert 2019-05-13 08:05:28 +0200