Commit Graph

70 Commits

Author SHA1 Message Date
Tissevert dd6bfd90bd Using toEnum to convert from Int to Int ? Surely a left-over from some time when it was a different type 2020-02-08 08:10:32 +01:00
Tissevert 03fbbc3a96 Why did I implement this overly complicated lift by hand again ? 2020-02-07 13:08:10 +01:00
Tissevert 95f9ab35b1 Implement MacRomanEncoding for real following their own vendor file https://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT 2020-02-07 10:59:28 +01:00
Tissevert fe055150a3 Migrate to Text to represent page contents and get rid of encoding concerns to early 2020-02-07 10:49:16 +01:00
Tissevert 57996749c6 Fix loose parser not making sure endOfInput is reached; add two families of operators and simplify the «Show» instance with a dedicated function to allow deleting lines of uninteresting code 2020-02-06 16:54:27 +01:00
Tissevert 3f6b0651f3 Expose the endOfLine parser through MonadParser to allow enforcing reaching the end of input in page parser 2020-02-06 16:53:06 +01:00
Tissevert ecfd682b34 Simplify functions exposed (all part of the MonadParser class 2020-02-06 16:52:22 +01:00
Tissevert 5fa32e35db Implement Font retrieving for simple fonts with an /Encoding and no ToUnicode 2020-02-05 22:15:18 +01:00
Tissevert b5a15a692b Forgot to remove commented-out dead code 2020-02-05 19:49:03 +01:00
Tissevert b859338a57 Start implementing the MacRomanEncoding 2020-02-05 18:03:44 +01:00
Tissevert 764e2c6a4f Removing deprecated hidding for «fail» 2020-02-05 18:02:52 +01:00
Tissevert 6ed57d66e8 Reimplement cMap as a type of Font and make the code ready for other Fonts 2020-02-05 17:42:17 +01:00
Tissevert 22cde37025 Add a Font class type to allow text rendition schemes other than CMaps 2020-02-05 14:42:51 +01:00
Tissevert c48ab22808 Forgot some useless parentheses when playing with operator precedences 2020-02-04 17:05:15 +01:00
Tissevert a2b66ac6d6 Generalize the getFont function because some /Resources have a direct dictionary as value for their /Font property 2020-02-04 17:04:42 +01:00
Tissevert cefb08ee50 Going a step further in «optimization» (slowing it even more…) by replacing choice by a search in a Map 2019-11-30 21:46:22 +01:00
Tissevert afbbcbffc5 Finish implementing the new stack-based call parser 2019-11-30 12:39:40 +01:00
Tissevert 8373bd1ea0 Removing +x permission on getText source that shouldn't ever have been set 2019-11-29 19:07:54 +01:00
Tissevert bac08446dd WIP: starting to fix this criminally inefficient parser for PDF's postfix-operator instructions 2019-11-29 17:42:57 +01:00
Tissevert f9f799c59b Take the dirty code of «getText» and turn it into a relatively clean module exposing pages, that can be retrieved all at once or by page number (numbered human-style, starting from 1) 2019-11-29 11:51:35 +01:00
Tissevert 08a9717b3a Get rid of wrapper PageContents structure returned by PageContent in the PDF.Text module (and return directly [ByteString] instead) 2019-11-29 11:48:28 +01:00
Tissevert 42a02808c1 Merge branch 'main' into extract-text 2019-11-27 18:05:47 +01:00
Tissevert 380c1e439b Fix a bug preventing Hufflepdf from reading objects with a ' ' after the `obj` keyword 2019-11-27 18:01:19 +01:00
Tissevert c9f050e64b Remove deprecated debug script and forgotten comments to bypass the selective export of Text module 2019-10-14 10:17:15 +02:00
Tissevert 3a3e1533b4 Clean ByteString types to identify when a ByteString contains the representation of an integer in a given base and fix the last remaining PDF string (un)escaping issue 2019-10-14 10:17:15 +02:00
Tissevert a96e36ec5a Fix error silently discarding code ranges, make sure ByteString intervals are created with the correct byte length and decode utf16BE encoded values in single-value ranges 2019-10-14 10:17:15 +02:00
Tissevert d07c286f8e Clean exported ByteString custom functions 2019-10-14 10:17:15 +02:00
Tissevert 7a15113285 Try and re-implement string decoding — compiles but now fails to decode any string 2019-10-14 10:17:15 +02:00
Tissevert 36d7f9b819 Still debugging, broke pretty much everything and finally implementing a proper coderange parsing for CMap because apparently that's necessary 2019-10-14 10:17:15 +02:00
Tissevert b8ca7281aa Fix parsing errors forgetting to make sure there's a space after special operator arguments like names and stringObjects 2019-10-14 10:17:15 +02:00
Tissevert 32efdcdd6b Try and fix stuff by generalizing a signature to ease debugging and add parenthesis which I think should have been here all along 2019-10-14 10:17:15 +02:00
Tissevert 3b59fd0c61 Separate CMap and Text in two distinct modules 2019-10-14 10:17:15 +02:00
Tissevert 0374b72920 Finish implementing reading, still bugs to investigate 2019-10-14 10:17:15 +02:00
Tissevert 1dd22c3889 Going to try with Text, naturally handling UTF-16 but will still have to parse «int codes» manually from strings 2019-10-14 10:17:15 +02:00
Tissevert 98d029c4d4 In complete debug, more or less implemented CMap parsing but apparently it uses UTF16 ?! 2019-10-14 10:17:15 +02:00
Tissevert c349d9b4c2 Don't trust serializer, they have nothing todo with a reasonable binary encoding 2019-10-14 10:17:15 +02:00
Tissevert e7484ef536 Completely lost, the same old Char8 / Word8 again, implemented all the text reading, still needing a couple details to parse CMaps 2019-10-14 10:17:15 +02:00
Tissevert f9e5683bf4 WIP: Use previous changes to start implementing font caching and text parsing (still very broken, doesn't compile) 2019-10-14 10:17:15 +02:00
Tissevert b8eb9e6856 Generalize the Parser type into a MonadParser class to use with MonadTrans and remove redundant code already defined in Applicative or Attoparsec 2019-10-14 10:17:15 +02:00
Tissevert 66d315b7fe Reflect the distinction between eval and run from State monad into the Parser module 2019-10-14 10:17:15 +02:00
Tissevert 51db57ec67 Ugly commit, breaks everything, still trying to figure a grammar for text 2019-10-14 10:17:15 +02:00
Tissevert 6f3c159ea7 Adding a module to implement text reading and a demo program to go with it 2019-10-14 10:17:15 +02:00
Tissevert d6994f0813 Release 0.2.0.0 2019-10-14 10:16:14 +02:00
Tissevert 68f90d20e2 Implement PDF's multilayer updates and use it in getObj to display only the current version of the object taken into account instead of the concatenation of all its versions 2019-09-22 01:40:39 +02:00
Tissevert 3a39c75e6a Stop requiring an empty line between subsections in a xref section 2019-09-22 01:37:28 +02:00
Tissevert 29c5823f34 Fix precision bug caused by using Floats to represent PDF Number values sometimes used to represent a byte offset within a file 2019-09-22 01:34:17 +02:00
Tissevert 9ab010de61 Add to example programs to show how the lib can be used 2019-09-20 22:42:17 +02:00
Tissevert 699f830a45 Simplify XRef structure, clarify integer types and remove nextLine 2019-09-20 22:39:14 +02:00
Tissevert dd79cb3fc7 Release bugfix v0.1.1.1 2019-05-31 15:16:23 +02:00
Tissevert 264b0dc92b Stop requiring «trailer» keywords to live on a separate line as counter-examples have been found 2019-05-31 15:08:54 +02:00