|
57996749c6
|
Fix loose parser not making sure endOfInput is reached; add two families of operators and simplify the «Show» instance with a dedicated function to allow deleting lines of uninteresting code
|
2020-02-06 16:54:27 +01:00 |
|
|
3f6b0651f3
|
Expose the endOfLine parser through MonadParser to allow enforcing reaching the end of input in page parser
|
2020-02-06 16:53:06 +01:00 |
|
|
ecfd682b34
|
Simplify functions exposed (all part of the MonadParser class
|
2020-02-06 16:52:22 +01:00 |
|
|
5fa32e35db
|
Implement Font retrieving for simple fonts with an /Encoding and no ToUnicode
|
2020-02-05 22:15:18 +01:00 |
|
|
b5a15a692b
|
Forgot to remove commented-out dead code
|
2020-02-05 19:49:03 +01:00 |
|
|
b859338a57
|
Start implementing the MacRomanEncoding
|
2020-02-05 18:03:44 +01:00 |
|
|
764e2c6a4f
|
Removing deprecated hidding for «fail»
|
2020-02-05 18:02:52 +01:00 |
|
|
6ed57d66e8
|
Reimplement cMap as a type of Font and make the code ready for other Fonts
|
2020-02-05 17:42:17 +01:00 |
|
|
22cde37025
|
Add a Font class type to allow text rendition schemes other than CMaps
|
2020-02-05 14:42:51 +01:00 |
|
|
c48ab22808
|
Forgot some useless parentheses when playing with operator precedences
|
2020-02-04 17:05:15 +01:00 |
|
|
a2b66ac6d6
|
Generalize the getFont function because some /Resources have a direct dictionary as value for their /Font property
|
2020-02-04 17:04:42 +01:00 |
|
|
cefb08ee50
|
Going a step further in «optimization» (slowing it even more…) by replacing choice by a search in a Map
|
2019-11-30 21:46:22 +01:00 |
|
|
afbbcbffc5
|
Finish implementing the new stack-based call parser
|
2019-11-30 12:39:40 +01:00 |
|
|
8373bd1ea0
|
Removing +x permission on getText source that shouldn't ever have been set
|
2019-11-29 19:07:54 +01:00 |
|
|
bac08446dd
|
WIP: starting to fix this criminally inefficient parser for PDF's postfix-operator instructions
|
2019-11-29 17:42:57 +01:00 |
|
|
f9f799c59b
|
Take the dirty code of «getText» and turn it into a relatively clean module exposing pages, that can be retrieved all at once or by page number (numbered human-style, starting from 1)
|
2019-11-29 11:51:35 +01:00 |
|
|
08a9717b3a
|
Get rid of wrapper PageContents structure returned by PageContent in the PDF.Text module (and return directly [ByteString] instead)
|
2019-11-29 11:48:28 +01:00 |
|
|
42a02808c1
|
Merge branch 'main' into extract-text
|
2019-11-27 18:05:47 +01:00 |
|
|
380c1e439b
|
Fix a bug preventing Hufflepdf from reading objects with a ' ' after the obj keyword
|
2019-11-27 18:01:19 +01:00 |
|
|
c9f050e64b
|
Remove deprecated debug script and forgotten comments to bypass the selective export of Text module
|
2019-10-14 10:17:15 +02:00 |
|
|
3a3e1533b4
|
Clean ByteString types to identify when a ByteString contains the representation of an integer in a given base and fix the last remaining PDF string (un)escaping issue
|
2019-10-14 10:17:15 +02:00 |
|
|
a96e36ec5a
|
Fix error silently discarding code ranges, make sure ByteString intervals are created with the correct byte length and decode utf16BE encoded values in single-value ranges
|
2019-10-14 10:17:15 +02:00 |
|
|
d07c286f8e
|
Clean exported ByteString custom functions
|
2019-10-14 10:17:15 +02:00 |
|
|
7a15113285
|
Try and re-implement string decoding — compiles but now fails to decode any string
|
2019-10-14 10:17:15 +02:00 |
|
|
36d7f9b819
|
Still debugging, broke pretty much everything and finally implementing a proper coderange parsing for CMap because apparently that's necessary
|
2019-10-14 10:17:15 +02:00 |
|
|
b8ca7281aa
|
Fix parsing errors forgetting to make sure there's a space after special operator arguments like names and stringObjects
|
2019-10-14 10:17:15 +02:00 |
|
|
32efdcdd6b
|
Try and fix stuff by generalizing a signature to ease debugging and add parenthesis which I think should have been here all along
|
2019-10-14 10:17:15 +02:00 |
|
|
3b59fd0c61
|
Separate CMap and Text in two distinct modules
|
2019-10-14 10:17:15 +02:00 |
|
|
0374b72920
|
Finish implementing reading, still bugs to investigate
|
2019-10-14 10:17:15 +02:00 |
|
|
1dd22c3889
|
Going to try with Text, naturally handling UTF-16 but will still have to parse «int codes» manually from strings
|
2019-10-14 10:17:15 +02:00 |
|
|
98d029c4d4
|
In complete debug, more or less implemented CMap parsing but apparently it uses UTF16 ?!
|
2019-10-14 10:17:15 +02:00 |
|
|
c349d9b4c2
|
Don't trust serializer, they have nothing todo with a reasonable binary encoding
|
2019-10-14 10:17:15 +02:00 |
|
|
e7484ef536
|
Completely lost, the same old Char8 / Word8 again, implemented all the text reading, still needing a couple details to parse CMaps
|
2019-10-14 10:17:15 +02:00 |
|
|
f9e5683bf4
|
WIP: Use previous changes to start implementing font caching and text parsing (still very broken, doesn't compile)
|
2019-10-14 10:17:15 +02:00 |
|
|
b8eb9e6856
|
Generalize the Parser type into a MonadParser class to use with MonadTrans and remove redundant code already defined in Applicative or Attoparsec
|
2019-10-14 10:17:15 +02:00 |
|
|
66d315b7fe
|
Reflect the distinction between eval and run from State monad into the Parser module
|
2019-10-14 10:17:15 +02:00 |
|
|
51db57ec67
|
Ugly commit, breaks everything, still trying to figure a grammar for text
|
2019-10-14 10:17:15 +02:00 |
|
|
6f3c159ea7
|
Adding a module to implement text reading and a demo program to go with it
|
2019-10-14 10:17:15 +02:00 |
|
|
d6994f0813
|
Release 0.2.0.0
|
2019-10-14 10:16:14 +02:00 |
|
|
68f90d20e2
|
Implement PDF's multilayer updates and use it in getObj to display only the current version of the object taken into account instead of the concatenation of all its versions
|
2019-09-22 01:40:39 +02:00 |
|
|
3a39c75e6a
|
Stop requiring an empty line between subsections in a xref section
|
2019-09-22 01:37:28 +02:00 |
|
|
29c5823f34
|
Fix precision bug caused by using Floats to represent PDF Number values sometimes used to represent a byte offset within a file
|
2019-09-22 01:34:17 +02:00 |
|
|
9ab010de61
|
Add to example programs to show how the lib can be used
|
2019-09-20 22:42:17 +02:00 |
|
|
699f830a45
|
Simplify XRef structure, clarify integer types and remove nextLine
|
2019-09-20 22:39:14 +02:00 |
|
|
dd79cb3fc7
|
Release bugfix v0.1.1.1
|
2019-05-31 15:16:23 +02:00 |
|
|
264b0dc92b
|
Stop requiring «trailer» keywords to live on a separate line as counter-examples have been found
|
2019-05-31 15:08:54 +02:00 |
|
|
9dac275f68
|
Keep comment-opening '%' along with the comment and support empty lines
|
2019-05-31 15:07:41 +02:00 |
|
|
85e4eb9273
|
Fix bypassed error message for lines + add one for occurrences
|
2019-05-31 15:06:20 +02:00 |
|
|
11cb6504d7
|
Go strict ByteStrings with attoparsec
|
2019-05-24 10:48:09 +02:00 |
|
|
0daa03d958
|
Remove commented out dead code
|
2019-05-21 09:07:37 +02:00 |
|