Commit graph

7249 commits

Author SHA1 Message Date
Salim B
fae6a204f1
Fix/update URLs and use HTTP**S** where possible (#7122) 2021-02-26 17:56:04 -08:00
John MacFarlane
f0a991a22b T.P.CSV: fix parsing of unquoted values.
Previously we didn't allow unescaped quotes in unquoted values,
but they are allowed. Closes #7112.
2021-02-22 21:18:04 -08:00
John MacFarlane
d30791a381 Fall back to latin1 if UTF-8 decoding fails...
...when handling URL argument served with no charset in the mime type.
The assumption is that most pages that don't specify a charset
in the mime type are either UTF-8 or latin1.  I think that's a good
assumption, though I'm not sure.
2021-02-22 14:17:22 -08:00
John MacFarlane
5a73c5d3f8 When downloading content from URL arguments, be sensitive to...
the character encoding.  We can properly handle UTF-8 and
latin1 (ISO-8859-1); for others we raise an error.
See #5600.
2021-02-22 14:01:10 -08:00
John MacFarlane
bafccd5aa2 T.P.Error: Add PandocUnsupportedCharsetError constructor...
...for PandocError.  [API change]
2021-02-22 14:01:04 -08:00
John MacFarlane
4617f229ea Text.Pandoc.MIME: add exported function getCharset.
[API change]
2021-02-22 13:28:47 -08:00
John MacFarlane
80fde18fb1 Text.Pandoc.UTF8: change IO functions to return Text, not String.
[API change] This affects `readFile`, `getContents`, `writeFileWith`,
`writeFile`, `putStrWith`, `putStr`, `putStrLnWith`, `putStrLn`.
`hPutStrWith`, `hPutStr`, `hPutStrLnWith`, `hPutStrLn`, `hGetContents`.

This avoids the need to uselessly create a linked list of characters
when emiting output.
2021-02-22 11:30:07 -08:00
John MacFarlane
2b37ed9f21 LaTeX reader: further optimizations in satisfyTok.
Benchmarks show 2/3 of the run time and 2/3 of the allocation
of the Feb. 10 benchmarks.
2021-02-21 11:30:17 -08:00
John MacFarlane
db4f882315 LaTeX reader: removed sExpanded in state.
This isn't actually needed and checking it doesn't change
anything.

Also remove an unnecessary `doMacros` before `satisfyTok`,
which does it anyway.
2021-02-21 11:24:04 -08:00
John MacFarlane
f43cb5ddcf LaTeX reader: further performance optimization.
Avoid unnecessary 'doMacros'.
2021-02-21 10:58:42 -08:00
John MacFarlane
c0c8865eaa HTML reader: small performance tweak. 2021-02-20 23:40:02 -08:00
John MacFarlane
d8ef383692 T.P.Shared: remove some obsolete functions [API change].
Removed:

- `splitByIndices`
- `splitStringByIndicies`
- `substitute`
- `underlineSpan`

None of these are used elsewhere in the code base.
2021-02-20 23:02:10 -08:00
John MacFarlane
321343b2cf HTML reader: small efficiency improvements.
Also, remove exported class NamedTag(..) [API change].
This was just intended to smooth over the transition from String to Text
and is no longer needed.

The functions isInlineTag and isBlockTag are no longer
polymorphic.
2021-02-20 22:49:20 -08:00
John MacFarlane
cec541e54c LaTeX reader: Another small improvement to macro handling. 2021-02-20 22:14:31 -08:00
John MacFarlane
31b8f60ea8 LaTeX reader: avoid macro resolution code if no macros defined. 2021-02-20 22:03:29 -08:00
John MacFarlane
0f955b10b4 T.P.Readers.LaTeX.Parsing: improve braced'.
Remove the parameter, have it parse the opening brace,
and make it more efficient.
2021-02-20 18:57:46 -08:00
John MacFarlane
13847267e9 HTML reader: efficiency improvements.
Do a lookahead to find the right parser to use.

Benchmarks from 34ms to 23ms, with less allocation.
Also speeds up the epub reader.
2021-02-20 00:07:38 -08:00
John MacFarlane
98d26c2345 DocBook, JATS, OPML readers: performance optimization.
With the new XML parser, we can avoid the expensive tree
normalization step we used to do.

This gives a significant speed boost in docbook and JATS
parsing (e.g. 9.7 to 6 ms).
2021-02-18 21:24:31 -08:00
John MacFarlane
ef642e2bbc T.P.XML Improve fromEntities. 2021-02-18 18:11:27 -08:00
John MacFarlane
0f5c56dfb1 T.P.PDF: disable smart when building PDF via LaTeX.
This is to prevent accidental creation of ligatures like
`` ?` `` and `` !` `` (especially in languages with quotations
like German), and similar ligature issues.

See jgm/citeproc#54.
2021-02-18 17:11:53 -08:00
John MacFarlane
53cf8295a4 LaTeX writer: adjust hypertargets to beginnings of paragraphs.
Use `\vadjust pre` so that the hypertarget takes you to the
beginning of the paragraph rather than one line down.

Closes #7078.

This makes a particular difference for links to citations
using `--citeproc` and `link-citations: true`.
2021-02-18 14:34:38 -08:00
John MacFarlane
9e728b40f3 T.P.Shared: cleanup.
Cleanup up some functions and added deprecation pragmas
to funtions no longer used in the code base.
2021-02-18 13:12:15 -08:00
Albert Krewinkel
743f7216de
Org reader: fix bug in org-ref citation parsing.
The org-ref syntax allows to list multiple citations separated by comma.
This fixes a bug that accepted commas as part of the citation id, so all
citation lists were parsed as one single citation.

Fixes: #7101
2021-02-18 21:59:18 +01:00
John MacFarlane
73add05789 Docx reader: use Map instead of list for Namespaces.
This gives a speedup of about 5-10%.

The reader is now approximately twice as fast as in the last
release.
2021-02-17 09:54:39 -08:00
John MacFarlane
80a1d5c9b6 Revert "Add T.P.XML.Light.Cursor."
This reverts commit d8fc497186.
2021-02-16 19:18:01 -08:00
John MacFarlane
d8fc497186 Add T.P.XML.Light.Cursor. 2021-02-16 18:51:41 -08:00
John MacFarlane
4af378702a Add orig copyright/license info for code derived from xml-light. 2021-02-16 18:44:38 -08:00
John MacFarlane
d7a4996b1e Split up T.P.XML.Light into submodules. 2021-02-16 18:40:06 -08:00
John MacFarlane
967e7f5fb9 Rename Text.Pandoc.XMLParser -> Text.Pandoc.XML.Light...
..and add new definitions isomorphic to xml-light's, but with
Text instead of String.  This allows us to keep most of the code in
existing readers that use xml-light, but avoid lots of unnecessary
allocation.

We also add versions of the functions from xml-light's
Text.XML.Light.Output and Text.XML.Light.Proc that operate
on our modified XML types, and functions that convert
xml-light types to our types (since some of our dependencies,
like texmath, use xml-light).

Update golden tests for docx and pptx.

OOXML test: Use `showContent` instead of `ppContent` in `displayDiff`.

Docx: Do a manual traversal to unwrap sdt and smartTag.
This is faster, and needed to pass the tests.

Benchmarks:

A = prior to 8ca191604d (Feb 8)
B = as of 8ca191604d (Feb 8)
C = this commit

| Reader  |  A    | B      | C     |
| ------- | ----- | ------ | ----- |
| docbook | 18 ms | 12 ms  | 10 ms |
| opml    | 65 ms | 62 ms  | 35 ms |
| jats    | 15 ms | 11 ms  |  9 ms |
| docx    | 72 ms | 69 ms  | 44 ms |
| odt     | 78 ms | 41 ms  | 28 ms |
| epub    | 64 ms | 61 ms  | 56 ms |
| fb2     | 14 ms | 5  ms  | 4 ms  |
2021-02-16 16:55:20 -08:00
Albert Krewinkel
8621ed600a
T.P.Error: remove unused variables 2021-02-14 15:49:12 +01:00
John MacFarlane
d84a6041e1 HTML reader: fix bad handling of empty src attribute in iframe.
- If src is empty, we simply skip the iframe.
- If src is invalid or cannot be fetched, we issue a warning
  and skip instead of failing with an error.
- Closes #7099.
2021-02-13 13:08:34 -08:00
John MacFarlane
6e73273916 T.P.Error: export renderError.
Refactor `handleError` to use `renderError`. This allows us
render error messages without exiting.
2021-02-13 13:08:34 -08:00
Albert Krewinkel
a3beed9db8 Org: support task_lists extension
The tasks lists extension is now supported by the org reader and writer;
the extension is turned on by default.

Closes: #6336
2021-02-13 13:00:37 -08:00
Albert Krewinkel
2d60a5127c T.P.Shared: export handleTaskListItem. [API change] 2021-02-13 13:00:37 -08:00
John MacFarlane
6323250bad LaTeX reader: remove unnecessary line 2021-02-13 00:22:22 -08:00
John MacFarlane
25b7df7c2a Remove Ext_fenced_code_attributes from allowed commonmark attributes.
This attribute was listed as allowed, but it didn't actually
do anything. Use `attributes` for code attributes and more.

Closes #7097.
2021-02-13 00:18:40 -08:00
John MacFarlane
eb0c63b002 Avoid an unnecessary withRaw. 2021-02-12 19:29:48 -08:00
John MacFarlane
d9322629a3 LaTeX reader improvements.
* Rewrote `withRaw` so it doesn't rely on fragile assumptions
  about token positions (which break when macros are expanded).
  This requires the addition of `sEnableWithRaw` and `sRawTokens`
  in `LaTeXState`, and a new combinator `disablingWithRaw` to
  disable collecting of raw tokens in certain contexts.
* Add `parseFromToks` to T.P.Readers.LaTeX.Parsing.
* Fix parsing of single character tokens so it doesn't mess
  up the new raw token collecting.
* These changes slightly increase allocations and have a small
  performance impact, but it's minor.

Closes #7092.
2021-02-12 19:04:14 -08:00
John MacFarlane
390d5e65b2 Use getTimestamp instead of getCurrentTime in writers.
Setting SOURCE_DATE_EPOCH will allow reproducible builds.

Partially addresses #7093.  This does not suffice to fully enable
reproducible in EPUB, since a unique id is being generated for each
build.
2021-02-11 14:55:03 -08:00
John MacFarlane
3c4a58bad0 T.P.Class: Add getTimestamp [API change].
This attempts to read the SOURCE_DATE_EPOCH environment variable
and parse a UTC time from it (treating it as a unix date stamp,
see https://reproducible-builds.org/specs/source-date-epoch/).
If the variable is not set or can't be parsed as a unix date
stamp, then the function returns the current date.
2021-02-11 14:54:28 -08:00
John MacFarlane
acc9afaf6f Correctly parse "raw" date value in markdown references metadata.
See jgm/citeproc#53.
2021-02-11 09:16:25 -08:00
John MacFarlane
8ca191604d Add new unexported module T.P.XMLParser.
This exports functions that uses xml-conduit's parser to
produce an xml-light Element or [Content].  This allows
existing pandoc code to use a better parser without
much modification.

The new parser is used in all places where xml-light's
parser was previously used.  Benchmarks show a significant
performance improvement in parsing XML-based formats
(especially ODT and FB2).

Note that the xml-light types use String, so the
conversion from xml-conduit types involves a lot
of extra allocation.  It would be desirable to
avoid that in the future by gradually switching
to using xml-conduit directly. This can be done
module by module.

The new parser also reports errors, which we report
when possible.

A new constructor PandocXMLError has been added to
PandocError in T.P.Error [API change].

Closes #7091, which was the main stimulus.

These changes revealed the need for some changes
in the tests.  The docbook-reader.docbook test
lacked definitions for the entities it used; these
have been added. And the docx golden tests have been
updated, because the new parser does not preserve
the order of attributes.

Add entity defs to docbook-reader.docbook.

Update golden tests for docx.
2021-02-10 22:04:11 -08:00
John MacFarlane
f70795dc5e ODT reader: finer-grained errors on parse failure.
See #7091.
2021-02-08 09:39:59 -08:00
John MacFarlane
5cd1c1001f ODT reader: give more information if zip can't be unpacked. 2021-02-08 09:39:59 -08:00
Nils Carlson
69b7401e31
DocBook reader: Support informalfigure (#7079)
Add support for informalfigure.
2021-02-08 09:36:58 -08:00
Albert Krewinkel
d202f7eb77
Avoid unnecessary use of NoImplicitPrelude pragma (#7089) 2021-02-07 10:02:35 -08:00
John MacFarlane
8e9131db4e Markdown reader: improved handling of mmd link attributes in references.
Previously they only worked for links that had titles.  Closes #7080.
2021-02-06 21:52:12 -08:00
Albert Krewinkel
a5169f68b2
Lua filters: use same function names in Haskell and Lua 2021-02-04 19:07:59 +01:00
Nick Berendsen
b79aba6ea1
ePub writer: belongs-to-collection metadata (#7063) 2021-02-03 09:00:18 -08:00
Albert Krewinkel
61b108d527 Lua: add module "pandoc.path"
The module allows to work with file paths in a convenient and
platform-independent manner.

Closes: #6001
Closes: #6565
2021-02-02 21:04:30 -08:00