Commit graph

5 commits

Author SHA1 Message Date
John MacFarlane
8ca191604d Add new unexported module T.P.XMLParser.
This exports functions that uses xml-conduit's parser to
produce an xml-light Element or [Content].  This allows
existing pandoc code to use a better parser without
much modification.

The new parser is used in all places where xml-light's
parser was previously used.  Benchmarks show a significant
performance improvement in parsing XML-based formats
(especially ODT and FB2).

Note that the xml-light types use String, so the
conversion from xml-conduit types involves a lot
of extra allocation.  It would be desirable to
avoid that in the future by gradually switching
to using xml-conduit directly. This can be done
module by module.

The new parser also reports errors, which we report
when possible.

A new constructor PandocXMLError has been added to
PandocError in T.P.Error [API change].

Closes #7091, which was the main stimulus.

These changes revealed the need for some changes
in the tests.  The docbook-reader.docbook test
lacked definitions for the entities it used; these
have been added. And the docx golden tests have been
updated, because the new parser does not preserve
the order of attributes.

Add entity defs to docbook-reader.docbook.

Update golden tests for docx.
2021-02-10 22:04:11 -08:00
Arfon Smith
5847624124 Update JATS dtd (#6020)
The current DTD for the JATS writer template is for Journal Publishing (JATS-journalpublishing1.dtd), which does not permit ext-link as a valid child (https://jats.nlm.nih.gov/publishing/tag-library/1.1/element/publisher-name.html).

This update modifies the default output template to be the less restrictive JATS archiving and interchange DTD which systems like PubMed use internally to represent their articles.
2019-12-30 22:18:55 -08:00
Nokome Bentley
7d193b2aad Remove extraneous, significant whitespace in JATS writer output (#4335)
This patch fixes some cases where the JATS writer was introducing
semantically significant whitespace by indenting and wrapping tags.
Note that the JATS spec has a content model for `<p>` tags of `(#PCDATA | ...`.
Any tag where `#PCDATA` children are possible should not have any
indentation. The same is true for `<th>`, `<td>`, `<term>`, `<label>`.
2018-03-05 09:44:34 -08:00
John MacFarlane
6b63b63f30 JATS reader: process author metadata. 2017-12-23 10:03:13 -08:00
Hamish Mackenzie
5d3c9e5646 Add Basic JATS reader based on DocBook reader 2017-12-20 13:54:02 +13:00