readNative can now read full pandoc documents, block lists, blocks,
inline lists, or inlines. It will interpret
Str "hi"
as if it were
Pandoc (Meta [] [] []) [Plain [Str "hi"]]
This should make testing easier.
'(_hi_)' was being parsed with literal underscores (no emphasis).
The fix: the 'str' parser now only parses alphanumerics and
embedded underscores. All other symbols are handled by the
'symbol' parser. This has a slight effect on the AST, since
you'll get [Str "hi",Str ":"] insntead of [Str "hi:"]. But there
should not be a visible effect in any of the writers.
Thanks to gwern for pointing out the regression.
* The new reader is faster and more accurate.
* API changes for Text.Pandoc.Readers.HTML:
- removed rawHtmlBlock, anyHtmlBlockTag, anyHtmlInlineTag,
anyHtmlTag, anyHtmlEndTag, htmlEndTag, extractTagType,
htmlBlockElement, htmlComment
- added htmlTag, htmlInBalanced, isInlineTag, isBlockTag, isTextTag
* tagsoup is a new dependency.
* Text.Pandoc.Parsing: Generalized type on readWith.
* Benchmark.hs: Added length calculation to force full evaluation.
* Updated HTML reader tests.
* Updated markdown and textile readers to use the functions from
the HTML reader.
* Note: The markdown reader now correctly handles some cases it did not
before. For example:
<hr/>
is reproduced without adding a space.
<script>
a = '<b>';
</script>
is parsed correctly.
I had previously assumed that we needed to ignore
</script> occuring in a string literal or javascript
comment. It turns out, though, that browsers aren't
that smart.
This was achieved by rearranging the parsers in inline.
Benchmarks went from 500ms to 307ms -- not quite back to the
279ms we had in 1.6, before supporting smart punctuation and
footnotes, but close.
There was a bug in parsing '_emph_, ...': when followed by
a comma, underscore emphasis did not register. (Thanks to
gwern for pointing this out.)
This bug was introduced by the change in
c66921f2ac
* The recent change allowing spaces and newlines in the URL
caused problems when reference keys are stacked up without
blank lines between. This is now fixed.
* Added test.
Moved inlineNote parser after superscript parser,
so ^[link](/foo)^ gets recognized as a superscripted
link, not an inline note followed by garbage.
Thanks to Conal Elliott for pointing out the problem.
This is better done on the resulting HTML; use the xss-sanitize library
for this. xss-sanitize is based on pandoc's sanitization, but improves
it.
- Removed stateSanitize from ParserState.
- Removed --sanitize-html option.