'(_hi_)' was being parsed with literal underscores (no emphasis).
The fix: the 'str' parser now only parses alphanumerics and
embedded underscores. All other symbols are handled by the
'symbol' parser. This has a slight effect on the AST, since
you'll get [Str "hi",Str ":"] insntead of [Str "hi:"]. But there
should not be a visible effect in any of the writers.
Thanks to gwern for pointing out the regression.
* The new reader is faster and more accurate.
* API changes for Text.Pandoc.Readers.HTML:
- removed rawHtmlBlock, anyHtmlBlockTag, anyHtmlInlineTag,
anyHtmlTag, anyHtmlEndTag, htmlEndTag, extractTagType,
htmlBlockElement, htmlComment
- added htmlTag, htmlInBalanced, isInlineTag, isBlockTag, isTextTag
* tagsoup is a new dependency.
* Text.Pandoc.Parsing: Generalized type on readWith.
* Benchmark.hs: Added length calculation to force full evaluation.
* Updated HTML reader tests.
* Updated markdown and textile readers to use the functions from
the HTML reader.
* Note: The markdown reader now correctly handles some cases it did not
before. For example:
<hr/>
is reproduced without adding a space.
<script>
a = '<b>';
</script>
is parsed correctly.
I had previously assumed that we needed to ignore
</script> occuring in a string literal or javascript
comment. It turns out, though, that browsers aren't
that smart.
This was achieved by rearranging the parsers in inline.
Benchmarks went from 500ms to 307ms -- not quite back to the
279ms we had in 1.6, before supporting smart punctuation and
footnotes, but close.
* Added Text.Pandoc.Pretty.
This is better suited for pandoc than the 'pretty' package.
One advantage is that we now get proper wrapping; Emph [Inline]
is no longer treated as a big unwrappable unit. Previously
we only got breaks for spaces at the "outer level." We can also
more easily avoid doubled blank lines. Performance is
significantly better as well.
* Removed Text.Pandoc.Blocks.
Text.Pandoc.Pretty allows you to define blocks and concatenate
them.
* Modified markdown, RST, org readers to use Text.Pandoc.Pretty
instead of Text.PrettyPrint.HughesPJ.
* Text.Pandoc.Shared: Added writerColumns to WriterOptions.
* Markdown, RST, Org writers now break text at writerColumns.
* Added --columns command-line option, which sets stColumns
and writerColumns.
* Table parsing: If the size of the header > stColumns,
use the header size as 100% for purposes of calculating
relative widths of columns.
This is necessary as the latex citation commands include there own
punctuation, which resulted in doubled commas for markdown documents
where citeproc output works correctly.
There was a bug in parsing '_emph_, ...': when followed by
a comma, underscore emphasis did not register. (Thanks to
gwern for pointing this out.)
This bug was introduced by the change in
c66921f2ac
* The recent change allowing spaces and newlines in the URL
caused problems when reference keys are stacked up without
blank lines between. This is now fixed.
* Added test.
Moved inlineNote parser after superscript parser,
so ^[link](/foo)^ gets recognized as a superscripted
link, not an inline note followed by garbage.
Thanks to Conal Elliott for pointing out the problem.
This is better done on the resulting HTML; use the xss-sanitize library
for this. xss-sanitize is based on pandoc's sanitization, but improves
it.
- Removed stateSanitize from ParserState.
- Removed --sanitize-html option.
Resolves issue #258.
Note that there are some differences in how docutils and
pandoc treat footnotes. Currently pandoc ignores the numeral
or symbol used in the note; footnotes are put in an auto-numbered
ordered list.
Previously we allowed '. . .', ' . . . ', etc. This caused
too many complications, and removed author's flexibility in
combining ellipses with spaces and periods.
The 'str' parser now reads internal _'s as part of the string.
This prevents pandoc from getting started looking for an emphasized
block, which can cause exponential slowdowns in some cases.
Resolves Issue #182.
Previously, curly quotes were just parsed literally, leading
to problems in some output formats. Now they are parsed as
Quoted inlines, if --smart is specified.
Resolves Issue #270.
This broke when we added the Key type. We had assumed that
the custom case-insensitive Ord instance would ensure case-insensitive
matching, but that is not how Data.Map works.
* Added a test case for case-insensitivity in markdown-reader-more
* Removed old refsMatch from Text.Pandoc.Parsing module;
* hid the 'Key' constructor;
* dropped the custom Ord and Eq instances, deriving instead;
* added fromKey and toKey to convert between Keys and Inline lists;
* toKey ensures that keys are case-insensitive, since this is the
only way the API provides to construct a Key.
Resolves Issue #272.
The smartPuncutation parser from the markdown parser
was being used, but this creates two problems:
* smart punctuation rules are slightly different in textile,
for example, a single dash wish space around becomes an
En dash.
* the following gets parsed as a double quoted string followed
by a colon, rather than as a link:
"emphasized text":http://my.url.com
This needs rethinking.
Do a quick lookahead to make sure what follows looks like a setext
header before parsing any Inlines. This gives a 15% performance
boost in one benchmark. Many thanks to knieriem for finding
the problem (in peg-markdown):
https://github.com/jgm/peg-markdown/issues/issue/3
If suffix doesn't begin with punctuation, include opening
comma and space in result.
Previously,
@item [only a suffix]
would result in something like
Doe (2002only a suffix)
because there was no opening delimiter.
* Don't look for bibliography in ~/.pandoc. Reason: doing
this requires a read + parse of the bibliography even when
the document doesn't use citations. This is a big performance
drag on regular pandoc invocations.
* Only look for default.csl if the document contains references.
Reason: avoids the need to read and parse csl file when the
document contains no references anyway.
* Removed findFirstFile from Shared.
Now we handle a suffix after a bare locator, e.g.
@item1 [p. 30, suffix]
The suffix now includes any punctuation that introduces it.
A few tests fail because of problems with citeproc (extra space
before the suffix, missing space after comma separating multiple
page ranges in the locator).
Suffixes and prefixes are now [Inline]. The locator is separated
from the citation key by a blank space. The locator consists of
one introductory word and any number of words containing at
least one digit. The suffix, if any, is separated from the locator
by a comma, and continues til the end of the citation.
citationPrefix now [Inline] rather than String;
citationSuffix added.
This change presupposes no changes in citeproc-hs.
It passes a string for these values to citeproc-hs.
Eventually, citeproc-hs should use an [Inline] for
these as well.