Commit graph

354 commits

Author SHA1 Message Date
John MacFarlane
e647f761ed Use spaceChar instead of oneOf " \t" in rst reader. 2011-01-19 15:17:51 -08:00
John MacFarlane
1b8a9711b8 Replaced more noneOf/oneOf parsers. 2011-01-19 15:14:23 -08:00
John MacFarlane
a400cfe10f Replaced uses of oneOf with more efficient parsers.
This speeds up the markdown reader.
2011-01-19 15:06:56 -08:00
John MacFarlane
a5cbcdfe3a HTML reader: parse simple tables.
Resolves Issue .  Thanks to Rodja Trappe for the idea
and some sample code.
2011-01-14 20:48:10 -08:00
John MacFarlane
c31d3cc306 HTML reader: parse location tags in pSatisfy.
This avoids the need for manual parsing all over the place.
2011-01-14 20:47:32 -08:00
John MacFarlane
d891b2c29d LaTeX reader: Support simple tables. 2011-01-07 10:15:48 -08:00
John MacFarlane
303ce8a9e5 LaTeX reader: allow spaces btw \\begin or \\end and {. 2011-01-06 09:34:24 -08:00
John MacFarlane
81ea1a59b4 LaTeX reader: Removed unnecessary 'spaces'. 2011-01-06 09:24:56 -08:00
John MacFarlane
1be2ca6c78 HTML reader: Fixed bug in htmlTag for comments. 2011-01-06 00:21:19 -08:00
John MacFarlane
b63a7f7c48 LaTeX reader: Apply macros to non-math; handle ensuremath. 2011-01-05 16:55:26 -08:00
John MacFarlane
18e7a7a495 LaTeX reader: Don't handle \label and \ref specially.
Put labels in {} instead of ().
2011-01-05 15:24:20 -08:00
John MacFarlane
1415b6831e LaTeX reader: Support \L \l accents. 2011-01-05 14:57:06 -08:00
John MacFarlane
23aae79b01 Updated for texmath 0.5. 2011-01-05 14:44:26 -08:00
John MacFarlane
e126ab9efc LaTeX reader: Parse inside arguments when ignoring commands. 2011-01-05 12:25:47 -08:00
John MacFarlane
c3071ff6e9 LaTeX reader: Don't handle \index separately.
Instead, just put it in list of commands to ignore.
2011-01-05 12:05:04 -08:00
John MacFarlane
b26247a4a8 LaTeX reader: Added "index" to ignorable commands. 2011-01-05 11:56:37 -08:00
John MacFarlane
cf6cd15c27 LaTeX reader: skip space before option or argument. 2011-01-05 11:54:40 -08:00
John MacFarlane
d033fc9d3e LaTeX reader: Skip \index commands. 2011-01-05 10:11:24 -08:00
John MacFarlane
c949530815 LaTeX reader: Removed \group (we want to parse inside {}). 2011-01-05 10:06:51 -08:00
John MacFarlane
3dab6c574c LaTeX reader: Better handling of preamble, inc. parsing macros. 2011-01-05 09:04:03 -08:00
John MacFarlane
85bfd26b78 LaTeX reader: Parse bracketed {parts} as raw TeX. 2011-01-04 22:20:35 -08:00
John MacFarlane
22b2c02aeb Markdown reader: Removed unneeded definitions.
specialChars, strChar, specialCharsMinusLt.
2011-01-04 22:11:56 -08:00
John MacFarlane
dac2e9156f LaTeX reader: parse macros and apply to math. 2011-01-04 19:18:20 -08:00
John MacFarlane
fcbe1e95eb Moved 'macro' and 'applyMacros'' from markdown reader to Parsing. 2011-01-04 19:12:33 -08:00
John MacFarlane
3e61333af0 Fixed regression in markdown reader.
'(_hi_)' was being parsed with literal underscores (no emphasis).
The fix:  the 'str' parser now only parses alphanumerics and
embedded underscores.  All other symbols are handled by the
'symbol' parser.  This has a slight effect on the AST, since
you'll get [Str "hi",Str ":"] insntead of [Str "hi:"].  But there
should not be a visible effect in any of the writers.

Thanks to gwern for pointing out the regression.
2011-01-01 22:46:30 -08:00
John MacFarlane
b05e739c6d LaTeX reader: Allow ignored comments after \end{document}. 2010-12-30 22:05:19 -08:00
John MacFarlane
d6f28af9cb HTML reader: Fixed some parsing bugs. 2010-12-30 19:33:37 -08:00
Puneeth Chaganti
e4dedad1c0 Added support for listings package code blocks and inline code. 2010-12-30 14:37:51 -08:00
John MacFarlane
f49e60a8b8 Textile reader: Slight speed improvement. 2010-12-30 14:33:11 -08:00
John MacFarlane
904050fa36 New HTML reader using tagsoup as a lexer.
* The new reader is faster and more accurate.

* API changes for Text.Pandoc.Readers.HTML:
   - removed rawHtmlBlock, anyHtmlBlockTag, anyHtmlInlineTag,
     anyHtmlTag, anyHtmlEndTag, htmlEndTag, extractTagType,
     htmlBlockElement, htmlComment
   - added htmlTag, htmlInBalanced, isInlineTag, isBlockTag, isTextTag

* tagsoup is a new dependency.

* Text.Pandoc.Parsing: Generalized type on readWith.

* Benchmark.hs: Added length calculation to force full evaluation.

* Updated HTML reader tests.

* Updated markdown and textile readers to use the functions from
  the HTML reader.

* Note: The markdown reader now correctly handles some cases it did not
  before. For example:

    <hr/>

  is reproduced without adding a space.

    <script>
      a = '<b>';
    </script>

  is parsed correctly.
2010-12-30 13:55:40 -08:00
John MacFarlane
10d85f8b0b Use functions from Text.Pandoc.Generic instead of processWith(M). 2010-12-24 13:39:27 -08:00
John MacFarlane
c08ca6fa6d HTML reader: Simplified parsing of <script> sections.
I had previously assumed that we needed to ignore
</script> occuring in a string literal or javascript
comment.  It turns out, though, that browsers aren't
that smart.
2010-12-22 19:20:27 -08:00
John MacFarlane
4bfe140ed1 Made --smart work with HTML reader.
It did not work before, because - and quotes were gobbled
up by the str parser.
2010-12-22 17:05:17 -08:00
John MacFarlane
63bf227e04 RST reader: Added unicode quote characters to specialChars.
(So they can trigger Quoted environments.)
2010-12-22 17:04:56 -08:00
John MacFarlane
bbad129066 RST reader: recouped speed loss due to addition of --smart.
This was achieved by rearranging the parsers in inline.

Benchmarks went from 500ms to 307ms -- not quite back to the
279ms we had in 1.6, before supporting smart punctuation and
footnotes, but close.
2010-12-22 15:10:21 -08:00
John MacFarlane
fe1152985c Shared: Made splitBy take a test instead of an element. 2010-12-21 08:41:24 -08:00
John MacFarlane
63cf37a9ca HTML reader: allow : in tags.
Resolves Issue .
2010-12-15 14:15:53 -08:00
John MacFarlane
3ac6f72f98 Fixed preamble parsing in LaTeX reader. 2010-12-14 19:34:28 -08:00
John MacFarlane
128cf46089 Fixed regression in parsing _emph_
There was a bug in parsing '_emph_, ...':  when followed by
a comma, underscore emphasis did not register.  (Thanks to
gwern for pointing this out.)

This bug was introduced by the change in
c66921f2ac
2010-12-14 18:23:26 -08:00
Nathan Gass
2e728df756 Moved special handling of punctuation in suffix out of markdown reader.
This allows different writers to handle punctuation in the suffix
differently.
2010-12-13 20:50:29 -08:00
Nathan Gass
c2d3796439 Added support for latex cite commands in latex reader. 2010-12-13 20:48:19 -08:00
John MacFarlane
1a4a0d0283 Markdown reader: Further fix to abbrevs. 2010-12-13 20:05:50 -08:00
John MacFarlane
7b4d3c77ec Markdown reader: Fixed abbrev handler to allow abbrev at end of line.
E.g., Mr.
Frank.
2010-12-13 20:04:11 -08:00
John MacFarlane
3822d6c440 Markdown reader: Fixed referenceKey parser to allow space after newline. 2010-12-13 20:03:59 -08:00
John MacFarlane
71e0557e61 Markdown reader: Fixed regression in reference key parser.
* The recent change allowing spaces and newlines in the URL
  caused problems when reference keys are stacked up without
  blank lines between. This is now fixed.
* Added test.
2010-12-13 20:03:12 -08:00
John MacFarlane
3748dfeb91 Markdown reader: fix superscripts with links.
Moved inlineNote parser after superscript parser,
so ^[link](/foo)^ gets recognized as a superscripted
link, not an inline note followed by garbage.

Thanks to Conal Elliott for pointing out the problem.
2010-12-12 20:30:55 -08:00
John MacFarlane
2dfb45950e LaTeX reader: Improved parsing of preamble.
Previously you'd get unexpected behavior on a document that
contained '\begin{document}' in, say, a verbatim block.
2010-12-10 23:21:24 -08:00
John MacFarlane
de6452c0d1 Markdown reader: small cosmetic code improvements. 2010-12-10 16:26:35 -08:00
John MacFarlane
5770ceca36 Removed HTML sanitization.
This is better done on the resulting HTML; use the xss-sanitize library
for this.  xss-sanitize is based on pandoc's sanitization, but improves
it.

- Removed stateSanitize from ParserState.
- Removed --sanitize-html option.
2010-12-10 12:26:03 -08:00
John MacFarlane
17d48cf4af Markdown reader: Allow linebreaks in URLs (treat as spaces).
Also, a string of consecutive spaces or tabs is now parsed
as a single space. If you have multiple spaces in your URL,
use %20%20.
2010-12-10 12:14:51 -08:00