Commit graph

355 commits

Author SHA1 Message Date
John MacFarlane
e1f3c6058e Added Text.Pandoc.Readers.Native (readNative).
readNative can now read full pandoc documents, block lists, blocks,
inline lists, or inlines.  It will interpret

Str "hi"

as if it were

Pandoc (Meta [] [] []) [Plain [Str "hi"]]

This should make testing easier.
2011-01-19 18:36:27 -08:00
John MacFarlane
e647f761ed Use spaceChar instead of oneOf " \t" in rst reader. 2011-01-19 15:17:51 -08:00
John MacFarlane
1b8a9711b8 Replaced more noneOf/oneOf parsers. 2011-01-19 15:14:23 -08:00
John MacFarlane
a400cfe10f Replaced uses of oneOf with more efficient parsers.
This speeds up the markdown reader.
2011-01-19 15:06:56 -08:00
John MacFarlane
a5cbcdfe3a HTML reader: parse simple tables.
Resolves Issue #106.  Thanks to Rodja Trappe for the idea
and some sample code.
2011-01-14 20:48:10 -08:00
John MacFarlane
c31d3cc306 HTML reader: parse location tags in pSatisfy.
This avoids the need for manual parsing all over the place.
2011-01-14 20:47:32 -08:00
John MacFarlane
d891b2c29d LaTeX reader: Support simple tables. 2011-01-07 10:15:48 -08:00
John MacFarlane
303ce8a9e5 LaTeX reader: allow spaces btw \\begin or \\end and {. 2011-01-06 09:34:24 -08:00
John MacFarlane
81ea1a59b4 LaTeX reader: Removed unnecessary 'spaces'. 2011-01-06 09:24:56 -08:00
John MacFarlane
1be2ca6c78 HTML reader: Fixed bug in htmlTag for comments. 2011-01-06 00:21:19 -08:00
John MacFarlane
b63a7f7c48 LaTeX reader: Apply macros to non-math; handle ensuremath. 2011-01-05 16:55:26 -08:00
John MacFarlane
18e7a7a495 LaTeX reader: Don't handle \label and \ref specially.
Put labels in {} instead of ().
2011-01-05 15:24:20 -08:00
John MacFarlane
1415b6831e LaTeX reader: Support \L \l accents. 2011-01-05 14:57:06 -08:00
John MacFarlane
23aae79b01 Updated for texmath 0.5. 2011-01-05 14:44:26 -08:00
John MacFarlane
e126ab9efc LaTeX reader: Parse inside arguments when ignoring commands. 2011-01-05 12:25:47 -08:00
John MacFarlane
c3071ff6e9 LaTeX reader: Don't handle \index separately.
Instead, just put it in list of commands to ignore.
2011-01-05 12:05:04 -08:00
John MacFarlane
b26247a4a8 LaTeX reader: Added "index" to ignorable commands. 2011-01-05 11:56:37 -08:00
John MacFarlane
cf6cd15c27 LaTeX reader: skip space before option or argument. 2011-01-05 11:54:40 -08:00
John MacFarlane
d033fc9d3e LaTeX reader: Skip \index commands. 2011-01-05 10:11:24 -08:00
John MacFarlane
c949530815 LaTeX reader: Removed \group (we want to parse inside {}). 2011-01-05 10:06:51 -08:00
John MacFarlane
3dab6c574c LaTeX reader: Better handling of preamble, inc. parsing macros. 2011-01-05 09:04:03 -08:00
John MacFarlane
85bfd26b78 LaTeX reader: Parse bracketed {parts} as raw TeX. 2011-01-04 22:20:35 -08:00
John MacFarlane
22b2c02aeb Markdown reader: Removed unneeded definitions.
specialChars, strChar, specialCharsMinusLt.
2011-01-04 22:11:56 -08:00
John MacFarlane
dac2e9156f LaTeX reader: parse macros and apply to math. 2011-01-04 19:18:20 -08:00
John MacFarlane
fcbe1e95eb Moved 'macro' and 'applyMacros'' from markdown reader to Parsing. 2011-01-04 19:12:33 -08:00
John MacFarlane
3e61333af0 Fixed regression in markdown reader.
'(_hi_)' was being parsed with literal underscores (no emphasis).
The fix:  the 'str' parser now only parses alphanumerics and
embedded underscores.  All other symbols are handled by the
'symbol' parser.  This has a slight effect on the AST, since
you'll get [Str "hi",Str ":"] insntead of [Str "hi:"].  But there
should not be a visible effect in any of the writers.

Thanks to gwern for pointing out the regression.
2011-01-01 22:46:30 -08:00
John MacFarlane
b05e739c6d LaTeX reader: Allow ignored comments after \end{document}. 2010-12-30 22:05:19 -08:00
John MacFarlane
d6f28af9cb HTML reader: Fixed some parsing bugs. 2010-12-30 19:33:37 -08:00
Puneeth Chaganti
e4dedad1c0 Added support for listings package code blocks and inline code. 2010-12-30 14:37:51 -08:00
John MacFarlane
f49e60a8b8 Textile reader: Slight speed improvement. 2010-12-30 14:33:11 -08:00
John MacFarlane
904050fa36 New HTML reader using tagsoup as a lexer.
* The new reader is faster and more accurate.

* API changes for Text.Pandoc.Readers.HTML:
   - removed rawHtmlBlock, anyHtmlBlockTag, anyHtmlInlineTag,
     anyHtmlTag, anyHtmlEndTag, htmlEndTag, extractTagType,
     htmlBlockElement, htmlComment
   - added htmlTag, htmlInBalanced, isInlineTag, isBlockTag, isTextTag

* tagsoup is a new dependency.

* Text.Pandoc.Parsing: Generalized type on readWith.

* Benchmark.hs: Added length calculation to force full evaluation.

* Updated HTML reader tests.

* Updated markdown and textile readers to use the functions from
  the HTML reader.

* Note: The markdown reader now correctly handles some cases it did not
  before. For example:

    <hr/>

  is reproduced without adding a space.

    <script>
      a = '<b>';
    </script>

  is parsed correctly.
2010-12-30 13:55:40 -08:00
John MacFarlane
10d85f8b0b Use functions from Text.Pandoc.Generic instead of processWith(M). 2010-12-24 13:39:27 -08:00
John MacFarlane
c08ca6fa6d HTML reader: Simplified parsing of <script> sections.
I had previously assumed that we needed to ignore
</script> occuring in a string literal or javascript
comment.  It turns out, though, that browsers aren't
that smart.
2010-12-22 19:20:27 -08:00
John MacFarlane
4bfe140ed1 Made --smart work with HTML reader.
It did not work before, because - and quotes were gobbled
up by the str parser.
2010-12-22 17:05:17 -08:00
John MacFarlane
63bf227e04 RST reader: Added unicode quote characters to specialChars.
(So they can trigger Quoted environments.)
2010-12-22 17:04:56 -08:00
John MacFarlane
bbad129066 RST reader: recouped speed loss due to addition of --smart.
This was achieved by rearranging the parsers in inline.

Benchmarks went from 500ms to 307ms -- not quite back to the
279ms we had in 1.6, before supporting smart punctuation and
footnotes, but close.
2010-12-22 15:10:21 -08:00
John MacFarlane
fe1152985c Shared: Made splitBy take a test instead of an element. 2010-12-21 08:41:24 -08:00
John MacFarlane
63cf37a9ca HTML reader: allow : in tags.
Resolves Issue #274.
2010-12-15 14:15:53 -08:00
John MacFarlane
3ac6f72f98 Fixed preamble parsing in LaTeX reader. 2010-12-14 19:34:28 -08:00
John MacFarlane
128cf46089 Fixed regression in parsing _emph_
There was a bug in parsing '_emph_, ...':  when followed by
a comma, underscore emphasis did not register.  (Thanks to
gwern for pointing this out.)

This bug was introduced by the change in
c66921f2ac
2010-12-14 18:23:26 -08:00
Nathan Gass
2e728df756 Moved special handling of punctuation in suffix out of markdown reader.
This allows different writers to handle punctuation in the suffix
differently.
2010-12-13 20:50:29 -08:00
Nathan Gass
c2d3796439 Added support for latex cite commands in latex reader. 2010-12-13 20:48:19 -08:00
John MacFarlane
1a4a0d0283 Markdown reader: Further fix to abbrevs. 2010-12-13 20:05:50 -08:00
John MacFarlane
7b4d3c77ec Markdown reader: Fixed abbrev handler to allow abbrev at end of line.
E.g., Mr.
Frank.
2010-12-13 20:04:11 -08:00
John MacFarlane
3822d6c440 Markdown reader: Fixed referenceKey parser to allow space after newline. 2010-12-13 20:03:59 -08:00
John MacFarlane
71e0557e61 Markdown reader: Fixed regression in reference key parser.
* The recent change allowing spaces and newlines in the URL
  caused problems when reference keys are stacked up without
  blank lines between. This is now fixed.
* Added test.
2010-12-13 20:03:12 -08:00
John MacFarlane
3748dfeb91 Markdown reader: fix superscripts with links.
Moved inlineNote parser after superscript parser,
so ^[link](/foo)^ gets recognized as a superscripted
link, not an inline note followed by garbage.

Thanks to Conal Elliott for pointing out the problem.
2010-12-12 20:30:55 -08:00
John MacFarlane
2dfb45950e LaTeX reader: Improved parsing of preamble.
Previously you'd get unexpected behavior on a document that
contained '\begin{document}' in, say, a verbatim block.
2010-12-10 23:21:24 -08:00
John MacFarlane
de6452c0d1 Markdown reader: small cosmetic code improvements. 2010-12-10 16:26:35 -08:00
John MacFarlane
5770ceca36 Removed HTML sanitization.
This is better done on the resulting HTML; use the xss-sanitize library
for this.  xss-sanitize is based on pandoc's sanitization, but improves
it.

- Removed stateSanitize from ParserState.
- Removed --sanitize-html option.
2010-12-10 12:26:03 -08:00