Commit graph

209 commits

Author SHA1 Message Date
John MacFarlane
1b8a9711b8 Replaced more noneOf/oneOf parsers. 2011-01-19 15:14:23 -08:00
John MacFarlane
a400cfe10f Replaced uses of oneOf with more efficient parsers.
This speeds up the markdown reader.
2011-01-19 15:06:56 -08:00
John MacFarlane
22b2c02aeb Markdown reader: Removed unneeded definitions.
specialChars, strChar, specialCharsMinusLt.
2011-01-04 22:11:56 -08:00
John MacFarlane
fcbe1e95eb Moved 'macro' and 'applyMacros'' from markdown reader to Parsing. 2011-01-04 19:12:33 -08:00
John MacFarlane
3e61333af0 Fixed regression in markdown reader.
'(_hi_)' was being parsed with literal underscores (no emphasis).
The fix:  the 'str' parser now only parses alphanumerics and
embedded underscores.  All other symbols are handled by the
'symbol' parser.  This has a slight effect on the AST, since
you'll get [Str "hi",Str ":"] insntead of [Str "hi:"].  But there
should not be a visible effect in any of the writers.

Thanks to gwern for pointing out the regression.
2011-01-01 22:46:30 -08:00
John MacFarlane
904050fa36 New HTML reader using tagsoup as a lexer.
* The new reader is faster and more accurate.

* API changes for Text.Pandoc.Readers.HTML:
   - removed rawHtmlBlock, anyHtmlBlockTag, anyHtmlInlineTag,
     anyHtmlTag, anyHtmlEndTag, htmlEndTag, extractTagType,
     htmlBlockElement, htmlComment
   - added htmlTag, htmlInBalanced, isInlineTag, isBlockTag, isTextTag

* tagsoup is a new dependency.

* Text.Pandoc.Parsing: Generalized type on readWith.

* Benchmark.hs: Added length calculation to force full evaluation.

* Updated HTML reader tests.

* Updated markdown and textile readers to use the functions from
  the HTML reader.

* Note: The markdown reader now correctly handles some cases it did not
  before. For example:

    <hr/>

  is reproduced without adding a space.

    <script>
      a = '<b>';
    </script>

  is parsed correctly.
2010-12-30 13:55:40 -08:00
John MacFarlane
10d85f8b0b Use functions from Text.Pandoc.Generic instead of processWith(M). 2010-12-24 13:39:27 -08:00
John MacFarlane
128cf46089 Fixed regression in parsing _emph_
There was a bug in parsing '_emph_, ...':  when followed by
a comma, underscore emphasis did not register.  (Thanks to
gwern for pointing this out.)

This bug was introduced by the change in
c66921f2ac
2010-12-14 18:23:26 -08:00
Nathan Gass
2e728df756 Moved special handling of punctuation in suffix out of markdown reader.
This allows different writers to handle punctuation in the suffix
differently.
2010-12-13 20:50:29 -08:00
John MacFarlane
1a4a0d0283 Markdown reader: Further fix to abbrevs. 2010-12-13 20:05:50 -08:00
John MacFarlane
7b4d3c77ec Markdown reader: Fixed abbrev handler to allow abbrev at end of line.
E.g., Mr.
Frank.
2010-12-13 20:04:11 -08:00
John MacFarlane
3822d6c440 Markdown reader: Fixed referenceKey parser to allow space after newline. 2010-12-13 20:03:59 -08:00
John MacFarlane
71e0557e61 Markdown reader: Fixed regression in reference key parser.
* The recent change allowing spaces and newlines in the URL
  caused problems when reference keys are stacked up without
  blank lines between. This is now fixed.
* Added test.
2010-12-13 20:03:12 -08:00
John MacFarlane
3748dfeb91 Markdown reader: fix superscripts with links.
Moved inlineNote parser after superscript parser,
so ^[link](/foo)^ gets recognized as a superscripted
link, not an inline note followed by garbage.

Thanks to Conal Elliott for pointing out the problem.
2010-12-12 20:30:55 -08:00
John MacFarlane
de6452c0d1 Markdown reader: small cosmetic code improvements. 2010-12-10 16:26:35 -08:00
John MacFarlane
5770ceca36 Removed HTML sanitization.
This is better done on the resulting HTML; use the xss-sanitize library
for this.  xss-sanitize is based on pandoc's sanitization, but improves
it.

- Removed stateSanitize from ParserState.
- Removed --sanitize-html option.
2010-12-10 12:26:03 -08:00
John MacFarlane
17d48cf4af Markdown reader: Allow linebreaks in URLs (treat as spaces).
Also, a string of consecutive spaces or tabs is now parsed
as a single space. If you have multiple spaces in your URL,
use %20%20.
2010-12-10 12:14:51 -08:00
John MacFarlane
ee0a0953de Markdown reader: Rewrote para parser for better efficiency.
This change avoids repeated parsing of inline lists for 'plain'
blocks.
2010-12-10 10:47:46 -08:00
John MacFarlane
91978d2201 Markdown reader: minor footnote changes.
Don't skipNonindentSpaces in noteMarker, since it's also
used in the inline note parser.
2010-12-08 08:17:16 -08:00
John MacFarlane
33ba35da9f Smart punctuation: recognize entities.
Now &ldquo;Hi&rdquo; gets parsed as a Quoted DoubleQuote inline.
2010-12-07 20:44:43 -08:00
John MacFarlane
e20052a1ba Markdown reader: Moved smartPunctuation parser, for slight speed bump. 2010-12-07 20:09:40 -08:00
John MacFarlane
50ca61ef49 Moved smartPunctuation from Markdown to Parsing.
+ Parameterized smartPunctuation on an inline parser.
+ Handle smartPunctuation in Textile reader.
2010-12-07 19:03:08 -08:00
John MacFarlane
c66921f2ac Markdown reader: better handling of intraword _.
The 'str' parser now reads internal _'s as part of the string.
This prevents pandoc from getting started looking for an emphasized
block, which can cause exponential slowdowns in some cases.

Resolves Issue #182.
2010-12-06 22:12:18 -08:00
John MacFarlane
7864f30717 Markdown reader: handle curly quotes better.
Previously, curly quotes were just parsed literally, leading
to problems in some output formats.  Now they are parsed as
Quoted inlines, if --smart is specified.

Resolves Issue #270.
2010-12-06 20:36:58 -08:00
John MacFarlane
5a4609584c Fix regression: markdown references should be case-insensitive.
This broke when we added the Key type.  We had assumed that
the custom case-insensitive Ord instance would ensure case-insensitive
matching, but that is not how Data.Map works.

* Added a test case for case-insensitivity in markdown-reader-more
* Removed old refsMatch from Text.Pandoc.Parsing module;
* hid the 'Key' constructor;
* dropped the custom Ord and Eq instances, deriving instead;
* added fromKey and toKey to convert between Keys and Inline lists;
* toKey ensures that keys are case-insensitive, since this is the
  only way the API provides to construct a Key.

Resolves Issue #272.
2010-12-05 19:27:00 -08:00
John MacFarlane
357b965b44 Merge branch 'citeproc' into master.
Conflicts:
	src/Text/Pandoc/Definition.hs
2010-12-03 23:43:47 -08:00
paul.rivier
c3866f3c66 punctuation handling, and more html-specific handling 2010-12-03 23:10:52 -08:00
John MacFarlane
4c21c5566d Merge branch 'master' into citeproc 2010-11-28 20:21:07 -08:00
John MacFarlane
3ffd724617 Markdown parser performance improvement.
Do a quick lookahead to make sure what follows looks like a setext
header before parsing any Inlines.  This gives a 15% performance
boost in one benchmark.  Many thanks to knieriem for finding
the problem (in peg-markdown):

https://github.com/jgm/peg-markdown/issues/issue/3
2010-11-28 20:19:32 -08:00
John MacFarlane
0ca84f0d38 Markdown suffix parser fix.
If suffix doesn't begin with punctuation, include opening
comma and space in result.

Previously,

@item [only a suffix]

would result in something like

Doe (2002only a suffix)

because there was no opening delimiter.
2010-11-26 22:34:53 -08:00
John MacFarlane
0871a512d7 Split locator and suffix in Biblio rather than Markdown parser.
Patch from Nathan Gass.
2010-11-26 12:06:56 -08:00
John MacFarlane
b48fa0ea59 Check biblio for all citations, not just textual. 2010-11-22 23:09:30 -08:00
John MacFarlane
6390103509 Markdown citation parser: small refactoring for clarity. 2010-11-18 14:16:18 -08:00
John MacFarlane
f3bb3c1ff1 Markdown citation parser improvements and test updates.
Now we handle a suffix after a bare locator, e.g.
@item1 [p. 30, suffix]
The suffix now includes any punctuation that introduces it.
A few tests fail because of problems with citeproc (extra space
before the suffix, missing space after comma separating multiple
page ranges in the locator).
2010-11-18 13:22:20 -08:00
John MacFarlane
aaf7de0dda Markdown reader: Revised parser for new citation syntax.
Suffixes and prefixes are now [Inline].  The locator is separated
from the citation key by a blank space.  The locator consists of
one introductory word and any number of words containing at
least one digit.  The suffix, if any, is separated from the locator
by a comma, and continues til the end of the citation.
2010-11-18 12:38:45 -08:00
John MacFarlane
47c64d4fc4 Don't pass a [Str ""] as citationPrefix. 2010-11-17 15:35:53 -08:00
John MacFarlane
ce9fc2a37d Updated for changes in Citaiton type.
citationPrefix now [Inline] rather than String;
citationSuffix added.

This change presupposes no changes in citeproc-hs.
It passes a string for these values to citeproc-hs.
Eventually, citeproc-hs should use an [Inline] for
these as well.
2010-11-16 20:31:22 -08:00
John MacFarlane
1fa2973da6 Repairs to citation parser + citation test suite. 2010-11-12 19:30:59 -08:00
John MacFarlane
79bab2d210 Revised citation parsers for markdown reader.
Added a form for in-text citations:

@doe99 [30; see also @smith99].
2010-11-12 00:37:44 -08:00
John MacFarlane
36d4e649a6 Added support for textual citations (but not yet markdown syntax).
Patch from Andrea Rossato.
2010-11-11 21:30:34 -08:00
John MacFarlane
83e6c01e4d Merge branch 'master' into citeproc 2010-11-09 22:52:36 -08:00
John MacFarlane
21556e37f4 Allow HTML comments as inline elements in markdown.
So,
aaa <!-- comment --> bbb
can be a single paragraph.
2010-11-09 22:51:02 -08:00
John MacFarlane
23c6f56bc5 Removed CITEPROC CPP conditionals from library code.
By Cabal policy, the API should not change depending on flags.
2010-11-06 14:58:54 -07:00
John MacFarlane
f7f6b2427d Changes to use citeproc-hs 0.3. 2010-11-06 14:43:23 -07:00
John MacFarlane
ac06ca2b00 Changes to use citeproc 0.3.
Patch from Andrea Rossato.
Note: the markdown syntax is preliminary and will probably change.
2010-10-27 18:25:59 -07:00
John MacFarlane
f870777c36 Parse blanklines after macro definitions. 2010-10-26 19:52:12 -07:00
John MacFarlane
6b722d1b45 Process LaTeX macros in markdown, and apply to TeX math.
Example:
\newcommand{\plus}[2]{#1 + #2}

$\plus{3}{4}$

yields:

3+4
2010-10-26 09:03:03 -07:00
John MacFarlane
afe18e53f1 Modified example refs so they can occur before or after target.
The refs are now replaced by numbers at the final stage, using
processWith.
2010-07-12 23:05:46 -07:00
John MacFarlane
0181e66250 Merge branch 'atlists'. Added auto-numbered example lists. 2010-07-11 22:47:52 -07:00
John MacFarlane
73b4cc0897 Minor comment change. 2010-07-06 21:23:25 -07:00