Commit graph

1765 commits

Author SHA1 Message Date
Albert Krewinkel
68d388f833 Org reader: add :PROPERTIES: drawer support
Headers can have optional `:PROPERTIES:` drawers associated with them.
These drawers contain key/value pairs like the header's `id`.  The
reader adds all listed pairs to the header's attributes; `id` and
`class` attributes are handled specially to match the way `Attr` are
defined.

This also changes behavior of how drawers of unknown type are handled.
Instead of including all unknown drawers, those are not read/exported,
thereby matching current Emacs behavior.

This closes #1877.
2016-05-20 17:01:26 +02:00
John MacFarlane
0958f2f5d0 Merge pull request #2927 from tarleb/org-attr-html
Org reader support for ATTR_HTML statements
2016-05-19 10:44:11 -07:00
Albert Krewinkel
16e233475a Org reader: add support for ATTR_HTML attributes
Arbitrary key-value pairs can be added to some block types using a
`#+ATTR_HTML` line before the block.  Emacs Org-mode only includes these
when exporting to HTML, but since we cannot make this distinction here,
the attributes are always added.

The functionality is now supported for figures.

This closes #1906.
2016-05-19 09:55:12 +02:00
Albert Krewinkel
26e8d98be2 Org reader: use custom anyLine
Additional state changes need to be made after a newline is parsed,
otherwise markup may not be recognized correctly.

This fixes a bug where markup after certain block-types would not be
recognized. E.g. `/emph/` in the following snippet was not parsed as
emphasized.

    foo
    # comment
    /emph/
2016-05-19 09:35:47 +02:00
Albert Krewinkel
1dda535378 Org reader: refactor block attribute handling
A parser state attribute was used to keep track of block attributes
defined in meta-lines.  Global state is undesirable, so block attributes
are no longer saved as part of the parser state.  Old functions and the
respective part of the parser state are removed.
2016-05-19 09:33:51 +02:00
John MacFarlane
847167804a EPUB reader: unescape URIs in spine.
This should fix #2924.

Testing on the epub that caused the problem originally
would be welcome.
2016-05-17 09:38:52 -07:00
John MacFarlane
344412cba8 Merge pull request #2894 from sid-kap/rst-code-class
Add class option for code block in RST reader
2016-05-12 00:03:14 -07:00
Albert Krewinkel
be5cccf248 Org reader: parse but ignore export options
All known export options are parsed but ignored.
2016-05-11 19:13:43 +02:00
Albert Krewinkel
76143de97e Org reader: add support for sub/superscript export options
Org-mode allows to specify export settings via `#+OPTIONS` lines.
Disabling simple sub- and superscripts is one of these export options,
this options is now supported.
2016-05-11 19:13:43 +02:00
Albert Krewinkel
7a0729ea09 Org reader: move parser state into separate module
The org reader code has become large and confusing.  Extracting smaller
parts into submodules should help to clean things up.
2016-05-11 19:13:42 +02:00
John MacFarlane
fd9ec835ec Merge pull request #2907 from tarleb/org-fixes
Org fixes (reader and writer)
2016-05-09 10:17:56 -07:00
Albert Krewinkel
10a809f126 Org reader: fix inline-LaTeX regression
The last fix for whitespace handling of inline LaTeX commands was
incorrect, preventing correct recognition of inline LaTeX commands which
contain spaces.  This fix ensures that only trailing whitespace is cut
off.
2016-05-09 19:06:04 +02:00
roblabla
acd492c7f4 Allow spaces before '!' in MediaWiki table header 2016-05-09 17:54:40 +02:00
John MacFarlane
21d1a3b57c Merge pull request #2898 from tarleb/org-table-refactoring
Org reader: table parsing code refactoring and fixes
2016-05-05 16:22:56 -07:00
Albert Krewinkel
405c3e9c36 Org reader: fix spacing after LaTeX-style symbols
The org-reader was droping space after unescaped LaTeX-style symbol
commands: `\ForAll \Auml` resulted in `∀Ä` but should give `∀ Ä`
instead.  This seems to be because the LaTeX-reader treats the
command-terminating space as part of the command.  Dropping the trailing
space from the symbol-command fixes this issue.
2016-05-04 23:16:23 +02:00
Albert Krewinkel
2d825603c6 Org reader: fix handling of empty table cells, rows
This fixes Org mode parsing of some corner cases regarding empty cells
and rows.  Empty cells weren't parsed correctly, e.g. `|||` should be
two empty cells, but would be parsed as a single cell containing a pipe
character.  Empty rows where parsed as alignment rows and dropped from
the output.

This fixes #2616.
2016-05-04 16:02:03 +02:00
Albert Krewinkel
a51e4e8215 Org reader: refactor rows-to-table conversion
This refactores the codes conversing a list table lines to an org table
ADT.  The old code was simplified and is now slightly less ugly.
2016-05-04 16:01:22 +02:00
Albert Krewinkel
d5e4bc179c Org reader: stop padding short table rows
Emacs Org-mode doesn't add any padding to table rows.  The first
row (header or first body row) is used to determine the column count, no
other magic is performed.

The org reader was padding rows to the length of the longest table row.
This was done due to a misunderstanding of how Org handles tables.  This
feature reflected how Org-mode handles tables when pressing <TAB>.  The
Org exporter however, which is what the reader should implement, doesn't
do any of this.  So this was a mis-feature that made the reader more
complex and reduced comparability.  It was hence removed.
2016-05-04 15:48:07 +02:00
Sidharth Kapur
490c2b543d Add class option for code block in RST reader
According to http://docutils.sourceforge.net/docs/ref/rst/directives.html#code,
the code directive supports the ":class:" option.
2016-05-01 21:42:58 -05:00
Jesse Rosenthal
99eac312fe Binary fmts throw PandocError on zip-archive fail
Commit 91dc3342 made `readDocx` throw PandocError if there was an
unarchiving error. This extends that fix to `readOdt` and `readEPUB`.
2016-05-01 18:27:20 -04:00
Jesse Rosenthal
91dc334249 Docx Reader: Throw PandocError on unzip failure
Previously, readDocx would error out if zip-archive failed. We change
the archive extraction step from `toArchive` to `toArchiveOrFail`, which
returns an Either value.
2016-05-01 12:17:12 -04:00
Emanuel Evans
1bfe39e24c
Ignore leading space in org code blocks
Fixes #2862

Also fix up tab handling for leading whitespace in code blocks.
2016-04-26 10:29:59 -07:00
Jesse Rosenthal
a385ee1d4f Docx Reader: parse moveTo and moveFrom
`moveTo` and `moveFrom` are track-changes tags that are used when a
block of text is moved in the document. We now recognize these tags and
treat them the same as `insert` and `delete`, respectively. So,
`--track-changes=accept` will show the moved version, while
`--track-changes=reject` will show the original version.
2016-04-15 14:09:18 -04:00
John MacFarlane
4b49f923cb Markdown reader: Fix pandoc title blocks with lines ending in 2 spaces.
Closes #2799.

Also added -s to markdown-reader-more test.
2016-04-10 09:13:53 -07:00
John MacFarlane
773bbb8fc7 Markdown + HTML readers: be more forgiving about unescaped &.
We are now more forgiving about parsing invalid HTML with
unescaped `&` as raw HTML.  (Previously any unescaped `&`
would cause pandoc not to recognize the string as raw HTML.)

Closes #2410.
2016-04-10 07:39:36 -07:00
John MacFarlane
b1ffdf3b01 Fixed bug in Markdown raw HTML parsing.
This was a regression, with the rewrite of `htmlInBalanced`
(from `Text.Pandoc.Readers.HTML`) in 1.17.

It caused newlines to be omitted in raw HTML blocks.

Closes #2804.
2016-03-22 16:56:10 -07:00
Jesse Rosenthal
28c7617f19 Docx reader: Handle alternate content
Some word functions -- especially graphics -- give various choices for
content so there can be backwards compatibility. This follows the
largely undocumented feature by working through the choices until we
find one that works.

Note that we had to split out the processing of child elems of runs into
a separate function so we can recurse properly. Any processing of an
element *within* a run (other than a plain run) should go into
`childElemToRun`.
2016-03-18 09:38:26 -04:00
Jesse Rosenthal
855c8b43f0 Docx reader: Don't make numbered heads into lists.
Word uses list numbering styles to number its headings. We only call
something a numbered list if it does not also heave a heading style.
2016-03-16 12:50:32 -04:00
Jesse Rosenthal
ee03e954d0 Add readDocxWithWarnings
The regular readDocx just becomes a special case.
2016-03-12 17:08:20 -05:00
Jesse Rosenthal
102ba9ecb8 Docx Reader: Add state to the parser, for warnings
In order to be able to collect warnings during parsing, we add a state
monad transformer to the D monad. At the moment, this only includes a
list of warning strings (nothing currently triggers them, however). We
use StateT instead of WriterT to correspond more closely with the
warnings behavior in T.P.Parsing.
2016-03-12 17:08:20 -05:00
John MacFarlane
a485c42d78 Fixed behavior of base tag.
+ If the base path does not end with slash, the last component
  will be replaced.  E.g. base = `http://example.com/foo`
  combines with `bar.html` to give `http://example.com/bar.html`.
+ If the href begins with a slash, the whole path of the base
  is replaced.  E.g. base = `http://example.com/foo/` combines
  with `/bar.html` to give `http://example.com/bar.html`.

Closes #2777.
2016-03-10 19:59:55 -08:00
John MacFarlane
2b55b76ebe Markdown reader: Improved pipe table parsing.
Fixes #2765.
Added test case.
2016-03-09 11:46:00 -08:00
John MacFarlane
54a68616d7 Markdown reader: Clean up pipe table parsing. 2016-03-09 10:11:32 -08:00
John MacFarlane
6e950a8eb5 Markdown reader: allow + separators in pipe table cells.
We already allowed them in the header, but not in the body
rows, for some reason.  This gives compatibility with org-mode
tables.
2016-03-09 08:44:31 -08:00
John MacFarlane
4ed64835cb Markdown reader: don't cross line boundary parsing pipe table row.
Previously an emph element could be parsed across the newline
at the end of the pipe table row.

I thought this would help with #2765, but it doesn't.
2016-03-09 08:33:13 -08:00
Jesse Rosenthal
0b9c54d9f3 Docx reader: update feature checklist.
The feature checklist in the source code was out of date. Update.
2016-03-08 00:36:13 -05:00
John MacFarlane
7c6a3c0f69 LaTeX reader: handle interior $ characters in math.
e.g. `$$\hbox{$i$}$$`.

Partially addresses #2743.
2016-02-28 11:14:03 -08:00
Jesse Rosenthal
a7a0b452a5 Docx Reader: Get rid of Modifiable typeclass.
The docx reader used to use a Modifiable typeclass to combine both
Blocks and Inlines. But all the work was in the inlines. So most of the
generality was wasted, at the expense of making the code harder to
understand. This gets rid of the generality, and adds functions for
Blocks and Inlines. It should be a bit easier to work with going forward.
2016-02-26 08:57:53 -05:00
John MacFarlane
04d1e40f37 Markdown reader: use htmlInBalanced for rawVerbatimBlock.
This should give better performance.

See #2730.
2016-02-21 07:56:41 -08:00
John MacFarlane
9693de7f59 Fixed some linter warnings. 2016-02-20 22:16:39 -08:00
John MacFarlane
29706ee02d Merge pull request #2646 from tarleb/org-figure-with-no-name
Prefix even empty figure names with "fig:"
2016-02-20 21:44:39 -08:00
John MacFarlane
e369e60fb4 Merge pull request #2691 from tarleb/org-image-file-links
Org reader: Refactor link-target processing
2016-02-20 21:42:12 -08:00
John MacFarlane
1534052dd9 HTML reader: rewrote htmlInBalanced.
This version avoids an exponential performance problem with `<script>` tags,
and it should be faster in general.

Closes #2730.
2016-02-20 15:00:31 -08:00
John MacFarlane
b8dadc608a HTML reader: properly handle an empty cell in a simple table.
Closes #2718.
2016-02-16 11:05:51 -08:00
John MacFarlane
6cb4991f6b Markdown reader: Fixed bug with smart quotes around tex math.
Previously smart quotes were incorrect in the following:

    '$\neg(x \in x)$'.

(because of the following period).  This commit fixes the problem,
which was introduced by commit 4229cf2d92.
2016-02-04 12:09:26 -08:00
Jesse Rosenthal
2ee7752d14 Docx reader: Add a "Link" modifier to Reducible
We want to make sure that links have their spaces removed, and are
appropriately smushed together.

This closes #2689
2016-02-02 14:40:09 -05:00
Albert Krewinkel
92e6ae47f6 Org reader: Refactor link-target processing
Cleanup of the code for link target handling.  Most notably, the
canonicalization of a link is handled by a separate function.

This fixes #2684.
2016-01-31 23:23:09 +01:00
John MacFarlane
18745585c1 LaTeX reader: inlineCommand now gobbles an empty {} after any command.
This gives better results when people write e.g. `\TeX{}` in Markdown.

    \TeX{} and \LaTeX{}

now works as expected with `pandoc -f markdown -t latex`.

Closes #2687.
2016-01-31 10:52:46 -08:00
John MacFarlane
a02c26d9f4 HTML reader: handle multiple meta tags with same name.
Put them in a list in the metadata so they are all
preserved, rather than (as before) throwing out all
but one..
2016-01-29 11:51:01 -08:00
John MacFarlane
76983c31f2 Properly handle LaTeX "math" environment as inline math.
See #2171.
2016-01-29 10:11:45 -08:00