Previously we just passed all raw TeX through when MathJax
was used for HTML math. This passed through too much.
With this patch, only raw LaTeX environments that MathJax
can handle get passed through.
This patch also causes raw LaTeX environments to be treated
as math, when possible, with MathML and WebTeX output.
Closes#2758.
- `writerEmailObfuscation` in `defaultWriterOptions` is now
`NoObfuscation`
- the default for the command-line `--email-obfuscation` option is
now `none`.
Closes#2988.
Org mode allows arbitrary raw inlines ("export snippets" in Emacs
parlance) to be included as `@@format:raw foreign format text@@`.
Support for this features is added to the Org reader.
Org mode allows arbitrary raw inlines ("export snippets" in Emacs
parlance) to be included as `@@format:raw foreign format text@@`.
Support for this features is added to the Org writer.
A specification for an official Org-mode citation syntax was drafted by
Richard Lawrence and enhanced with the help of others on the orgmode
mailing list. Basic support for this citation style is added to the
reader.
This closes#1978.
Semicolons are used as special characters in citations syntax. This
ensures the correct parsing of Pandoc-style citations:
[prefix; @key; suffix]
Previously, parsing would have failed unless there was a space or other
special character as the last <prefix> character.
Parsing of emphasized text can be toggled using the `*` option. This
influences parsing of text marked as emphasized, strong, strikeout, and
underline. Parsing of inline math, code, and verbatim text is not
affected by this option.
The `OrgParserState` contained both an `orgStateMeta` and
`orgStateMeta'` field, the former for plain meta information and the
latter for F-monad wrapped meta info. The plain meta info is only used
to make `OrgParserState` an instance of the `HasMeta` class, which in
turn is never used in the reader. The (F Meta) version is hence renamed
to the "un-primed" version while the other one is dropped.
Some code was duplicated (copy-pasted) or placed in an inappropriate
module during the modularization refactoring. Those functions are moved
into a `Shared` module, as was originally intended but forgotten.
Better documentation of the respective functions is a positive
side-effect.
Org-mode version 9 usees a new syntax for export blocks. Instead of
`#+BEGIN_<FORMAT>`, where `<FORMAT>` is the format of the block's
content, the new format uses `#+BEGIN_export <FORMAT>` instead. Both
types are supported.
- Reorder functions, grouping related functions together.
- Demote simple functions to local functions if they are used just once.
- Rename and document functions to increase code readability.
- Fix handling of whitespace in blocks, allowing content to be indented
less then the block header.
Having a function starting with `parse` in a parsing library is overly
redundant. Let's use a nicer, shorter name more in line with the rest
of the library.
The *org-ref* package is an org-mode extension commonly used to manage
citations in org documents. Basic support for the `cite:citeKey` and
`[[cite:citeKey][prefix text::suffix text]]` syntax is added.
Inline parsing code is moved to a separate module. Parsers for block
starts are extracted as well, as those are used in the `endline` parser.
This is part of the Org-mode reader cleanup effort.
The Org-mode reader uses many functions defined in the
`Text.Pandoc.Parsing` utility module. Some of the functions are
overwritten with versions adapted to Org-mode idiosyncrasies. These
special functions, as well as the normal Pandoc versions, are combined
in a single module to increase the ease of use.
This leads to decoupling of Org-mode and Pandoc and hence to slightly
cleaner code. The downside is code-bloat due to repeated import/export
statements.
For the implementation of the Drawer element in the Org Writer, we make
use of a generic Block container with attributes. The presence of a
`drawer` class defines that the `Div` constructor is a drawer. The first
class defines the drawer name to use. The key-value list in the
attributes defines the keys to add inside the Drawer. Lastly, the list
of Block elements contains miscellaneous blocks elements to add inside
of the Drawer.
Signed-off-by: Albert Krewinkel <albert@zeitkraut.de>
The `ID` property is reserved for internal use by Org-mode and should
not be used. The `CUSTOM_ID` property is to be used instead, it is
converted to the `ID` property for certain export format.
The reader and writer erroneously used `ID`. This is corrected by using
`CUSTOM_ID` where appropriate.
This allows header attributes to be added to org documents in the form
of `:PROPERTIES:` drawers. All available attributes are stored as
key/value pairs. This reflects the way the org reader handles
`:PROPERTIES:` blocks.
This closes#1962.
Headers can have optional `:PROPERTIES:` drawers associated with them.
These drawers contain key/value pairs like the header's `id`. The
reader adds all listed pairs to the header's attributes; `id` and
`class` attributes are handled specially to match the way `Attr` are
defined.
This also changes behavior of how drawers of unknown type are handled.
Instead of including all unknown drawers, those are not read/exported,
thereby matching current Emacs behavior.
This closes#1877.
Arbitrary key-value pairs can be added to some block types using a
`#+ATTR_HTML` line before the block. Emacs Org-mode only includes these
when exporting to HTML, but since we cannot make this distinction here,
the attributes are always added.
The functionality is now supported for figures.
This closes#1906.
Additional state changes need to be made after a newline is parsed,
otherwise markup may not be recognized correctly.
This fixes a bug where markup after certain block-types would not be
recognized. E.g. `/emph/` in the following snippet was not parsed as
emphasized.
foo
# comment
/emph/
A parser state attribute was used to keep track of block attributes
defined in meta-lines. Global state is undesirable, so block attributes
are no longer saved as part of the parser state. Old functions and the
respective part of the parser state are removed.
Org-mode allows to specify export settings via `#+OPTIONS` lines.
Disabling simple sub- and superscripts is one of these export options,
this options is now supported.
The last fix for whitespace handling of inline LaTeX commands was
incorrect, preventing correct recognition of inline LaTeX commands which
contain spaces. This fix ensures that only trailing whitespace is cut
off.
The org-reader was droping space after unescaped LaTeX-style symbol
commands: `\ForAll \Auml` resulted in `∀Ä` but should give `∀ Ä`
instead. This seems to be because the LaTeX-reader treats the
command-terminating space as part of the command. Dropping the trailing
space from the symbol-command fixes this issue.
This fixes Org mode parsing of some corner cases regarding empty cells
and rows. Empty cells weren't parsed correctly, e.g. `|||` should be
two empty cells, but would be parsed as a single cell containing a pipe
character. Empty rows where parsed as alignment rows and dropped from
the output.
This fixes#2616.
Emacs Org-mode doesn't add any padding to table rows. The first
row (header or first body row) is used to determine the column count, no
other magic is performed.
The org reader was padding rows to the length of the longest table row.
This was done due to a misunderstanding of how Org handles tables. This
feature reflected how Org-mode handles tables when pressing <TAB>. The
Org exporter however, which is what the reader should implement, doesn't
do any of this. So this was a mis-feature that made the reader more
complex and reduced comparability. It was hence removed.
Previously, readDocx would error out if zip-archive failed. We change
the archive extraction step from `toArchive` to `toArchiveOrFail`, which
returns an Either value.
`moveTo` and `moveFrom` are track-changes tags that are used when a
block of text is moved in the document. We now recognize these tags and
treat them the same as `insert` and `delete`, respectively. So,
`--track-changes=accept` will show the moved version, while
`--track-changes=reject` will show the original version.
We are now more forgiving about parsing invalid HTML with
unescaped `&` as raw HTML. (Previously any unescaped `&`
would cause pandoc not to recognize the string as raw HTML.)
Closes#2410.
Partially addresses #2813.
This isn't perfect, because now the hypertarget is in the
wrong place -- when you link to the figure, the screen
is positioned with the caption at the top, and most of
the figure off screen.
So this needs a bit more tweaking.
This was a regression, with the rewrite of `htmlInBalanced`
(from `Text.Pandoc.Readers.HTML`) in 1.17.
It caused newlines to be omitted in raw HTML blocks.
Closes#2804.
Add a `\strut` after `\crlf` before space.
Closes#2744, #2745. Thanks to @c-foster.
This uses the fix suggested by @c-foster.
Mid-line spaces are still not supported, because of limitations
of the Markdown parser.
Some word functions -- especially graphics -- give various choices for
content so there can be backwards compatibility. This follows the
largely undocumented feature by working through the choices until we
find one that works.
Note that we had to split out the processing of child elems of runs into
a separate function so we can recurse properly. Any processing of an
element *within* a run (other than a plain run) should go into
`childElemToRun`.
Traditionally pandoc operates on multiple files by first concetenating
them (around extra line breaks) and then processing the joined file. So
it only parses a multi-file document at the document scope. This has the
benefit that footnotes and links can be in different files, but it also
introduces a couple of difficulties:
- it is difficult to join files with footnotes without some sort of
preprocessing, which makes it difficult to write academic documents
in small pieces.
- it makes it impossible to process multiple binary input files, which
can't be catted.
- it makes it impossible to process files from different input
formats.
This commit introduces alternative method. Instead of catting the files
first, it parses the files first, and then combines the parsed
output. This makes it impossible to have links across multiple files,
and auto-identified headers won't work correctly if headers in multiple
files have the same name. On the other hand, footnotes across multiple
files will work correctly and will allow more freedom for input formats.
Since ByteStringReaders can currently only read one binary file, and
will ignore subsequent files, we also changes the behavior to
automatically parse before combining if using the ByteStringReader. If
we use one file, it will work as normal. If there is more than one file
it will combine them after parsing (assuming that the format is the
same).
Note that this is intended to be an optional method, defaulting to
off. Turn it on with `--file-scope`.
In order to be able to collect warnings during parsing, we add a state
monad transformer to the D monad. At the moment, this only includes a
list of warning strings (nothing currently triggers them, however). We
use StateT instead of WriterT to correspond more closely with the
warnings behavior in T.P.Parsing.
The docx reader used to use a Modifiable typeclass to combine both
Blocks and Inlines. But all the work was in the inlines. So most of the
generality was wasted, at the expense of making the code harder to
understand. This gets rid of the generality, and adds functions for
Blocks and Inlines. It should be a bit easier to work with going forward.
We currently treat all memoir templates as books. This means that pandoc
will infer the `--chapters` argument, even if the `article` iption is
set for memoir.
This commit makes pandoc treats the document as an article if there is
an article option (i.e., `\documentclass[12pt,article]{memoir}`).
Note that this refactors out the parsec parsers for document class and
options, to make it a little clearer what's going on.
Previously smart quotes were incorrect in the following:
'$\neg(x \in x)$'.
(because of the following period). This commit fixes the problem,
which was introduced by commit 4229cf2d92.
This gives better results when people write e.g. `\TeX{}` in Markdown.
\TeX{} and \LaTeX{}
now works as expected with `pandoc -f markdown -t latex`.
Closes#2687.
Change types of divs.
From Docbook "sect#" and "simplesect" to "level#" and
"section."
Add tests.
Add mention of TEI to README.
Small changes to TEI writer.
The convention used by pandoc for figures is to mark them by prefixing
the name with "fig:". The org reader failed to do this if a figure had
no name. The test for this was broken as well.
This fixes#2643.
- Text.Pandoc.XML.fromEntities: handle entities without a
semicolon. Always lookup character references with the
trailing ';', even if it wasn't present. And never add
it when looking up numerical entities. (This is what
tagsoup seems to require.)
- Text.Pandoc.Parsing.characterReference: Always lookup
character references with the trailing ';', and leave off
the ';' when looking up numerical entities.
This fixes a regression for e.g. `⟨`.
Continue scanning for comment subtrees beyond only the first block.
Note to self: when writing an recursive function, don't forget to, you
know, actually recurse.
Shout to @mrvdb for noticing this.
This fixes#2628.
The reader previously did allow this, following redcloth,
which happily parses
Html blocks can be <div>inlined</div> as well.
as
<p>Html blocks can be <div>inlined</div> as well.</p>
This is invalid HTML, and this kind of thing can lead
to parsing problems (stack overflows) as well. So this
commit undoes this behavior. The above sample now produces;
<p>Html blocks can be</p>
<div>
<p>inlined</p>
</div>
<p>as well.</p>
+ Start cell on new line unless it's a single Para or Plain.
+ For single Para or Plain, insert a space after the `|` to
avoid problems when the text begins with a character like
`-`.
Closes#2604, closes#2606.
If `geometry` has no value, but `margin-left`, `margin-right`,
`margin-top`, and/or `-margin-bottom` are given, a default value
for `geometry` is created from these.
Note that these variables already affect PDF production via HTML5
with wkhtmltopdf.
Variables margin-top, margin-bottom, margin-left, margin-right.
Setting them with css inside @page doesn't seem to work, at least
with the released wkhtmltopdf.