This fixes Org mode parsing of some corner cases regarding empty cells
and rows. Empty cells weren't parsed correctly, e.g. `|||` should be
two empty cells, but would be parsed as a single cell containing a pipe
character. Empty rows where parsed as alignment rows and dropped from
the output.
This fixes#2616.
Emacs Org-mode doesn't add any padding to table rows. The first
row (header or first body row) is used to determine the column count, no
other magic is performed.
The org reader was padding rows to the length of the longest table row.
This was done due to a misunderstanding of how Org handles tables. This
feature reflected how Org-mode handles tables when pressing <TAB>. The
Org exporter however, which is what the reader should implement, doesn't
do any of this. So this was a mis-feature that made the reader more
complex and reduced comparability. It was hence removed.
Previously, readDocx would error out if zip-archive failed. We change
the archive extraction step from `toArchive` to `toArchiveOrFail`, which
returns an Either value.
`moveTo` and `moveFrom` are track-changes tags that are used when a
block of text is moved in the document. We now recognize these tags and
treat them the same as `insert` and `delete`, respectively. So,
`--track-changes=accept` will show the moved version, while
`--track-changes=reject` will show the original version.
We are now more forgiving about parsing invalid HTML with
unescaped `&` as raw HTML. (Previously any unescaped `&`
would cause pandoc not to recognize the string as raw HTML.)
Closes#2410.
Partially addresses #2813.
This isn't perfect, because now the hypertarget is in the
wrong place -- when you link to the figure, the screen
is positioned with the caption at the top, and most of
the figure off screen.
So this needs a bit more tweaking.
This was a regression, with the rewrite of `htmlInBalanced`
(from `Text.Pandoc.Readers.HTML`) in 1.17.
It caused newlines to be omitted in raw HTML blocks.
Closes#2804.
Add a `\strut` after `\crlf` before space.
Closes#2744, #2745. Thanks to @c-foster.
This uses the fix suggested by @c-foster.
Mid-line spaces are still not supported, because of limitations
of the Markdown parser.
Some word functions -- especially graphics -- give various choices for
content so there can be backwards compatibility. This follows the
largely undocumented feature by working through the choices until we
find one that works.
Note that we had to split out the processing of child elems of runs into
a separate function so we can recurse properly. Any processing of an
element *within* a run (other than a plain run) should go into
`childElemToRun`.
Traditionally pandoc operates on multiple files by first concetenating
them (around extra line breaks) and then processing the joined file. So
it only parses a multi-file document at the document scope. This has the
benefit that footnotes and links can be in different files, but it also
introduces a couple of difficulties:
- it is difficult to join files with footnotes without some sort of
preprocessing, which makes it difficult to write academic documents
in small pieces.
- it makes it impossible to process multiple binary input files, which
can't be catted.
- it makes it impossible to process files from different input
formats.
This commit introduces alternative method. Instead of catting the files
first, it parses the files first, and then combines the parsed
output. This makes it impossible to have links across multiple files,
and auto-identified headers won't work correctly if headers in multiple
files have the same name. On the other hand, footnotes across multiple
files will work correctly and will allow more freedom for input formats.
Since ByteStringReaders can currently only read one binary file, and
will ignore subsequent files, we also changes the behavior to
automatically parse before combining if using the ByteStringReader. If
we use one file, it will work as normal. If there is more than one file
it will combine them after parsing (assuming that the format is the
same).
Note that this is intended to be an optional method, defaulting to
off. Turn it on with `--file-scope`.
In order to be able to collect warnings during parsing, we add a state
monad transformer to the D monad. At the moment, this only includes a
list of warning strings (nothing currently triggers them, however). We
use StateT instead of WriterT to correspond more closely with the
warnings behavior in T.P.Parsing.
The docx reader used to use a Modifiable typeclass to combine both
Blocks and Inlines. But all the work was in the inlines. So most of the
generality was wasted, at the expense of making the code harder to
understand. This gets rid of the generality, and adds functions for
Blocks and Inlines. It should be a bit easier to work with going forward.
We currently treat all memoir templates as books. This means that pandoc
will infer the `--chapters` argument, even if the `article` iption is
set for memoir.
This commit makes pandoc treats the document as an article if there is
an article option (i.e., `\documentclass[12pt,article]{memoir}`).
Note that this refactors out the parsec parsers for document class and
options, to make it a little clearer what's going on.
Previously smart quotes were incorrect in the following:
'$\neg(x \in x)$'.
(because of the following period). This commit fixes the problem,
which was introduced by commit 4229cf2d92.
This gives better results when people write e.g. `\TeX{}` in Markdown.
\TeX{} and \LaTeX{}
now works as expected with `pandoc -f markdown -t latex`.
Closes#2687.
Change types of divs.
From Docbook "sect#" and "simplesect" to "level#" and
"section."
Add tests.
Add mention of TEI to README.
Small changes to TEI writer.
The convention used by pandoc for figures is to mark them by prefixing
the name with "fig:". The org reader failed to do this if a figure had
no name. The test for this was broken as well.
This fixes#2643.
- Text.Pandoc.XML.fromEntities: handle entities without a
semicolon. Always lookup character references with the
trailing ';', even if it wasn't present. And never add
it when looking up numerical entities. (This is what
tagsoup seems to require.)
- Text.Pandoc.Parsing.characterReference: Always lookup
character references with the trailing ';', and leave off
the ';' when looking up numerical entities.
This fixes a regression for e.g. `⟨`.
Continue scanning for comment subtrees beyond only the first block.
Note to self: when writing an recursive function, don't forget to, you
know, actually recurse.
Shout to @mrvdb for noticing this.
This fixes#2628.
The reader previously did allow this, following redcloth,
which happily parses
Html blocks can be <div>inlined</div> as well.
as
<p>Html blocks can be <div>inlined</div> as well.</p>
This is invalid HTML, and this kind of thing can lead
to parsing problems (stack overflows) as well. So this
commit undoes this behavior. The above sample now produces;
<p>Html blocks can be</p>
<div>
<p>inlined</p>
</div>
<p>as well.</p>
+ Start cell on new line unless it's a single Para or Plain.
+ For single Para or Plain, insert a space after the `|` to
avoid problems when the text begins with a character like
`-`.
Closes#2604, closes#2606.
If `geometry` has no value, but `margin-left`, `margin-right`,
`margin-top`, and/or `-margin-bottom` are given, a default value
for `geometry` is created from these.
Note that these variables already affect PDF production via HTML5
with wkhtmltopdf.
Variables margin-top, margin-bottom, margin-left, margin-right.
Setting them with css inside @page doesn't seem to work, at least
with the released wkhtmltopdf.
* Added `thanks` variable
* Use `parskip.sty` when `indent` isn't set (fall
back to using `setlength` as before if `parskip.sty`
isn't available).
* Use `biblio-style` with biblatex.
* Added `biblatexoptions` variable.
* Added `section-titles` variable (defaults to true)
to enable/suppress section title pages in beamer
slide shows.
* Moved beamer themes after fonts, so that themes can
change fonts. (Previously the fonts set were being
clobbered by lmodern.sty.)
`file:filename` rather than `file://./filename`.
I think this is right; it matches what we had before
with people actually using the ICML writer, and seems
to match examples in the spec. I don't
have a copy of InDesign I can test on, though.
@DigitalPublishingToolkit and @mb21, can you have
a look?
This partially addresses jgm/pandoc-citeproc#143.
It does not use the native asciidoc syntax for citations,
but it does get the links to individual citations working.
Text.Pandoc.Options: Added `Ext_east_asian_line_breaks` constructor to
`Extension` (API change).
This extension is like `ignore_line_breaks`, but smarter -- it
only ignores line breaks between two East Asian wide characters.
This makes it better suited for writing with a mix of East Asian
and non-East Asian scripts.
Closes#2586.
Previously pipe table columns got relative widths (based
on the header underscore lines) when the source of one of the rows was
greater in width than the column width. This gave bad results in some
cases where much of the width of the row was due to nonprinting
material (e.g. link URLs). Now pandoc only looks at printable
width (the width of a plain string version of the source), which
should give better results.
Thanks to John Muccigrosso for bringing up the issue.
Previously we tried to get the image size from the image even
if an explicit size was specified. Since we still can't get
image size for PDFs, this made it impossible to use PDF images
in docx.
Now we don't try to get the image size when a size is already
explicitly specified.
This contains a JSON version of all the metadata, in the
format selected for the writer.
So, for example, to get just the YAML metadata, you can
run pandoc with the following custom template:
$meta-json$
Closes#2019. The intent is to make it easier for static
site generators and other tools to get at the metadata.
Change 5527465c introduced a `DummyListItem` type in Docx/Parse.hs. In
retrospect, this seems like it mixes parsing and iterpretation
excessively. What's *really* going on is that we have a list item
without and associate level or numeric info. We can decide what to do
what that in Docx.hs (treat it like a list paragraph), but the parser
shouldn't make that decision.
This commit makes what is going on a bit more explicit. `LevelInfo` is
now a Maybe value in the `ListItem` type. If it's a Nothing, we treat
it as a ListParagraph. If it's a Just, it's a normal list item.
* Old `link_attributes` -> `mmd_link_attributes`
* Recently added `common_link_attributes` -> `link_attributes`
Note: this change could break some existing workflows.
* Bumped version to 1.16.
* Added Attr field to Link and Image.
* Added `common_link_attributes` extension.
* Updated readers for link attributes.
* Updated writers for link attributes.
* Updated tests
* Updated stack.yaml to build against unreleased versions of
pandoc-types and texmath.
* Fixed various compiler warnings.
Closes#261.
TODO:
* Relative (percentage) image widths in docx writer.
* ODT/OpenDocument writer (untested, same issue about percentage widths).
* Update pandoc-citeproc.
This change makes `--no-tex-ligatures` affect the LaTeX reader
as well as the LaTeX and ConTeXt writers. If it is used,
the LaTeX reader will parse characters `` ` ``, `'`, and `-`
literally, rather than parsing ligatures for quotation marks
and dashes. And the LaTeX writer will print unicode quotation
mark and dash characters literally, rather than converting
them to the standard ASCII ligatures.
Note that `--smart` has no affect on the LaTeX reader.
`--smart` is still the default for all input formats when
LaTeX or ConTeXt is the output format, *unless* `--no-tex-ligatures`
is used.
Some examples to illustrate the logic:
```
% echo "'hi'" | pandoc -t latex
`hi'
% echo "'hi'" | pandoc -t latex --no-tex-ligatures
'hi'
% echo "'hi'" | pandoc -t latex --no-tex-ligatures --smart
‘hi’
% echo "'hi'" | pandoc -f latex --no-tex-ligatures
<p>'hi'</p>
% echo "'hi'" | pandoc -f latex
<p>’hi’</p>
```
Closes#2541.
These come up when people create a list item and then delete the
bullet. It doesn't refer to any real list item, and we used to ignore
it.
We handle it with a DummyListItem type, which, in Docx.hs, is turned
into a normal paragraph with a "ListParagraph" class. If it follow
another list item, it is folded as another paragraph into that item. If
it doesn't, it's just its own (usually indented, and therefore
block-quoted) paragraph.
We don't have a place yet for styles or sizes on images, but
we can skip the attributes rather than incorrectly taking them
to be part of the filename.
Closes#2515.
Automatic styles can now be inserted in the template,
since the template, not the writer, now provides the
enclosing `<office:automatic-styles>` tags.
Closes#2520.
Previously, when using headers below the slide level, pauses are left
uninterpreted into pauses. In my opinion, unexpected behavior but
intentional looking at the code.
Fixes#2530
There are separate relationship (link) files for foot and
endnotes. These had previously been grouped together which led to
links not working correctly in notes. This should finally fix that.
Definition list markers (i.e. double colons `::`) must be surrounded by
whitespace to start a definition item. This rule was not checked
before, resulting in bugs with footnotes and some link types.
Thanks to @conklech for noticing and reporting this issue.
This fixes#2518.
Smart quotes, ellipses, and dashes should behave like normal quotes,
single dashes, and dots with respect to text markup parsing. The parser
state was not updated properly in all cases, which has been fixed.
Thanks to @conklech for reporting this issue.
This fixes#2513.
By default pandoc downloads all linked media and includes it in the
EPUB container. This can be disabled by setting `data-external`
on the tags linking to media that should not be downloaded.
Example:
<audio controls="1">
<source src="http://www.sixbarsjail.it/tmp/bach_toccata.mp3"
type="audio/mpeg"></source>
</audio>
Closes#2473.
Don't use custom prelude for latest ghc.
This is a better approach to making 'stack ghci' and 'cabal repl'
work. Instead of using NoImplicitPrelude, we only use the custom
prelude for older ghc versions. The custom prelude presents a
uniform API that matches the current base version's prelude.
So, when developing (presumably with latest ghc), we don't
use a custom prelude at all and hence have no trouble with ghci.
The custom prelude no longer exports (<>): we now want to
match the base 4.8 prelude behavior.
Markup as the very first item in a header wasn't recognized. This was
caused by an incorrect parser state: positions at which inline markup
can start need to be marked explicitly by changing the parser state.
This wasn't done for headers. The proper function to update the state
is now called at the beginning of the header parser, fixing this issue.
This fixes#2504.
Footnotes aren't allowed in the list of figures. This
patch causes footnotes to be stripped from captions when
entered into the list of figures.
Footnotes still don't actually WORK in captions in latex/pdf,
but at least an error is no longer raised.
See #1506.
If a pipe table contains a line longer than the column
width (as set by `--columns` or 80 by default), relative
widths are computed based on the widths of the separator lines
relative to the column width.
This should solve persistent problems with long pipe tables in
LaTeX/PDF output, and give more flexibility for determining
relative column widths in other formats, too.
For narrower pipe tables, column widths of 0 are used,
telling pandoc not to specify widths explicitly in output
formats that permit this.
Closes#2471.
Closes#2480.
Note that although smart punctuation is part of the textile
spec, it's not always wanted when converting from textile
to, say, Markdown. So it seems better to make this an option.
We now allow blank metadata fields. These were explicitly
disallowed before.
For background see #2026. The issue in #2026 has since
been fixed in another way, so there is no need to forbid
blank metadata fields.
Org-mode allows to skip the argument of a code block header argument if
it's toggling a value. Argument-less headers are now recognized,
avoiding weird parsing errors.
The fixes are not exactly pretty, but neither is the code that was
fixed. So I guess it's about par for the course. However, a rewrite of
the header parsing code wouldn't hurt in the long run.
Thanks to @jo-tham for filing the bug report.
This fixes#2269.
Paragraphs can be followed by lists, even if there is no blank line
between the two blocks. However, this should only be true if the
paragraph is not within a list, were the preceding block should be
parsed as a plain instead of paragraph (to allow for compact lists).
Thanks to @rgaiacs for bringing this up.
This fixes#2464.
Tightened up the inline HTML parser so it disallows
TagWarnings.
This only affects the markdown reader when the `markdown_in_html_blocks`
option is disabled.
Closes#2469.