Word uses, by default, footnotes with id -1 and 0 for separators. If a
user modifies reference.docx, they will end up with a settings.xml file
that references these footnotes, but no such footnotes in the
document. This will produce a corruption error. Here we add these to the
document and settings.xml file, so future modifications won't break the file.
We apply a "BodyText" style to all unstyled paragraphs. This is,
essentially, the same as "Normal" up until now -- except that since not
everything inherits from "BodyText" (the metadata won't, for example, or
the headers or footnote numbers) we can change the text in the body
without having to make exceptions for everything.
This will still inherit from Normal, so if we want to
change *everything*, we can do it through "Normal".
Before we had used `FirstParagraph` style after Headings, BlockQuotes,
and other blocks a user might not want an indentation after. We hadn't
actually used it for the first paragraph -- i.e. the opening of the
body. This makes sure the first body paragraph gets that style.
Following the odt writer, we make the first text paragraph following an
image, blockquote, table, or heading into a "FirstParagraph" style. This
allows it to be styled differently, if the user wishes. The default is
for it to be the same as "Normal"
The preferred syntax for Images and other media is [[File:Foo.jpg]] in MediaWiki since v1.14 (2008). [[Image:Foo.jpg]] is deprecated but still works as an alias to the File namespace. I don't think this would break any existing wikis since talk of switching the syntax/namespace for images started back in 2002 (https://phabricator.wikimedia.org/T2044). NS_FILE became the new namespace for Files in v 1.14 in late 2008. (https://www.mediawiki.org/wiki/Release_notes/1.14) There is still a namespace alias so '[[Image:]]' still works today. It's just that MediaWiki supports other media as well, and so the name and syntax used in documentation (see https://www.mediawiki.org/wiki/Help:Images) has long been '[[File:foo.jpg]]'
This change improves output formatting of content with a large amount of force line breaks, such as line-blocks. The following writers are affected:
* Dokuwiki
* HTML
* EPUB (via HTML)
* LaTeX
* MediaWiki
* OpenDocument
* Texinfo
This commit resolves#1924
Previously `\input` and `\include` would only work if the
included files had the extension `.tex`. This change relaxes
that restriction, though if the extension is not `.tex`, it
must be given explicitly in the `\input` or `\include`.
Closes#1882.
Some older versions of word use vml (vector markup language) and put
their images in a "v:imagedata" tag inside a "w:pict". We read those as
we read the more modern "blip" inside a "w:drawing".
Note that this does not mean the reader knows anything about vml. It
just looks for a `v:imagdata`. It's possible that, with more complicated
uses of images in vml, it won't do the right thing.
This change allows pandoc not to choke on the table-width parameter
of `tabular*`. Note that the table width is not actually parsed
or taken into account, but this should give tolerable results in
many cases.
Closes#1850.
Org links like `[[file:target][title]]` were not handled correctly,
parsing the link target verbatim. The org reader is changed such that
the leading `file:` is dropped from the link target.
This is related to issues #756 and #1812.
Move recursive role lookup from renderRole to addNewRole. The Attr value
will be the same for every occurance of this role, so there's no reason
to compute it every time. This allows simplifying the
stateRstCustomRoles map considerably.
We could go even further, and remove the fmt and attr arguments to
renderRole, which are null except for custom roles.
- Add "sourceCode" to classes for :code: role, and anything inheriting
from it.
- Add the name of the custom role to classes if the Inline constructor
supports Attr.
- If the custom role directive does not specify a parent role, inherit
from the :span: role.
This differs somewhat from the rst2xml.py behavior. If a custom role
inherits from another custom role, Pandoc will attach both roles' names
as classes. rst2xml.py will only use the class of the directly invoked
role (though in the case of inheriting from a :code: role with a
:language: defined, it will also provide the inherited language as a
class).
code role should have "code" class.
http://docutils.sourceforge.net/docs/ref/rst/roles.html says that
`text`:literal` is the same as ``text``. docutils outputs a <literal>
element in both cases, whereas for the code role, it outputs a <literal>
element with the "code" class.
This commit moves some code which was only used for the Markdown Reader
into a generic form which can be used for any Reader. Otherwise, it
takes naming and interface cues from the preexisting Markdown code.
Word doesn't really treat table captions as something special. It's just a paragraph with special style, nothing more, so simple reversal of output order in writer works fine.
The class directive accepts one or more class names, and creates a Div
value with those classes. If the directive has an indented body, the
body is parsed as the children of the Div. If not, the first block
folowing the directive is made a child of the Div.
This differs from the behavior of rst2xml, which does not create a Div
element. Instead, the specified classes are applied to each child of
the directive. However, most Pandoc Block constructors to not take an
Attr argument, so we can't duplicate this behavior.
closes#65
RST quoted literal blocks are the same as indented literal blocks (which
pandoc already supports) except that the quote character is preserved in
each line.
This includes test cases for the quoted literal block, as well as
additional tests for line blocks and indented literal blocks, to verify
that these are unaffected by the changes.
Now we do as before, including blank lines after list items in
loose lists (even though RST doesn't care -- this is just a matter
of visual appeal). But we chomp any excess whitespace after the
last list item, which solves #1777.
While empty links are not allowed in Emacs org-mode, Pandoc org-mode
should support them: gitit relies on empty links as they are used to
create wiki links.
Fixesjgm/gitit#471
The org reader was to restrictive when parsing links, some relative
links and links to files given as absolute paths were not recognized
correctly. The org reader's link parsing function was amended to handle
such cases properly.
This fixes#1741
This patch builds paragraph styles tree, then checks if paragraph has
style.styleId or style/name.val matching predetermined patterns.
Works with "Heading#" (name.val="heading #") for headings and
"Quote"|"BlockQuote"|"BlockQuotation" (name.val="Quote"|"Block Text")
for block quotes.
Document trees under a header starting with the word `COMMENT` are
comment trees and should not be exported. Those trees are dropped
silently.
This closes#1678.
Things like `/hello,/` or `/hi'/` were falsy recognized as emphasised
strings. This is wrong, as `,` and `'` are forbidden border chars and
may not occur on the inner border of emphasized text. This patch
enables the reader to matches the reference implementation in that it
reads the above strings as plain text.
Fixes issue with top-level bullet list parsing.
Previously we would use `many1 spaceChars` rather than respecting
the list's indent level. We also permitted `*` bullets on unindented
lists, which should unambiguously parse as `header 1`.
Combined, this meant headers at a different indent level were
being unwittingly slurped into preceding bullet lists, as per
Issue #1650.
Now we outsource most of the work to `fetchItem'`.
Also, do not include queries in file extensions.
Improves fix to #1671.
It is possible that this will have some unexpected effects, so
further testing would be good.
Closes#1669.
If there are further issues, please open a new, targeted issue on the
tracker. Some notes on the further issues you gestured at:
Data URIs are indeed dereferenced, but why is this a problem?
(The function being used to fetch from URLs is used for many different
formats. Preserving data URIs would make sense in EPUBs, but not
for e.g. PDF output. And by dereferencing we can get a smaller,
more efficient EPUB, with the data stored as bytes in a file rather
than encoded in textual representation.)
"absolute uris are not recognized" -- I assume that is the problem
just fixed. If not, please open a new issue.
"relative uris are resolved (wrongly) like file paths" -- can you
give an example?
`<base>` tag is ignored. Yes. I didn't know about the base tag. Could
you open a new issue just for this?
This function can be used to sanitize reference labels so that
they do not contain any of the illegal characters \#[]",{}%()|= .
Currently only Links have their labels sanitized, because they
are the only Elements that use passed labels.
We previously took the old relationship names of the headers and footer in
secptr. That led to collisions. We now make a map of availabl names in the
relationships file, and then rename in secptr.
Graphics in `\section`/`\subsection` etc titles need to be `\protect`ed.
This adds a state value and manually turns it on before every invocation
of `sectionHeader` and manually turns it off after. Using a writer value
and applying `local` would probably be cleaner, but this fits with the
current style.
When we encounter one of the polyglot header styles, we want to remove
that from the par styles after we convert to a header. To do that, we
have to keep track of the style name, and remove it appropriately.
We're just keeping a list of header formats that different languages use
as their default styles. At the moment, we have English, German, Danish,
and French. We can continue to add to this.
This is simpler than parsing the styles file, and perhaps less
error-prone, since there seems to be some variations, even within a
language, of how a style file will define headers.
When users number their headers, Word understands that as a single item
enumerated list. We make the assumption that such a list is, in fact, a header.
Currently, pandoc has hard-coded the following in order to make tight lists in
LaTeX:
```hs
text "\\itemsep1pt\\parskip0pt\\parsep0pt"
```
Which is fine, but does not allow customizations. For example, the `memoir`
class already has a `\tightlist` declaration for this purpose:
```tex
\newcommand{\tightlist}{%
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
```
I'm proposing to use a similar solution:
```diff
@@ In Writers/LaTeX.hs:
-then text "\\itemsep1pt\\parskip0pt\\parsep0pt"
+then text "\\tightlist"
@@ In templates/default.latex:
+\newcommand{\tightlist}{%
+ \setlength{\itemsep}{1pt}\setlength{\parskip}{0pt}\setlength{\parsep}{0pt}}
```
This allows us to customize the tightness to our needs.
Backward Compatibility
If a person is using a custom LaTeX template (not based upon the `memoir`
class), the `\tightlist` declaration must be added.
Because of the built-in line skip, LaTeX can't handle a section header
as the first element in a list item. (To be precise, it can't handle it
if the list immediately follows a section header, but the instance is
rare enough that we can afford to be a bit more general). This puts a
non-breaking space before the header to solve this problem. We won't see
this space, since the header skips a line before printing anyway.
The output is ugly in LaTeX and this structure seems like it should
probably be avoided. But it is valid HTML and native pandoc, so we
should have some sort of typesettable representation in LaTeX.
Previously text that ended a div would be parsed as Plain
unless there was a blank line before the closing div tag.
Test case:
<div class="first">
This is a paragraph.
This is another paragraph.
</div>
Closes#1591.
We can now handle all different alignment types, for simple
tables only (no captions, no relative widths, cell contents just
plain inlines). Other tables are still handled using raw HTML.
Addresses #1585 as far as it can be addresssed, I believe.
This makes to docx reader's native output fit with the way the markdown
reader understands its markdown output. Ie, as far as table cells go:
docx -> native == docx -> native -> markdown -> native
(This identity isn't true for other things outside of table cells, of
course).
Currently, pandoc has hard-coded the following in order to make horizontal
rules in LaTeX:
```hs
"\\begin{center}\\rule{3in}{0.4pt}\\end{center}"
```
Which is fine, but does not allow customizations. It also does not take into
consideration the current line width.
I'm proposing this change:
```diff
@@ In Writers/LaTeX.hs:
-"\\begin{center}\\rule{3in}{0.4pt}\\end{center}"
+"\\begin{center}\\rule{0.5\\linewidth}{\\linethickness}\\end{center}"
```
Also, if page-progression-direction not specified in metadata,
don't include the attribute even in EPUB3; not including it is
the same as including it with the value "default", as we did before.
Closes#1550.
Previously a section like this would be enclosed in a paragraph,
with RawInline for the video tags (since video is a tag that can
be either block or inline):
<video controls="controls">
<source src="../videos/test.mp4" type="video/mp4" />
<source src="../videos/test.webm" type="video/webm" />
<p>
The videos can not be played back on your system.<br/>
Try viewing on Youtube (requires Internet connection):
<a href="http://youtu.be/etE5urBps_w">Relative Velocity on
Youtube</a>.
</p>
</video>
This change will cause the video and source tags to be parsed
as RawBlock instead, giving better output.
The general change is this: when we're parsing a "plain" sequence
of inlines, we don't parse anything that COULD be a block-level tag.
We always favor an explicit positive or negative in a style in a
descendent, and only turn to the ancestor if nothing is set.
We also introduce an (empty) list of styles that are black-listed. We
won't check them. (Think underlines in hyperlinks).
Two points here: (1) We're going bottom-up, from styles not based on
anything, to avoid circular dependencies or any other sort of
maliciousness/incompetence. And (2) each style points to its
parent. That way, we don't need the whole tree to pass a style over to
Docx.hs
* Create a type synonym for MIME type (instead of `String`).
* Add `getMimeTypeDef` function.
* Avoid recreating MIME type `Map`s every time.
* Move “Formula-...” case handling into `getMimeType`.
We want to be able to read user-defined styles. Eventually we'll be able
to figure out styles in terms of inheritance as well. The actual
cascading will happen in the docx reader.
In docx, super- and subscript are attributes of Vertalign. It makes more
sense to follow this, and have different possible values of Vertalign in
runStyle. This is mainly a preparatory step for real style parsing,
since it can distinguish between vertical align being explicitly turned
off and it not being set.
In addition, it makes parsing a bit clearer, and makes sure we don't do
docx-impossible things like being simultaneously super and sub.
functions like runElemsToInlines and parPartsToInlines are just defined
in terms of concatting and mapping their singular
version (e.g. `runElemToInlines`). Having two functions with almost
identical names makes it easier to introduce errors. It's easy enough to
just concat and map inline, and it makes it clearer what is going on in
the code.
The big news here is a rewrite of Docx to use the builder
functions. As opposed to previous attempts, we now see a significant
speedup -- times are cut in half (or more) in a few informal tests.
Reducible has also been rewritten. It can doubtless be simplified and
clarified further. We can consider this, at the moment, a reference for
correct behavior.
Note that "Italic" can be on, and, from the last commit, `<w:i>` can be
present, but be turned off. In that case, the turned-off tag takes
precedence. So, we have to distinguish between something being off and
something not being there. Hence, isItalic, isBold, isStrike, and
isSmallCaps have become Maybes.
Indented code at the beginning of a list item must be indented eight
spaces from the margin (or from the edge of the container), or four
spaces past the list marker, whichever is farther.
Some examples in `tests/markdown-reader-more.txt`.
Introduces a new function in Reducibles, concatR. The idea is that if we
have two list of Reducibles (blocks or inlines), we can combine them and
just perform the reduction on the joining parts (the last element of the
first list, the first element of the second list). This is useful in cases
where the two lists are already reduced, and we're only worried about the
joining elements.
This actually improves the efficiency a bit further, because concatR can be
smart about empty lists.
Before, we had to run reduceList on the whole combined paragraph, which
was redundant, and could take some time for long paragraphs. We only
need to combine the drop cap with the first inline of the next
paragraph.
Closes#1513.
Lists can now start without an intervening blank line.
Also, html block-level tags that don't start a line are parsed
as RawInline and don't interrupt paragraphs, as in RedCloth.
This allows users to turn off the default pandoc behavior of
parsing contents of div and span tags in markdown and HTML
as native pandoc Div blocks and Span inlines.
Setting of default epub extensions has been moved from the EPUB
reader to Text.Pandoc.
We now maintain the invariant that when fetchImages is called,
all images have absolute paths.
This patch fixes several bugs relating to this as there are three places
where images can be introduced.
(1) During the HTML parse
(2) As spine elements
(3) As a cover image
For (1), the paths are corrected by the transformation renameImages
For (2) and (3), we need to append the "root" to the path we parse from the
spine
Before the images were relative to the position of the package file. The
collapse function changed this so that they were then absolute in the
archive but the fetchImages function wasn't updated to recognise this.
pandoc -t markdown-raw_html should not emit any raw HTML, even
span and div tags that go with pandoc Span and Div elements.
Cleaned up a bit of the logic with extensions and plain.
This changes the signature of the exported `readOMML` to `String ->
Either String [Exp]`, so it can now, in theory, be slotted into
TeXMath. It doesn't have any real error reporting yet, but that might
make more sense once I put it in a branch, and understand how it works
in the other readers.
It also now reads strings that parse to either oMath or oMathPara
elements. Note that the distinction is lost in the output. It's up to
the caller to remember the display type.
We still need to test against prefixes, but this is only going to look
at oMath fragments, so we're not going to be worried about looking up
the real namespace.