Also, if page-progression-direction not specified in metadata,
don't include the attribute even in EPUB3; not including it is
the same as including it with the value "default", as we did before.
Closes#1550.
Previously a section like this would be enclosed in a paragraph,
with RawInline for the video tags (since video is a tag that can
be either block or inline):
<video controls="controls">
<source src="../videos/test.mp4" type="video/mp4" />
<source src="../videos/test.webm" type="video/webm" />
<p>
The videos can not be played back on your system.<br/>
Try viewing on Youtube (requires Internet connection):
<a href="http://youtu.be/etE5urBps_w">Relative Velocity on
Youtube</a>.
</p>
</video>
This change will cause the video and source tags to be parsed
as RawBlock instead, giving better output.
The general change is this: when we're parsing a "plain" sequence
of inlines, we don't parse anything that COULD be a block-level tag.
We always favor an explicit positive or negative in a style in a
descendent, and only turn to the ancestor if nothing is set.
We also introduce an (empty) list of styles that are black-listed. We
won't check them. (Think underlines in hyperlinks).
Two points here: (1) We're going bottom-up, from styles not based on
anything, to avoid circular dependencies or any other sort of
maliciousness/incompetence. And (2) each style points to its
parent. That way, we don't need the whole tree to pass a style over to
Docx.hs
* Create a type synonym for MIME type (instead of `String`).
* Add `getMimeTypeDef` function.
* Avoid recreating MIME type `Map`s every time.
* Move “Formula-...” case handling into `getMimeType`.
We want to be able to read user-defined styles. Eventually we'll be able
to figure out styles in terms of inheritance as well. The actual
cascading will happen in the docx reader.
In docx, super- and subscript are attributes of Vertalign. It makes more
sense to follow this, and have different possible values of Vertalign in
runStyle. This is mainly a preparatory step for real style parsing,
since it can distinguish between vertical align being explicitly turned
off and it not being set.
In addition, it makes parsing a bit clearer, and makes sure we don't do
docx-impossible things like being simultaneously super and sub.
functions like runElemsToInlines and parPartsToInlines are just defined
in terms of concatting and mapping their singular
version (e.g. `runElemToInlines`). Having two functions with almost
identical names makes it easier to introduce errors. It's easy enough to
just concat and map inline, and it makes it clearer what is going on in
the code.
The big news here is a rewrite of Docx to use the builder
functions. As opposed to previous attempts, we now see a significant
speedup -- times are cut in half (or more) in a few informal tests.
Reducible has also been rewritten. It can doubtless be simplified and
clarified further. We can consider this, at the moment, a reference for
correct behavior.
Note that "Italic" can be on, and, from the last commit, `<w:i>` can be
present, but be turned off. In that case, the turned-off tag takes
precedence. So, we have to distinguish between something being off and
something not being there. Hence, isItalic, isBold, isStrike, and
isSmallCaps have become Maybes.
Indented code at the beginning of a list item must be indented eight
spaces from the margin (or from the edge of the container), or four
spaces past the list marker, whichever is farther.
Some examples in `tests/markdown-reader-more.txt`.
Introduces a new function in Reducibles, concatR. The idea is that if we
have two list of Reducibles (blocks or inlines), we can combine them and
just perform the reduction on the joining parts (the last element of the
first list, the first element of the second list). This is useful in cases
where the two lists are already reduced, and we're only worried about the
joining elements.
This actually improves the efficiency a bit further, because concatR can be
smart about empty lists.
Before, we had to run reduceList on the whole combined paragraph, which
was redundant, and could take some time for long paragraphs. We only
need to combine the drop cap with the first inline of the next
paragraph.
Closes#1513.
Lists can now start without an intervening blank line.
Also, html block-level tags that don't start a line are parsed
as RawInline and don't interrupt paragraphs, as in RedCloth.
This allows users to turn off the default pandoc behavior of
parsing contents of div and span tags in markdown and HTML
as native pandoc Div blocks and Span inlines.
Setting of default epub extensions has been moved from the EPUB
reader to Text.Pandoc.
We now maintain the invariant that when fetchImages is called,
all images have absolute paths.
This patch fixes several bugs relating to this as there are three places
where images can be introduced.
(1) During the HTML parse
(2) As spine elements
(3) As a cover image
For (1), the paths are corrected by the transformation renameImages
For (2) and (3), we need to append the "root" to the path we parse from the
spine
Before the images were relative to the position of the package file. The
collapse function changed this so that they were then absolute in the
archive but the fetchImages function wasn't updated to recognise this.
pandoc -t markdown-raw_html should not emit any raw HTML, even
span and div tags that go with pandoc Span and Div elements.
Cleaned up a bit of the logic with extensions and plain.
This changes the signature of the exported `readOMML` to `String ->
Either String [Exp]`, so it can now, in theory, be slotted into
TeXMath. It doesn't have any real error reporting yet, but that might
make more sense once I put it in a branch, and understand how it works
in the other readers.
It also now reads strings that parse to either oMath or oMathPara
elements. Note that the distinction is lost in the output. It's up to
the caller to remember the display type.
We still need to test against prefixes, but this is only going to look
at oMath fragments, so we're not going to be worried about looking up
the real namespace.
The new version of TeXMath can translate from its type system into
LaTeX. So instead of writing the LaTeX ourself, we write to the TeXMath
`Exp` type, and let TeXMath do the rest.
Using `map toUpper` to capitalise text is wrong, as e.g.
“Straße” should be converted to “STRASSE”, which is 1 character
longer. This commit adds a `capitalize` function and replaces
2 identical implementations in different modules (`toCaps` and
`capitalize`) with it.
* More consistent logic: absolute URIs are fetched from the net;
other things are treated as relative URIs if sourceURL is a Just,
otherwise as file paths.
* We escape characters that are not allowed in URIs before trying
to parse them (e.g. '|', which often occurs in the wild).
* When treating relative paths as local file paths, we drop
any fragment or query. This is useful e.g. when you've downloaded
web fonts locally, but your source still contains the original
relative URLs.
Together with the previous commit, this should close#1477.
* mkSelfContained now takes just two arguments, WriterOptions and
the string.
* It no longer looks in data files. This only made sense when we
had copies of slidy and S5 code there.
* Shared.fetchItem' is used instead of the nearly duplicate getItem.
The parser had been changing footnotes and endnotes into footnotes. This
isn't a problem, because pandoc collapses them, but the parser should
maintain as much of the docx structure as is collapsed, and let the
toplevel reader worry about how to translate it into Pandoc. (This would
be an issue when, as is planned, the docx parser spins off into its
own module.)
The output is the same, so no test change is required.
Moved `MediaBag` definition and functions from Shared:
`lookupMedia`, `mediaDirectory`, `insertMedia`, `extractMediaBag`.
Removed `emptyMediaBag`; use `mempty` instead, since `MediaBag`
is a Monoid.
This will make paragraphs styled with `Author`, `Title`, `Subtitle`,
`Date`, and `Abstract` into pandoc metavalues, rather than text. The
implementation only takes those elements from the beginning of the
document (ignoring empty paragraphs).
Multiple paragraphs in the `Author` style will be made into a metaList,
one paragraph per item. Hard linebreaks (shift-return) in the paragraph
will be maintained, and can be used for institution, email, etc.
Math now appears in unicode if possible, without the distracting
italics around identifiers.
Blank lines around headers are more consistent.
Footnotes appear in regular [n] style.
We now largely follow the style of Project Gutenberg.
Emphasis is rendered with `_underscores_`, strong with ALL CAPS.
The appearance of horizontal rules has changed (even in regular
markdown) to a line across the whole page.
Headings are rendered differently, using space to set them off.
http://txt2tags.org/
There are two points which currently do not match the official
implementation.
1. In the official implementation lists can not be nested like the
following but the reader would interpret this as a bullet list with the
first item being a numbered list.
```
- + This is not a list
```
2. The specification describes how URIs automatically becomes links.
Unfortunately as is often the case, their definitiong of URI is not
clear. I tried three solutions but was unsure about which to adopt.
* Using isURI from Network.URI, this matches far too many strings and is
therefore unsuitable
* Using uri from Text.Pandoc.Shared, this doesn't match all strings that
the reference implementation matches
* Try to simulate the regex which is used in the native code
I went with the third approach but it is not perfect, for example
trailing punctuation is captured in Urls.