Closes#1669.
If there are further issues, please open a new, targeted issue on the
tracker. Some notes on the further issues you gestured at:
Data URIs are indeed dereferenced, but why is this a problem?
(The function being used to fetch from URLs is used for many different
formats. Preserving data URIs would make sense in EPUBs, but not
for e.g. PDF output. And by dereferencing we can get a smaller,
more efficient EPUB, with the data stored as bytes in a file rather
than encoded in textual representation.)
"absolute uris are not recognized" -- I assume that is the problem
just fixed. If not, please open a new issue.
"relative uris are resolved (wrongly) like file paths" -- can you
give an example?
`<base>` tag is ignored. Yes. I didn't know about the base tag. Could
you open a new issue just for this?
This function can be used to sanitize reference labels so that
they do not contain any of the illegal characters \#[]",{}%()|= .
Currently only Links have their labels sanitized, because they
are the only Elements that use passed labels.
We previously took the old relationship names of the headers and footer in
secptr. That led to collisions. We now make a map of availabl names in the
relationships file, and then rename in secptr.
Graphics in `\section`/`\subsection` etc titles need to be `\protect`ed.
This adds a state value and manually turns it on before every invocation
of `sectionHeader` and manually turns it off after. Using a writer value
and applying `local` would probably be cleaner, but this fits with the
current style.
When we encounter one of the polyglot header styles, we want to remove
that from the par styles after we convert to a header. To do that, we
have to keep track of the style name, and remove it appropriately.
We're just keeping a list of header formats that different languages use
as their default styles. At the moment, we have English, German, Danish,
and French. We can continue to add to this.
This is simpler than parsing the styles file, and perhaps less
error-prone, since there seems to be some variations, even within a
language, of how a style file will define headers.
When users number their headers, Word understands that as a single item
enumerated list. We make the assumption that such a list is, in fact, a header.
Because of the built-in line skip, LaTeX can't handle a section header
as the first element in a list item. (To be precise, it can't handle it
if the list immediately follows a section header, but the instance is
rare enough that we can afford to be a bit more general). This puts a
non-breaking space before the header to solve this problem. We won't see
this space, since the header skips a line before printing anyway.
The output is ugly in LaTeX and this structure seems like it should
probably be avoided. But it is valid HTML and native pandoc, so we
should have some sort of typesettable representation in LaTeX.
Previously text that ended a div would be parsed as Plain
unless there was a blank line before the closing div tag.
Test case:
<div class="first">
This is a paragraph.
This is another paragraph.
</div>
Closes#1591.
We can now handle all different alignment types, for simple
tables only (no captions, no relative widths, cell contents just
plain inlines). Other tables are still handled using raw HTML.
Addresses #1585 as far as it can be addresssed, I believe.
This makes to docx reader's native output fit with the way the markdown
reader understands its markdown output. Ie, as far as table cells go:
docx -> native == docx -> native -> markdown -> native
(This identity isn't true for other things outside of table cells, of
course).
Currently, pandoc has hard-coded the following in order to make horizontal
rules in LaTeX:
```hs
"\\begin{center}\\rule{3in}{0.4pt}\\end{center}"
```
Which is fine, but does not allow customizations. It also does not take into
consideration the current line width.
I'm proposing this change:
```diff
@@ In Writers/LaTeX.hs:
-"\\begin{center}\\rule{3in}{0.4pt}\\end{center}"
+"\\begin{center}\\rule{0.5\\linewidth}{\\linethickness}\\end{center}"
```
Also, if page-progression-direction not specified in metadata,
don't include the attribute even in EPUB3; not including it is
the same as including it with the value "default", as we did before.
Closes#1550.
Previously a section like this would be enclosed in a paragraph,
with RawInline for the video tags (since video is a tag that can
be either block or inline):
<video controls="controls">
<source src="../videos/test.mp4" type="video/mp4" />
<source src="../videos/test.webm" type="video/webm" />
<p>
The videos can not be played back on your system.<br/>
Try viewing on Youtube (requires Internet connection):
<a href="http://youtu.be/etE5urBps_w">Relative Velocity on
Youtube</a>.
</p>
</video>
This change will cause the video and source tags to be parsed
as RawBlock instead, giving better output.
The general change is this: when we're parsing a "plain" sequence
of inlines, we don't parse anything that COULD be a block-level tag.
We always favor an explicit positive or negative in a style in a
descendent, and only turn to the ancestor if nothing is set.
We also introduce an (empty) list of styles that are black-listed. We
won't check them. (Think underlines in hyperlinks).
Two points here: (1) We're going bottom-up, from styles not based on
anything, to avoid circular dependencies or any other sort of
maliciousness/incompetence. And (2) each style points to its
parent. That way, we don't need the whole tree to pass a style over to
Docx.hs
* Create a type synonym for MIME type (instead of `String`).
* Add `getMimeTypeDef` function.
* Avoid recreating MIME type `Map`s every time.
* Move “Formula-...” case handling into `getMimeType`.
We want to be able to read user-defined styles. Eventually we'll be able
to figure out styles in terms of inheritance as well. The actual
cascading will happen in the docx reader.
In docx, super- and subscript are attributes of Vertalign. It makes more
sense to follow this, and have different possible values of Vertalign in
runStyle. This is mainly a preparatory step for real style parsing,
since it can distinguish between vertical align being explicitly turned
off and it not being set.
In addition, it makes parsing a bit clearer, and makes sure we don't do
docx-impossible things like being simultaneously super and sub.
functions like runElemsToInlines and parPartsToInlines are just defined
in terms of concatting and mapping their singular
version (e.g. `runElemToInlines`). Having two functions with almost
identical names makes it easier to introduce errors. It's easy enough to
just concat and map inline, and it makes it clearer what is going on in
the code.
The big news here is a rewrite of Docx to use the builder
functions. As opposed to previous attempts, we now see a significant
speedup -- times are cut in half (or more) in a few informal tests.
Reducible has also been rewritten. It can doubtless be simplified and
clarified further. We can consider this, at the moment, a reference for
correct behavior.
Note that "Italic" can be on, and, from the last commit, `<w:i>` can be
present, but be turned off. In that case, the turned-off tag takes
precedence. So, we have to distinguish between something being off and
something not being there. Hence, isItalic, isBold, isStrike, and
isSmallCaps have become Maybes.
Indented code at the beginning of a list item must be indented eight
spaces from the margin (or from the edge of the container), or four
spaces past the list marker, whichever is farther.
Some examples in `tests/markdown-reader-more.txt`.
Introduces a new function in Reducibles, concatR. The idea is that if we
have two list of Reducibles (blocks or inlines), we can combine them and
just perform the reduction on the joining parts (the last element of the
first list, the first element of the second list). This is useful in cases
where the two lists are already reduced, and we're only worried about the
joining elements.
This actually improves the efficiency a bit further, because concatR can be
smart about empty lists.
Before, we had to run reduceList on the whole combined paragraph, which
was redundant, and could take some time for long paragraphs. We only
need to combine the drop cap with the first inline of the next
paragraph.
Closes#1513.
Lists can now start without an intervening blank line.
Also, html block-level tags that don't start a line are parsed
as RawInline and don't interrupt paragraphs, as in RedCloth.
This allows users to turn off the default pandoc behavior of
parsing contents of div and span tags in markdown and HTML
as native pandoc Div blocks and Span inlines.
Setting of default epub extensions has been moved from the EPUB
reader to Text.Pandoc.
We now maintain the invariant that when fetchImages is called,
all images have absolute paths.
This patch fixes several bugs relating to this as there are three places
where images can be introduced.
(1) During the HTML parse
(2) As spine elements
(3) As a cover image
For (1), the paths are corrected by the transformation renameImages
For (2) and (3), we need to append the "root" to the path we parse from the
spine
Before the images were relative to the position of the package file. The
collapse function changed this so that they were then absolute in the
archive but the fetchImages function wasn't updated to recognise this.
pandoc -t markdown-raw_html should not emit any raw HTML, even
span and div tags that go with pandoc Span and Div elements.
Cleaned up a bit of the logic with extensions and plain.
This changes the signature of the exported `readOMML` to `String ->
Either String [Exp]`, so it can now, in theory, be slotted into
TeXMath. It doesn't have any real error reporting yet, but that might
make more sense once I put it in a branch, and understand how it works
in the other readers.
It also now reads strings that parse to either oMath or oMathPara
elements. Note that the distinction is lost in the output. It's up to
the caller to remember the display type.
We still need to test against prefixes, but this is only going to look
at oMath fragments, so we're not going to be worried about looking up
the real namespace.
The new version of TeXMath can translate from its type system into
LaTeX. So instead of writing the LaTeX ourself, we write to the TeXMath
`Exp` type, and let TeXMath do the rest.
Using `map toUpper` to capitalise text is wrong, as e.g.
“Straße” should be converted to “STRASSE”, which is 1 character
longer. This commit adds a `capitalize` function and replaces
2 identical implementations in different modules (`toCaps` and
`capitalize`) with it.
* More consistent logic: absolute URIs are fetched from the net;
other things are treated as relative URIs if sourceURL is a Just,
otherwise as file paths.
* We escape characters that are not allowed in URIs before trying
to parse them (e.g. '|', which often occurs in the wild).
* When treating relative paths as local file paths, we drop
any fragment or query. This is useful e.g. when you've downloaded
web fonts locally, but your source still contains the original
relative URLs.
Together with the previous commit, this should close#1477.