Commit graph

17 commits

Author SHA1 Message Date
Milan Bracke
6acc82c5d2 Docx parser: implement PAGEREF fields
These fields, often used in tables of contents, can be a hyperlink.
2021-10-18 19:15:40 -07:00
John MacFarlane
0a4c6925b6 Docx writer: copy over more settings from referenc.odcx.
From settings.xml in the reference-doc, we now include:
`zoom`, `embedSystemFonts`, `doNotTrackMoves`, `defaultTabStop`,
`drawingGridHorizontalSpacing`, `drawingGridVerticalSpacing`,
`displayHorizontalDrawingGridEvery`, `displayVerticalDrawingGridEvery`,
`characterSpacingControl`, `savePreviewPicture`, `mathPr`, `themeFontLang`,
`decimalSymbol`, `listSeparator`, `autoHyphenation`, `compat`.

Closes #7240.
2021-05-15 15:40:49 -07:00
John MacFarlane
2cf971cf56 docx writer: Remove rsids from settings.docx.
Word will add these when revisions are made.  But it's
pointless to start out with a set of them.
2021-05-15 10:54:05 -07:00
John MacFarlane
5eb7ad7d1e Improve integration of settings from reference.docx.
The settings we can carry over from a reference.docx are
autoHyphenation, consecutiveHyphenLimit, hyphenationZone,
doNotHyphenateCap, evenAndOddHeaders, and proofState.

Previously this was implemented in a buggy way, so that the
reference doc's values AND the new values were included.

This change allows users to create a reference.docx that
sets w:proofState for spelling or grammar to "dirty,"
so that spell/grammar checking will be triggered on the
generated docx.

Closes #1209.
2021-05-11 22:31:38 -06:00
John MacFarlane
c3f9e8c122 Docx writer: make nsid in abstractNum deterministic.
Previously we assigned a random number (though in a deterministic
way).  But changes in the random package mean we get different
results now on different architectures, even with the same random
seed. We don't need random values; so now we just assign a value
based on the list number id, which is guaranteed to be unique
to the list marker.
2021-03-17 22:31:20 -07:00
John MacFarlane
967e7f5fb9 Rename Text.Pandoc.XMLParser -> Text.Pandoc.XML.Light...
..and add new definitions isomorphic to xml-light's, but with
Text instead of String.  This allows us to keep most of the code in
existing readers that use xml-light, but avoid lots of unnecessary
allocation.

We also add versions of the functions from xml-light's
Text.XML.Light.Output and Text.XML.Light.Proc that operate
on our modified XML types, and functions that convert
xml-light types to our types (since some of our dependencies,
like texmath, use xml-light).

Update golden tests for docx and pptx.

OOXML test: Use `showContent` instead of `ppContent` in `displayDiff`.

Docx: Do a manual traversal to unwrap sdt and smartTag.
This is faster, and needed to pass the tests.

Benchmarks:

A = prior to 8ca191604d (Feb 8)
B = as of 8ca191604d (Feb 8)
C = this commit

| Reader  |  A    | B      | C     |
| ------- | ----- | ------ | ----- |
| docbook | 18 ms | 12 ms  | 10 ms |
| opml    | 65 ms | 62 ms  | 35 ms |
| jats    | 15 ms | 11 ms  |  9 ms |
| docx    | 72 ms | 69 ms  | 44 ms |
| odt     | 78 ms | 41 ms  | 28 ms |
| epub    | 64 ms | 61 ms  | 56 ms |
| fb2     | 14 ms | 5  ms  | 4 ms  |
2021-02-16 16:55:20 -08:00
John MacFarlane
8ca191604d Add new unexported module T.P.XMLParser.
This exports functions that uses xml-conduit's parser to
produce an xml-light Element or [Content].  This allows
existing pandoc code to use a better parser without
much modification.

The new parser is used in all places where xml-light's
parser was previously used.  Benchmarks show a significant
performance improvement in parsing XML-based formats
(especially ODT and FB2).

Note that the xml-light types use String, so the
conversion from xml-conduit types involves a lot
of extra allocation.  It would be desirable to
avoid that in the future by gradually switching
to using xml-conduit directly. This can be done
module by module.

The new parser also reports errors, which we report
when possible.

A new constructor PandocXMLError has been added to
PandocError in T.P.Error [API change].

Closes #7091, which was the main stimulus.

These changes revealed the need for some changes
in the tests.  The docbook-reader.docbook test
lacked definitions for the entities it used; these
have been added. And the docx golden tests have been
updated, because the new parser does not preserve
the order of attributes.

Add entity defs to docbook-reader.docbook.

Update golden tests for docx.
2021-02-10 22:04:11 -08:00
John MacFarlane
c451207b08 Docx writer: handle table header using styles.
Instead of hard-coding the border and header cell vertical alignment,
we now let this be determined by the Table style, making use of
Word's "conditional formatting" for the table's first row.
For headerless tables, we use the tblLook element to tell Word
not to apply conditional first-row formatting.

Closes #7008.
2021-01-12 09:49:10 -08:00
cholonam
5f4deb5455 Docx writer: Fix bullets/lists indentation
Fix appearance of bullets/numbered lists (the first level is slightly
indented to the right instead of right on the margin).

New golden files have been tested using Word 2010 on Windows 10.
2020-11-26 12:11:26 -08:00
John MacFarlane
1e84178431 Docx writer: support --number-sections.
Closes #1413.
2020-07-22 11:53:31 -07:00
John MacFarlane
5a20cc07dd Docx writer: enable column and row bands for tables.
This change will not have any effect with the default style.
However, it enables users to use a style (via a reference.docx)
that turns on row and/or column bands.

Closes #6371.
2020-05-16 15:50:59 -07:00
John MacFarlane
41d1ae0fdd Change styles in reference.docx.
All headings now have a uniform color.

Level-1 headings no longer set `w:themeShade="B5"`.

Level-2 headings are now 14 point rather than 16 point.

Level-3 headings are now 12 point rather than 14 point.

Level-4 headings are italic rather than bold.

Closes #5820.
2019-11-16 09:48:05 -08:00
John MacFarlane
e8de53ce4a Change reference.docx to use more normal block quotes.
Indented left and right, same font and size.
Previously it was unindented, smaller font and different
typeface.

See #5820.
2019-11-14 22:20:58 -08:00
John MacFarlane
b7cbd7b8c9 docx writer: avoid extra copy of abstractNum and num elements...
...in numbering.xml.  This caused pandoc-produced docx files to
be uneditable using Word Online.

The problem was that recent versions of reference.docx include
samples of various kinds of text, including lists.  The
numering elements for these were getting copied over to
the new docx, where they clashed with the autogenerated
elements produced by pandoc.  This didn't confuse Desktop
Word, but it did confuse Word Online.

Closes #5358.
2019-03-11 22:09:21 -07:00
John MacFarlane
d333c283cc Docx writer: Fix bookmarks to headers with long titles.
Word has a 40 character limit for bookmark names.  In
addition, bookmarks must begin with a letter.  Since
pandoc's auto-generated identifiers may not respect
these constraints, some internal links did not work.

With this change, pandoc uses a bookmark name based
on the SHA1 hash of the identifier when the identifier
isn't a legal bookmark name.

Closes #5091.
2018-11-20 23:43:21 -05:00
John MacFarlane
30033f417f Docx writer: added framework for custom properties.
So far, we don't actually write any custom properties,
but we have the infrastructure to add this.

See #3034.
2018-10-09 10:38:50 -07:00
Jesse Rosenthal
b3449a84aa Docx writer tests: Use new golden framework
These are based off the reader tests, with some removed (where the
reader output was identical, based on different docx inputs). There
are still more to be added. In particular, tests for custom-styles
need to be added.

All golden docx files have been checked in MS Word
2013 (windows). There is no corruption.

There is questionable output in the `tables` test: the three tables
seemed to be joined. This will be addressed in a future commit, and
the golden docx file will be changed.
2018-01-27 08:08:25 -05:00