5 Home
Tissevert edited this page 5 months ago

Chaoui

Here are a few resources to document how to use the application chaoui.

A typical view of Chaoui editing a document

This is the main view of chaoui while editing an ALTO file. The text is rendered in the browser according to the geometry documented in the ALTO file. You can see a very tiny menu bar with only two menus : ‘file’ and ‘help’. The tool bar displays the position and name of the current file, the value of the «quality threshold», the current mode and the zoom section.

File Menu

The file menu

The file menu allows loading and saving the two types of files that chaoui handles : ALTO files representing documents and «scoriae» files representing blacklists of word IDs to delete. Scoriae files are technically CSV files, they can be opened in any spreadsheet application or text editor.

Tool bar

Position

The position section of the toolbar

All the loading functions of the file menu allow selecting as many files as you want. In general, you are hence working on a set of files. This indicator in the tool bar shows the rank of the current file within the selection as well as the name of the corresponding file. Note that the file name is read only and only depends on the files you’ve selected in the file browser dialog that opened; but the rank is an input that lets you pick the file you want to open, either by clicking the arrows or by directly typing the rank of the file you want to reach.

Note that the keybord «Left» and «Right» arrows also let you browse the set of selected files.

Quality threshold

The quality threshold selector of the toolbar

While rendering the text, chaoui will highlight some words with a blue background. This means that the corresponding <String> elements in the original ALTO file have a low quality indicator (WC attribute) in the output of the OCR. This input lets you pick the lowest value you still consider «ok». All words with a WC attribute value below this level will be considered of «low» quality.

Some low quality words in a paragraph

Note that the OCR can attribute low quality values to perfectly recognized words and, on the opposite, have a good confidence in words it nonetheless failed to read. The quality threshold is a number between 0 and 1, reflecting the values found in the WC attributes. A threshold of 0 disables this feature in practice, since all words have a WC attribute ≥ 0.

Modes

The mode selector

By default, chaoui will only display ALTO files, allowing you to browse and inspect the geometry of their content. This selector allows to switch between this mode two others. The edition mode will let you mark words for deletion (read more about the edition process) and the block-order mode will put the <TextBlock>s elements of ALTO into focus and display the (linear) order they appear in the ALTO file.

Zoom

The mode selector

Finally, the zoom section of the toolbar lets you choose between three zooming strategies :

  • fit width : all the pages will be zoomed to occupy exactly the width available in your browser, regardless of their own width in the original ALTO documents
  • custom zoom : the number input in the middle represent an arbitrary proportionality ratio used to render the pages. All pages will be rendered with the same ration in this mode
  • fit height : all the pages will be zoomed to occupy exactly the height available in your browser, regardless of their own height in the original ALTO documents

Note that the two «fit» strategies will result in a zooming ratio that may vary from page to page if all the files selected don’t have the same dimensions : an approximation of this ratio rounded to the closest multiple of 10 will be displayed in the custom zoom input automatically.