PDiff
Supporteintrag
- Version: 1.5 oder neuer
- Plattform: Mac/Win
- Sprache: alle
- Kategorie: Tipps & Tricks
- Letzte Aktualisierung: 27.08.2012
Aufgabenstellung / Problembeschreibung:
Was bedeuten die Textextraktions-Parameter in PDiff?
Lösung:
disableCharReordering
When true, it disables reconstructing the character orders, and the word finding algorithm is applied to the characters in the drawing order. By default, word finder reorders characters on a single line by the relative horizontal character locations. Most of the time, the character reordering feature improves the text extraction quality. However, on a PDF page with heavily overlapped character bounding boxes, the outcome becomes somewhat unpredictable. In such case, disabling the character reordering (disableCharReordering = true) may produce a more static result.
disableTaggedPDF
When true, it disables tagged PDF support and treats the document as non-tagged PDF. Use this to keep the word finder in legacy mode when it is created with the latest algorithm version (WF_LATEST_VERSION).
noXYSort
When true, it disables generating an XY-ordered word list. This option replaces the sort order flags in the older version of the word finder creation command (PDDocCreateWordFinder()). Setting this option is equivalent to omitting the WXE_XY_SORT flag.
preserveSpaces
When true, the word finder preserves space characters during word breaking. Otherwise, spaces are removed from output text. When false (the default), you can add spaces later by considering the word attribute flag WXE_ADJACENT_TO_SPACE, but there is no way to restore the exact number of consecutive space characters.
noLigatureExp
When true, and the font has a ToUnicode table, it disables the expansion of ligatures using the default ligatures. The default ligatures are:
fi
ff
fl
ffi
ffl
st
oe
OE
When noLigatureExp is true and the font does not have a ToUnicode table, the ligature is expanded based on whether there is a representation of the ligature in the defined codePage. If there is no representation, the ligature is expanded; otherwise, the ligature is not expanded.
ignoreCharGaps
When true, it disables converting large character gaps to space characters, so that the word finder reports a character space only when a space character appears in the original PDF content. This option has no effect on tagged PDF.
ignoreLineGaps
When true, it disables treating vertical movements as line breaks, so that the word finder determines a line break only when a line break character or special tag information appears in the original PDF content. This option has no effect on tagged PDF.
noAnnots
When true, it disables extracting text from text annotations. Normally, the word finder extracts text from the normal appearances of text annotations that are inside the page crop box.
noHyphenDetection
When true, it disables finding and removing soft hyphens in non-tagged PDF, so that the word finder trusts hard hyphens as non-soft hyphens. This option has no effect on tagged PDF files. Normally, the word finder does not differentiate between soft and hard hyphen characters in non-tagged PDF files, because these are often misused.
trustNBSpace
When true, it disables treating non-breaking space characters as regular space characters in non-tagged PDF files, so that the word finder preserves the space without breaking the word. This option has no effect on tagged PDF files. Normally, the word finder does not differentiate between breaking and non-breaking space characters in non-tagged PDF files, because these are often misused.
noExtCharOffset
When true, it disables generating extended character offset information to improve text extraction performance. The extended character offset information is necessary to determine exact character offset for character-by-character text selection. The beginning character offset of each word is always available regardless of this option, and can be used for word-by-word text selection with reasonable accuracy. When a client has no need for the detailed character offset information, it can use this option to improve the text extraction efficiency. There is a minor difference in the text extraction performance, and less memory is needed for the extracted word list.
noStyleInfo
When true, it disables generating character style information to improve text extraction performance and memory efficiency. When you select this option, you cannot use PDWordGetNthCharStyle() and PDWordGetStyleTransition() with the output of the word finder.
preserveRedundantChars
When true, it disables detecting and removing redundant characters. Some PDF pages have the same text drawn multiple times on the same spot to get a special visual effect. Normally, those redundant characters are removed from the word finder output.
Since this option may leave extra characters with overlapping bounding boxes, using it together with the disableCharReordering option is recommended for more consistent text extraction results.