PDF Importer Dialog

Global: PDF Importer

Keystroke: Select from menu

Settings in this dialog are used to control how DBT interprets Adobe Portable Document Format (PDF) files when importing them.

This is a complex dialog, because importing PDF files is complex. The dialog is divided into five panels. The General panel contains settings that control the overall process. The other panels contain closely related controls that help fine-tune some particular aspect of the import. Click on a panel title to show the controls for that panel. You can also use Control-Tab to switch from one panel to the next, or Control-Shift-Tab to traverse the panels in reverse order.

General

Show This Dialog for Each Import can be checked to cause DBT to show the PDF Importer dialog each time you import a PDF file. This may be helpful for advanced users who want to fine tune the importer to produce the best possible results with each file. Importantly, whatever settings are saved in the dialog when it is shown for a specific file to be imported are not saved as defaults for the next import..

Use PDF/A Markup when Possible may be checked to take advantage of structural information embedded within a PDF document. Please note, however, the PDF structural information is often rather poor, and the results of importing may be better for some files when this control is left unchecked. For files without structural information embedded, this setting makes no difference.

Detect Ledger Format when Scraping may be checked to enable a feature that attempts to automatic locate tables containing financial data, as found for example on monthly utility and bank statements.

Fix Leader Dots should generally be checked. When it is checked, DBT will change long series of leader dots into a leader dot code, which will improve the layout of both the print and braille documents in DBT.

Custom Python Script: is generally left blank. This allows advanced users to deploy their own Python code for PDF imports. The code will run in the context set up in the Global: Python Options dialog. It must read a PDF file directly and must write a UTF-8 output stream with DBT codes embedded. Contact support@duxsys.com for more information.

Headers

This panel controls how DBT determines what is a page header. Page headers are not included in the imported file, so it may be necessary to re-introduce them in the DBT document after the import is complete. However, this can be simpler than having the text repeatedly included.

Unless you do take the time to set up these controls, DBT will not attempt to find running page headers. However, if Use PDF/A Markup when Possible is checked, and a PDF file in fact has such markup, the PDF structure should identify running page headers, and the controls on this panel are not relevant.

Header Height can be left at 0 to avoid detection of running page headers. Or it can be set to the height (or maximum height) of the header to be stripped from each page.

Header May Be Shorter may be checked to avoid having all text less than the Header Height from the top of the page stripped from the imported document. When checked, the remainder of the controls on the panel are enabled. When not checked, all text less than the Header Height from the top of the page will be dropped from the imported document. So Header Height should be set with measured precision unless you check Header May Be Shorter and give other context clues.

End Header After Any Text Matching: contains a list of text fragments, any one of which signals to DBT that this is the bottommost line of the running page header. Only matches that are less than Header Height from the top of the page are considered.

Page Has No Header if No Text Matches may be confidently checked if you have carefully set up the other controls on this panel. When checked, this tells DBT that, when no text is matched to any fragment in the prior control, the page is presumed not to have a running head, so no text will be omitted from the top of the page. However, when not checked, failure to match any fragment in the prior control will cause DBT to omit all text within Header Height of the top of the page.

Footers

This panel controls how DBT determines what is a page footer. Page footers are not included in the imported file, so it may be necessary to re-introduce them in the DBT document after the import is complete. However, this can be simpler than having the text repeatedly included.

Unless you do take the time to set up these controls, DBT will not attempt to find running page footers. However, if Use PDF/A Markup when Possible is checked, and a PDF file in fact has such markup, the PDF structure should identify running page footers, and the controls on this panel are not relevant.

Footer Height can be left at 0 to avoid detection of running page footers. Or it can be set to the height (or maximum height) of the footer to be stripped from each page.

Footer May Be Shorter may be checked to avoid having all text less than the Footer Height from the bottom of the page stripped from the imported document. When checked, the remainder of the controls on the panel are enabled. When not checked, all text less than the Footer Height from the bottom of the page will be dropped from the imported document. So Footer Height should be set with measured precision unless you check Footer May Be Shorter and give other context clues.

Begin Footer With Any Text Matching: contains a list of text fragments, any one of which signals to DBT that this is the bottommost line of the running page header. Only matches that are less that Footer Height from the bottom of the page are considered.

Page Has No Footer if No Text Matches may be confidently checked if you have carefully set up the other controls in this panel. When checked, this tells DBT that, when no text is matched to any fragment in the prior control, the page is presumed not to have a running footer, so no text will be omitted from the bottom of the page. However, when not checked, failure to match any fragment in the prior control will cause DBT to omit all text within Footer Height of the bottom of the page.

Fonts

As is generally the case, these controls are only used when embedded structural information is not used for the import — when a document does not contain such information or when Use PDF/A Markup when Possible is not checked.

Use Font Hints for Headers, when checked, causes DBT to look for less common and larger fonts in order to mark paragraphs as headers.

Retain Bold, when checked, causes DBT to attempt to mark text as bold where it is bold the PDF file.

Retain Italic, when checked, causes DBT to attempt to mark text as italic where it is italic in the PDF file.

Retain Underline, when checked, causes DBT to attempt to mark text as underlined, where it is underlined in the PDF file.

Scraping Setup

These controls determine how text is grouped into words, lines, and boxes, when embedded structural information is not used for the import — when a document does not contain such information or when Use PDF/A Markup when Possible is not checked.

The parameters below are all pretty technical in nature, and are passed directly on to code in pdfminer.six, which DBT uses to parse PDF files and group text into words, lines, and boxes. The parameters are more fully described in the pdfminer.six documentation, specifically in the portion describing LAParams, with a summary below.

Before reading details, it helps to understand that there are essentially three steps to page analysis:

Characters are grouped into lines. Text will be placed in separate lines when the characters are not sufficiently aligned vertically or when there is excessive space between characters horizontally with no other characters intervening. Thus, it is possible for two separate lines to be laid out alongside each other. At the same time, spaces are likely to be introduced so that each line is separated into proper words.
Lines are grouped into groups that generally correspond to paragraphs.
A reading order is inferred. This is likely to be the most error-prone portion of the task.

Advanced Layout Analysis is ordinarily checked. This allows greatest flexibility in determining reading order (step 3). When unchecked, reading order is based only on the position of the bottom left corner of each group of lines (paragraph). Having this setting unchecked is recommended only if the reading order of a document can be improved by doing so.

Line Overlap: This and Character Margin determine when two pieces of text are regarded as being on the same line. The default is 0.5. A smaller value will tend to separate text; a larger value will tend to combine text where the baselines are not really quite aligned.

Character Margin: This and Line Overlap determine when two pieces of text are regarded as being on the same line. The default is 2.0. A smaller value will tend to separate text; a larger value will tend to combine text where the text is horizontally separated.

Word Margin is used to ensure that spaces are introduced where needed. PDF files without structural information don't generally contain spaces. The presence of spaces between words must be inferred based upon the distance between characters. The default value for Word Margin is 0.1. A smaller value will tend to introduce more spaces into the imported file. A larger value will tend to introduce fewer. Unfortunately, for many files, no value will produce perfect results, so it is a good idea to proofread the imported document, looking for extra or missing spaces, before proceeding.

Line Margin controls the grouping of lines into paragraphs based on the vertical spacing. The default value is 0.5. A smaller value will cause more paragraph breaks. A larger value will group lines into larger paragraphs.

Boxes Flow is used to control the inferred reading order. The default value is 0.5, which means that vertical position matters more than horizontal position, though horizontal position is a factor. Valid values range from -1.0, which means that only horizontal position matters, to 1.0, which means that only vertical position matters. Note that this control is disabled and unused when Advanced Layout Analysis is unchecked. (Having Advanced Layout Analysis unchecked is equivalent to passing None for boxes_flow.)

Detect Vertical determines whether text that is vertically aligned may be grouped into a single word or line.

All Texts determines whether text embedded in a figure will be included in the imported document.