Skip to content

core.pdf¤

core.pdf ¤

PDF parsing utils, including OCR.

© 2024 - Aurélien Pierre

Attributes¤

Classes¤

Functions¤

core.pdf.ocr_pdf ¤

ocr_pdf(
    document: bytes,
    output_images: bool = False,
    path: str | None = None,
    repair: int = 1,
    upscale: int = 3,
    contrast: float = 1.5,
    sharpening: float = 1.2,
    threshold: float = 0.4,
    tesseract_lang: str = "eng+fra+equ",
    tesseract_bin: str | None = None,
) -> str

Extract text from PDF using OCR through Tesseract. Both the binding Python package PyTesseract and the Tesseract binaries need to be installed.

To run on a server where you don’t have sudo access to install package, you will need to download the AppImage package and pass its path to the tesseract_bin argument.

Tesseract uses machine-learning to identify words and needs the relevant language models to be installed on the system as well. Linux packaged version of Tesseract seem to generally ship French, English and equations (math) models by default. Other languages need to be installed manually, see Tesseract docs for available packages. Use pytesseract.get_languages(config='') to list available language packages installed locally.

The OCR is preceeded by an image processing step aiming at text reconstruction, by sharpening, increasing contrast and iteratively reconstructing holes in letters using an inpainting method in wavelets space. This is computationaly expensive, which may not be suitable to run on server.

PARAMETER DESCRIPTION
document

the PDF document to open.

TYPE: bytes

output_images

set to True, each page of the document is saved as PNG in the path directory before and after contrast enhancement. This is useful to tune the image contrast and sharpness enhancements, prior to OCR.

TYPE: bool DEFAULT: False

repair

number of iterations of enhancements (sharpening, contrast and inpainting) to perform. More iterations take longer, too many iterations might simplify their geometry (as if they were fluid and would drip, removing corners and pointy ends) in a way that actually degrades OCR.

TYPE: int DEFAULT: 1

upscale

upscaling factor to apply before enhancement. This can help recovering ink leaks but takes more memory and time to compute.

TYPE: int DEFAULT: 3

contrast

1.0 is the neutral value. Moves RGB values farther away from the threshold.

TYPE: float DEFAULT: 1.5

sharpening

1.0 is the neutral value. Increases sharpness. Values too high can produce ringing (replicated ghost edges).

TYPE: float DEFAULT: 1.2

threshold

the reference value (fulcrum) for contrast enhancement. Good values are typically in the range 0.20-0.50.

TYPE: float DEFAULT: 0.4

tesseract_lang

the Tesseract command argument -l defining languages models to use for OCR. Languages are referenced by their 3-letters ISO-something code. See Tesseract doc for syntax and meaning. You can mix several languages by joining them with +.

TYPE: str DEFAULT: 'eng+fra+equ'

tesseract_bin

the path to the Tesseract executable if it is not in the global CLI path. This is passed as-is to pytesseract.pytesseract.tesseract_cmd of the PyTesseract binding library.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
str

All the retrieved text from all the PDF pages as a single string. No pagination is done.

RAISES DESCRIPTION
RuntimeError

when using a language package is attempted while Tesseract has no such package installed.

core.pdf.get_pdf_content ¤

get_pdf_content(
    url: str,
    lang: str,
    delay: DelayedClass,
    file_path: str = None,
    process_outline: bool = True,
    category: str = None,
    ocr: int = 1,
    max_size: int = 20,
    max_pages: int = 20,
    custom_header: dict = {},
    **kwargs: dict
) -> list[web_page]

Retrieve a PDF document through the network with HTTP GET or from the local filesystem, and parse its text content, using OCR if needed. This needs a functionnal network connection if file_path is not provided.

PARAMETER DESCRIPTION
url

the online address of the document, or the downloading page if the doc is not directly accessible from a GET request (for some old-schools website where downloads are inited from a POST request to some PHP form handler, or publications behind a paywall).

TYPE: str

lang

the ISO code of the language.

TYPE: str

file_path

local path to the PDF file if the URL can’t be directly fetched by GET request. The content will be extracted from the local file but the original/remote URL will still be referenced as the source.

TYPE: str DEFAULT: None

process_outline

set to True to split the document according to its outline (table of content), so each section will be in fact a document in itself. PDF pages are processed in full, so sections are at least equal to a page length and there will be some overlapping.

TYPE: bool DEFAULT: True

category

arbitrary category or label set by user

TYPE: str DEFAULT: None

ocr
  • 0 disables any attempt at using OCR,
  • 1 enables OCR through Tesseract if no text was found in the PDF document
  • 2 forces OCR through Tesseract even when text was found in the PDF document. See core.pdf.ocr_pdf for info regarding the Tesseract environment. You will need to manually disable

TYPE: int DEFAULT: 1

max_size

when attempting OCR on PDF files, files larger than this value (in MiB) will be ignored.

TYPE: int DEFAULT: 20

max_pages

when attempting OCR on PDF files, files having more pages than this value will be ignored.

TYPE: int DEFAULT: 20

custom_header

option HTTP headers to form the request that will download the PDF

TYPE: dict DEFAULT: {}

PARAMETER DESCRIPTION
**kwargs

directly passed-through to core.pdf.ocr_pdf. See this function documentation for more info.

TYPE: dict

RETURNS DESCRIPTION
list[web_page]

a list of core.types.web_page objects holding the text content and the PDF metadata