core.pdf¤

core.pdf ¤

PDF parsing utils, including OCR.

Attributes¤

Classes¤

Functions:¤

core.pdf.ocr_pdf ¤

ocr_pdf(
    document: bytes,
    output_images: bool = False,
    path: str | None = None,
    repair: int = 1,
    upscale: int = 3,
    contrast: float = 1.5,
    sharpening: float = 1.2,
    threshold: float = 0.4,
    tesseract_lang: str = "eng+fra+equ",
    tesseract_bin: str | None = None,
) -> str

Extract text from PDF using OCR through Tesseract. Both the binding Python package PyTesseract and the Tesseract binaries need to be installed.

To run on a server where you don’t have sudo access to install package, you will need to download the AppImage package and pass its path to the tesseract_bin argument.

Tesseract uses machine-learning to identify words and needs the relevant language models to be installed on the system as well. Linux packaged version of Tesseract seem to generally ship French, English and equations (math) models by default. Other languages need to be installed manually, see Tesseract docs for available packages. Use pytesseract.get_languages(config='') to list available language packages installed locally.

The OCR is preceeded by an image processing step aiming at text reconstruction, by sharpening, increasing contrast and iteratively reconstructing holes in letters using an inpainting method in wavelets space. This is computationaly expensive, which may not be suitable to run on server.

PARAMETER	DESCRIPTION
`document`	the PDF document to open. TYPE: `bytes`
`output_images`	set to `True`, each page of the document is saved as PNG in the `path` directory before and after contrast enhancement. This is useful to tune the image contrast and sharpness enhancements, prior to OCR. TYPE: `bool` DEFAULT: `False`
`repair`	number of iterations of enhancements (sharpening, contrast and inpainting) to perform. More iterations take longer, too many iterations might simplify their geometry (as if they were fluid and would drip, removing corners and pointy ends) in a way that actually degrades OCR. TYPE: `int` DEFAULT: `1`
`upscale`	upscaling factor to apply before enhancement. This can help recovering ink leaks but takes more memory and time to compute. TYPE: `int` DEFAULT: `3`
`contrast`	`1.0` is the neutral value. Moves RGB values farther away from the threshold. TYPE: `float` DEFAULT: `1.5`
`sharpening`	`1.0` is the neutral value. Increases sharpness. Values too high can produce ringing (replicated ghost edges). TYPE: `float` DEFAULT: `1.2`
`threshold`	the reference value (fulcrum) for contrast enhancement. Good values are typically in the range 0.20-0.50. TYPE: `float` DEFAULT: `0.4`
`tesseract_lang`	the Tesseract command argument `-l` defining languages models to use for OCR. Languages are referenced by their 3-letters ISO-something code. See Tesseract doc for syntax and meaning. You can mix several languages by joining them with `+`. TYPE: `str` DEFAULT: `'eng+fra+equ'`
`tesseract_bin`	the path to the Tesseract executable if it is not in the global CLI path. This is passed as-is to `pytesseract.pytesseract.tesseract_cmd` of the PyTesseract binding library. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`str`	All the retrieved text from all the PDF pages as a single string. No pagination is done.

RAISES	DESCRIPTION
`RuntimeError`	when using a language package is attempted while Tesseract has no such package installed.

core.pdf.get_pdf_content ¤

get_pdf_content(
    url: str,
    lang: str,
    delay: DelayedClass,
    file_path: str = None,
    process_outline: bool = True,
    category: str = None,
    ocr: int = 1,
    max_size: int = 20,
    max_pages: int = 20,
    custom_header: dict = {},
    **kwargs: dict
) -> list[web_page]

Retrieve a PDF document through the network with HTTP GET or from the local filesystem, and parse its text content, using OCR if needed. This needs a functionnal network connection if file_path is not provided.

PARAMETER	DESCRIPTION
`url`	the online address of the document, or the downloading page if the doc is not directly accessible from a GET request (for some old-schools website where downloads are inited from a POST request to some PHP form handler, or publications behind a paywall). TYPE: `str`
`lang`	the ISO code of the language. TYPE: `str`
`file_path`	local path to the PDF file if the URL can’t be directly fetched by GET request. The content will be extracted from the local file but the original/remote URL will still be referenced as the source. TYPE: `str` DEFAULT: `None`
`process_outline`	set to `True` to split the document according to its outline (table of content), so each section will be in fact a document in itself. PDF pages are processed in full, so sections are at least equal to a page length and there will be some overlapping. TYPE: `bool` DEFAULT: `True`
`category`	arbitrary category or label set by user TYPE: `str` DEFAULT: `None`
`ocr`	`0` disables any attempt at using OCR, `1` enables OCR through Tesseract if no text was found in the PDF document `2` forces OCR through Tesseract even when text was found in the PDF document. See core.pdf.ocr_pdf for info regarding the Tesseract environment. You will need to manually disable TYPE: `int` DEFAULT: `1`
`max_size`	when attempting OCR on PDF files, files larger than this value (in MiB) will be ignored. TYPE: `int` DEFAULT: `20`
`max_pages`	when attempting OCR on PDF files, files having more pages than this value will be ignored. TYPE: `int` DEFAULT: `20`
`custom_header`	option HTTP headers to form the request that will download the PDF TYPE: `dict` DEFAULT: `{}`

PARAMETER	DESCRIPTION
`**kwargs`	directly passed-through to core.pdf.ocr_pdf. See this function documentation for more info. TYPE: `dict`

RETURNS	DESCRIPTION
`list[web_page]`	a list of core.types.web_page objects holding the text content and the PDF metadata