core.pdf¤
core.pdf
¤
PDF parsing utils, including OCR.
© 2024 - Aurélien Pierre
Attributes¤
Classes¤
Functions¤
core.pdf.ocr_pdf
¤
ocr_pdf(
document: bytes,
output_images: bool = False,
path: str | None = None,
repair: int = 1,
upscale: int = 3,
contrast: float = 1.5,
sharpening: float = 1.2,
threshold: float = 0.4,
tesseract_lang: str = "eng+fra+equ",
tesseract_bin: str | None = None,
) -> str
Extract text from PDF using OCR through Tesseract. Both the binding Python package PyTesseract and the Tesseract binaries need to be installed.
To run on a server where you don’t have sudo access to install package, you will need to download the AppImage package and pass its path to the tesseract_bin argument.
Tesseract uses machine-learning to identify words and needs the relevant language models to be installed on the system as well. Linux packaged version of Tesseract seem to generally ship French, English and equations (math) models by default. Other languages need to be installed manually, see Tesseract docs for available packages. Use pytesseract.get_languages(config='') to list available language packages installed locally.
The OCR is preceeded by an image processing step aiming at text reconstruction, by sharpening, increasing contrast and iteratively reconstructing holes in letters using an inpainting method in wavelets space. This is computationaly expensive, which may not be suitable to run on server.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
the PDF document to open.
TYPE:
|
output_images
|
set to
TYPE:
|
repair
|
number of iterations of enhancements (sharpening, contrast and inpainting) to perform. More iterations take longer, too many iterations might simplify their geometry (as if they were fluid and would drip, removing corners and pointy ends) in a way that actually degrades OCR.
TYPE:
|
upscale
|
upscaling factor to apply before enhancement. This can help recovering ink leaks but takes more memory and time to compute.
TYPE:
|
contrast
|
TYPE:
|
sharpening
|
TYPE:
|
threshold
|
the reference value (fulcrum) for contrast enhancement. Good values are typically in the range 0.20-0.50.
TYPE:
|
tesseract_lang
|
the Tesseract command argument
TYPE:
|
tesseract_bin
|
the path to the Tesseract executable if it is not in the global CLI path. This is passed as-is to
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
All the retrieved text from all the PDF pages as a single string. No pagination is done. |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
when using a language package is attempted while Tesseract has no such package installed. |
core.pdf.get_pdf_content
¤
get_pdf_content(
url: str,
lang: str,
delay: DelayedClass,
file_path: str = None,
process_outline: bool = True,
category: str = None,
ocr: int = 1,
max_size: int = 20,
max_pages: int = 20,
custom_header: dict = {},
**kwargs: dict
) -> list[web_page]
Retrieve a PDF document through the network with HTTP GET or from the local filesystem, and parse its text content, using OCR if needed. This needs a functionnal network connection if file_path is not provided.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
the online address of the document, or the downloading page if the doc is not directly accessible from a GET request (for some old-schools website where downloads are inited from a POST request to some PHP form handler, or publications behind a paywall).
TYPE:
|
lang
|
the ISO code of the language.
TYPE:
|
file_path
|
local path to the PDF file if the URL can’t be directly fetched by GET request. The content will be extracted from the local file but the original/remote URL will still be referenced as the source.
TYPE:
|
process_outline
|
set to
TYPE:
|
category
|
arbitrary category or label set by user
TYPE:
|
ocr
|
TYPE:
|
max_size
|
when attempting OCR on PDF files, files larger than this value (in MiB) will be ignored.
TYPE:
|
max_pages
|
when attempting OCR on PDF files, files having more pages than this value will be ignored.
TYPE:
|
custom_header
|
option HTTP headers to form the request that will download the PDF
TYPE:
|
| PARAMETER | DESCRIPTION |
|---|---|
**kwargs |
directly passed-through to core.pdf.ocr_pdf. See this function documentation for more info.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[web_page]
|
a list of core.types.web_page objects holding the text content and the PDF metadata |