Skip to content

API Reference¤

core¤

batching¤

High-performance, paralellized high-level methods to process large corpora of documents.

connectors¤

Provide the abstract classes for server and content, that need to be implemented for each protocol.

crawler¤

Module containing utilities to crawl websites for HTML, XML and PDF pages for their text content. PDF can be read from their text content if any, or through optical characters recognition for scans. Websites can be crawled from a sitemap.xml file or by following internal links recursively from and index page. Each page is aggregated on a list of core.types.web_page objects, meant to be used as input to train natural language AI models and to index and rank for search engines.

database¤

Create an SQLite database of web_pages to be used by a search engine.

deduplicator¤

Find and remove duplicates and near-duplicates in a list of core.types.web_page

facebook-login¤

language¤

network¤

nlp¤

High-level natural language processing module for message-like (emails, comments, posts) input.

parser¤

patterns¤

Contains global regular expression patterns re-used in the app.

pdf¤

PDF parsing utils, including OCR.

secretary¤

High level manager of all server connectors and bank of filters:

types¤

utils¤

Logging and filter finding utilities.

protocols¤

carddav_server¤

imap_object¤

imap_server¤

instagram_server¤

smtp_server¤