API Reference¤

core¤

batching ¤

High-performance, paralellized high-level methods to process large corpora of documents.

connectors ¤

Provide the abstract classes for server and content, that need to be implemented for each protocol.

Module containing utilities to crawl websites for HTML, XML and PDF pages for their text content. PDF can be read from their text content if any, or through optical characters recognition for scans. Websites can be crawled from a sitemap.xml file or by following internal links recursively from and index page. Each page is aggregated on a list of core.types.web_page objects, meant to be used as input to train natural language AI models and to index and rank for search engines.