core.batching¤

core.batching ¤

High-performance, paralellized high-level methods to process large corpora of documents.

Interfaces NLP processing with database entries, for efficient RAM management.

Database structure is hard-coded and expects conformation to data structures defined in core.database and core.types

Attributes¤

core.batching.LANG_MAP `module-attribute` ¤

LANG_MAP = {
    "en": "english",
    "fr": "french",
    "de": "german",
    "es": "spanish",
    "it": "italian",
    "pt": "portuguese",
    "nl": "dutch",
    "sv": "swedish",
    "no": "norwegian",
    "da": "danish",
    "fi": "finnish",
    "ru": "russian",
    "ro": "romanian",
    "hu": "hungarian",
    "tr": "turkish",
}

Map ISO 639-1 language codes of supported languages to their full-name, as used by pre-trained corpora

core.batching.LANG_MAP_REVERSE `module-attribute` ¤

LANG_MAP_REVERSE = {v: k for k, v in (LANG_MAP.items())}

Map the full-name of supported languages, as used by pre-trained corpora, to ISO 639-1 language codes

core.batching.STOPWORDS_DICT `module-attribute` ¤

STOPWORDS_DICT = {
    language: (set(STOPWORDS_DICT[language])) for language in STOPWORDS_DICT
}

Dictionnary of stopwords (as sets values) mapped to full language names (as keys)

core.batching.regex_starter `module-attribute` ¤

regex_starter = '(?<=^|\\s|\\[|\\(|\\{|\\<|\\\'|\\"|`|;|\\>)'

Start of line, or start of document, or start of markup

core.batching.regex_stopper `module-attribute` ¤

regex_stopper = '(?=$|\\s|\\]|\\)|\\}|\\>|\\\'|\\"|`|;|\\<)'

End of line, or end of document, or end of markup

core.batching.end_of_word `module-attribute` ¤

end_of_word = '(?=$|\\s|\\]|\\)|\\}|\\>|\\\'|\\"|`|;|:|,|\\?|\\!|\\.|\\<)'

End of word, or end of line, or end of document, or end of markup

core.batching.regex_algebra `module-attribute` ¤

regex_algebra = '[\\+\\-\\=\\≠\\±]'

Algebraic signs

core.batching.IP_PATTERN `module-attribute` ¤

IP_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_ip, regex_stopper), re.IGNORECASE
)

IPv4 and IPv6 patterns where the whole IP is captured in the first group.

core.batching.EMAIL_PATTERN `module-attribute` ¤

EMAIL_PATTERN = re.compile(
    "<?([0-9a-z\\-\\_\\+\\.]+?@[0-9a-z\\-\\_\\+]+(\\.[0-9a-z\\_\\-]{2,})+)>?",
    re.IGNORECASE,
)

Emails patterns like <me@mail.com> or me@mail.com where the whole address is captured in the first group.

core.batching.URL_PATTERN `module-attribute` ¤

URL_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_url, end_of_word), re.IGNORECASE
)

URL patterns like http(s)://domain.ext/page/subpage?q=x&r=0:1#anchor or //domain.ext/page. URL must follow RFC3986, meaning query parameters should be before anchors, if any. Relying on this assumption allows a faster regex parsing.

the protocol (ftp, ftps, http, https) is captured as the first group,
domain.ext is captured as the second group,
/page/etc is the third group, including leading and trailing /,
page query parameters ?s=x&r=0, including ?, is the fourth group if the URL declares ...?params#anchor,
anchor #anchor is the fifth group, including #, if the URL declares ...?params#anchor.

URLs are captured if they are:

alone on their own line,
enclosed in {}, [], ()
enclosed in whitespaces.

Warning: URLs enclosed in (), [] and {} may retain the closing sign as part of the page name since () and [] are valid in URL pathes and parameters. This pattern will work on plain text only: Markdown, XML, HTML and JSON will need to be parsed ahead.

core.batching.MEMBERS_PATTERN `module-attribute` ¤

MEMBERS_PATTERN = re.compile('(?<=[a-z])(\\.)(?=[a-z])', re.IGNORECASE)

Domain patterns without leading protocol like cdn.company.com or class members in object-oriented programming languages like params.cookies.client.

core.batching.DATE_PATTERN `module-attribute` ¤

DATE_PATTERN = re.compile(date_regex, re.IGNORECASE)

Dates like 2022-12-01, 01-12-2022, 01-12-22, 01/12/2022, 01/12/22 where the whole date is captured in the first group, then each group of digits is captured in the order of appearance, in the next 3 groups

core.batching.TIME_PATTERN `module-attribute` ¤

TIME_PATTERN = re.compile(time_regex, re.IGNORECASE)

Identify more or less standard time patterns, like :

12h15
12:15
12:15:00
12am
12 am
12 h
12:15:00Z
12:15:00+01
12:15:00 UTC+1
11:27:45+0000

RETURNS	DESCRIPTION
`0`	1- or 2-digits hour, TYPE: `str`
`1`	hour/minutes separator or half-day marker among `["h", ":", "am", "pm"]` (case-insensitive) TYPE: `str`
`2`	2-digits minutes, if any, or `None` TYPE: `str`
`3`	2-digits seconds, if any. TYPE: `str`
`4`	hour marker (`h` or `H`), half-day marker (case-insensitive `["am", "pm"]`), or time zone marker (case-sensitive `["Z", "UTC"]`) TYPE: `str`
`5`	1-or 2-digits signed integer timezone shift (referred to UTC). TYPE: `str`

Examples:

see https://regex101.com/r/QNtZAK/2

see src/tests/test-patterns.py

core.batching.DOMAIN_PATTERN `module-attribute` ¤

DOMAIN_PATTERN = re.compile(
    "from ((?:[a-z0-9\\-_]{0,61}\\.)+[a-z]{2,})", re.IGNORECASE
)

Matches patterns like from (domain.ext) from RFC-822 Received header in emails.

core.batching.UID_PATTERN `module-attribute` ¤

UID_PATTERN = re.compile('UID ([0-9]+)')

Matches email integer UID from IMAP headers.

core.batching.FLAGS_PATTERN `module-attribute` ¤

FLAGS_PATTERN = re.compile('FLAGS \\((.*?)\\)')

Matches email flags from IMAP headers.

core.batching.PATH_PATTERN `module-attribute` ¤

PATH_PATTERN = re.compile('%s%s%s' % (regex_starter, path_regex, end_of_word))

File path pattern like ~/file, /home/file, ./file or C:\windows

core.batching.PARTIAL_PATH_REGEX `module-attribute` ¤

PARTIAL_PATH_REGEX = re.compile(
    "%s%s%s" % (regex_starter, partial_path_regex, end_of_word)
)

Partial, invalid path patterns missing the leading root, like home/user/stuff. We start capturing after at least two folder separators (slash or backslash).

Warning

this will collide with date detection, so run it after in the pipeline.

core.batching.RESOLUTION_PATTERN `module-attribute` ¤

RESOLUTION_PATTERN = re.compile('\\d+(?:×|x|X)\\d+')

Pixel resolution like 10x20 or 10×20. Units are discarded.

core.batching.NUMBER_PATTERN `module-attribute` ¤

NUMBER_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_number, regex_stopper)
)

Signed integers and decimals, fractions and numeric IDs with interal dashes and underscores. Numbers with starting or trailing units are not considered. Lazy decimals (.1 and 1.) are considered.

core.batching.HASH_PATTERN `module-attribute` ¤

HASH_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_hash, end_of_word), re.IGNORECASE
)

Cryptographic hexadecimal hashes and fingerprints, of a min length of 8 characters.

core.batching.MULTIPLE_LINES `module-attribute` ¤

MULTIPLE_LINES = re.compile('(?: ?[\\t\\r\\n]{2,} ?)+')

Detect more than 2 newlines and tab, possibly mixed with spaces

core.batching.MULTIPLE_NEWLINES `module-attribute` ¤

MULTIPLE_NEWLINES = re.compile('(?: ?[\\t\\r\\n]+ ?){2,}')

Detect broken sequences of newlines and spaces.

core.batching.INTERNAL_NEWLINE `module-attribute` ¤

INTERNAL_NEWLINE = re.compile('(?<=\\w)[\\n\\t\\r]{1}(?=\\w)')

Detect single newline characters nested inside text. Mostly useful for parsed PDF where line wrapping is quite literal ( used instead of space).

core.batching.EXPOSURE `module-attribute` ¤

EXPOSURE = re.compile(
    "%s%s%s" % (regex_starter, exposure_regex, end_of_word), flags=re.IGNORECASE
)

Exposure values in EV or IL

core.batching.PHOTOSPEED `module-attribute` ¤

PHOTOSPEED = re.compile(
    "%s%s%s" % (regex_starter, photospeed_regex, end_of_word),
    flags=re.IGNORECASE,
)

Exposure values in EV or IL

core.batching.SENSIBILITY `module-attribute` ¤

SENSIBILITY = re.compile(
    "%s%s%s" % (regex_starter, sensibility_regex, end_of_word),
    flags=re.IGNORECASE,
)

Photographic sensibility in ISO or ASA

core.batching.LUMINANCE `module-attribute` ¤

LUMINANCE = re.compile(
    "%s%s%s" % (regex_starter, luminance_regex, end_of_word),
    flags=re.IGNORECASE,
)

Luminance/radiance in nits or Cd/m²

core.batching.DIAPHRAGM `module-attribute` ¤

DIAPHRAGM = re.compile(
    "%s%s" % (regex_starter, diaphragm_regex), flags=re.IGNORECASE
)

Photographic diaph aperture values like f/2.8 or f/11

core.batching.GAIN `module-attribute` ¤

GAIN = re.compile(
    "%s%s%s" % (regex_starter, gain_regex, end_of_word), flags=re.IGNORECASE
)

Gain, attenuation and PSNR in dB

core.batching.FILE_SIZE `module-attribute` ¤

FILE_SIZE = re.compile(
    "%s%s%s" % (regex_starter, filesize_regex, end_of_word), flags=re.IGNORECASE
)

File and memory size in bit, byte, or octet and their multiples

core.batching.DISTANCE `module-attribute` ¤

DISTANCE = re.compile(
    "%s%s%s" % (regex_starter, distance_regex, end_of_word), flags=re.IGNORECASE
)

Distance in meter, inch, foot and their multiples

core.batching.PERCENT `module-attribute` ¤

PERCENT = re.compile('%s%s%s' % (regex_starter, percent_regex, end_of_word))

Number followed by %

core.batching.WEIGHT `module-attribute` ¤

WEIGHT = re.compile(
    "%s%s%s" % (regex_starter, weight_regex, end_of_word), flags=re.IGNORECASE
)

Weight (mass) in British and SI units and their multiples

core.batching.ANGLE `module-attribute` ¤

ANGLE = re.compile(
    "%s%s%s" % (regex_starter, angle_regex, end_of_word), flags=re.IGNORECASE
)

Angles in radians, degrees and steradians

core.batching.TEMPERATURE `module-attribute` ¤

TEMPERATURE = re.compile(
    "%s%s%s" % (regex_starter, temperature_regex, end_of_word),
    flags=re.IGNORECASE,
)

Temperatures in °C, °F and K

core.batching.FREQUENCY `module-attribute` ¤

FREQUENCY = re.compile(
    "%s%s%s" % (regex_starter, frequency_regex, end_of_word),
    flags=re.IGNORECASE,
)

Frequencies in hertz and multiples

core.batching.TEXT_DATES `module-attribute` ¤

TEXT_DATES = re.compile(
    "([0-9]{1,2})? (jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec|jan|fév|mar|avr|mai|jui|jui|aou|sep|oct|nov|déc|janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre|january|february|march|april|may|june|july|august|september|october|november|december)\\.?( [0-9]{1,2})?( [0-9]{2,4})(?!\\:)",
    flags=re.IGNORECASE | re.MULTILINE,
)

Find textual dates formats:

English dates like 01 Jan 20 or 01 Jan. 2020 but avoid capturing adjacent time like 12:08.
French dates like 01 Jan 20 or 01 Jan. 2020 but avoid capturing adjacent time like 12:08.

RETURNS	DESCRIPTION
`0`	2 digits (day number or year number, depending on language) TYPE: `str`
`1`	month (full-form or abbreviated) TYPE: `str`
`2`	2 digits (day number or year number, depending on language) TYPE: `str`
`3`	4 digits (full year) TYPE: `str`

core.batching.BASE_64 `module-attribute` ¤

BASE_64 = re.compile(
    "((?:[A-Za-z0-9+\\/]{4}){64,}(?:[A-Za-z0-9+\\/]{2}==|[A-Za-z0-9+\\/]{3}=)?)"
)

Identifies base64 encoding

core.batching.BB_CODE `module-attribute` ¤

BB_CODE = re.compile('\\[(img|quote)[a-zA-Z0-9 =\\"]*?\\].*?\\[\\/\\1\\]')

Identifies left-over BB code markup [img] and [quote]

core.batching.MARKUP `module-attribute` ¤

MARKUP = re.compile('(?:\\[|\\{|\\<)([^\\n\\r]+?)(?:\\]|\\}|\\>)')

Identifies left-over HTML and Markdown markup, like <...>, {...}, [...]

core.batching.USER `module-attribute` ¤

USER = re.compile('([\\w\\-\\+\\.]+)?@([\\w\\-\\+\\.]+)|(user\\-?\\d+)')

Identifies user handles or emails

core.batching.REPEATED_CHARACTERS `module-attribute` ¤

REPEATED_CHARACTERS = re.compile('(.)\\1{9,}')

Identifies any character repeated more than 9 times

core.batching.UNFINISHED_SENTENCES `module-attribute` ¤

UNFINISHED_SENTENCES = re.compile('(?<![?!.;:])\\n\\n|\\r\\n')

Identifies sentences finishing with 2 newlines characters without having ending punctuations

core.batching.MULTIPLE_DOTS `module-attribute` ¤

MULTIPLE_DOTS = re.compile('\\.{2,}')

Identifies dots repeated more than twice

core.batching.MULTIPLE_DASHES `module-attribute` ¤

MULTIPLE_DASHES = re.compile('[-~]{1,}')

Identifies dashes repeated more than once

core.batching.MULTIPLE_QUESTIONS `module-attribute` ¤

MULTIPLE_QUESTIONS = re.compile('\\?{1,}')

Identifies question marks repeated more than once

core.batching.ORDINAL_FR `module-attribute` ¤

ORDINAL_FR = re.compile('n° ?([0-9]+)')

French ordinal numbers (numéros n°)

core.batching.FRANCAIS `module-attribute` ¤

FRANCAIS = re.compile(
    "%s(j|t|s|d|qu|lorsqu|quelqu|jusqu|m|c|n)\\'(?=[aeiouyéèàêâîôûïüäëöh][\\w\\s])"
    % regex_starter,
    flags=re.IGNORECASE,
)

French contractions of pronouns and determinants

core.batching.DASHES `module-attribute` ¤

DASHES = re.compile('(?<=\\w)(-|_|=)+(?=\\w)', re.IGNORECASE)

Dashes in the middle of ASCII/Latin compounded words. Will not work if accented or Unicode characters are immediately surrounding the dash.

core.batching.ALTERNATIVES `module-attribute` ¤

ALTERNATIVES = re.compile('(?<=[a-z])(\\/)(?=[a-z])', re.IGNORECASE)

Slash-separated word alternatives like and/or mr/mrs

core.batching.PLURAL_S `module-attribute` ¤

PLURAL_S = re.compile('(?<=[a-zA-Z]{4,})s?e{0,2}s%s' % end_of_word)

Identify plural form of nouns (French and English), adjectives (French) and third-person present verbs (English) and second-person verbs (French) in -s.

core.batching.FEMININE_E `module-attribute` ¤

FEMININE_E = re.compile('(?<=\\w{4,})e{1,2}%s' % end_of_word)

Identify feminine form of adjectives (French) in -e.

core.batching.DOUBLE_CONSONANTS `module-attribute` ¤

DOUBLE_CONSONANTS = re.compile(
    "(?<=\\w{2,})([bcfghjklmnpqrstvwxz])\\1", re.IGNORECASE
)

Identify double consonants in the middle of words.

core.batching.FEMININE_TRICE `module-attribute` ¤

FEMININE_TRICE = re.compile('(?<=\\w{4,})t(rice|eur|or)%s' % end_of_word)

Identify French feminine nouns in -trice.

core.batching.ADVERB_MENT `module-attribute` ¤

ADVERB_MENT = re.compile('(?<=\\w{4,})e?ment%s' % end_of_word)

Identify French adverbs and English nouns ending en -ment

core.batching.SUBSTANTIVE_TION `module-attribute` ¤

SUBSTANTIVE_TION = re.compile('(?<=\\w{4,})(t|s)ion%s' % end_of_word)

Identify French and English substantives formed from verbs by adding -tion and -sion

core.batching.SUBSTANTIVE_AT `module-attribute` ¤

SUBSTANTIVE_AT = re.compile('(?<=\\w{4,})at%s' % end_of_word)

Identify French and English substantives formed from other nouns by adding -at

core.batching.PARTICIPLE_ING `module-attribute` ¤

PARTICIPLE_ING = re.compile('(?<=\\w{4,})ing%s' % end_of_word)

Identify English substantives and present participles formed from verbs by adding -ing

core.batching.ADJECTIVE_ED `module-attribute` ¤

ADJECTIVE_ED = re.compile('(?<=\\w{4,})ed%s' % end_of_word)

Identify English adjectives formed from verbs by adding -ed

core.batching.ADJECTIVE_TIF `module-attribute` ¤

ADJECTIVE_TIF = re.compile('(?<=\\w{2,})ti(f|v)%s' % end_of_word)

Identify English and French adjectives formed from verbs by adding -tif or -tive

core.batching.SUBSTANTIVE_Y `module-attribute` ¤

SUBSTANTIVE_Y = re.compile('(?<=\\w{3,})y%s' % end_of_word)

Identify English substantives ending in -y

core.batching.VERB_IZ `module-attribute` ¤

VERB_IZ = re.compile('(?<=\\w{4,})(i|y)z%s' % end_of_word)

Identify American verbs ending in -iz that French and Brits write in -is

core.batching.STUFF_ER `module-attribute` ¤

STUFF_ER = re.compile('(?<=\\w{5,})er%s' % end_of_word)

Identify French 1st group verb (infinitive) and English substantives ending in -er

core.batching.BRITISH_OUR `module-attribute` ¤

BRITISH_OUR = re.compile('(?<=\\w{3,})our%s' % end_of_word)

Identify British spelling ending in -our (colour, behaviour).

core.batching.SUBSTANTIVE_ITY `module-attribute` ¤

SUBSTANTIVE_ITY = re.compile('(?<=\\w{4,})it(y|e)%s' % end_of_word)

Identify substantives in -ity (English) and -ite (French).

core.batching.SUBSTANTIVE_IST `module-attribute` ¤

SUBSTANTIVE_IST = re.compile('(?<=\\w{3,})is(t|m)%s' % end_of_word)

Identify substantives in -ist and -ism.

core.batching.SUBSTANTIVE_IQU `module-attribute` ¤

SUBSTANTIVE_IQU = re.compile('(?<=\\w{3,})i(qu|c)%s' % end_of_word)

Identify French substantives in -iqu

core.batching.SUBSTANTIVE_EUR `module-attribute` ¤

SUBSTANTIVE_EUR = re.compile('(?<=\\w{3,})eur%s' % end_of_word)

Identify French substantives -eur

core.batching.HYPHENIZED `module-attribute` ¤

HYPHENIZED = re.compile('(?<=\\w{3,})[-–—]+ *[\\n\\r]{1,2}(?=\\w)')

Detect hyphenized words at the end of a PDF text line.

core.batching.WAYBACK_RE `module-attribute` ¤

WAYBACK_RE = re.compile('https?://web\\.archive\\.org/web/[^/]+/(https?://.+)')

Find the canonical URL from web.archive.org (Wayback Machine) URLs

Classes¤

core.batching.Lexicon `dataclass` ¤

Lexicon(counts: Counter[str] = Counter())

Mutable token frequency index with canonicalization helpers for:

malformed n-grams,
merged/split variants,
plural compound normalization.

Examples:

- liber_tarian  -> libertarian
- etres_humains -> etre_humain

Methods:¤

core.batching.Lexicon.update ¤

update(corpus: Iterable[Iterable[str]]) -> None

Update token frequencies from a corpus of tokenized sentences.

PARAMETER	DESCRIPTION
`corpus`	Iterable of tokenized sentences: [ [“this”, “is”, “a”, “sentence”], [“another”, “sentence”] ] TYPE: `Iterable[Iterable[str]]`

core.batching.Lexicon.frequency ¤

frequency(token: str) -> int

Return token frequency.

core.batching.Lexicon.exists ¤

exists(token: str) -> bool

Check whether a token exists in the lexicon.

core.batching.Lexicon.prune ¤

prune(min_count: int = 10) -> None

Remove all entries whose frequency is lower than min_count.

PARAMETER	DESCRIPTION
`min_count`	Minimum frequency to keep. TYPE: `int` DEFAULT: `10`

core.batching.Lexicon.resolve_token ¤

resolve_token(token: str, separator: str = '_', min_ratio: float = 1.0) -> str

Attempt to canonicalize malformed n-grams.

Operations

malformed n-grams: liber_tarian -> libertarian
plural compound reduction: etres_humains -> etre_humain

Strategy

if token exists already -> keep it
otherwise:
- remove separators,
- check if merged variant exists,
- compare frequencies,
- prefer merged form if sufficiently frequent.

PARAMETER	DESCRIPTION
`token`	Token to canonicalize. TYPE: `str`
`separator`	N-gram separator. TYPE: `str` DEFAULT: `'_'`
`min_ratio`	Require merged token frequency to be at least `min_ratio` times the split variant frequency. Helps avoid false positives. TYPE: `float` DEFAULT: `1.0`

RETURNS	DESCRIPTION
`str`	Canonicalized token.

core.batching.Lexicon.canonicalize_sentence ¤

canonicalize_sentence(
    sentence: list[str], separator: str = "_", min_ratio: float = 1.0
) -> list[str]

Canonicalize all tokens in a sentence.

core.batching.Tokenizer ¤

Tokenizer(
    meta_tokens: dict[re.Pattern, str] | None = None,
    abbreviations: dict[str, str] | None = None,
    replacements: dict[str, str] | None = None,
    stopwords: set[str] | None = None,
    lang_stopwords: dict[str, set[str]] | None = None,
    backend: str = "blingfire",
)

Pre-processing pipeline and tokenizer.

Splits a string into normalized word tokens after applying a series of configurable text transformations.

PARAMETER	DESCRIPTION
`meta_tokens`	Pipeline of regular-expression substitutions used to replace document fragments with meta-tokens. Keys must be compiled `re.Pattern` objects and values must be meta-token strings, typically enclosed in underscores. Transformations are applied in declaration order. This relies on Python’s ordered dictionaries (Python 3.7+). If not provided, a default pipeline suitable for bilingual English/French technical documents is used. TYPE: `dict[re.Pattern, str] \| None` DEFAULT: `None`
`abbreviations`	Pipeline of abbreviation replacements as a `{to_replace: replacement}` dictionary. Replacements are applied in declaration order. TYPE: `dict[str, str] \| None` DEFAULT: `None`
`replacements`	Dictionary of token-level substitutions applied as `{key: value}` string replacements. TYPE: `dict[str, str] \| None` DEFAULT: `None`
`stopwords`	Language-agnostic stopwords to remove from the token stream. TYPE: `set[str] \| None` DEFAULT: `None`
`lang_stopwords`	Language-specific stopwords. Keys must be ISO 639-1 language codes and values must be sets of stopwords associated with each language. TYPE: `dict[str, set[str]] \| None` DEFAULT: `None`
`backend`	Tokenization backend to use. Supported values are: `"blingfire"`: Microsoft BlingFire tokenizer (pattern-based). `"nltk"`: NLTK Punkt tokenizer. TYPE: `str` DEFAULT: `'blingfire'`

Attributes¤

core.batching.Tokenizer.characters_cleanup `class-attribute` `instance-attribute` ¤

characters_cleanup: dict[(re.Pattern) : str] = {
    MULTIPLE_DOTS: "...",
    MULTIPLE_DASHES: "-",
    MULTIPLE_QUESTIONS: "?",
    REPEATED_CHARACTERS: " ",
    BB_CODE: " ",
    MARKUP: " \\1 ",
    BASE_64: " ",
}

Dictionnary of regular expressions (keys) to find and replace by the provided strings (values). Cleanup repeated characters, including ellipses and question marks, leftover BBcode and XML markup, base64-encoded strings and French pronominal contractions (e.g “me + a” contracted into “m’a”).

core.batching.Tokenizer.internal_meta_tokens `class-attribute` `instance-attribute` ¤

internal_meta_tokens: dict[(re.Pattern) : str] = {
    HASH_PATTERN_FAST: "_HASH_",
    NUMBER_PATTERN_FAST: "_NUMBER_",
}

Dictionnary of regular expressions (keys) to find in full-tokens and replace by meta-tokens. Use simplified regex patterns for performance.

core.batching.Tokenizer.abbreviations `instance-attribute` ¤

abbreviations = abbreviations

Abbreviations and contractions to replace in full documents

core.batching.Tokenizer.replacements `instance-attribute` ¤

replacements = replacements

Arbitrary string replacements in single tokens

core.batching.Tokenizer.stopwords `instance-attribute` ¤

stopwords = set(stopwords) if stopwords else None

Language-agnostic stopwords

core.batching.Tokenizer.lang_stopwords `instance-attribute` ¤

lang_stopwords = lang_stopwords

Language-specific stopwords

core.batching.Tokenizer.supports_ngrams `instance-attribute` ¤

supports_ngrams: bool = False

Whether or not the tokenizer has an embedded n-grams model

core.batching.Tokenizer.ngrams_trie `instance-attribute` ¤

ngrams_trie = {}

Prefix tree of known n-grams for efficient lookups

core.batching.Tokenizer.vocabulary `instance-attribute` ¤

vocabulary: Lexicon = Lexicon()

Known tokens, if trained for n-grams.

Methods:¤

core.batching.Tokenizer.prefilter ¤

prefilter(string: str, meta_tokens: bool = True) -> str

Tokenizers split words based on unsupervised machine-learned models. Sometimes, they work weird. For example, in emails and user handles like @user, they would split @ and user as 2 different tokens, making it impossible to detect usernames in single tokens later.

To avoid that, we replace data of interest by meta-tokens before the tokenization, with regular expressions.

core.batching.Tokenizer.lemmatize ¤

lemmatize(word: str) -> str

Find the root (lemma) of words to help topical generalization.

core.batching.Tokenizer.normalize_text ¤

normalize_text(document: str) -> str

Prepare text for tokenization by converting it to lowercase ASCII characters.

This will loose accents, diacritics and capitals, which means some nuance will be lost at the benefit of generality. In case this does not suit your usecase, you may inherit the Tokenizer class, build a child class and re-implement this method

core.batching.Tokenizer.normalize_token ¤

normalize_token(
    word: str,
    language: str | None,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> str | None

Return normalized, lemmatized and stemmed word tokens, where dates, times, digits, monetary units and URLs have their actual value replaced by meta-tokens designating their type. Stopwords (“the”, “a”, etc.), punctuation etc. is replaced by None, which should be filtered out at the next step.

PARAMETER	DESCRIPTION
`word`	tokenized word in lower case only. TYPE: `str`
`language`	the ISO 369-1 language code used to remove typical stopwords. TYPE: `str`
`normalize`	remove punctuation and leading/trailing symbols. TYPE: `str` DEFAULT: `True`
`meta_tokens`	replace string patterns by meta_tokens TYPE: `bool` DEFAULT: `True`
`stem`	remove word suffixes, double consonnants, etc. TYPE: `bool` DEFAULT: `True`
`remove_stopwords`	remove stopwords TYPE: `bool` DEFAULT: `True`

NOTE

Tokenization is non-destructive (full sentences can be reconstructed entirely from token lists) if normalize=False, meta_tokens=False, stem=False and remove_stopwords=False. In this setting, only 1:1 token replacements defined in self.replacements will be applied, which can allow to replace abbreviations or accronyms. Other modes start generalizing semantics by removing meaning.

Examples:

Meta-tokens: 10:00 or 10 h or 10am or 10 am will all be replaced by a _TIME_ meta-token. feb, February, feb., monday will all be replaced by a _DATE_ meta-token.

core.batching.Tokenizer.tokenize_text ¤

tokenize_text(
    sentence: str,
    language: str | None = None,
    n_grams: bool = True,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> list[str]

Split text into normalized word tokens and meta-tokens.

No sentence or paragraph boundary detection is performed.

PARAMETER	DESCRIPTION
`sentence`	Input text to tokenize. TYPE: `str`
`n_grams`	Whether to detect and collapse n-grams. Requires a trained n-gram model generated with `train_ngrams()`. TYPE: `bool` DEFAULT: `True`

Note

The parameters language, normalize, meta_tokens, stem, and remove_stopwords are forwarded to normalize_token() and have the same meaning.

RETURNS	DESCRIPTION
`list[str]`	List of normalized tokens represented as a bag of words.

core.batching.Tokenizer.post_filter_tokens ¤

post_filter_tokens(
    tokens: list[str],
    language: str | None = None,
    meta_tokens: bool = True,
    stem: bool = False,
    normalize: bool = False,
    remove_stopwords: bool = False,
) -> list[str]

Apply post-processing operations to an existing token stream.

This method applies token normalization, stemming, stopword removal, and meta-token handling without performing tokenization.

PARAMETER	DESCRIPTION
`tokens`	List of input tokens to process. TYPE: `list[str]`

Note

The parameters language, meta_tokens, stem, normalize, and remove_stopwords are forwarded to normalize_token() and have the same meaning.

RETURNS	DESCRIPTION
`list[str]`	List of processed tokens.

core.batching.Tokenizer.tokenize_document_flat ¤

tokenize_document_flat(
    document: str,
    language: str | None = None,
    n_grams: bool = True,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> list[str]

Cleanup and tokenize a document or a sentence as an atomic element, meaning we don’t split it into sentences. Use this either for search-engine purposes (into a document’s body) or if the document is already split into sentences. The document text needs to have been prepared and cleaned, which means :

lowercased (optional but recommended) with str.lower(),
translated from Unicode to ASCII (optional but recommended) with core.utils.typography_undo,
cleaned up for sequences of whitespaces with core.utils.clean_whitespaces

Note

the language is detected internally if not provided as an optional argument. When processing a single sentence extracted from a document, instead of the whole document, it is more accurate to run the language detection on the whole document, ahead of calling this method, and pass on the result here.

PARAMETER	DESCRIPTION
`document`	the text of the document to tokenize TYPE: `str`
`n_grams`	see core.nlp.Tokenizer.tokenize_text TYPE: `bool` DEFAULT: `True`

Note

The parameters language, meta_tokens, stem, normalize, and remove_stopwords are forwarded to normalize_token() and have the same meaning.

RETURNS	DESCRIPTION
`tokens`	a 1D list of normalized tokens and meta-tokens. TYPE: `list[str]`

core.batching.Tokenizer.tokenize_document_per_sentence ¤

tokenize_document_per_sentence(
    document: str,
    language: str | None = None,
    n_grams: bool = True,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> list[list[str]]

Cleanup and tokenize a whole document as a list of sentences, meaning we split it into sentences before tokenizing. Use this to train a Word2Vec (embedding) model so each token is properly embedded into its syntactic context. The document text needs to have been prepared and cleaned, which means :

lowercased (optional but recommended) with str.lower(),
translated from Unicode to ASCII (optional but recommended) with core.utils.typography_undo,
cleaned up for sequences of whitespaces with core.utils.clean_whitespaces

PARAMETER	DESCRIPTION
`document`	the text of the document to tokenize TYPE: `str`
`n_grams`	see core.nlp.Tokenizer.tokenize_text TYPE: `bool` DEFAULT: `True`

Note

The parameters language, meta_tokens, stem, normalize, and remove_stopwords are forwarded to normalize_token() and have the same meaning.

RETURNS	DESCRIPTION
`tokens`	a 2D list of sentences (1st axis), each containing a list of normalized tokens and meta-tokens (2nd axis). TYPE: `list[list[str]]`

core.batching.Tokenizer.tokenize_document_per_paragraph ¤

tokenize_document_per_paragraph(
    document: str,
    language: str | None = None,
    n_grams: bool = True,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> list[list[str]]

Cleanup and tokenize a whole document as a list of paragraphs, meaning we split it on `

or ` before tokenizing. Use this to train a Word2Vec (embedding) model so each token is properly embedded into its syntactic context. The document text needs to have been prepared and cleaned, which means :

    - lowercased (optional but recommended) with `str.lower()`,
    - translated from Unicode to ASCII (optional but recommended) with [core.utils.typography_undo][],
    - cleaned up for sequences of whitespaces with [core.utils.clean_whitespaces][]

    Arguments:
        document (str): the text of the document to tokenize
        n_grams (bool): see [core.nlp.Tokenizer.tokenize_text][]
        others: see [core.nlp.Tokenizer.normalize_token][] arguments

    Note:
        the language is detected internally if not provided. The text is prefiltered with [self.prefilter][]

    Returns:
        tokens: a 2D list of paragraphs (1st axis), each containing a list of normalized tokens and meta-tokens (2nd axis).

core.batching.Tokenizer.load `classmethod` ¤

load(name: str)

Load an existing trained model by its name from the ../models folder.

core.batching.Tokenizer.members_from_ngram ¤

members_from_ngram(token: str | None) -> list[str] | None

Recover n-grams members from a single tokenized phrase, separated with _. This expects lower-case tokens, except for meta-tokens which are expected capitalized.

RETURNS	DESCRIPTION
`list[str] \| None`	the list of n-gram members, or None if the token was not an n-gram but a singleton.

core.batching.Tokenizer.train_ngrams ¤

train_ngrams(
    sentences: list[str],
    connector_words: str = "",
    min_count: int = 10,
    threshold: float = 0.7,
    scoring: str = "npmi",
)

Train an n-gram model (bigrams and trigrams).

Detects common phrases such as “New York City” and merges them into single tokens using a statistical phrase model.

PARAMETER	DESCRIPTION
`sentences`	Training corpus. Must be a list of tokenized sentences. TYPE: `list[str]`
`connector_words`	Space-separated list of connector words allowed inside phrases (e.g. “by” in “piece by piece”). These words are treated as valid bridges when forming n-grams. TYPE: `str` DEFAULT: `''`
`min_count`	Minimum number of occurrences required for a phrase to be considered. See gensim.models.phrases.Phrases for details. TYPE: `int` DEFAULT: `10`
`threshold`	Phrase detection sensitivity threshold. See gensim.models.phrases.Phrases. TYPE: `float` DEFAULT: `0.7`
`scoring`	Scoring function used for phrase detection. See gensim.models.phrases.Phrases. TYPE: `str` DEFAULT: `'npmi'`

Warning

N-gram training must be performed on lightly processed tokenized sentences. Do not apply stemming, stopword removal, or punctuation stripping before training.

See Tokenizer.normalize_token() for required preprocessing options.

Note

Writes an ngrams log file in the models directory containing discovered phrases.
Can be executed multiple times (e.g. per language); results are appended to the existing model.

core.batching.Tokenizer.compile_ngrams ¤

compile_ngrams(ngrams: list[str])

Build a nested n-grams dictionnary for efficient querying, like:

{
    "new": {
        "york": {
            "__value__": "new_york",
            "city": {
                "__value__": "new_york_city"
            }
        }
    }
}

core.batching.Tokenizer.replace_ngrams ¤

replace_ngrams(tokens: list[str]) -> list[str]

Identify n-grams among tokens and collapse them into single tokens. N-grams should have been trained before, with core.nlp.Tokenizer.train_ngrams.

RETURNS	DESCRIPTION
`list[str]`	the collapsed list of strings, or the original list if no n-grams
`list[str]`	was found or the n-grams model has not been trained.

core.batching.Tokenizer.lookup_ngram ¤

lookup_ngram(members: list[str] | tuple[str, ...]) -> str | None

Lookup an n-gram in the trie from its token members.

PARAMETER	DESCRIPTION
`members`	the tokens iterable TYPE: `list[str] \| tuple[str, ...]`

RETURNS	DESCRIPTION
`str \| None`	the collapsed n-gram if found in the trie, or `None` if the input members match
`str \| None`	no known n-gram.

Example

lookup_ngram((“new”, “york”)) -> “new_york”

lookup_ngram((“new”, “york”, “city”)) -> “new_york_city”

lookup_ngram((“foo”, “bar”)) -> None

core.batching.Data ¤

Data(text: str, label: str)

Represent an item of tagged training data.

PARAMETER	DESCRIPTION
`text`	the content to label, which will be vectorized TYPE: `str`
`label`	the category of the content, which will be predicted by the model TYPE: `str`

core.batching.LossLogger ¤

LossLogger()

Bases: CallbackAny2Vec

Output loss at each epoch

core.batching.WordEmbedding ¤

Shared interface and post-processing for word-embedding models.

Gensim-agnostic mixin implementing corpus statistics (IDF/SIF), All-but-the-Top post-processing and the word/document vector-retrieval API used by the search engine. It is combined with a concrete gensim training class (core.nlp.Word2Vec or core.nlp.FastText), which must provide self.wv, an output matrix (syn1neg or syn1) and save().

Both core.nlp.Word2Vec and core.nlp.FastText inherit it, so they expose the same interface and are interchangeable wherever the search core.search.Indexer expects an embedding model. Type-hint against core.nlp.WordEmbedding to depend on the interface rather than a concrete model.

Attributes¤

core.batching.WordEmbedding.tokenizer `instance-attribute` ¤

tokenizer: 'Tokenizer'

Tokenizer object, instanciated with word replacements and trained for n-grams if needed.

core.batching.WordEmbedding.vector_size `instance-attribute` ¤

vector_size: int

Number of vector dimensions used to embed words.

core.batching.WordEmbedding.N_docs `instance-attribute` ¤

N_docs: int

Number of documents used at training time

core.batching.WordEmbedding.N_sentences `instance-attribute` ¤

N_sentences: int

Number of sentences found in the training corpus

core.batching.WordEmbedding.N_words `instance-attribute` ¤

N_words: int

Number of words (tokens) found in the training corpus

core.batching.WordEmbedding.N_terms `instance-attribute` ¤

N_terms: int

Number of unique terms found in the training corpus

core.batching.WordEmbedding.idf `instance-attribute` ¤

idf: 'dict[str, float] | None'

Inverse document frequency of words. Computed only if core.nlp.WordEmbedding is instanciated with compute_idf=True

core.batching.WordEmbedding.avg_doc_len `instance-attribute` ¤

avg_doc_len: 'float | None'

Average document length. Computed only if core.nlp.WordEmbedding is instanciated with compute_idf=True

core.batching.WordEmbedding.wv `instance-attribute` ¤

wv: gensim.models.KeyedVectors

Gensim keyed vectors

Methods:¤

core.batching.WordEmbedding.prune_idf ¤

prune_idf()

Prune IDF entries to the actual model vocabulary (remove tokens that were filtered out by gensim during super().__init__).

core.batching.WordEmbedding.apply_abtt ¤

apply_abtt(n_components_in: int = 3, n_components_out: int = 0) -> None

Post-process IN and OUT word vectors using All-but-the-Top.

Applies mean subtraction (and optionally PC removal) to W_IN and W_OUT independently. W_OUT receives lighter treatment because its common component is weaker under negative sampling and because it feeds into document centroids that are already corrected by normalize_pc() in the search index.

The intuition is that the principal components of the embedding vector space encode frequency rather than semantics.

For the FastText drop-in this adjusts only the in-vocabulary full-word vectors (wv.vectors) and the output matrix; OOV vectors reconstructed on the fly from sub-word n-grams are left untouched.

Reference

Mu & Viswanath (2018) “All-but-the-Top: Simple and Effective Postprocessing for Word Representations” https://arxiv.org/abs/1702.01417

PARAMETER	DESCRIPTION
`n_components_in`	PCs to remove from W_IN (query space) beyond the mean. For a specialty corpus, 3-10 is typical. Removing too many components induces a risk of loosing semantic meaning. `0` performs only the mean substraction (no principal components). `-1` disables principal components and mean substraction (bypass). TYPE: `int` DEFAULT: `3`
`n_components_out`	PCs to remove from W_OUT (document space) beyond the mean. Default 0 (mean only) is recommended unless you observe residual domain bias in document clusters after indexing. `0` performs only the mean substraction (no principal components). `-1` disables principal components and mean substraction (bypass). TYPE: `int` DEFAULT: `0`

core.batching.WordEmbedding.load_model `classmethod` ¤

load_model(name: str)

Load a trained model saved in models folders

core.batching.WordEmbedding.get_word ¤

get_word(word: str) -> str | None

Find out if word is in dictionary, optionnaly attempting spell-checking if not found.

For core.nlp.Word2Vec this means the word is in vocabulary. For core.nlp.FastText it is also true for out-of-vocabulary words that can be reconstructed from sub-word n-grams.

PARAMETER	DESCRIPTION
`word`	word to find TYPE: `str`

RETURNS	DESCRIPTION
`str \| None`	the original word if found in dictionnary, `None` if both previous conditions were not matched.

core.batching.WordEmbedding.get_wordvec ¤

get_wordvec(
    word: str, embed: str = "IN", normalize: bool = True
) -> np.ndarray[np.float32] | None

Return the embedding vector associated to a word.

PARAMETER	DESCRIPTION
`word`	the word to convert to a vector. TYPE: `str`
`embed`	`IN` uses the input embedding matrix (query/document encoding). `OUT` uses the output embedding matrix (dual-embedding space document ranking). [^1] TYPE: `str` DEFAULT: `'IN'`

A Dual Embedding Space Model for Document Ranking (2016), Bhaskar Mitra, Eric Nalisnick, Nick Craswell, Rich Caruana https://arxiv.org/pdf/1602.01137.pdf ↩

RETURNS	DESCRIPTION
`np.ndarray[np.float32] \| None`	the nD vector if the word can be vectorized, or `None`. For
`np.ndarray[np.float32] \| None`	FastText, OOV words have no `OUT` vector (output embeddings exist
`np.ndarray[np.float32] \| None`	only for in-vocabulary words) and return `None` for `embed="OUT"`.

core.batching.WordEmbedding.get_features ¤

get_features(
    tokens: list[str],
    embed: str = "IN",
    use_sif: bool = False,
    sif_smoothing: float = 0.001,
    top_k: int = 0,
) -> np.ndarray[np.float32]

Calls core.nlp.WordEmbedding.get_wordvec over a list of tokens and returns a single centroid vector representing the whole list.

Tokens are aggregated per unique word, so a word’s contribution scales with its in-list frequency (a word occurring n times contributes n × weight). This is mathematically identical to summing over every occurrence, but it also exposes a per-word salience used by top_k.

PARAMETER	DESCRIPTION
`tokens`	list of text tokens. TYPE: `list[str]`
`embed`	see core.nlp.WordEmbedding.get_wordvec TYPE: `str` DEFAULT: `'IN'`
`use_sif`	Use SIF weighting on each term when embedding a full sentence or document. See core.nlp.WordEmbedding.SIF. TYPE: `bool` DEFAULT: `False`
`sif_smoothing`	The SIF smoothing coefficient. TYPE: `float` DEFAULT: `0.001`
`top_k`	length-aware pooling. When `> 0`, keep only the `top_k` most salient unique tokens (highest accumulated `frequency × SIF` weight) before averaging; `0` (default) uses every token. Long documents otherwise drown their topical signal under a long tail of low-salience words, which pulls the centroid toward the corpus mean (centroid dilution) and makes comprehensive pages rank below short, keyword-peaky ones. Capping to the most discriminative tokens de-dilutes long documents while leaving short ones (fewer than `top_k` tokens) untouched. Used at document-vectorization time (see core.batching.batch_vectorize); the default `0` keeps the query path unchanged. TYPE: `int` DEFAULT: `0`

RETURNS	DESCRIPTION
`np.ndarray[np.float32]`	the normalized centroid of word embedding vectors associated with the input tokens
`np.ndarray[np.float32]`	(aka the average vector), or the null vector if no word from the list was found in dictionnary.

core.batching.WordEmbedding.SIF ¤

SIF(token: str, a: float = 0.005) -> float

Smooth inverse frequency weighting.

This helps refining semantics by under-weighting stopwords, when aggregating word vectors into a document centroid.

Reference

A simple but tough-to-beat baseline for sentence embeddings (2017). Sanjeev Arora, Yingyu Liang, Tengyu Ma. https://openreview.net/pdf?id=SyK00v5xx

PARAMETER	DESCRIPTION
`token`	the token to weight. It should be in the model vocabulary. TYPE: `str`

RETURNS	DESCRIPTION
`float`	The SIF weight associated with the token or 0. if the token was not found in the vocabulary.

Warning

The core.nlp.WordEmbedding model needs to have been trained with compute_idf=True to prepare the statistics needed by SIF weighting. The method will raise an error if the stats are not available.

core.batching.WordEmbedding.tokens_to_indices ¤

tokens_to_indices(tokens: list[str]) -> np.ndarray[np.int32]

Convert a list of tokens to a list of their index number in the vocabulary. This yields a more compact, albeit purely symbolic, representation of a tokenized document as a series of integers.

Only in-vocabulary tokens are kept (out-of-vocabulary FastText tokens have no stable vocabulary index). The conversion is reversible and the original token can be found with self.wv.index_to_key[i], where i is the index number output (for each token) from here.

RETURNS	DESCRIPTION
`np.ndarray[np.int32]`	the list of indices as 32 bits integers, meaning the vocabulary needs to contain fewer
`np.ndarray[np.int32]`	than 4.29 billions words.

core.batching.Word2Vec ¤

Word2Vec(
    documents: Iterable[Iterable[list[str]]],
    name: str = "word2vec",
    vector_size: int = 300,
    epochs: int = 200,
    window: int = 5,
    min_count: int = 5,
    sample: float = 0.0005,
    tokenizer: Tokenizer | None = None,
    compute_idf: bool = True,
    n_pc_in: int = 3,
    n_pc_out: int = 0,
    **kwargs: dict[str, Any]
)

Bases: WordEmbedding, gensim.models.Word2Vec

Train a Word2Vec embedding model and compute TF-IDF word statistics on the corpus.

The pre-computed object is automatically saved to VirtualSecretary/models, as per core.utils.get_models_folder

PARAMETER	DESCRIPTION
`documents`	Pre-tokenized training corpus. Structure: - outer list: documents - inner list: tokenized sentences TYPE: `Iterable[Iterable[list[str]]]`
`name`	Name of the model file used for saving/loading. TYPE: `str` DEFAULT: `'word2vec'`
`vector_size`	Dimensionality of word embeddings. TYPE: `int` DEFAULT: `300`
`epochs`	Number of training iterations. Higher values improve quality on small corpora but increase training time. TYPE: `int` DEFAULT: `200`
`window`	Context window size for word co-occurrence. TYPE: `int` DEFAULT: `5`
`min_count`	Minimum frequency threshold for vocabulary filtering. TYPE: `int` DEFAULT: `5`
`sample`	Subsampling rate for frequent words. TYPE: `float` DEFAULT: `0.0005`
`tokenizer`	Tokenizer instance used for preprocessing (if applicable). TYPE: `Tokenizer \| None` DEFAULT: `None`
`compute_idf`	Whether to compute and store IDF statistics for SIF weighting. See core.nlp.Word2Vec.SIF Disable to reduce model size when SIF is not used. TYPE: `bool` DEFAULT: `True`
`n_pc_in`	Number of principal components to remove on the word embedding vectors for the input space. See core.nlp.Word2Vec.apply_abtt TYPE: `int` DEFAULT: `3`
`n_pc_out`	Number of principal components to remove on the word embedding vectors for the output space. See core.nlp.Word2Vec.apply_abtt TYPE: `int` DEFAULT: `0`
`**kwargs`	Additional parameters forwarded directly to gensim.models.word2vec.Word2Vec. TYPE: `dict[str, Any]` DEFAULT: `{}`

Attributes¤

core.batching.Word2Vec.tokenizer `instance-attribute` ¤

tokenizer: Tokenizer = tokenizer if tokenizer is not None else Tokenizer()

Tokenizer object, instanciated with word replacements and trained for n-grams if needed.

core.batching.Word2Vec.pathname `instance-attribute` ¤

pathname: str = get_models_folder(name)

Path and filename of the saved model

core.batching.Word2Vec.vector_size `instance-attribute` ¤

vector_size: int = vector_size

Number of vector dimensions used to embed words.

core.batching.FastText ¤

FastText(
    documents: Iterable[Iterable[list[str]]],
    name: str = "fasttext",
    vector_size: int = 300,
    epochs: int = 200,
    window: int = 5,
    min_count: int = 5,
    sample: float = 0.0005,
    tokenizer: Tokenizer | None = None,
    compute_idf: bool = True,
    n_pc_in: int = 3,
    n_pc_out: int = 0,
    min_n: int = 3,
    max_n: int = 6,
    bucket: int = 2000000,
    **kwargs: dict[str, Any]
)

Bases: WordEmbedding, gensim.models.FastText

Drop-in alternative to core.nlp.Word2Vec backed by gensim.models.fasttext.FastText.

Same API, corpus statistics (IDF/SIF), All-but-the-Top post-processing and vector-retrieval interface as Word2Vec (both inherit core.nlp.WordEmbedding), so it is interchangeable everywhere the search core.search.Indexer expects an embedding model. The difference is that word vectors are composed from character n-grams, which gives:

- sensible vectors for out-of-vocabulary tokens (rare domain jargon,
  misspellings, morphological variants) on the query (IN) side,
- robustness to FR/EN morphology, allowing lighter stemming upstream.

Notes

Output (OUT) embeddings exist only for in-vocabulary words, so document vectorization (embed="OUT") still skips OOV tokens.
All-but-the-Top adjusts only the in-vocabulary full-word vectors; OOV vectors reconstructed from sub-word n-grams are left raw.

PARAMETER	DESCRIPTION
`min_n`	smallest character n-gram length for sub-word vectors. TYPE: `int` DEFAULT: `3`
`max_n`	largest character n-gram length for sub-word vectors. TYPE: `int` DEFAULT: `6`
`bucket`	number of hash buckets for sub-word n-grams. TYPE: `int` DEFAULT: `2000000`

Other arguments: see core.nlp.Word2Vec.

Attributes¤

core.batching.FastText.tokenizer `instance-attribute` ¤

tokenizer: Tokenizer = tokenizer if tokenizer is not None else Tokenizer()

Tokenizer object, instanciated with word replacements and trained for n-grams if needed.

core.batching.FastText.pathname `instance-attribute` ¤

pathname: str = get_models_folder(name)

Path and filename of the saved model

core.batching.FastText.vector_size `instance-attribute` ¤

vector_size: int = vector_size

Number of vector dimensions used to embed words.

core.batching.Classifier ¤

Classifier(
    training_set: list[Data],
    name: str,
    word2vec: Word2Vec,
    validate: bool = True,
    variant: str = "svm",
)

Bases: nltk.classify.SklearnClassifier

Initialize a Word2Vec + SVM classification pipeline.

This class wraps a Word2Vec embedding model with a downstream machine-learning classifier (SVM or alternatives).

PARAMETER	DESCRIPTION
`training_set`	List of `Data` samples used for training. If empty, the system will attempt to load a pre-trained model using `name`. TYPE: `list[Data]`
`name`	Identifier used to save and reload the trained model. TYPE: `str`
`word2vec`	Word embedding model used to generate feature vectors. TYPE: `Word2Vec`
`validate`	If True, splits the dataset into training (95%) and testing (5%) subsets and prints evaluation metrics. Useful for classifier selection and sanity checking. TYPE: `bool` DEFAULT: `True`
`variant`	Type of classifier to use: `svm`: RBF-kernel Support Vector Machine (default). Robust and stable across general datasets. `linear svm`: Linear Support Vector Machine. Faster and often better for high-dimensional features. `forest`: Random Forest classifier. Faster than linear SVM in some cases, but produces larger models. TYPE: `str` DEFAULT: `'svm'`

Note

The previous documentation mentioned path and features, but these are not part of the current signature and were removed.

Methods:¤

core.batching.Classifier.get_features_parallel ¤

get_features_parallel(post: Data) -> tuple[str, str]

Thread-safe call to .get_features() to be called in multiprocessing.Pool map

core.batching.Classifier.load `classmethod` ¤

load(name: str)

Load an existing trained model by its name from the ../models folder.

core.batching.Classifier.classify ¤

classify(post: str) -> str

Apply a label on a post based on the trained model.

core.batching.Classifier.prob_classify ¤

prob_classify(post: str) -> tuple[str, float]

Apply a label on a post based on the trained model and output the probability too.

core.batching.StemTokenIndex ¤

StemTokenIndex(db: sqlite3.Connection, tokenizer: Tokenizer)

Build a reverse-lookup table in db mapping stems to tokens.

The rationale is that core.nlp.Tokenizer.tokenize_text (and the higher-level method calling it internally), when used with stem=True, produces unlegible tokens for humans. This class helps building a translation dictionnary mapping back the stemmed tokens to the most probable non-stemmed token for UI purposes.

RETURNS	DESCRIPTION
`None`	A new indexed `stem_tokens` table in `db` containing 3 columns: `stem`, `token`, `occurences`. Each row records the frequency of the `(stem, token)` couple. TYPE: `None`

Methods:¤

core.batching.StemTokenIndex.most_probable_token ¤

most_probable_token(db: sqlite3.Connection, stem: str) -> str

Return the most probable original token associated to the stem. If the stem doesn’t exist in the database, it is returned as-is.

core.batching.StemTokenIndex.most_probable_tokens ¤

most_probable_tokens(db: sqlite3.Connection, stems: list[str]) -> list[str]

Return the most probable original token for each stem.

Stems not found in DB are returned unchanged.

core.batching.SQLitePageCorpus ¤

SQLitePageCorpus(
    db,
    query,
    params=(),
    atomic_types=(str, bytes),
    max_depth=None,
    yield_rows=False,
)

Lazily stream rows from an SQLite request, avoiding full copy.

Example

    corpus = SQLitePageCorpus(
        db,
        """
        SELECT tokenized
        FROM pages
        WHERE lang IN ('fr', 'en')
        """,
        max_depth=0
    )

- max_depth=0 will not flatten the content, so it will return the original list[list[str]] (list of sentences, aka list of list of words), - max_depth=1 flattens documents, to it will return list[str] (list of words)

core.batching.Deduplicator ¤

Deduplicator(
    threshold: float = 0.9,
    distance: int = 50,
    discard_params: bool = True,
    n_min: int = 0,
    fix_urls: bool = True,
)

Instanciate a depduplicator object.

The duplicates factorizing takes a list of core.types.web_page

Duplication detection is done using canonical URLs (removing query parameters and anchors) and lowercased, ASCII-converted content.

You can edit (append or replace) the list of URLs to ignore core.deduplicator.Deduplicator.urls_to_ignore before doing the actual process.

Optionaly, near-duplicates are detected too by computing the Levenshtein distance between pages contents (lowercased and ASCII-converted). This brings a significant performance penalty on large datasets.

PARAMETER	DESCRIPTION
`threshold`	the minimum Levenshtein distance ratio between 2 pages contents for those pages to be considered near-duplicates and be factorized. If set to 1.0, the near-duplicates detection is bypassed which results in a huge speed up. TYPE: `float` DEFAULT: `0.9`
`distance`	the near-duplicates search is performed on the nearest elements after the core.types.web_page list has been ordered alphabetically by URL, for performance, assuming near-duplicates will most likely be found on the same domain and at a resembling path. The distance parameters defines how many elements ahead we will look into. TYPE: `int` DEFAULT: `50`
`discard_params`	on modern CMS that enable “pretty URLs” (URL rewriting), pages will be indexed by a `domain/section/subsection/page` and URL query parameters will most likely be used my meaningless pages like social sharing links or search results page so this parameter can be set to `True` to discard those. On Rest-API-driven websites, streaming websites and old CMS using “ugly URLS”, pages will be indexed by `domain?content=id` and the query parameters need to be kept by setting this parameter to `False` TYPE: `bool` DEFAULT: `True`
`n_min`	domains that have a number of indexed pages below this threshold will be discarded entirely. This avoids indexing random dude’s website, under the assumption that relevant and reliable domains will have several pages indexed. TYPE: `int` DEFAULT: `0`
`fix_urls`	attempt to convert `http` to `https` URLs and remove leading `www.`. This sends DNS requests to assess if the `https` and `www.`-less variants can be reached, which takes a most 2 s per URL. Set to `False` to speed things up. TYPE: `bool` DEFAULT: `True`

Attributes¤

core.batching.Deduplicator.urls_to_ignore `class-attribute` `instance-attribute` ¤

urls_to_ignore: list[str] = [
    "/tag/",
    "/tags/",
    "/category/",
    "/categories/",
    "/author/",
    "/authors/",
    "/profil/",
    "/profiles/",
    "/user/",
    "/users/",
    "/login/",
    "/signup/",
    "/member/",
    "/members/",
    "/cart/",
    "/shop/",
    "/register",
]

URL substrings to find in URLs and remove matching web pages: mostly WordPress archive pages, user profiles and login pages.

Methods:¤

core.batching.Deduplicator.prepare_posts_parallel `classmethod` ¤

prepare_posts_parallel(elem, discard_params, urls_to_ignore, fix_urls)

Canonicalize a :class:~core.types.web_page dict for the list path.

Delegates URL normalization to :meth:_canonicalize_url and adds list-path-specific fallbacks for length and datetime (which are guaranteed to be pre-computed on the DB path by batch_parse_web_page but may be absent on hand-assembled lists).

Returns the mutated elem dict, or None if the URL must be discarded.

core.batching.Deduplicator.get_unique_urls ¤

get_unique_urls(posts: list[web_page]) -> list[web_page]

Pick the most recent, or otherwise the longer, candidate for each canonical URL.

core.batching.Deduplicator.run_on_db ¤

run_on_db(db: sqlite3.Connection, chunksize: int = 4096) -> None

Deduplicate the pages table in-place, matching the full __call__ pipeline.

Runs six sequential phases:

URL canonicalization – stream every row through :meth:_canonicalize_url (threaded, I/O-bound), normalise URLs, populate _prepared with canonical URL, domain, and metadata copied verbatim from pages (pre-computed by batch_parse_web_page).
URL deduplication – for each canonical URL keep the single best row via SQL window functions ordered by :attr:_ELECTION_ORDER_URL.
Exact-content deduplication – among URL winners, collapse rows that share the same content_hash using :attr:_ELECTION_ORDER_CONTENT. Rows without a hash (archival stubs) pass through unchanged.
Near-duplicate removal (skipped when threshold == 1.0) – load survivors with non-NULL parsed text into memory, run the parallel Levenshtein window scan, write the final winner set back. Archival stubs bypass this phase entirely.
Domain frequency filter (skipped when n_min == 0) – drop every row whose canonical domain appears fewer than :attr:n_min times in the survivor set. Rows with NULL domain are kept unconditionally.
Table rebuild – atomically replace pages with the winner rows, writing back canonicalised url, domain, and wayback.

All intermediate temp tables are cleaned up on success.

PARAMETER	DESCRIPTION
`db`	Open `sqlite3.Connection` to the database. TYPE: `sqlite3.Connection`
`chunksize`	Rows fetched per batch during Phase 1. TYPE: `int` DEFAULT: `4096`

core.batching.Deduplicator.add_content_hash_column `staticmethod` ¤

add_content_hash_column(db: sqlite3.Connection) -> None

Add (or refresh) a content_hash column on the pages table.

Computes a SHA-1 digest of each row’s parsed field and stores it in content_hash. The column is created if it does not yet exist. Rows with a NULL parsed value are skipped and left with a NULL hash.

A covering index idx_pages_content_hash is created (or left in place) after the update so that subsequent deduplication queries are cheap.

This method is a standalone maintenance utility. The deduplication pipeline (:meth:run_on_db) computes hashes inline during Phase 1 and does not require this method to be called first.

Assumption: parsed values fit in memory individually (they are fetched one batch at a time, not all at once).

PARAMETER	DESCRIPTION
`db`	Open `sqlite3.Connection` to the target database. TYPE: `sqlite3.Connection`

core.batching.Deduplicator.get_unique_content ¤

get_unique_content(posts: list[web_page]) -> list[web_page]

Pick the most recent candidate for each canonical content.

RETURNS	DESCRIPTION
`list[web_page]`	`canonical content: web_page` dictionnary

core.batching.Deduplicator.get_close_content ¤

get_close_content(
    posts: list[web_page], threshold: float = 0.9, distance: int = 50
) -> list[web_page]

Find and remove near-duplicates using the Levenshtein ratio.

Delegates the actual scan to :meth:_close_content_scan, which parallelises comparisons within each window via a :class:~concurrent.futures.ThreadPoolExecutor. This method is the list-path counterpart to :meth:_elect_near_duplicates; both call the same shared scan implementation.

The election among near-duplicate candidates honours the same priority rules as URL and content deduplication (non-external > newer > longer > shorter URL) via :meth:_elect_group.

PARAMETER	DESCRIPTION
`posts`	List of :class:`core.types.web_page` dicts after URL and exact-content deduplication. TYPE: `list[web_page]`
`threshold`	Minimum Levenshtein ratio for two pages to be considered near-duplicates. Defaults to :attr:`self.threshold`. TYPE: `float` DEFAULT: `0.9`
`distance`	Positions ahead to scan from each row after sorting by URL. Defaults to :attr:`self.distance`. TYPE: `int` DEFAULT: `50`

RETURNS	DESCRIPTION
`list[web_page]`	Filtered list with near-duplicates removed; one survivor per group.

core.batching.Deduplicator.run_on_list ¤

run_on_list(posts: list[web_page]) -> list[web_page]

Deduplicate an in-memory list of web pages, matching the full pipeline.

This is the list-based counterpart to :meth:run_on_db. The two methods are kept symmetrical: both run the same four phases (URL canonicalization, exact-URL deduplication, exact-content deduplication, optional near-duplicate removal) and honour the same election rules.

Note

posts is consumed and partially destroyed during processing to avoid keeping two copies in memory simultaneously.

PARAMETER	DESCRIPTION
`posts`	Flat list of :class:`~core.types.web_page` dicts. The list is modified in-place; callers should not rely on its contents after this call returns. TYPE: `list[web_page]`

RETURNS	DESCRIPTION
`list[web_page]`	Deduplicated list of sanitised :class:`~core.types.web_page` dicts,
`list[web_page]`	ready for downstream use. Also writes a `domains` frequency file
`list[web_page]`	via core.utils.get_models_folder.

Functions:¤

core.batching.parse_lang_to_iso639_1 ¤

parse_lang_to_iso639_1(value: str | None) -> str | None

Normalize language identifier to ISO 639-1.

core.batching.guess_language ¤

guess_language(
    string: str,
    stopwords_threshold: float = 0.05,
    letters_threshold: float = 0.8,
) -> str | None

Basic language guesser based on stopwords detection.

Stopwords are the most common words of a language: for each language, we count how many stopwords we found and return the language having the most matches. It is accurate for paragraphs and long documents, not so much for short sentences.

PARAMETER	DESCRIPTION
`string`	the string to analyze. Needs to be lowercased but to retain accents and diacritics. TYPE: `str`
`stopwords_threshold`	the minimum ratio of stopwords divided by total words in strings to be found to conclude on a language. For example, Japanese companies often have technical reports written in Japanese but still containing some English. If less than 5% of the words are known English stopwords, we could conclude it’s not English. TYPE: `float` DEFAULT: `0.05`
`letters_threshold`	the minimum ratio of roman (latin) characters among all characters (including numbers, symbols and non-latin alphabets) to be found to conclude on a language. TYPE: `float` DEFAULT: `0.8`

RETURNS	DESCRIPTION
`str \| None`	ISO 639-1 language code. Defaults to “en” if nothing found.

core.batching.detect_language ¤

detect_language(text: str) -> str | None

Detect language from arbitrary text safely.

RETURNS	DESCRIPTION
`str \| None`	ISO 639-1 language code.

core.batching.tokenize_document_to_words ¤

tokenize_document_to_words(
    text: str, language: str | None = None, backend: str = "blingfire"
) -> list[str]

Split a text into single words

PARAMETER	DESCRIPTION
`language`	ISO 639-1 language code. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`list[str]`	Bag of words for the whole document. Sentence delimiters are removed.

core.batching.split_document_to_sentences ¤

split_document_to_sentences(
    text: str, language: str | None = None, backend: str = "blingfire"
) -> list[str]

Split a text into a list of sentences.

PARAMETER	DESCRIPTION
`language`	ISO 639-1 language code. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`list[str]`	List of sentences as full text.

core.batching.tokenize_document_to_sentences ¤

tokenize_document_to_sentences(
    text: str, language: str | None = None, backend: str = "blingfire"
) -> list[list[str]]

Split a text into single words as a list of lists

PARAMETER	DESCRIPTION
`language`	ISO 639-1 language code. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`list[list[str]]`	List of sentences, each sentence is itself a list of words.

core.batching.split_url ¤

split_url(url: str) -> tuple[str, str, str, str, str] | None

Split a well-formed URL following RFC3986 into base elements.

RETURNS	DESCRIPTION
`tuple[str, str, str, str, str] \| None`	a tuple of `(protocol, domain, page, parameters, anchor)`.
`tuple[str, str, str, str, str] \| None`	Empty/missing fields are inited with empty strings so there is no need for individual `None` checks.
`tuple[str, str, str, str, str] \| None`	If the `url` input doesn’t match an URL format, return `None`.

core.batching.ensure_decompressed ¤

ensure_decompressed(path: str) -> str

Inflate a gzip sibling path + ".gz" into path if it is newer, then return path.

Deploying over FTP gives us no way to run a remote command, so the heavy search_engine.joblib / chantal-slim.db deploy artifacts are gzipped locally (the .db shrinks ~60%) and inflated here instead: whichever worker handles the first request after a deploy pays the one-time gunzip cost via an atomic replace, and every later worker sees an up-to-date plain file and just returns immediately (a single stat).

If path + ".gz" is missing, older than path, or fails to decompress (e.g. caught mid-upload), this is a no-op and the existing path – if any – is left untouched, so a request never breaks because of a deploy in flight.

core.batching.adapt_array ¤

adapt_array(arr: np.ndarray)

http://stackoverflow.com/a/31312102/190597 (SoulNibbler)

core.batching.create_db ¤

create_db(name: str) -> sqlite3.Connection

Create the pages table if needed and add any missing columns. This doesn’t destroy existing tables, rows or columns, so it’s safe to run on any database.

Warning

Columns are inferred directly from web_page.__annotations__. Existing columns are preserved unchanged.

The url column is used as the PRIMARY KEY.

core.batching.create_temp_db ¤

create_temp_db(
    min_free: float = 2.0, filename: str | None = None
) -> sqlite3.Connection

Create a temporary SQLite database file (in /dev/shm when available) and initialize the pages table according to web_page annotations.

PARAMETER	DESCRIPTION
`min_free`	minimum available disk space in GiB required to create the temporary database. This is checked at runtime and the function will raise an error if the condition is not met. TYPE: `float` DEFAULT: `2.0`
`filename`	the full path and filename to save the temporary database, if it needs to be reused at some point. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`sqlite3.Connection`	the sqlite3.Connection opened in bulk mode.

WARNING

the temporary SQLite database doesn’t use web_page URL as primary key, to allow later deduplication.

core.batching.delete_temp_db ¤

delete_temp_db(db: sqlite3.Connection)

Close and delete a temporary database in one shot.

core.batching.open_db ¤

open_db(name: str, mode: str = 'rw') -> sqlite3.Connection

Open an SQLite database with workload-specific optimizations.

PARAMETER	DESCRIPTION
`name`	Database identifier/path passed to `get_models_folder()`. TYPE: `str`
`mode`	“rw”: Generic read/write mode. “ro”: Read-only immutable mode optimized for serving/search workloads. “bulk”: Bulk-ingestion mode optimized for large batch writes. TYPE: `str` DEFAULT: `'rw'`

RETURNS	DESCRIPTION
`sqlite3.Connection`	sqlite3.Connection

core.batching.compress_db ¤

compress_db(
    db: sqlite3.Connection,
    delete_query: str | None = None,
    delete_params: tuple | None = None,
    delete_columns: list[str] | None = None,
    repack: bool = False,
)

Optionally delete rows, then reclaim SQLite disk space.

Two reclaim strategies, picked automatically:

Incremental (cheap, default): when the database was created with auto_vacuum = INCREMENTAL (see :func:open_db), free pages are returned to the OS in place via PRAGMA incremental_vacuum. No full copy is made, so this needs no scratch space and cannot hit the “database or disk is full” trap. It does not defragment.
Full repack (repack=True, or as a fallback when the DB predates the auto_vacuum setting): rewrites the whole DB tightly via VACUUM INTO + online backup. Defragments and, as a side effect, applies any pending auto_vacuum mode change so legacy DBs convert to incremental on their first full repack.

PARAMETER	DESCRIPTION
`db`	SQLite connection TYPE: `sqlite3.Connection`
`delete_query`	full DELETE SQL query TYPE: `str \| None` DEFAULT: `None`
`delete_params`	optional SQL parameters TYPE: `tuple \| None` DEFAULT: `None`
`delete_columns`	columns to NULL out before reclaiming space TYPE: `list[str] \| None` DEFAULT: `None`
`repack`	force a full defragmenting rewrite (use for slim deliverables) TYPE: `bool` DEFAULT: `False`

core.batching.is_primary_key ¤

is_primary_key(db: sqlite3.Connection, table: str, column: str) -> bool

Check whether column is part of the PRIMARY KEY of table.

core.batching.populate_db ¤

populate_db(
    db: sqlite3.Connection, pages: list[web_page], batch_size: int = 4096
)

Insert or update web_page records into the SQLite database.

Existing rows are matched using the PRIMARY KEY url.

Warning

Array-like Python values are converted to bytearray then to bytes in order to be handled as BLOB by SQLite.

core.batching.db_to_list ¤

db_to_list(db: sqlite3.Connection) -> list[web_page]

Extract all web_page rows from the pages table in db as a list of web_page

core.batching.migrate_url_to_primary_key ¤

migrate_url_to_primary_key(db: sqlite3.Connection)

Rebuild the pages table using url as PRIMARY KEY for older databases that didn’t use a primary key.

core.batching.merge_databases ¤

merge_databases(old_db: sqlite3.Connection, new_db: sqlite3.Connection)

Merge two pages databases.

Rows from old_db are inserted into new_db only if their URL does not already exist.

Existing rows in new_db are preserved unchanged.

Only columns existing in BOTH databases are copied.

core.batching.update_pages_from_database ¤

update_pages_from_database(
    target_db: sqlite3.Connection, source_db: sqlite3.Connection
) -> list[str]

Update rows in target_db.pages from source_db.pages using url as PRIMARY KEY.

Only shared columns are updated.

Returns missing_urls: URLs present in target_db but absent from source_db.

core.batching.import_pages ¤

import_pages(
    source_db: str | sqlite3.Connection,
    destination_db: str | sqlite3.Connection,
    where_clause: str = "1=1",
    params: tuple = (),
    preserve_derived: list[str] | None = None,
) -> int

Import rows from one SQLite database into another.

Both source_db and destination_db may be either a filesystem path (str) or an active sqlite3.Connection handle. Passing a Connection is the only way to target a :memory: database, since those cannot be addressed by path.

Connection lifecycle - Path supplied – the function opens, commits, and closes the connection itself (original behaviour). - Connection supplied – the caller retains full control; the connection is neither committed nor closed here, so the import can participate in a larger transaction.

Rows are copied from source.pages into destination.pages. Existing rows are updated on conflict of the url primary key. Columns present in the destination but absent from the source receive NULL. Both schemas are discovered at runtime, so the function adapts automatically if either evolves.

PARAMETER	DESCRIPTION
`source_db`	Path to, or an open connection for, the source SQLite database. TYPE: `str \| sqlite3.Connection`
`destination_db`	Path to, or an open connection for, the destination SQLite database. TYPE: `str \| sqlite3.Connection`
`where_clause`	SQL WHERE clause applied to `source.pages`. Example: `"domain = ? AND date >= ?"` TYPE: `str` DEFAULT: `'1=1'`
`params`	Positional parameters bound to where_clause. TYPE: `tuple` DEFAULT: `()`
`preserve_derived`	columns whose existing value in the destination must be preserved when a conflicting (same-`url`) row’s content is unchanged, and only overwritten when the content changed (detected via `content_hash`). Use this when merging a freshly-crawled source that has not computed these derived columns yet, so re-crawling an unchanged page does not wipe its expensive artifacts (e.g. `["tokenized", "stemmed", "vectorized"]`). `None` keeps the plain “overwrite everything” upsert behaviour. TYPE: `list[str] \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`int`	Number of affected rows.

Examples::

# File → file (unchanged from before)
import_pages("old.db", "new.db", "domain = ?", ("example.com",))

# In-memory source → file destination
import_pages(mem_conn, "new.db")

# File source → in-memory destination (e.g. for tests)
import_pages("prod.db", mem_conn, "date >= ?", ("2024-01-01",))

# Both in-memory
import_pages(src_conn, dst_conn)

core.batching.inspect_db ¤

inspect_db(db: sqlite3.Connection, message: str = '') -> None

Print useful metadata and statistics about a SQLite database.

PARAMETER	DESCRIPTION
`db`	active database connection TYPE: `sqlite3.Connection`
`message`	optional additional message to indentify several inspections if any. TYPE: `str` DEFAULT: `''`

core.batching.sanitize_web_page ¤

sanitize_web_page(page: web_page) -> web_page

Ensure existence and validity of web_page keys/values.

core.batching.batch_guess_dates ¤

batch_guess_dates(db: sqlite3.Connection, chunksize: int = 2048)

High-throughput parallel datetime parsing.

core.batching.batch_parse_web_page ¤

batch_parse_web_page(
    documents: sqlite3.Connection,
    tokenizer: Tokenizer,
    chunksize: int = 512,
    cores: int | None = None,
    only_none: bool = False,
)

High-performance parallel parsing for core.types.web_page objects

This function is meant to cleanup text encoding issues and multi-spacings in web_page title and content. It prepares the web_page["parsed"] field from title and content for the next stages of tokenization, and updates language (using declared ISO code or machine-learned detection).

It is needed to call it before core.deduplicator.Deduplicator, so the content duplication has a clean parsed version to compare web pages.

PARAMETER	DESCRIPTION
`documents`	any database having core.types.web_page rows stored in a `pages` table and stored on the filesystem. It cannot be a memory-hosted database: each parallel worker will open its own copy by file path. TYPE: `sqlite3.Connection`
`tokenizer`	we only use it for the the core.nlp.Tokenizer.normalize_text method TYPE: `Tokenizer`
`chunksize`	number of SQLite rows to process at once, too many is not helpful since some batches may take longer than others, depending on text length. TYPE: `int` DEFAULT: `512`
`cores`	CPU cores to use for parallel processing. TYPE: `int \| None` DEFAULT: `None`
`only_none`	parse only the rows that have not been parsed yet (`parsed IS NULL`). Each worker recomputes `parsed`/`content_hash`/`length`/`lang` from the raw `title`/`content`, never from the existing `parsed`, so already-parsed rows are byte-for-byte identical on re-run and safe to skip. Use this on an incrementally-updated index (freshly-crawled pages arrive already parsed via the temporary DB) to avoid re-normalizing the whole corpus every day. If `False` (default), the whole database is re-parsed, which is what you want when the normalization logic itself changed. TYPE: `bool` DEFAULT: `False`

core.batching.batch_tokenize ¤

batch_tokenize(
    db: sqlite3.Connection,
    tokenizer: Tokenizer,
    chunksize: int = 512,
    urls: list[str] | None = None,
    only_none: bool = True,
)

Tokenize a list of web_pages in a non-destructive way, in parallel, in a RAM-friendly way, directly in database.

Populate the tokenized database column from the parsed column. This needs to run after core.batching.batch_parse_web_page and prepares n-gram training if any, or stemming.

Note

The tokenization is forced non-destructive and doesn’t apply stemming, stopwords removal, normalization, or n-grams. Original sentences can be reconstructed from joining back the list of tokens.

PARAMETER	DESCRIPTION
`urls`	list of URLs to tokenize. If None, the whole database is processed. TYPE: `list[str] \| None` DEFAULT: `None`
`only_none`	stem only the new entries that have not been tokenized already. If `False`, force-update the whole database. It has no effect when `urls` are explicitely specified TYPE: `bool` DEFAULT: `True`

core.batching.batch_stem ¤

batch_stem(
    db: sqlite3.Connection,
    tokenizer: Tokenizer,
    chunksize: int = 512,
    urls: list[str] | None = None,
    only_none: bool = True,
)

Tokenize and stem a list of web_pages in parallel, in a RAM-friendly way, directly in database.

Populate the stemmed database column from the tokenized column. This needs to run after core.batching.batch_tokenize. The tokenization is destructive and apply stemming, stopwords removal, normalization and n-grams if available.

PARAMETER	DESCRIPTION
`urls`	list of URLs to tokenize. If None, the whole database is processed. TYPE: `list[str] \| None` DEFAULT: `None`
`only_none`	stem only the new entries that have not been stemmed already. If `False`, force-update the whole database. It has no effect when `urls` are explicitely specified TYPE: `bool` DEFAULT: `True`

core.batching.batch_vectorize ¤

batch_vectorize(
    db: sqlite3.Connection,
    word2vec: Word2Vec,
    chunksize: int = 256,
    title_weight: float = 0.5,
    use_sif: bool = True,
    sif_smoothing: float = 0.001,
    body_top_k: int = 48,
    only_none: bool = True,
)

Vectorize the documents of the db database using the provided embedding model, using all available cores.

Reads the stemmed and title columns and writes the vectorized column. Each document vector is the SIF-weighted OUT-embedding centroid of the body, blended with a separate centroid of the (re-stemmed) title weighted by title_weight, then L2-normalized. Title-boosting counteracts the centroid dilution that buries long, focused pages under their own body text.

PARAMETER	DESCRIPTION
`title_weight`	relative weight of the title centroid in the blend. `0` reproduces the plain body-only centroid. TYPE: `float` DEFAULT: `0.5`
`use_sif`	SIF-weight terms when building the centroids. TYPE: `bool` DEFAULT: `True`
`sif_smoothing`	SIF smoothing constant `a` (see core.nlp.WordEmbedding.SIF). TYPE: `float` DEFAULT: `0.001`
`body_top_k`	length-aware pooling for the body centroid: keep only the `body_top_k` most salient tokens per document so long pages are de-diluted (see core.nlp.WordEmbedding.get_features). `0` disables it (plain full-document centroid). The title centroid is always built from all title tokens. TYPE: `int` DEFAULT: `48`
`only_none`	vectorize only the rows that have not been vectorized yet (`vectorized IS NULL`). On a daily index update this skips every unchanged page. Retraining the embedding model (chantal-02) or the tokenizer (chantal-01) wipes the `vectorized` column, which forces a full re-vectorization on the next run. Set to `False` to force re-vectorizing the whole database in place (e.g. when only vectorization hyper-parameters changed, without a model retrain). TYPE: `bool` DEFAULT: `True`

core.batching¤

core.batching ¤

Attributes¤

core.batching.LANG_MAP module-attribute ¤

core.batching.LANG_MAP_REVERSE module-attribute ¤

core.batching.STOPWORDS_DICT module-attribute ¤

core.batching.regex_starter module-attribute ¤

core.batching.regex_stopper module-attribute ¤

core.batching.end_of_word module-attribute ¤

core.batching.regex_algebra module-attribute ¤

core.batching.IP_PATTERN module-attribute ¤

core.batching.EMAIL_PATTERN module-attribute ¤

core.batching.URL_PATTERN module-attribute ¤

core.batching.MEMBERS_PATTERN module-attribute ¤

core.batching.DATE_PATTERN module-attribute ¤

core.batching.TIME_PATTERN module-attribute ¤

core.batching.DOMAIN_PATTERN module-attribute ¤

core.batching.UID_PATTERN module-attribute ¤

core.batching.FLAGS_PATTERN module-attribute ¤

core.batching.PATH_PATTERN module-attribute ¤

core.batching.PARTIAL_PATH_REGEX module-attribute ¤

core.batching.RESOLUTION_PATTERN module-attribute ¤

core.batching.NUMBER_PATTERN module-attribute ¤

core.batching.HASH_PATTERN module-attribute ¤

core.batching.MULTIPLE_LINES module-attribute ¤

core.batching.MULTIPLE_NEWLINES module-attribute ¤

core.batching.INTERNAL_NEWLINE module-attribute ¤

core.batching.EXPOSURE module-attribute ¤

core.batching.PHOTOSPEED module-attribute ¤

core.batching.SENSIBILITY module-attribute ¤

core.batching.LUMINANCE module-attribute ¤

core.batching.DIAPHRAGM module-attribute ¤

core.batching.GAIN module-attribute ¤

core.batching.FILE_SIZE module-attribute ¤

core.batching.DISTANCE module-attribute ¤

core.batching.PERCENT module-attribute ¤

core.batching.WEIGHT module-attribute ¤

core.batching.ANGLE module-attribute ¤

core.batching.TEMPERATURE module-attribute ¤

core.batching.FREQUENCY module-attribute ¤

core.batching.TEXT_DATES module-attribute ¤

core.batching.BASE_64 module-attribute ¤

core.batching.BB_CODE module-attribute ¤

core.batching.MARKUP module-attribute ¤

core.batching.USER module-attribute ¤

core.batching.REPEATED_CHARACTERS module-attribute ¤

core.batching.UNFINISHED_SENTENCES module-attribute ¤

core.batching.MULTIPLE_DOTS module-attribute ¤

core.batching.MULTIPLE_DASHES module-attribute ¤

core.batching.MULTIPLE_QUESTIONS module-attribute ¤

core.batching.ORDINAL_FR module-attribute ¤

core.batching.FRANCAIS module-attribute ¤

core.batching.DASHES module-attribute ¤

core.batching.ALTERNATIVES module-attribute ¤

core.batching.PLURAL_S module-attribute ¤

core.batching.FEMININE_E module-attribute ¤

core.batching.DOUBLE_CONSONANTS module-attribute ¤

core.batching.FEMININE_TRICE module-attribute ¤

core.batching.ADVERB_MENT module-attribute ¤

core.batching.SUBSTANTIVE_TION module-attribute ¤

core.batching.SUBSTANTIVE_AT module-attribute ¤

core.batching.PARTICIPLE_ING module-attribute ¤

core.batching.ADJECTIVE_ED module-attribute ¤

core.batching.ADJECTIVE_TIF module-attribute ¤

core.batching.SUBSTANTIVE_Y module-attribute ¤

core.batching.VERB_IZ module-attribute ¤

core.batching.STUFF_ER module-attribute ¤

core.batching.BRITISH_OUR module-attribute ¤

core.batching.SUBSTANTIVE_ITY module-attribute ¤

core.batching.SUBSTANTIVE_IST module-attribute ¤

core.batching.SUBSTANTIVE_IQU module-attribute ¤

core.batching.SUBSTANTIVE_EUR module-attribute ¤

core.batching.HYPHENIZED module-attribute ¤

core.batching.WAYBACK_RE module-attribute ¤

Classes¤

core.batching.Lexicon dataclass ¤

Methods:¤

core.batching.Lexicon.update ¤

core.batching.Lexicon.frequency ¤

core.batching.Lexicon.exists ¤

core.batching.LANG_MAP `module-attribute` ¤

core.batching.LANG_MAP_REVERSE `module-attribute` ¤

core.batching.STOPWORDS_DICT `module-attribute` ¤

core.batching.regex_starter `module-attribute` ¤

core.batching.regex_stopper `module-attribute` ¤

core.batching.end_of_word `module-attribute` ¤

core.batching.regex_algebra `module-attribute` ¤

core.batching.IP_PATTERN `module-attribute` ¤

core.batching.EMAIL_PATTERN `module-attribute` ¤

core.batching.URL_PATTERN `module-attribute` ¤

core.batching.MEMBERS_PATTERN `module-attribute` ¤

core.batching.DATE_PATTERN `module-attribute` ¤

core.batching.TIME_PATTERN `module-attribute` ¤

core.batching.DOMAIN_PATTERN `module-attribute` ¤

core.batching.UID_PATTERN `module-attribute` ¤

core.batching.FLAGS_PATTERN `module-attribute` ¤

core.batching.PATH_PATTERN `module-attribute` ¤

core.batching.PARTIAL_PATH_REGEX `module-attribute` ¤

core.batching.RESOLUTION_PATTERN `module-attribute` ¤

core.batching.NUMBER_PATTERN `module-attribute` ¤

core.batching.HASH_PATTERN `module-attribute` ¤

core.batching.MULTIPLE_LINES `module-attribute` ¤

core.batching.MULTIPLE_NEWLINES `module-attribute` ¤

core.batching.INTERNAL_NEWLINE `module-attribute` ¤

core.batching.EXPOSURE `module-attribute` ¤

core.batching.PHOTOSPEED `module-attribute` ¤

core.batching.SENSIBILITY `module-attribute` ¤

core.batching.LUMINANCE `module-attribute` ¤

core.batching.DIAPHRAGM `module-attribute` ¤

core.batching.GAIN `module-attribute` ¤

core.batching.FILE_SIZE `module-attribute` ¤

core.batching.DISTANCE `module-attribute` ¤

core.batching.PERCENT `module-attribute` ¤

core.batching.WEIGHT `module-attribute` ¤

core.batching.ANGLE `module-attribute` ¤

core.batching.TEMPERATURE `module-attribute` ¤

core.batching.FREQUENCY `module-attribute` ¤

core.batching.TEXT_DATES `module-attribute` ¤

core.batching.BASE_64 `module-attribute` ¤

core.batching.BB_CODE `module-attribute` ¤

core.batching.MARKUP `module-attribute` ¤

core.batching.USER `module-attribute` ¤

core.batching.REPEATED_CHARACTERS `module-attribute` ¤

core.batching.UNFINISHED_SENTENCES `module-attribute` ¤

core.batching.MULTIPLE_DOTS `module-attribute` ¤

core.batching.MULTIPLE_DASHES `module-attribute` ¤

core.batching.MULTIPLE_QUESTIONS `module-attribute` ¤

core.batching.ORDINAL_FR `module-attribute` ¤

core.batching.FRANCAIS `module-attribute` ¤

core.batching.DASHES `module-attribute` ¤

core.batching.ALTERNATIVES `module-attribute` ¤

core.batching.PLURAL_S `module-attribute` ¤

core.batching.FEMININE_E `module-attribute` ¤

core.batching.DOUBLE_CONSONANTS `module-attribute` ¤

core.batching.FEMININE_TRICE `module-attribute` ¤

core.batching.ADVERB_MENT `module-attribute` ¤

core.batching.SUBSTANTIVE_TION `module-attribute` ¤

core.batching.SUBSTANTIVE_AT `module-attribute` ¤

core.batching.PARTICIPLE_ING `module-attribute` ¤

core.batching.ADJECTIVE_ED `module-attribute` ¤

core.batching.ADJECTIVE_TIF `module-attribute` ¤

core.batching.SUBSTANTIVE_Y `module-attribute` ¤

core.batching.VERB_IZ `module-attribute` ¤

core.batching.STUFF_ER `module-attribute` ¤

core.batching.BRITISH_OUR `module-attribute` ¤

core.batching.SUBSTANTIVE_ITY `module-attribute` ¤

core.batching.SUBSTANTIVE_IST `module-attribute` ¤

core.batching.SUBSTANTIVE_IQU `module-attribute` ¤

core.batching.SUBSTANTIVE_EUR `module-attribute` ¤

core.batching.HYPHENIZED `module-attribute` ¤

core.batching.WAYBACK_RE `module-attribute` ¤

core.batching.Lexicon `dataclass` ¤

core.batching.Tokenizer.characters_cleanup `class-attribute` `instance-attribute` ¤

core.batching.Tokenizer.internal_meta_tokens `class-attribute` `instance-attribute` ¤

core.batching.Tokenizer.abbreviations `instance-attribute` ¤

core.batching.Tokenizer.replacements `instance-attribute` ¤

core.batching.Tokenizer.stopwords `instance-attribute` ¤

core.batching.Tokenizer.lang_stopwords `instance-attribute` ¤

core.batching.Tokenizer.supports_ngrams `instance-attribute` ¤

core.batching.Tokenizer.ngrams_trie `instance-attribute` ¤