Skip to content

core.batching¤

core.batching ¤

High-performance, paralellized high-level methods to process large corpora of documents.

Interfaces NLP processing with database entries, for efficient RAM management.

Database structure is hard-coded and expects conformation to data structures defined in core.database and core.types

© 2026 - Aurélien Pierre

Attributes¤

core.batching.LANG_MAP module-attribute ¤

LANG_MAP = {
    "en": "english",
    "fr": "french",
    "de": "german",
    "es": "spanish",
    "it": "italian",
    "pt": "portuguese",
    "nl": "dutch",
    "sv": "swedish",
    "no": "norwegian",
    "da": "danish",
    "fi": "finnish",
    "ru": "russian",
    "ro": "romanian",
    "hu": "hungarian",
    "tr": "turkish",
}

Map ISO 639-1 language codes of supported languages to their full-name, as used by pre-trained corpora

core.batching.LANG_MAP_REVERSE module-attribute ¤

LANG_MAP_REVERSE = {v: k for k, v in (LANG_MAP.items())}

Map the full-name of supported languages, as used by pre-trained corpora, to ISO 639-1 language codes

core.batching.STOPWORDS_DICT module-attribute ¤

STOPWORDS_DICT = {
    language: (set(STOPWORDS_DICT[language])) for language in STOPWORDS_DICT
}

Dictionnary of stopwords (as sets values) mapped to full language names (as keys)

core.batching.regex_starter module-attribute ¤

regex_starter = '(?<=^|\\s|\\[|\\(|\\{|\\<|\\\'|\\"|`|;|\\>)'

Start of line, or start of document, or start of markup

core.batching.regex_stopper module-attribute ¤

regex_stopper = '(?=$|\\s|\\]|\\)|\\}|\\>|\\\'|\\"|`|;|\\<)'

End of line, or end of document, or end of markup

core.batching.end_of_word module-attribute ¤

end_of_word = '(?=$|\\s|\\]|\\)|\\}|\\>|\\\'|\\"|`|;|:|,|\\?|\\!|\\.|\\<)'

End of word, or end of line, or end of document, or end of markup

core.batching.regex_algebra module-attribute ¤

regex_algebra = '[\\+\\-\\=\\\\±]'

Algebraic signs

core.batching.IP_PATTERN module-attribute ¤

IP_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_ip, regex_stopper), re.IGNORECASE
)

IPv4 and IPv6 patterns where the whole IP is captured in the first group.

core.batching.EMAIL_PATTERN module-attribute ¤

EMAIL_PATTERN = re.compile(
    "<?([0-9a-z\\-\\_\\+\\.]+?@[0-9a-z\\-\\_\\+]+(\\.[0-9a-z\\_\\-]{2,})+)>?",
    re.IGNORECASE,
)

Emails patterns like <me@mail.com> or me@mail.com where the whole address is captured in the first group.

core.batching.URL_PATTERN module-attribute ¤

URL_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_url, end_of_word), re.IGNORECASE
)

URL patterns like http(s)://domain.ext/page/subpage?q=x&r=0:1#anchor or //domain.ext/page. URL must follow RFC3986, meaning query parameters should be before anchors, if any. Relying on this assumption allows a faster regex parsing.

  • the protocol (ftp, ftps, http, https) is captured as the first group,
  • domain.ext is captured as the second group,
  • /page/etc is the third group, including leading and trailing /,
  • page query parameters ?s=x&r=0, including ?, is the fourth group if the URL declares ...?params#anchor,
  • anchor #anchor is the fifth group, including #, if the URL declares ...?params#anchor.

URLs are captured if they are:

  • alone on their own line,
  • enclosed in {}, [], ()
  • enclosed in whitespaces.

Warning: URLs enclosed in (), [] and {} may retain the closing sign as part of the page name since () and [] are valid in URL pathes and parameters. This pattern will work on plain text only: Markdown, XML, HTML and JSON will need to be parsed ahead.

core.batching.MEMBERS_PATTERN module-attribute ¤

MEMBERS_PATTERN = re.compile('(?<=[a-z])(\\.)(?=[a-z])', re.IGNORECASE)

Domain patterns without leading protocol like cdn.company.com or class members in object-oriented programming languages like params.cookies.client.

core.batching.DATE_PATTERN module-attribute ¤

DATE_PATTERN = re.compile(date_regex, re.IGNORECASE)

Dates like 2022-12-01, 01-12-2022, 01-12-22, 01/12/2022, 01/12/22 where the whole date is captured in the first group, then each group of digits is captured in the order of appearance, in the next 3 groups

core.batching.TIME_PATTERN module-attribute ¤

TIME_PATTERN = re.compile(time_regex, re.IGNORECASE)

Identify more or less standard time patterns, like :

  • 12h15
  • 12:15
  • 12:15:00
  • 12am
  • 12 am
  • 12 h
  • 12:15:00Z
  • 12:15:00+01
  • 12:15:00 UTC+1
  • 11:27:45+0000
RETURNS DESCRIPTION
0

1- or 2-digits hour,

TYPE: str

1

hour/minutes separator or half-day marker among ["h", ":", "am", "pm"] (case-insensitive)

TYPE: str

2

2-digits minutes, if any, or None

TYPE: str

3

2-digits seconds, if any.

TYPE: str

4

hour marker (h or H), half-day marker (case-insensitive ["am", "pm"]), or time zone marker (case-sensitive ["Z", "UTC"])

TYPE: str

5

1-or 2-digits signed integer timezone shift (referred to UTC).

TYPE: str

Examples:

see https://regex101.com/r/QNtZAK/2

see src/tests/test-patterns.py

core.batching.DOMAIN_PATTERN module-attribute ¤

DOMAIN_PATTERN = re.compile(
    "from ((?:[a-z0-9\\-_]{0,61}\\.)+[a-z]{2,})", re.IGNORECASE
)

Matches patterns like from (domain.ext) from RFC-822 Received header in emails.

core.batching.UID_PATTERN module-attribute ¤

UID_PATTERN = re.compile('UID ([0-9]+)')

Matches email integer UID from IMAP headers.

core.batching.FLAGS_PATTERN module-attribute ¤

FLAGS_PATTERN = re.compile('FLAGS \\((.*?)\\)')

Matches email flags from IMAP headers.

core.batching.PATH_PATTERN module-attribute ¤

PATH_PATTERN = re.compile('%s%s%s' % (regex_starter, path_regex, end_of_word))

File path pattern like ~/file, /home/file, ./file or C:\windows

core.batching.PARTIAL_PATH_REGEX module-attribute ¤

PARTIAL_PATH_REGEX = re.compile(
    "%s%s%s" % (regex_starter, partial_path_regex, end_of_word)
)

Partial, invalid path patterns missing the leading root, like home/user/stuff. We start capturing after at least two folder separators (slash or backslash).

Warning

this will collide with date detection, so run it after in the pipeline.

core.batching.RESOLUTION_PATTERN module-attribute ¤

RESOLUTION_PATTERN = re.compile('\\d+(?:×|x|X)\\d+')

Pixel resolution like 10x20 or 10×20. Units are discarded.

core.batching.NUMBER_PATTERN module-attribute ¤

NUMBER_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_number, regex_stopper)
)

Signed integers and decimals, fractions and numeric IDs with interal dashes and underscores. Numbers with starting or trailing units are not considered. Lazy decimals (.1 and 1.) are considered.

core.batching.HASH_PATTERN module-attribute ¤

HASH_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_hash, end_of_word), re.IGNORECASE
)

Cryptographic hexadecimal hashes and fingerprints, of a min length of 8 characters.

core.batching.MULTIPLE_LINES module-attribute ¤

MULTIPLE_LINES = re.compile('(?: ?[\\t\\r\\n]{2,} ?)+')

Detect more than 2 newlines and tab, possibly mixed with spaces

core.batching.MULTIPLE_NEWLINES module-attribute ¤

MULTIPLE_NEWLINES = re.compile('(?: ?[\\t\\r\\n]+ ?){2,}')

Detect broken sequences of newlines and spaces.

core.batching.INTERNAL_NEWLINE module-attribute ¤

INTERNAL_NEWLINE = re.compile('(?<=\\w)[\\n\\t\\r]{1}(?=\\w)')

Detect single newline characters nested inside text. Mostly useful for parsed PDF where line wrapping is quite literal ( used instead of space).

core.batching.EXPOSURE module-attribute ¤

EXPOSURE = re.compile(
    "%s%s%s" % (regex_starter, exposure_regex, end_of_word), flags=re.IGNORECASE
)

Exposure values in EV or IL

core.batching.PHOTOSPEED module-attribute ¤

PHOTOSPEED = re.compile(
    "%s%s%s" % (regex_starter, photospeed_regex, end_of_word),
    flags=re.IGNORECASE,
)

Exposure values in EV or IL

core.batching.SENSIBILITY module-attribute ¤

SENSIBILITY = re.compile(
    "%s%s%s" % (regex_starter, sensibility_regex, end_of_word),
    flags=re.IGNORECASE,
)

Photographic sensibility in ISO or ASA

core.batching.LUMINANCE module-attribute ¤

LUMINANCE = re.compile(
    "%s%s%s" % (regex_starter, luminance_regex, end_of_word),
    flags=re.IGNORECASE,
)

Luminance/radiance in nits or Cd/m²

core.batching.DIAPHRAGM module-attribute ¤

DIAPHRAGM = re.compile(
    "%s%s" % (regex_starter, diaphragm_regex), flags=re.IGNORECASE
)

Photographic diaph aperture values like f/2.8 or f/11

core.batching.GAIN module-attribute ¤

GAIN = re.compile(
    "%s%s%s" % (regex_starter, gain_regex, end_of_word), flags=re.IGNORECASE
)

Gain, attenuation and PSNR in dB

core.batching.FILE_SIZE module-attribute ¤

FILE_SIZE = re.compile(
    "%s%s%s" % (regex_starter, filesize_regex, end_of_word), flags=re.IGNORECASE
)

File and memory size in bit, byte, or octet and their multiples

core.batching.DISTANCE module-attribute ¤

DISTANCE = re.compile(
    "%s%s%s" % (regex_starter, distance_regex, end_of_word), flags=re.IGNORECASE
)

Distance in meter, inch, foot and their multiples

core.batching.PERCENT module-attribute ¤

PERCENT = re.compile('%s%s%s' % (regex_starter, percent_regex, end_of_word))

Number followed by %

core.batching.WEIGHT module-attribute ¤

WEIGHT = re.compile(
    "%s%s%s" % (regex_starter, weight_regex, end_of_word), flags=re.IGNORECASE
)

Weight (mass) in British and SI units and their multiples

core.batching.ANGLE module-attribute ¤

ANGLE = re.compile(
    "%s%s%s" % (regex_starter, angle_regex, end_of_word), flags=re.IGNORECASE
)

Angles in radians, degrees and steradians

core.batching.TEMPERATURE module-attribute ¤

TEMPERATURE = re.compile(
    "%s%s%s" % (regex_starter, temperature_regex, end_of_word),
    flags=re.IGNORECASE,
)

Temperatures in °C, °F and K

core.batching.FREQUENCY module-attribute ¤

FREQUENCY = re.compile(
    "%s%s%s" % (regex_starter, frequency_regex, end_of_word),
    flags=re.IGNORECASE,
)

Frequencies in hertz and multiples

core.batching.TEXT_DATES module-attribute ¤

TEXT_DATES = re.compile(
    "([0-9]{1,2})? (jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec|jan|fév|mar|avr|mai|jui|jui|aou|sep|oct|nov|déc|janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre|january|february|march|april|may|june|july|august|september|october|november|december)\\.?( [0-9]{1,2})?( [0-9]{2,4})(?!\\:)",
    flags=re.IGNORECASE | re.MULTILINE,
)

Find textual dates formats:

  • English dates like 01 Jan 20 or 01 Jan. 2020 but avoid capturing adjacent time like 12:08.
  • French dates like 01 Jan 20 or 01 Jan. 2020 but avoid capturing adjacent time like 12:08.
RETURNS DESCRIPTION
0

2 digits (day number or year number, depending on language)

TYPE: str

1

month (full-form or abbreviated)

TYPE: str

2

2 digits (day number or year number, depending on language)

TYPE: str

3

4 digits (full year)

TYPE: str

core.batching.BASE_64 module-attribute ¤

BASE_64 = re.compile(
    "((?:[A-Za-z0-9+\\/]{4}){64,}(?:[A-Za-z0-9+\\/]{2}==|[A-Za-z0-9+\\/]{3}=)?)"
)

Identifies base64 encoding

core.batching.BB_CODE module-attribute ¤

BB_CODE = re.compile('\\[(img|quote)[a-zA-Z0-9 =\\"]*?\\].*?\\[\\/\\1\\]')

Identifies left-over BB code markup [img] and [quote]

core.batching.MARKUP module-attribute ¤

MARKUP = re.compile('(?:\\[|\\{|\\<)([^\\n\\r]+?)(?:\\]|\\}|\\>)')

Identifies left-over HTML and Markdown markup, like <...>, {...}, [...]

core.batching.USER module-attribute ¤

USER = re.compile('([\\w\\-\\+\\.]+)?@([\\w\\-\\+\\.]+)|(user\\-?\\d+)')

Identifies user handles or emails

core.batching.REPEATED_CHARACTERS module-attribute ¤

REPEATED_CHARACTERS = re.compile('(.)\\1{9,}')

Identifies any character repeated more than 9 times

core.batching.UNFINISHED_SENTENCES module-attribute ¤

UNFINISHED_SENTENCES = re.compile('(?<![?!.;:])\\n\\n|\\r\\n')

Identifies sentences finishing with 2 newlines characters without having ending punctuations

core.batching.MULTIPLE_DOTS module-attribute ¤

MULTIPLE_DOTS = re.compile('\\.{2,}')

Identifies dots repeated more than twice

core.batching.MULTIPLE_DASHES module-attribute ¤

MULTIPLE_DASHES = re.compile('[-~]{1,}')

Identifies dashes repeated more than once

core.batching.MULTIPLE_QUESTIONS module-attribute ¤

MULTIPLE_QUESTIONS = re.compile('\\?{1,}')

Identifies question marks repeated more than once

core.batching.ORDINAL_FR module-attribute ¤

ORDINAL_FR = re.compile('n° ?([0-9]+)')

French ordinal numbers (numéros n°)

core.batching.FRANCAIS module-attribute ¤

FRANCAIS = re.compile(
    "%s(j|t|s|d|qu|lorsqu|quelqu|jusqu|m|c|n)\\'(?=[aeiouyéèàêâîôûïüäëöh][\\w\\s])"
    % regex_starter,
    flags=re.IGNORECASE,
)

French contractions of pronouns and determinants

core.batching.DASHES module-attribute ¤

DASHES = re.compile('(?<=\\w)(-|_|=)+(?=\\w)', re.IGNORECASE)

Dashes in the middle of ASCII/Latin compounded words. Will not work if accented or Unicode characters are immediately surrounding the dash.

core.batching.ALTERNATIVES module-attribute ¤

ALTERNATIVES = re.compile('(?<=[a-z])(\\/)(?=[a-z])', re.IGNORECASE)

Slash-separated word alternatives like and/or mr/mrs

core.batching.PLURAL_S module-attribute ¤

PLURAL_S = re.compile('(?<=[a-zA-Z]{4,})s?e{0,2}s%s' % end_of_word)

Identify plural form of nouns (French and English), adjectives (French) and third-person present verbs (English) and second-person verbs (French) in -s.

core.batching.FEMININE_E module-attribute ¤

FEMININE_E = re.compile('(?<=\\w{4,})e{1,2}%s' % end_of_word)

Identify feminine form of adjectives (French) in -e.

core.batching.DOUBLE_CONSONANTS module-attribute ¤

DOUBLE_CONSONANTS = re.compile(
    "(?<=\\w{2,})([bcfghjklmnpqrstvwxz])\\1", re.IGNORECASE
)

Identify double consonants in the middle of words.

core.batching.FEMININE_TRICE module-attribute ¤

FEMININE_TRICE = re.compile('(?<=\\w{4,})t(rice|eur|or)%s' % end_of_word)

Identify French feminine nouns in -trice.

core.batching.ADVERB_MENT module-attribute ¤

ADVERB_MENT = re.compile('(?<=\\w{4,})e?ment%s' % end_of_word)

Identify French adverbs and English nouns ending en -ment

core.batching.SUBSTANTIVE_TION module-attribute ¤

SUBSTANTIVE_TION = re.compile('(?<=\\w{4,})(t|s)ion%s' % end_of_word)

Identify French and English substantives formed from verbs by adding -tion and -sion

core.batching.SUBSTANTIVE_AT module-attribute ¤

SUBSTANTIVE_AT = re.compile('(?<=\\w{4,})at%s' % end_of_word)

Identify French and English substantives formed from other nouns by adding -at

core.batching.PARTICIPLE_ING module-attribute ¤

PARTICIPLE_ING = re.compile('(?<=\\w{4,})ing%s' % end_of_word)

Identify English substantives and present participles formed from verbs by adding -ing

core.batching.ADJECTIVE_ED module-attribute ¤

ADJECTIVE_ED = re.compile('(?<=\\w{4,})ed%s' % end_of_word)

Identify English adjectives formed from verbs by adding -ed

core.batching.ADJECTIVE_TIF module-attribute ¤

ADJECTIVE_TIF = re.compile('(?<=\\w{2,})ti(f|v)%s' % end_of_word)

Identify English and French adjectives formed from verbs by adding -tif or -tive

core.batching.SUBSTANTIVE_Y module-attribute ¤

SUBSTANTIVE_Y = re.compile('(?<=\\w{3,})y%s' % end_of_word)

Identify English substantives ending in -y

core.batching.VERB_IZ module-attribute ¤

VERB_IZ = re.compile('(?<=\\w{4,})(i|y)z%s' % end_of_word)

Identify American verbs ending in -iz that French and Brits write in -is

core.batching.STUFF_ER module-attribute ¤

STUFF_ER = re.compile('(?<=\\w{5,})er%s' % end_of_word)

Identify French 1st group verb (infinitive) and English substantives ending in -er

core.batching.BRITISH_OUR module-attribute ¤

BRITISH_OUR = re.compile('(?<=\\w{3,})our%s' % end_of_word)

Identify British spelling ending in -our (colour, behaviour).

core.batching.SUBSTANTIVE_ITY module-attribute ¤

SUBSTANTIVE_ITY = re.compile('(?<=\\w{4,})it(y|e)%s' % end_of_word)

Identify substantives in -ity (English) and -ite (French).

core.batching.SUBSTANTIVE_IST module-attribute ¤

SUBSTANTIVE_IST = re.compile('(?<=\\w{3,})is(t|m)%s' % end_of_word)

Identify substantives in -ist and -ism.

core.batching.SUBSTANTIVE_IQU module-attribute ¤

SUBSTANTIVE_IQU = re.compile('(?<=\\w{3,})i(qu|c)%s' % end_of_word)

Identify French substantives in -iqu

core.batching.SUBSTANTIVE_EUR module-attribute ¤

SUBSTANTIVE_EUR = re.compile('(?<=\\w{3,})eur%s' % end_of_word)

Identify French substantives -eur

core.batching.HYPHENIZED module-attribute ¤

HYPHENIZED = re.compile('(?<=\\w{3,})[-–—]+ *[\\n\\r]{1,2}(?=\\w)')

Detect hyphenized words at the end of a PDF text line.

core.batching.WAYBACK_RE module-attribute ¤

WAYBACK_RE = re.compile('https?://web\\.archive\\.org/web/[^/]+/(https?://.+)')

Find the canonical URL from web.archive.org (Wayback Machine) URLs

Classes¤

core.batching.Lexicon dataclass ¤

Lexicon(counts: Counter[str] = Counter())

Mutable token frequency index with canonicalization helpers for: - malformed n-grams, - merged/split variants, - plural compound normalization.

Examples:

liber_tarian -> libertarian etres_humains -> etre_humain

Functions¤
core.batching.Lexicon.update ¤
update(corpus: Iterable[Iterable[str]]) -> None

Update token frequencies from a corpus of tokenized sentences.

PARAMETER DESCRIPTION
corpus

Iterable of tokenized sentences: [ [“this”, “is”, “a”, “sentence”], [“another”, “sentence”] ]

TYPE: Iterable[Iterable[str]]

core.batching.Lexicon.frequency ¤
frequency(token: str) -> int

Return token frequency.

core.batching.Lexicon.exists ¤
exists(token: str) -> bool

Check whether a token exists in the lexicon.

core.batching.Lexicon.prune ¤
prune(min_count: int = 10) -> None

Remove all entries whose frequency is lower than min_count.

PARAMETER DESCRIPTION
min_count

Minimum frequency to keep.

TYPE: int DEFAULT: 10

core.batching.Lexicon.resolve_token ¤
resolve_token(token: str, separator: str = '_', min_ratio: float = 1.0) -> str

Attempt to canonicalize malformed n-grams.

Operations: 1. malformed n-grams: liber_tarian -> libertarian

  1. plural compound reduction: etres_humains -> etre_humain

Strategy: - if token exists already -> keep it - otherwise: - remove separators, - check if merged variant exists, - compare frequencies, - prefer merged form if sufficiently frequent.

PARAMETER DESCRIPTION
token

Token to canonicalize.

TYPE: str

separator

N-gram separator.

TYPE: str DEFAULT: '_'

min_ratio

Require merged token frequency to be at least min_ratio times the split variant frequency.

Helps avoid false positives.

TYPE: float DEFAULT: 1.0

RETURNS DESCRIPTION
str

Canonicalized token.

core.batching.Lexicon.canonicalize_sentence ¤
canonicalize_sentence(
    sentence: list[str], separator: str = "_", min_ratio: float = 1.0
) -> list[str]

Canonicalize all tokens in a sentence.

core.batching.Tokenizer ¤

Tokenizer(
    meta_tokens: dict[re.Pattern, str] | None = None,
    abbreviations: dict[str, str] | None = None,
    replacements: dict[str, str] | None = None,
    stopwords: set[str] | None = None,
    lang_stopwords: dict[str, set[str]] | None = None,
    backend: str = "blingfire",
)

Pre-processing pipeline and tokenizer.

Splits a string into normalized word tokens after applying a series of configurable text transformations.

PARAMETER DESCRIPTION
meta_tokens

Pipeline of regular-expression substitutions used to replace document fragments with meta-tokens.

Keys must be compiled re.Pattern objects and values must be meta-token strings, typically enclosed in underscores.

Transformations are applied in declaration order. This relies on Python’s ordered dictionaries (Python 3.7+).

If not provided, a default pipeline suitable for bilingual English/French technical documents is used.

TYPE: dict[re.Pattern, str] | None DEFAULT: None

abbreviations

Pipeline of abbreviation replacements as a {to_replace: replacement} dictionary.

Replacements are applied in declaration order.

TYPE: dict[str, str] | None DEFAULT: None

replacements

Dictionary of token-level substitutions applied as {key: value} string replacements.

TYPE: dict[str, str] | None DEFAULT: None

stopwords

Language-agnostic stopwords to remove from the token stream.

TYPE: set[str] | None DEFAULT: None

lang_stopwords

Language-specific stopwords.

Keys must be ISO 639-1 language codes and values must be sets of stopwords associated with each language.

TYPE: dict[str, set[str]] | None DEFAULT: None

backend

Tokenization backend to use.

Supported values are:

  • "blingfire": Microsoft BlingFire tokenizer (pattern-based).
  • "nltk": NLTK Punkt tokenizer.

TYPE: str DEFAULT: 'blingfire'

Attributes¤
core.batching.Tokenizer.characters_cleanup class-attribute instance-attribute ¤
characters_cleanup: dict[(re.Pattern) : str] = {
    MULTIPLE_DOTS: "...",
    MULTIPLE_DASHES: "-",
    MULTIPLE_QUESTIONS: "?",
    REPEATED_CHARACTERS: " ",
    BB_CODE: " ",
    MARKUP: " \\1 ",
    BASE_64: " ",
}

Dictionnary of regular expressions (keys) to find and replace by the provided strings (values). Cleanup repeated characters, including ellipses and question marks, leftover BBcode and XML markup, base64-encoded strings and French pronominal contractions (e.g “me + a” contracted into “m’a”).

core.batching.Tokenizer.internal_meta_tokens class-attribute instance-attribute ¤
internal_meta_tokens: dict[(re.Pattern) : str] = {
    HASH_PATTERN_FAST: "_HASH_",
    NUMBER_PATTERN_FAST: "_NUMBER_",
}

Dictionnary of regular expressions (keys) to find in full-tokens and replace by meta-tokens. Use simplified regex patterns for performance.

core.batching.Tokenizer.abbreviations instance-attribute ¤
abbreviations = abbreviations

Abbreviations and contractions to replace in full documents

core.batching.Tokenizer.replacements instance-attribute ¤
replacements = replacements

Arbitrary string replacements in single tokens

core.batching.Tokenizer.stopwords instance-attribute ¤
stopwords = set(stopwords) if stopwords else None

Language-agnostic stopwords

core.batching.Tokenizer.lang_stopwords instance-attribute ¤
lang_stopwords = lang_stopwords

Language-specific stopwords

core.batching.Tokenizer.supports_ngrams instance-attribute ¤
supports_ngrams: bool = False

Whether or not the tokenizer has an embedded n-grams model

core.batching.Tokenizer.ngrams_trie instance-attribute ¤
ngrams_trie = {}

Prefix tree of known n-grams for efficient lookups

core.batching.Tokenizer.vocabulary instance-attribute ¤
vocabulary: Lexicon = Lexicon()

Known tokens, if trained for n-grams.

Functions¤
core.batching.Tokenizer.prefilter ¤
prefilter(string: str, meta_tokens: bool = True) -> str

Tokenizers split words based on unsupervised machine-learned models. Sometimes, they work weird. For example, in emails and user handles like @user, they would split @ and user as 2 different tokens, making it impossible to detect usernames in single tokens later.

To avoid that, we replace data of interest by meta-tokens before the tokenization, with regular expressions.

core.batching.Tokenizer.lemmatize ¤
lemmatize(word: str) -> str

Find the root (lemma) of words to help topical generalization.

core.batching.Tokenizer.normalize_text ¤
normalize_text(document: str) -> str

Prepare text for tokenization by converting it to lowercase ASCII characters.

This will loose accents, diacritics and capitals, which means some nuance will be lost at the benefit of generality. In case this does not suit your usecase, you may inherit the Tokenizer class, build a child class and re-implement this method

core.batching.Tokenizer.normalize_token ¤
normalize_token(
    word: str,
    language: str | None,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> str | None

Return normalized, lemmatized and stemmed word tokens, where dates, times, digits, monetary units and URLs have their actual value replaced by meta-tokens designating their type. Stopwords (“the”, “a”, etc.), punctuation etc. is replaced by None, which should be filtered out at the next step.

PARAMETER DESCRIPTION
word

tokenized word in lower case only.

TYPE: str

language

the ISO 369-1 language code used to remove typical stopwords.

TYPE: str

normalize

remove punctuation and leading/trailing symbols.

TYPE: str DEFAULT: True

meta_tokens

replace string patterns by meta_tokens

TYPE: bool DEFAULT: True

stem

remove word suffixes, double consonnants, etc.

TYPE: bool DEFAULT: True

remove_stopwords

remove stopwords

TYPE: bool DEFAULT: True

NOTE

Tokenization is non-destructive (full sentences can be reconstructed entirely from token lists) if normalize=False, meta_tokens=False, stem=False and remove_stopwords=False. In this setting, only 1:1 token replacements defined in self.replacements will be applied, which can allow to replace abbreviations or accronyms. Other modes start generalizing semantics by removing meaning.

Examples:

Meta-tokens: 10:00 or 10 h or 10am or 10 am will all be replaced by a _TIME_ meta-token. feb, February, feb., monday will all be replaced by a _DATE_ meta-token.

core.batching.Tokenizer.tokenize_text ¤
tokenize_text(
    sentence: str,
    language: str | None = None,
    n_grams: bool = True,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> list[str]

Split text into normalized word tokens and meta-tokens.

No sentence or paragraph boundary detection is performed.

PARAMETER DESCRIPTION
sentence

Input text to tokenize.

TYPE: str

n_grams

Whether to detect and collapse n-grams.

Requires a trained n-gram model generated with train_ngrams().

TYPE: bool DEFAULT: True

Note

The parameters language, normalize, meta_tokens, stem, and remove_stopwords are forwarded to normalize_token() and have the same meaning.

RETURNS DESCRIPTION
list[str]

List of normalized tokens represented as a bag of words.

core.batching.Tokenizer.post_filter_tokens ¤
post_filter_tokens(
    tokens: list[str],
    language: str | None = None,
    meta_tokens: bool = True,
    stem: bool = False,
    normalize: bool = False,
    remove_stopwords: bool = False,
) -> list[str]

Apply post-processing operations to an existing token stream.

This method applies token normalization, stemming, stopword removal, and meta-token handling without performing tokenization.

PARAMETER DESCRIPTION
tokens

List of input tokens to process.

TYPE: list[str]

Note

The parameters language, meta_tokens, stem, normalize, and remove_stopwords are forwarded to normalize_token() and have the same meaning.

RETURNS DESCRIPTION
list[str]

List of processed tokens.

core.batching.Tokenizer.tokenize_document_flat ¤
tokenize_document_flat(
    document: str,
    language: str | None = None,
    n_grams: bool = True,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> list[str]

Cleanup and tokenize a document or a sentence as an atomic element, meaning we don’t split it into sentences. Use this either for search-engine purposes (into a document’s body) or if the document is already split into sentences. The document text needs to have been prepared and cleaned, which means :

Note

the language is detected internally if not provided as an optional argument. When processing a single sentence extracted from a document, instead of the whole document, it is more accurate to run the language detection on the whole document, ahead of calling this method, and pass on the result here.

PARAMETER DESCRIPTION
document

the text of the document to tokenize

TYPE: str

n_grams

TYPE: bool DEFAULT: True

Note

The parameters language, meta_tokens, stem, normalize, and remove_stopwords are forwarded to normalize_token() and have the same meaning.

RETURNS DESCRIPTION
tokens

a 1D list of normalized tokens and meta-tokens.

TYPE: list[str]

core.batching.Tokenizer.tokenize_document_per_sentence ¤
tokenize_document_per_sentence(
    document: str,
    language: str | None = None,
    n_grams: bool = True,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> list[list[str]]

Cleanup and tokenize a whole document as a list of sentences, meaning we split it into sentences before tokenizing. Use this to train a Word2Vec (embedding) model so each token is properly embedded into its syntactic context. The document text needs to have been prepared and cleaned, which means :

PARAMETER DESCRIPTION
document

the text of the document to tokenize

TYPE: str

n_grams

TYPE: bool DEFAULT: True

Note

The parameters language, meta_tokens, stem, normalize, and remove_stopwords are forwarded to normalize_token() and have the same meaning.

RETURNS DESCRIPTION
tokens

a 2D list of sentences (1st axis), each containing a list of normalized tokens and meta-tokens (2nd axis).

TYPE: list[list[str]]

core.batching.Tokenizer.tokenize_document_per_paragraph ¤
tokenize_document_per_paragraph(
    document: str,
    language: str | None = None,
    n_grams: bool = True,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> list[list[str]]

Cleanup and tokenize a whole document as a list of paragraphs, meaning we split it on `

or ` before tokenizing. Use this to train a Word2Vec (embedding) model so each token is properly embedded into its syntactic context. The document text needs to have been prepared and cleaned, which means :

    - lowercased (optional but recommended) with `str.lower()`,
    - translated from Unicode to ASCII (optional but recommended) with [core.utils.typography_undo][],
    - cleaned up for sequences of whitespaces with [core.utils.clean_whitespaces][]

    Arguments:
        document (str): the text of the document to tokenize
        n_grams (bool): see [core.nlp.Tokenizer.tokenize_text][]
        others: see [core.nlp.Tokenizer.normalize_token][] arguments

    Note:
        the language is detected internally if not provided. The text is prefiltered with [self.prefilter][]

    Returns:
        tokens: a 2D list of paragraphs (1st axis), each containing a list of normalized tokens and meta-tokens (2nd axis).
core.batching.Tokenizer.load classmethod ¤
load(name: str)

Load an existing trained model by its name from the ../models folder.

core.batching.Tokenizer.members_from_ngram ¤
members_from_ngram(token: str | None) -> list[str] | None

Recover n-grams members from a single tokenized phrase, separated with _. This expects lower-case tokens, except for meta-tokens which are expected capitalized.

RETURNS DESCRIPTION
list[str] | None

the list of n-gram members, or None if the token was not an n-gram but a singleton.

core.batching.Tokenizer.train_ngrams ¤
train_ngrams(
    sentences: list[str],
    connector_words: str = "",
    min_count: int = 10,
    threshold: float = 0.7,
    scoring: str = "npmi",
)

Train an n-gram model (bigrams and trigrams).

Detects common phrases such as “New York City” and merges them into single tokens using a statistical phrase model.

PARAMETER DESCRIPTION
sentences

Training corpus. Must be a list of tokenized sentences.

TYPE: list[str]

connector_words

Space-separated list of connector words allowed inside phrases (e.g. “by” in “piece by piece”).

These words are treated as valid bridges when forming n-grams.

TYPE: str DEFAULT: ''

min_count

Minimum number of occurrences required for a phrase to be considered.

See gensim.models.phrases.Phrases for details.

TYPE: int DEFAULT: 10

threshold

Phrase detection sensitivity threshold.

See gensim.models.phrases.Phrases.

TYPE: float DEFAULT: 0.7

scoring

Scoring function used for phrase detection.

See gensim.models.phrases.Phrases.

TYPE: str DEFAULT: 'npmi'

Warning

N-gram training must be performed on lightly processed tokenized sentences. Do not apply stemming, stopword removal, or punctuation stripping before training.

See Tokenizer.normalize_token() for required preprocessing options.

Note
  • Writes an ngrams log file in the models directory containing discovered phrases.
  • Can be executed multiple times (e.g. per language); results are appended to the existing model.
core.batching.Tokenizer.compile_ngrams ¤
compile_ngrams(ngrams: list[str])

Build a nested n-grams dictionnary for efficient querying, like:

{
    "new": {
        "york": {
            "__value__": "new_york",
            "city": {
                "__value__": "new_york_city"
            }
        }
    }
}

core.batching.Tokenizer.replace_ngrams ¤
replace_ngrams(tokens: list[str]) -> list[str]

Identify n-grams among tokens and collapse them into single tokens. N-grams should have been trained before, with core.nlp.Tokenizer.train_ngrams.

RETURNS DESCRIPTION
list[str]

the collapsed list of strings, or the original list if no n-grams

list[str]

was found or the n-grams model has not been trained.

core.batching.Tokenizer.lookup_ngram ¤
lookup_ngram(members: list[str] | tuple[str, ...]) -> str | None

Lookup an n-gram in the trie from its token members.

PARAMETER DESCRIPTION
members

the tokens iterable

TYPE: list[str] | tuple[str, ...]

RETURNS DESCRIPTION
str | None

the collapsed n-gram if found in the trie, or None if the input members match

str | None

no known n-gram.

Example

lookup_ngram((“new”, “york”)) -> “new_york”

lookup_ngram((“new”, “york”, “city”)) -> “new_york_city”

lookup_ngram((“foo”, “bar”)) -> None

core.batching.Data ¤

Data(text: str, label: str)

Represent an item of tagged training data.

PARAMETER DESCRIPTION
text

the content to label, which will be vectorized

TYPE: str

label

the category of the content, which will be predicted by the model

TYPE: str

core.batching.LossLogger ¤

LossLogger()

Bases: CallbackAny2Vec

Output loss at each epoch

core.batching.Word2Vec ¤

Word2Vec(
    documents: list[list[str]],
    name: str = "word2vec",
    vector_size: int = 300,
    epochs: int = 200,
    window: int = 5,
    min_count: int = 5,
    sample: float = 0.0005,
    tokenizer: Tokenizer = None,
    compute_idf: bool = False,
    **kwargs: dict[str, Any]
)

Bases: gensim.models.Word2Vec

Train, re-train, or load a Word2Vec embedding model.

If a model with the given name already exists, it is automatically loaded instead of re-trained. Note that in this case, vector_size will be overridden by the saved model configuration.

PARAMETER DESCRIPTION
documents

Pre-tokenized training corpus.

Structure: - outer list: documents - inner list: tokenized sentences

TYPE: list[list[str]]

name

Name of the model file used for saving/loading.

TYPE: str DEFAULT: 'word2vec'

vector_size

Dimensionality of word embeddings.

TYPE: int DEFAULT: 300

epochs

Number of training iterations.

Higher values improve quality on small corpora but increase training time.

TYPE: int DEFAULT: 200

window

Context window size for word co-occurrence.

TYPE: int DEFAULT: 5

min_count

Minimum frequency threshold for vocabulary filtering.

TYPE: int DEFAULT: 5

sample

Subsampling rate for frequent words.

TYPE: float DEFAULT: 0.0005

tokenizer

Tokenizer instance used for preprocessing (if applicable).

TYPE: Tokenizer DEFAULT: None

compute_idf

Whether to compute and store IDF statistics for SIF weighting.

Disable to reduce model size when SIF is not used.

TYPE: bool DEFAULT: False

**kwargs

Additional parameters forwarded directly to gensim.models.Word2Vec.

TYPE: dict[str, Any] DEFAULT: {}

Attributes¤
core.batching.Word2Vec.tokenizer instance-attribute ¤
tokenizer = tokenizer if tokenizer is not None else Tokenizer()

Tokenizer used to train the model. We store it to be sure to use the same when using it.

core.batching.Word2Vec.N_docs instance-attribute ¤
N_docs = len(documents)

Number of documents in the training corpus

core.batching.Word2Vec.N_sentences instance-attribute ¤
N_sentences = len(sentences)

Number of sentences in the training corpus

core.batching.Word2Vec.N_words instance-attribute ¤
N_words = len(words)

Number of words (tokens) in the training corpus

core.batching.Word2Vec.N_terms instance-attribute ¤
N_terms = len(counts)

Number of terms (unique words) in the training corpus

core.batching.Word2Vec.idf instance-attribute ¤
idf: dict[str, float] | None = None

Inverse Document Frequency, used only for SIF weighting when enabled.

core.batching.Word2Vec.avg_doc_len instance-attribute ¤
avg_doc_len: float | None = None

Average number of words in documents of the training corpus, available with IDF stats.

Functions¤
core.batching.Word2Vec.compute_idf ¤
compute_idf(documents: list[list[str]]) -> None

Compute and store IDF statistics from a tokenized document corpus.

core.batching.Word2Vec.update_idf ¤
update_idf(documents: list[list[str]]) -> None

Update IDF statistics and corpus-dependent metadata with new documents.

PARAMETER DESCRIPTION
documents

New pre-tokenized documents.

TYPE: list[list[str]]

core.batching.Word2Vec.prune_idf ¤
prune_idf()

Prune IDF entries to the actual model vocabulary (remove tokens that were filtered out by gensim during super().__init__).

core.batching.Word2Vec.load_model classmethod ¤
load_model(name: str)

Load a trained model saved in models folders

core.batching.Word2Vec.get_word ¤
get_word(word: str) -> str | None

Find out if word is in dictionary, optionnaly attempting spell-checking if not found.

PARAMETER DESCRIPTION
word

word to find

TYPE: str

RETURNS DESCRIPTION
str | None
  • the original word if found in dictionnary,
  • None if both previous conditions were not matched.
core.batching.Word2Vec.get_wordvec ¤
get_wordvec(
    word: str, embed: str = "IN", normalize: bool = True
) -> np.ndarray[np.float32] | None

Return the vector associated to a word, through a dictionnary of words.

PARAMETER DESCRIPTION
word

the word to convert to a vector.

TYPE: str

embed

TYPE: str DEFAULT: 'IN'


  1. A Dual Embedding Space Model for Document Ranking (2016), Bhaskar Mitra, Eric Nalisnick, Nick Craswell, Rich Caruana https://arxiv.org/pdf/1602.01137.pdf 

RETURNS DESCRIPTION
np.ndarray[np.float32] | None

the nD vector if the word was found in the dictionnary, or None.

core.batching.Word2Vec.get_features ¤
get_features(
    tokens: list[str],
    embed: str = "IN",
    use_sif: bool = False,
    sif_smoothing: float = 0.001,
) -> np.ndarray[np.float32]

Calls core.nlp.Word2Vec.get_wordvec over a list of tokens and returns a single vector representing the whole list.

PARAMETER DESCRIPTION
tokens

list of text tokens.

TYPE: list[str]

embed

TYPE: str DEFAULT: 'IN'

use_sif

Use SIF weighting on each term when embedding a full sentence or document. See core.nlp.Word2Vec.SIF.

TYPE: bool DEFAULT: False

sif_smoothing

The SIF smoothing coefficient.

TYPE: float DEFAULT: 0.001

RETURNS DESCRIPTION
np.ndarray[np.float32]

the normalized centroid of word embedding vectors associated with the input tokens

np.ndarray[np.float32]

(aka the average vector), or the null vector if no word from the list was found in dictionnary.

core.batching.Word2Vec.SIF ¤
SIF(token: str, a: float = 0.001) -> float

Smooth inverse frequency weighting

Taken from A simple but tough-to-beat baseline for sentence embeddings, Sanjeev Arora, Yingyu Liang, Tengyu Ma. https://openreview.net/pdf?id=SyK00v5xx

This helps refining semantics by under-weighting stopwords, however it’s unsuited for File Information Retrieval (search engines) because it over-smoothen the embedding space geometry and hinders relevance discrimination with regard to a query.

PARAMETER DESCRIPTION
token

the token to weight. It should be in the model vocabulary.

TYPE: str

Return

The SIF weight associated with the token or 0. if the token was not found in the vocabulary.

core.batching.Word2Vec.tokens_to_indices ¤
tokens_to_indices(tokens: list[str]) -> np.ndarray[np.int32]

Convert a list of tokens to a list of their index number in the Word2Vec vocabulary. This yields a more compact, albeit purely symbolic, representation of a tokenized document as a series of integers.

The conversion is reversible and the original token can be found with self.wv.index_to_key[i], where i is the index number output (for each token) from here.

Return

the list of indices as 32 bits integers, meaning the Word2Vec vocabulary needs to contain fewer than 4.29 billions words.

core.batching.Classifier ¤

Classifier(
    training_set: list[Data],
    name: str,
    word2vec: Word2Vec,
    validate: bool = True,
    variant: str = "svm",
)

Bases: nltk.classify.SklearnClassifier

Initialize a Word2Vec + SVM classification pipeline.

This class wraps a Word2Vec embedding model with a downstream machine-learning classifier (SVM or alternatives).

PARAMETER DESCRIPTION
training_set

List of Data samples used for training.

If empty, the system will attempt to load a pre-trained model using name.

TYPE: list[Data]

name

Identifier used to save and reload the trained model.

TYPE: str

word2vec

Word embedding model used to generate feature vectors.

TYPE: Word2Vec

validate

If True, splits the dataset into training (95%) and testing (5%) subsets and prints evaluation metrics.

Useful for classifier selection and sanity checking.

TYPE: bool DEFAULT: True

variant

Type of classifier to use:

  • svm: RBF-kernel Support Vector Machine (default). Robust and stable across general datasets.

  • linear svm: Linear Support Vector Machine. Faster and often better for high-dimensional features.

  • forest: Random Forest classifier. Faster than linear SVM in some cases, but produces larger models.

TYPE: str DEFAULT: 'svm'

Note

The previous documentation mentioned path and features, but these are not part of the current signature and were removed.

Functions¤
core.batching.Classifier.get_features_parallel ¤
get_features_parallel(post: Data) -> tuple[str, str]

Thread-safe call to .get_features() to be called in multiprocessing.Pool map

core.batching.Classifier.load classmethod ¤
load(name: str)

Load an existing trained model by its name from the ../models folder.

core.batching.Classifier.classify ¤
classify(post: str) -> str

Apply a label on a post based on the trained model.

core.batching.Classifier.prob_classify ¤
prob_classify(post: str) -> tuple[str, float]

Apply a label on a post based on the trained model and output the probability too.

core.batching.StemTokenIndex ¤

StemTokenIndex(db: sqlite3.Connection, tokenizer: Tokenizer)

Reverse normalized stem -> token lookup index.

Functions¤
core.batching.StemTokenIndex.most_probable_token ¤
most_probable_token(db: sqlite3.Connection, stem: str) -> str

Return the most probable original token associated to the stem. If the stem doesn’t exist in the database, it is returned as-is.

core.batching.StemTokenIndex.most_probable_tokens ¤
most_probable_tokens(db: sqlite3.Connection, stems: list[str]) -> list[str]

Return the most probable original token for each stem.

Stems not found in DB are returned unchanged.

core.batching.SQLitePageCorpus ¤

SQLitePageCorpus(
    db,
    query,
    params=(),
    atomic_types=(str, bytes),
    max_depth=None,
    yield_rows=False,
)

Lazily stream rows from an SQLite request, avoiding full copy.

Example

    corpus = SQLitePageCorpus(
        db,
        """
        SELECT tokenized
        FROM pages
        WHERE lang IN ('fr', 'en')
        """,
        max_depth=0
    )
- max_depth=0 will not flatten the content, so it will return the original list[list[str]] (list of sentences, aka list of list of words), - max_depth=1 flattens documents, to it will return list[str] (list of words)

core.batching.Deduplicator ¤

Deduplicator(
    threshold: float = 0.9,
    distance: int = 50,
    discard_params: bool = True,
    n_min: int = 0,
    fix_urls: bool = True,
)

Instanciate a depduplicator object.

The duplicates factorizing takes a list of core.types.web_page

Duplication detection is done using canonical URLs (removing query parameters and anchors) and lowercased, ASCII-converted content.

You can edit (append or replace) the list of URLs to ignore core.deduplicator.Deduplicator.urls_to_ignore before doing the actual process.

Optionaly, near-duplicates are detected too by computing the Levenshtein distance between pages contents (lowercased and ASCII-converted). This brings a significant performance penalty on large datasets.

PARAMETER DESCRIPTION
threshold

the minimum Levenshtein distance ratio between 2 pages contents for those pages to be considered near-duplicates and be factorized. If set to 1.0, the near-duplicates detection is bypassed which results in a huge speed up.

TYPE: float DEFAULT: 0.9

distance

the near-duplicates search is performed on the nearest elements after the core.types.web_page list has been ordered alphabetically by URL, for performance, assuming near-duplicates will most likely be found on the same domain and at a resembling path. The distance parameters defines how many elements ahead we will look into.

TYPE: int DEFAULT: 50

discard_params

on modern CMS that enable “pretty URLs” (URL rewriting), pages will be indexed by a domain/section/subsection/page and URL query parameters will most likely be used my meaningless pages like social sharing links or search results page so this parameter can be set to True to discard those. On Rest-API-driven websites, streaming websites and old CMS using “ugly URLS”, pages will be indexed by domain?content=id and the query parameters need to be kept by setting this parameter to False

TYPE: bool DEFAULT: True

n_min

domains that have a number of indexed pages below this threshold will be discarded entirely. This avoids indexing random dude’s website, under the assumption that relevant and reliable domains will have several pages indexed.

TYPE: int DEFAULT: 0

fix_urls

attempt to convert http to https URLs and remove leading www.. This sends DNS requests to assess if the https and www.-less variants can be reached, which takes a most 2 s per URL. Set to False to speed things up.

TYPE: bool DEFAULT: True

Attributes¤
core.batching.Deduplicator.urls_to_ignore class-attribute instance-attribute ¤
urls_to_ignore: list[str] = [
    "/tag/",
    "/tags/",
    "/category/",
    "/categories/",
    "/author/",
    "/authors/",
    "/profil/",
    "/profiles/",
    "/user/",
    "/users/",
    "/login/",
    "/signup/",
    "/member/",
    "/members/",
    "/cart/",
    "/shop/",
    "/register",
]

URL substrings to find in URLs and remove matching web pages: mostly WordPress archive pages, user profiles and login pages.

Functions¤
core.batching.Deduplicator.prepare_posts_parallel classmethod ¤
prepare_posts_parallel(elem, discard_params, urls_to_ignore, fix_urls)

Canonicalize a :class:~core.types.web_page dict for the list path.

Delegates URL normalization to :meth:_canonicalize_url and adds list-path-specific fallbacks for length and datetime (which are guaranteed to be pre-computed on the DB path by batch_parse_web_page but may be absent on hand-assembled lists).

Returns the mutated elem dict, or None if the URL must be discarded.

core.batching.Deduplicator.get_unique_urls ¤
get_unique_urls(posts: list[web_page]) -> list[web_page]

Pick the most recent, or otherwise the longer, candidate for each canonical URL.

core.batching.Deduplicator.run_on_db ¤
run_on_db(db: sqlite3.Connection, chunksize: int = 4096) -> None

Deduplicate the pages table in-place, matching the full __call__ pipeline.

The method runs four sequential phases that mirror __call__:

  1. URL canonicalization – stream every row through :meth:prepare_posts_parallel (threaded, I/O-bound), normalise URLs, compute a SHA-1 content hash, and write results to the temporary _prepared table.
  2. URL deduplication – for each canonical URL keep the single best row using SQL window functions with :attr:_ELECTION_ORDER.
  3. Exact-content deduplication – among URL winners, collapse rows that share the same SHA-1 hash using the same election order.
  4. Near-duplicate removal (skipped when threshold == 1.0) – load the survivors into memory, run the Levenshtein window scan with parallelised comparisons (threaded; python-Levenshtein releases the GIL), write the final winner set back to a temp table.

The pages table is atomically replaced by the winner set at the end. All intermediate _prepared / _url_winners / _content_winners / _near_winners temp tables are cleaned up on success.

Assumptions: - pages has at least the columns: url, title, content, date, datetime, parsed, category. - datetime values, when present, are ISO-8601 strings (SQLite TEXT). NULL is treated as “oldest possible” in the election. - The external category label means the page was crawled by following external links and contains the full <body>; any other category means it was crawled from a sitemap / REST-API and contains cleaner markup. Non-external therefore wins over external in the election.

PARAMETER DESCRIPTION
db

Open sqlite3.Connection to the database.

TYPE: sqlite3.Connection

chunksize

Number of rows fetched per batch during Phase 1.

TYPE: int DEFAULT: 4096

core.batching.Deduplicator.add_content_hash_column staticmethod ¤
add_content_hash_column(db: sqlite3.Connection) -> None

Add (or refresh) a content_hash column on the pages table.

Computes a SHA-1 digest of each row’s parsed field and stores it in content_hash. The column is created if it does not yet exist. Rows with a NULL parsed value are skipped and left with a NULL hash.

A covering index idx_pages_content_hash is created (or left in place) after the update so that subsequent deduplication queries are cheap.

This method is a standalone maintenance utility. The deduplication pipeline (:meth:run_on_db) computes hashes inline during Phase 1 and does not require this method to be called first.

Assumption: parsed values fit in memory individually (they are fetched one batch at a time, not all at once).

PARAMETER DESCRIPTION
db

Open sqlite3.Connection to the target database.

TYPE: sqlite3.Connection

core.batching.Deduplicator.get_unique_content ¤
get_unique_content(posts: list[web_page]) -> list[web_page]

Pick the most recent candidate for each canonical content.

Return

canonical content: web_page dictionnary

core.batching.Deduplicator.get_close_content ¤
get_close_content(
    posts: list[web_page], threshold: float = 0.9, distance: int = 50
) -> list[web_page]

Find and remove near-duplicates using the Levenshtein ratio.

Delegates the actual scan to :meth:_close_content_scan, which parallelises comparisons within each window via a :class:~concurrent.futures.ThreadPoolExecutor. This method is the list-path counterpart to :meth:_elect_near_duplicates; both call the same shared scan implementation.

The election among near-duplicate candidates honours the same priority rules as URL and content deduplication (non-external > newer > longer > shorter URL) via :meth:_elect_group.

PARAMETER DESCRIPTION
posts

List of :class:core.types.web_page dicts after URL and exact-content deduplication.

TYPE: list[web_page]

threshold

Minimum Levenshtein ratio for two pages to be considered near-duplicates. Defaults to :attr:self.threshold.

TYPE: float DEFAULT: 0.9

distance

Positions ahead to scan from each row after sorting by URL. Defaults to :attr:self.distance.

TYPE: int DEFAULT: 50

RETURNS DESCRIPTION
list[web_page]

Filtered list with near-duplicates removed; one survivor per group.

core.batching.Deduplicator.run_on_list ¤
run_on_list(posts: list[web_page]) -> list[web_page]

Deduplicate an in-memory list of web pages, matching the full pipeline.

This is the list-based counterpart to :meth:run_on_db. The two methods are kept symmetrical: both run the same four phases (URL canonicalization, exact-URL deduplication, exact-content deduplication, optional near-duplicate removal) and honour the same election rules.

Note

posts is consumed and partially destroyed during processing to avoid keeping two copies in memory simultaneously.

PARAMETER DESCRIPTION
posts

Flat list of :class:~core.types.web_page dicts. The list is modified in-place; callers should not rely on its contents after this call returns.

TYPE: list[web_page]

RETURNS DESCRIPTION
list[web_page]

Deduplicated list of sanitised :class:~core.types.web_page dicts,

list[web_page]

ready for downstream use. Also writes a domains frequency file

list[web_page]

Functions¤

core.batching.parse_lang_to_iso639_1 ¤

parse_lang_to_iso639_1(value: str | None) -> str | None

Normalize language identifier to ISO 639-1.

core.batching.guess_language ¤

guess_language(
    string: str,
    stopwords_threshold: float = 0.05,
    letters_threshold: float = 0.8,
) -> str | None

Basic language guesser based on stopwords detection.

Stopwords are the most common words of a language: for each language, we count how many stopwords we found and return the language having the most matches. It is accurate for paragraphs and long documents, not so much for short sentences.

PARAMETER DESCRIPTION
string

the string to analyze. Needs to be lowercased but to retain accents and diacritics.

TYPE: str

stopwords_threshold

the minimum ratio of stopwords divided by total words in strings to be found to conclude on a language. For example, Japanese companies often have technical reports written in Japanese but still containing some English. If less than 5% of the words are known English stopwords, we could conclude it’s not English.

TYPE: float DEFAULT: 0.05

letters_threshold

the minimum ratio of roman (latin) characters among all characters (including numbers, symbols and non-latin alphabets) to be found to conclude on a language.

TYPE: float DEFAULT: 0.8

RETURNS DESCRIPTION
str | None

ISO 639-1 language code. Defaults to “en” if nothing found.

core.batching.detect_language ¤

detect_language(text: str) -> str | None

Detect language from arbitrary text safely.

RETURNS DESCRIPTION
str | None

ISO 639-1 language code.

core.batching.tokenize_document_to_words ¤

tokenize_document_to_words(
    text: str, language: str | None = None, backend: str = "blingfire"
) -> list[str]

Split a text into single words

PARAMETER DESCRIPTION
language

ISO 639-1 language code.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
list[str]

Bag of words for the whole document. Sentence delimiters are removed.

core.batching.split_document_to_sentences ¤

split_document_to_sentences(
    text: str, language: str | None = None, backend: str = "blingfire"
) -> list[str]

Split a text into a list of sentences.

PARAMETER DESCRIPTION
language

ISO 639-1 language code.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
list[str]

List of sentences as full text.

core.batching.tokenize_document_to_sentences ¤

tokenize_document_to_sentences(
    text: str, language: str | None = None, backend: str = "blingfire"
) -> list[list[str]]

Split a text into single words as a list of lists

PARAMETER DESCRIPTION
language

ISO 639-1 language code.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
list[list[str]]

List of sentences, each sentence is itself a list of words.

core.batching.split_url ¤

split_url(url: str) -> tuple[str, str, str, str, str] | None

Split a well-formed URL following RFC3986 into base elements.

RETURNS DESCRIPTION
tuple[str, str, str, str, str] | None

a tuple of (protocol, domain, page, parameters, anchor).

tuple[str, str, str, str, str] | None

Empty/missing fields are inited with empty strings so there is no need for individual None checks.

tuple[str, str, str, str, str] | None

If the url input doesn’t match an URL format, return None.

core.batching.adapt_array ¤

adapt_array(arr: np.ndarray)

core.batching.create_db ¤

create_db(name: str) -> sqlite3.Connection

Create the pages table if needed and add any missing columns. This doesn’t destroy existing tables, rows or columns, so it’s safe to run on any database.

Warning

Columns are inferred directly from web_page.__annotations__. Existing columns are preserved unchanged.

The url column is used as the PRIMARY KEY.

core.batching.create_temp_db ¤

create_temp_db(
    min_free: float = 2.0, filename: str | None = None
) -> sqlite3.Connection

Create a temporary SQLite database file (in /dev/shm when available) and initialize the pages table according to web_page annotations.

PARAMETER DESCRIPTION
min_free

minimum available disk space in GiB required to create the temporary database. This is checked at runtime and the function will raise an error if the condition is not met.

TYPE: float DEFAULT: 2.0

filename

the full path and filename to save the temporary database, if it needs to be reused at some point.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
sqlite3.Connection

the sqlite3.Connection opened in bulk mode.

WARNING

the temporary SQLite database doesn’t use web_page URL as primary key, to allow later deduplication.

core.batching.delete_temp_db ¤

delete_temp_db(db: sqlite3.Connection)

Close and delete a temporary database in one shot.

core.batching.open_db ¤

open_db(name: str, mode: str = 'rw') -> sqlite3.Connection

Open an SQLite database with workload-specific optimizations.

PARAMETER DESCRIPTION
name

Database identifier/path passed to get_models_folder().

TYPE: str

mode
  • “rw”: Generic read/write mode.
  • “ro”: Read-only immutable mode optimized for serving/search workloads.
  • “bulk”: Bulk-ingestion mode optimized for large batch writes.

TYPE: str DEFAULT: 'rw'

RETURNS DESCRIPTION
sqlite3.Connection

sqlite3.Connection

core.batching.compress_db ¤

compress_db(
    db: sqlite3.Connection,
    delete_query: str | None = None,
    delete_params: tuple | None = None,
    delete_columns: list[str] | None = None,
)

Optionally delete rows, then reclaim SQLite disk space.

PARAMETER DESCRIPTION
db

SQLite connection

TYPE: sqlite3.Connection

delete_query

full DELETE SQL query

TYPE: str | None DEFAULT: None

delete_params

optional SQL parameters

TYPE: tuple | None DEFAULT: None

core.batching.is_primary_key ¤

is_primary_key(db: sqlite3.Connection, table: str, column: str) -> bool

Check whether column is part of the PRIMARY KEY of table.

core.batching.populate_db ¤

populate_db(
    db: sqlite3.Connection, pages: list[web_page], batch_size: int = 4096
)

Insert or update web_page records into the SQLite database.

Existing rows are matched using the PRIMARY KEY url.

Warning

Array-like Python values are converted to bytearray then to bytes in order to be handled as BLOB by SQLite.

core.batching.db_to_list ¤

db_to_list(db: sqlite3.Connection) -> list[web_page]

Extract all web_page rows from the pages table in db as a list of web_page

core.batching.migrate_url_to_primary_key ¤

migrate_url_to_primary_key(db: sqlite3.Connection)

Rebuild the pages table using url as PRIMARY KEY for older databases that didn’t use a primary key.

core.batching.merge_databases ¤

merge_databases(old_db: sqlite3.Connection, new_db: sqlite3.Connection)

Merge two pages databases.

Rows from old_db are inserted into new_db only if their URL does not already exist.

Existing rows in new_db are preserved unchanged.

Only columns existing in BOTH databases are copied.

core.batching.update_pages_from_database ¤

update_pages_from_database(
    target_db: sqlite3.Connection, source_db: sqlite3.Connection
) -> list[str]

Update rows in target_db.pages from source_db.pages using url as PRIMARY KEY.

Only shared columns are updated.

Returns missing_urls: URLs present in target_db but absent from source_db.

core.batching.import_pages ¤

import_pages(
    source_db: str | sqlite3.Connection,
    destination_db: str | sqlite3.Connection,
    where_clause: str = "1=1",
    params: tuple = (),
) -> int

Import rows from one SQLite database into another.

Both source_db and destination_db may be either a filesystem path (str) or an active sqlite3.Connection handle. Passing a Connection is the only way to target a :memory: database, since those cannot be addressed by path.

Connection lifecycle - Path supplied – the function opens, commits, and closes the connection itself (original behaviour). - Connection supplied – the caller retains full control; the connection is neither committed nor closed here, so the import can participate in a larger transaction.

Rows are copied from source.pages into destination.pages. Existing rows are updated on conflict of the url primary key. Columns present in the destination but absent from the source receive NULL. Both schemas are discovered at runtime, so the function adapts automatically if either evolves.

PARAMETER DESCRIPTION
source_db

Path to, or an open connection for, the source SQLite database.

TYPE: str | sqlite3.Connection

destination_db

Path to, or an open connection for, the destination SQLite database.

TYPE: str | sqlite3.Connection

where_clause

SQL WHERE clause applied to source.pages. Example: "domain = ? AND date >= ?"

TYPE: str DEFAULT: '1=1'

params

Positional parameters bound to where_clause.

TYPE: tuple DEFAULT: ()

RETURNS DESCRIPTION
int

Number of affected rows.

Examples::

# File → file (unchanged from before)
import_pages("old.db", "new.db", "domain = ?", ("example.com",))

# In-memory source → file destination
import_pages(mem_conn, "new.db")

# File source → in-memory destination (e.g. for tests)
import_pages("prod.db", mem_conn, "date >= ?", ("2024-01-01",))

# Both in-memory
import_pages(src_conn, dst_conn)

core.batching.inspect_db ¤

inspect_db(db: sqlite3.Connection, message: str = '') -> None

Print useful metadata and statistics about a SQLite database.

PARAMETER DESCRIPTION
db

active database connection

TYPE: sqlite3.Connection

message

optional additional message to indentify several inspections if any.

TYPE: str DEFAULT: ''

core.batching.sanitize_web_page ¤

sanitize_web_page(page: web_page) -> web_page

Ensure existence and validity of web_page keys/values.

core.batching.batch_guess_dates ¤

batch_guess_dates(db: sqlite3.Connection, chunksize: int = 2048)

High-throughput parallel datetime parsing.

core.batching.batch_parse_web_page ¤

batch_parse_web_page(
    documents: sqlite3.Connection,
    tokenizer: Tokenizer,
    chunksize: int = 512,
    cores: int | None = None,
)

High-performance parallel parsing for core.types.web_page objects

This function is meant to cleanup text encoding issues and multi-spacings in web_page title and content. It prepares the web_page["parsed"] field from title and content for the next stages of tokenization, and updates language (using declared ISO code or machine-learned detection).

It is needed to call it before core.deduplicator.Deduplicator, so the content duplication has a clean parsed version to compare web pages.

PARAMETER DESCRIPTION
documents

any database having core.types.web_page rows stored in a pages table and stored on the filesystem. It cannot be a memory-hosted database: each parallel worker will open its own copy by file path.

TYPE: sqlite3.Connection

tokenizer

we only use it for the the core.nlp.Tokenizer.normalize_text method

TYPE: Tokenizer

chunksize

number of SQLite rows to process at once, too many is not helpful since some batches may take longer than others, depending on text length.

TYPE: int DEFAULT: 512

cores

CPU cores to use for parallel processing.

TYPE: int | None DEFAULT: None

core.batching.batch_tokenize ¤

batch_tokenize(
    db: sqlite3.Connection,
    tokenizer: Tokenizer,
    chunksize: int = 512,
    urls: list[str] | None = None,
    only_none: bool = True,
)

Tokenize a list of web_pages in a non-destructive way, in parallel, in a RAM-friendly way, directly in database.

Populate the tokenized database column from the parsed column. This needs to run after core.batching.batch_parse_web_page and prepares n-gram training if any, or stemming.

Note

The tokenization is forced non-destructive and doesn’t apply stemming, stopwords removal, normalization, or n-grams. Original sentences can be reconstructed from joining back the list of tokens.

PARAMETER DESCRIPTION
urls

list of URLs to tokenize. If None, the whole database is processed.

TYPE: list[str] | None DEFAULT: None

only_none

stem only the new entries that have not been tokenized already. If False, force-update the whole database. It has no effect when urls are explicitely specified

TYPE: bool DEFAULT: True

core.batching.batch_stem ¤

batch_stem(
    db: sqlite3.Connection,
    tokenizer: Tokenizer,
    chunksize: int = 512,
    urls: list[str] | None = None,
    only_none: bool = True,
)

Tokenize and stem a list of web_pages in parallel, in a RAM-friendly way, directly in database.

Populate the stemmed database column from the tokenized column. This needs to run after core.batching.batch_tokenize. The tokenization is destructive and apply stemming, stopwords removal, normalization and n-grams if available.

PARAMETER DESCRIPTION
urls

list of URLs to tokenize. If None, the whole database is processed.

TYPE: list[str] | None DEFAULT: None

only_none

stem only the new entries that have not been stemmed already. If False, force-update the whole database. It has no effect when urls are explicitely specified

TYPE: bool DEFAULT: True

core.batching.batch_vectorize ¤

batch_vectorize(
    db: sqlite3.Connection, word2vec: Word2Vec, chunksize: int = 256
)

Vectorize a column of the db database using the provided word2vec model using all available cores.

Works on the tokenized column of the database and writes the vectorized column. Vectors are normalized as per nlp.Word2Vec.get_features() output.