core.batching¤
core.batching
¤
High-performance, paralellized high-level methods to process large corpora of documents.
Interfaces NLP processing with database entries, for efficient RAM management.
Database structure is hard-coded and expects conformation to data structures defined in core.database and core.types
© 2026 - Aurélien Pierre
Attributes¤
core.batching.LANG_MAP
module-attribute
¤
LANG_MAP = {
"en": "english",
"fr": "french",
"de": "german",
"es": "spanish",
"it": "italian",
"pt": "portuguese",
"nl": "dutch",
"sv": "swedish",
"no": "norwegian",
"da": "danish",
"fi": "finnish",
"ru": "russian",
"ro": "romanian",
"hu": "hungarian",
"tr": "turkish",
}
Map ISO 639-1 language codes of supported languages to their full-name, as used by pre-trained corpora
core.batching.LANG_MAP_REVERSE
module-attribute
¤
LANG_MAP_REVERSE = {v: k for k, v in (LANG_MAP.items())}
Map the full-name of supported languages, as used by pre-trained corpora, to ISO 639-1 language codes
core.batching.STOPWORDS_DICT
module-attribute
¤
STOPWORDS_DICT = {
language: (set(STOPWORDS_DICT[language])) for language in STOPWORDS_DICT
}
Dictionnary of stopwords (as sets values) mapped to full language names (as keys)
core.batching.regex_starter
module-attribute
¤
Start of line, or start of document, or start of markup
core.batching.regex_stopper
module-attribute
¤
End of line, or end of document, or end of markup
core.batching.end_of_word
module-attribute
¤
End of word, or end of line, or end of document, or end of markup
core.batching.IP_PATTERN
module-attribute
¤
IP_PATTERN = re.compile(
"%s%s%s" % (regex_starter, regex_ip, regex_stopper), re.IGNORECASE
)
IPv4 and IPv6 patterns where the whole IP is captured in the first group.
core.batching.EMAIL_PATTERN
module-attribute
¤
EMAIL_PATTERN = re.compile(
"<?([0-9a-z\\-\\_\\+\\.]+?@[0-9a-z\\-\\_\\+]+(\\.[0-9a-z\\_\\-]{2,})+)>?",
re.IGNORECASE,
)
Emails patterns like <me@mail.com> or me@mail.com where the whole address is captured in the first group.
core.batching.URL_PATTERN
module-attribute
¤
URL_PATTERN = re.compile(
"%s%s%s" % (regex_starter, regex_url, end_of_word), re.IGNORECASE
)
URL patterns like http(s)://domain.ext/page/subpage?q=x&r=0:1#anchor or //domain.ext/page.
URL must follow RFC3986, meaning query parameters
should be before anchors, if any. Relying on this assumption allows a faster regex parsing.
- the protocol (ftp, ftps, http, https) is captured as the first group,
domain.extis captured as the second group,/page/etcis the third group, including leading and trailing/,- page query parameters
?s=x&r=0, including?, is the fourth group if the URL declares...?params#anchor, - anchor
#anchoris the fifth group, including#, if the URL declares...?params#anchor.
URLs are captured if they are:
- alone on their own line,
- enclosed in
{},[],() - enclosed in whitespaces.
Warning: URLs enclosed in (), [] and {} may retain the closing sign
as part of the page name since () and [] are valid in URL pathes
and parameters. This pattern will work on plain text only: Markdown, XML, HTML and JSON
will need to be parsed ahead.
core.batching.MEMBERS_PATTERN
module-attribute
¤
Domain patterns without leading protocol like cdn.company.com
or class members in object-oriented programming languages like params.cookies.client.
core.batching.DATE_PATTERN
module-attribute
¤
Dates like 2022-12-01, 01-12-2022, 01-12-22, 01/12/2022, 01/12/22 where the whole date is captured in the first group, then each group of digits is captured in the order of appearance, in the next 3 groups
core.batching.TIME_PATTERN
module-attribute
¤
Identify more or less standard time patterns, like :
- 12h15
- 12:15
- 12:15:00
- 12am
- 12 am
- 12 h
- 12:15:00Z
- 12:15:00+01
- 12:15:00 UTC+1
- 11:27:45+0000
| RETURNS | DESCRIPTION |
|---|---|
0
|
1- or 2-digits hour,
TYPE:
|
1
|
hour/minutes separator or half-day marker among
TYPE:
|
2
|
2-digits minutes, if any, or
TYPE:
|
3
|
2-digits seconds, if any.
TYPE:
|
4
|
hour marker (
TYPE:
|
5
|
1-or 2-digits signed integer timezone shift (referred to UTC).
TYPE:
|
Examples:
see https://regex101.com/r/QNtZAK/2
see src/tests/test-patterns.py
core.batching.DOMAIN_PATTERN
module-attribute
¤
Matches patterns like from (domain.ext) from RFC-822 Received header in emails.
core.batching.UID_PATTERN
module-attribute
¤
Matches email integer UID from IMAP headers.
core.batching.FLAGS_PATTERN
module-attribute
¤
Matches email flags from IMAP headers.
core.batching.PATH_PATTERN
module-attribute
¤
PATH_PATTERN = re.compile('%s%s%s' % (regex_starter, path_regex, end_of_word))
File path pattern like ~/file, /home/file, ./file or C:\windows
core.batching.PARTIAL_PATH_REGEX
module-attribute
¤
PARTIAL_PATH_REGEX = re.compile(
"%s%s%s" % (regex_starter, partial_path_regex, end_of_word)
)
Partial, invalid path patterns missing the leading root, like home/user/stuff.
We start capturing after at least two folder separators (slash or backslash).
Warning
this will collide with date detection, so run it after in the pipeline.
core.batching.RESOLUTION_PATTERN
module-attribute
¤
Pixel resolution like 10x20 or 10×20. Units are discarded.
core.batching.NUMBER_PATTERN
module-attribute
¤
NUMBER_PATTERN = re.compile(
"%s%s%s" % (regex_starter, regex_number, regex_stopper)
)
Signed integers and decimals, fractions and numeric IDs with interal dashes and underscores. Numbers with starting or trailing units are not considered. Lazy decimals (.1 and 1.) are considered.
core.batching.HASH_PATTERN
module-attribute
¤
HASH_PATTERN = re.compile(
"%s%s%s" % (regex_starter, regex_hash, end_of_word), re.IGNORECASE
)
Cryptographic hexadecimal hashes and fingerprints, of a min length of 8 characters.
core.batching.MULTIPLE_LINES
module-attribute
¤
Detect more than 2 newlines and tab, possibly mixed with spaces
core.batching.MULTIPLE_NEWLINES
module-attribute
¤
Detect broken sequences of newlines and spaces.
core.batching.INTERNAL_NEWLINE
module-attribute
¤
Detect single newline characters nested inside text. Mostly useful for parsed PDF where line wrapping is quite literal ( used instead of space).
core.batching.EXPOSURE
module-attribute
¤
EXPOSURE = re.compile(
"%s%s%s" % (regex_starter, exposure_regex, end_of_word), flags=re.IGNORECASE
)
Exposure values in EV or IL
core.batching.PHOTOSPEED
module-attribute
¤
PHOTOSPEED = re.compile(
"%s%s%s" % (regex_starter, photospeed_regex, end_of_word),
flags=re.IGNORECASE,
)
Exposure values in EV or IL
core.batching.SENSIBILITY
module-attribute
¤
SENSIBILITY = re.compile(
"%s%s%s" % (regex_starter, sensibility_regex, end_of_word),
flags=re.IGNORECASE,
)
Photographic sensibility in ISO or ASA
core.batching.LUMINANCE
module-attribute
¤
LUMINANCE = re.compile(
"%s%s%s" % (regex_starter, luminance_regex, end_of_word),
flags=re.IGNORECASE,
)
Luminance/radiance in nits or Cd/m²
core.batching.DIAPHRAGM
module-attribute
¤
DIAPHRAGM = re.compile(
"%s%s" % (regex_starter, diaphragm_regex), flags=re.IGNORECASE
)
Photographic diaph aperture values like f/2.8 or f/11
core.batching.GAIN
module-attribute
¤
GAIN = re.compile(
"%s%s%s" % (regex_starter, gain_regex, end_of_word), flags=re.IGNORECASE
)
Gain, attenuation and PSNR in dB
core.batching.FILE_SIZE
module-attribute
¤
FILE_SIZE = re.compile(
"%s%s%s" % (regex_starter, filesize_regex, end_of_word), flags=re.IGNORECASE
)
File and memory size in bit, byte, or octet and their multiples
core.batching.DISTANCE
module-attribute
¤
DISTANCE = re.compile(
"%s%s%s" % (regex_starter, distance_regex, end_of_word), flags=re.IGNORECASE
)
Distance in meter, inch, foot and their multiples
core.batching.PERCENT
module-attribute
¤
PERCENT = re.compile('%s%s%s' % (regex_starter, percent_regex, end_of_word))
Number followed by %
core.batching.WEIGHT
module-attribute
¤
WEIGHT = re.compile(
"%s%s%s" % (regex_starter, weight_regex, end_of_word), flags=re.IGNORECASE
)
Weight (mass) in British and SI units and their multiples
core.batching.ANGLE
module-attribute
¤
ANGLE = re.compile(
"%s%s%s" % (regex_starter, angle_regex, end_of_word), flags=re.IGNORECASE
)
Angles in radians, degrees and steradians
core.batching.TEMPERATURE
module-attribute
¤
TEMPERATURE = re.compile(
"%s%s%s" % (regex_starter, temperature_regex, end_of_word),
flags=re.IGNORECASE,
)
Temperatures in °C, °F and K
core.batching.FREQUENCY
module-attribute
¤
FREQUENCY = re.compile(
"%s%s%s" % (regex_starter, frequency_regex, end_of_word),
flags=re.IGNORECASE,
)
Frequencies in hertz and multiples
core.batching.TEXT_DATES
module-attribute
¤
TEXT_DATES = re.compile(
"([0-9]{1,2})? (jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec|jan|fév|mar|avr|mai|jui|jui|aou|sep|oct|nov|déc|janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre|january|february|march|april|may|june|july|august|september|october|november|december)\\.?( [0-9]{1,2})?( [0-9]{2,4})(?!\\:)",
flags=re.IGNORECASE | re.MULTILINE,
)
Find textual dates formats:
- English dates like
01 Jan 20or01 Jan. 2020but avoid capturing adjacent time like12:08. - French dates like
01 Jan 20or01 Jan. 2020but avoid capturing adjacent time like12:08.
| RETURNS | DESCRIPTION |
|---|---|
0
|
2 digits (day number or year number, depending on language)
TYPE:
|
1
|
month (full-form or abbreviated)
TYPE:
|
2
|
2 digits (day number or year number, depending on language)
TYPE:
|
3
|
4 digits (full year)
TYPE:
|
core.batching.BASE_64
module-attribute
¤
BASE_64 = re.compile(
"((?:[A-Za-z0-9+\\/]{4}){64,}(?:[A-Za-z0-9+\\/]{2}==|[A-Za-z0-9+\\/]{3}=)?)"
)
Identifies base64 encoding
core.batching.BB_CODE
module-attribute
¤
Identifies left-over BB code markup [img] and [quote]
core.batching.MARKUP
module-attribute
¤
Identifies left-over HTML and Markdown markup, like <...>, {...}, [...]
core.batching.USER
module-attribute
¤
Identifies user handles or emails
core.batching.REPEATED_CHARACTERS
module-attribute
¤
Identifies any character repeated more than 9 times
core.batching.UNFINISHED_SENTENCES
module-attribute
¤
Identifies sentences finishing with 2 newlines characters without having ending punctuations
core.batching.MULTIPLE_DOTS
module-attribute
¤
Identifies dots repeated more than twice
core.batching.MULTIPLE_DASHES
module-attribute
¤
Identifies dashes repeated more than once
core.batching.MULTIPLE_QUESTIONS
module-attribute
¤
Identifies question marks repeated more than once
core.batching.ORDINAL_FR
module-attribute
¤
French ordinal numbers (numéros n°)
core.batching.FRANCAIS
module-attribute
¤
FRANCAIS = re.compile(
"%s(j|t|s|d|qu|lorsqu|quelqu|jusqu|m|c|n)\\'(?=[aeiouyéèàêâîôûïüäëöh][\\w\\s])"
% regex_starter,
flags=re.IGNORECASE,
)
French contractions of pronouns and determinants
core.batching.DASHES
module-attribute
¤
Dashes in the middle of ASCII/Latin compounded words. Will not work if accented or Unicode characters are immediately surrounding the dash.
core.batching.ALTERNATIVES
module-attribute
¤
Slash-separated word alternatives like and/or mr/mrs
core.batching.PLURAL_S
module-attribute
¤
PLURAL_S = re.compile('(?<=[a-zA-Z]{4,})s?e{0,2}s%s' % end_of_word)
Identify plural form of nouns (French and English), adjectives (French) and third-person present verbs (English) and second-person verbs (French) in -s.
core.batching.FEMININE_E
module-attribute
¤
FEMININE_E = re.compile('(?<=\\w{4,})e{1,2}%s' % end_of_word)
Identify feminine form of adjectives (French) in -e.
core.batching.DOUBLE_CONSONANTS
module-attribute
¤
Identify double consonants in the middle of words.
core.batching.FEMININE_TRICE
module-attribute
¤
FEMININE_TRICE = re.compile('(?<=\\w{4,})t(rice|eur|or)%s' % end_of_word)
Identify French feminine nouns in -trice.
core.batching.ADVERB_MENT
module-attribute
¤
ADVERB_MENT = re.compile('(?<=\\w{4,})e?ment%s' % end_of_word)
Identify French adverbs and English nouns ending en -ment
core.batching.SUBSTANTIVE_TION
module-attribute
¤
SUBSTANTIVE_TION = re.compile('(?<=\\w{4,})(t|s)ion%s' % end_of_word)
Identify French and English substantives formed from verbs by adding -tion and -sion
core.batching.SUBSTANTIVE_AT
module-attribute
¤
SUBSTANTIVE_AT = re.compile('(?<=\\w{4,})at%s' % end_of_word)
Identify French and English substantives formed from other nouns by adding -at
core.batching.PARTICIPLE_ING
module-attribute
¤
PARTICIPLE_ING = re.compile('(?<=\\w{4,})ing%s' % end_of_word)
Identify English substantives and present participles formed from verbs by adding -ing
core.batching.ADJECTIVE_ED
module-attribute
¤
ADJECTIVE_ED = re.compile('(?<=\\w{4,})ed%s' % end_of_word)
Identify English adjectives formed from verbs by adding -ed
core.batching.ADJECTIVE_TIF
module-attribute
¤
ADJECTIVE_TIF = re.compile('(?<=\\w{2,})ti(f|v)%s' % end_of_word)
Identify English and French adjectives formed from verbs by adding -tif or -tive
core.batching.SUBSTANTIVE_Y
module-attribute
¤
SUBSTANTIVE_Y = re.compile('(?<=\\w{3,})y%s' % end_of_word)
Identify English substantives ending in -y
core.batching.VERB_IZ
module-attribute
¤
VERB_IZ = re.compile('(?<=\\w{4,})(i|y)z%s' % end_of_word)
Identify American verbs ending in -iz that French and Brits write in -is
core.batching.STUFF_ER
module-attribute
¤
STUFF_ER = re.compile('(?<=\\w{5,})er%s' % end_of_word)
Identify French 1st group verb (infinitive) and English substantives ending in -er
core.batching.BRITISH_OUR
module-attribute
¤
BRITISH_OUR = re.compile('(?<=\\w{3,})our%s' % end_of_word)
Identify British spelling ending in -our (colour, behaviour).
core.batching.SUBSTANTIVE_ITY
module-attribute
¤
SUBSTANTIVE_ITY = re.compile('(?<=\\w{4,})it(y|e)%s' % end_of_word)
Identify substantives in -ity (English) and -ite (French).
core.batching.SUBSTANTIVE_IST
module-attribute
¤
SUBSTANTIVE_IST = re.compile('(?<=\\w{3,})is(t|m)%s' % end_of_word)
Identify substantives in -ist and -ism.
core.batching.SUBSTANTIVE_IQU
module-attribute
¤
SUBSTANTIVE_IQU = re.compile('(?<=\\w{3,})i(qu|c)%s' % end_of_word)
Identify French substantives in -iqu
core.batching.SUBSTANTIVE_EUR
module-attribute
¤
SUBSTANTIVE_EUR = re.compile('(?<=\\w{3,})eur%s' % end_of_word)
Identify French substantives -eur
core.batching.HYPHENIZED
module-attribute
¤
Detect hyphenized words at the end of a PDF text line.
core.batching.WAYBACK_RE
module-attribute
¤
Find the canonical URL from web.archive.org (Wayback Machine) URLs
Classes¤
core.batching.Lexicon
dataclass
¤
Mutable token frequency index with canonicalization helpers for: - malformed n-grams, - merged/split variants, - plural compound normalization.
Examples:
liber_tarian -> libertarian etres_humains -> etre_humain
Functions¤
core.batching.Lexicon.update
¤
core.batching.Lexicon.exists
¤
Check whether a token exists in the lexicon.
core.batching.Lexicon.prune
¤
prune(min_count: int = 10) -> None
Remove all entries whose frequency is lower than min_count.
| PARAMETER | DESCRIPTION |
|---|---|
min_count
|
Minimum frequency to keep.
TYPE:
|
core.batching.Lexicon.resolve_token
¤
Attempt to canonicalize malformed n-grams.
Operations: 1. malformed n-grams: liber_tarian -> libertarian
- plural compound reduction: etres_humains -> etre_humain
Strategy: - if token exists already -> keep it - otherwise: - remove separators, - check if merged variant exists, - compare frequencies, - prefer merged form if sufficiently frequent.
| PARAMETER | DESCRIPTION |
|---|---|
token
|
Token to canonicalize.
TYPE:
|
separator
|
N-gram separator.
TYPE:
|
min_ratio
|
Require merged token frequency to be at least
Helps avoid false positives.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Canonicalized token. |
core.batching.Tokenizer
¤
Tokenizer(
meta_tokens: dict[re.Pattern, str] | None = None,
abbreviations: dict[str, str] | None = None,
replacements: dict[str, str] | None = None,
stopwords: set[str] | None = None,
lang_stopwords: dict[str, set[str]] | None = None,
backend: str = "blingfire",
)
Pre-processing pipeline and tokenizer.
Splits a string into normalized word tokens after applying a series of configurable text transformations.
| PARAMETER | DESCRIPTION |
|---|---|
meta_tokens
|
Pipeline of regular-expression substitutions used to replace document fragments with meta-tokens. Keys must be compiled Transformations are applied in declaration order. This relies on Python’s ordered dictionaries (Python 3.7+). If not provided, a default pipeline suitable for bilingual English/French technical documents is used. |
abbreviations
|
Pipeline of abbreviation replacements as a
Replacements are applied in declaration order. |
replacements
|
Dictionary of token-level substitutions applied as
|
stopwords
|
Language-agnostic stopwords to remove from the token stream. |
lang_stopwords
|
Language-specific stopwords. Keys must be ISO 639-1 language codes and values must be sets of stopwords associated with each language. |
backend
|
Tokenization backend to use. Supported values are:
TYPE:
|
Attributes¤
core.batching.Tokenizer.characters_cleanup
class-attribute
instance-attribute
¤
characters_cleanup: dict[(re.Pattern) : str] = {
MULTIPLE_DOTS: "...",
MULTIPLE_DASHES: "-",
MULTIPLE_QUESTIONS: "?",
REPEATED_CHARACTERS: " ",
BB_CODE: " ",
MARKUP: " \\1 ",
BASE_64: " ",
}
Dictionnary of regular expressions (keys) to find and replace by the provided strings (values). Cleanup repeated characters, including ellipses and question marks, leftover BBcode and XML markup, base64-encoded strings and French pronominal contractions (e.g “me + a” contracted into “m’a”).
core.batching.Tokenizer.internal_meta_tokens
class-attribute
instance-attribute
¤
internal_meta_tokens: dict[(re.Pattern) : str] = {
HASH_PATTERN_FAST: "_HASH_",
NUMBER_PATTERN_FAST: "_NUMBER_",
}
Dictionnary of regular expressions (keys) to find in full-tokens and replace by meta-tokens. Use simplified regex patterns for performance.
core.batching.Tokenizer.abbreviations
instance-attribute
¤
Abbreviations and contractions to replace in full documents
core.batching.Tokenizer.replacements
instance-attribute
¤
Arbitrary string replacements in single tokens
core.batching.Tokenizer.stopwords
instance-attribute
¤
stopwords = set(stopwords) if stopwords else None
Language-agnostic stopwords
core.batching.Tokenizer.lang_stopwords
instance-attribute
¤
Language-specific stopwords
core.batching.Tokenizer.supports_ngrams
instance-attribute
¤
supports_ngrams: bool = False
Whether or not the tokenizer has an embedded n-grams model
core.batching.Tokenizer.ngrams_trie
instance-attribute
¤
Prefix tree of known n-grams for efficient lookups
core.batching.Tokenizer.vocabulary
instance-attribute
¤
Known tokens, if trained for n-grams.
Functions¤
core.batching.Tokenizer.prefilter
¤
Tokenizers split words based on unsupervised machine-learned models. Sometimes, they work weird.
For example, in emails and user handles like @user, they would split @ and user as 2 different tokens,
making it impossible to detect usernames in single tokens later.
To avoid that, we replace data of interest by meta-tokens before the tokenization, with regular expressions.
core.batching.Tokenizer.lemmatize
¤
Find the root (lemma) of words to help topical generalization.
core.batching.Tokenizer.normalize_text
¤
Prepare text for tokenization by converting it to lowercase ASCII characters.
This will loose accents, diacritics and capitals, which means some nuance will be lost
at the benefit of generality. In case this does not suit your usecase, you may
inherit the Tokenizer class, build a child class and re-implement this method
core.batching.Tokenizer.normalize_token
¤
normalize_token(
word: str,
language: str | None,
normalize: bool = True,
meta_tokens: bool = True,
stem: bool = True,
remove_stopwords: bool = True,
) -> str | None
Return normalized, lemmatized and stemmed word tokens, where dates, times, digits, monetary units
and URLs have their actual value replaced by meta-tokens designating their type.
Stopwords (“the”, “a”, etc.), punctuation etc. is replaced by None, which should be filtered out at the next step.
| PARAMETER | DESCRIPTION |
|---|---|
word
|
tokenized word in lower case only.
TYPE:
|
language
|
the ISO 369-1 language code used to remove typical stopwords.
TYPE:
|
normalize
|
remove punctuation and leading/trailing symbols.
TYPE:
|
meta_tokens
|
replace string patterns by meta_tokens
TYPE:
|
stem
|
remove word suffixes, double consonnants, etc.
TYPE:
|
remove_stopwords
|
remove stopwords
TYPE:
|
NOTE
Tokenization is non-destructive (full sentences can be reconstructed entirely from token lists)
if normalize=False, meta_tokens=False, stem=False and remove_stopwords=False. In this setting,
only 1:1 token replacements defined in self.replacements will be applied, which can allow
to replace abbreviations or accronyms.
Other modes start generalizing semantics by removing meaning.
Examples:
Meta-tokens:
10:00 or 10 h or 10am or 10 am will all be replaced by a _TIME_ meta-token.
feb, February, feb., monday will all be replaced by a _DATE_ meta-token.
core.batching.Tokenizer.tokenize_text
¤
tokenize_text(
sentence: str,
language: str | None = None,
n_grams: bool = True,
normalize: bool = True,
meta_tokens: bool = True,
stem: bool = True,
remove_stopwords: bool = True,
) -> list[str]
Split text into normalized word tokens and meta-tokens.
No sentence or paragraph boundary detection is performed.
| PARAMETER | DESCRIPTION |
|---|---|
sentence
|
Input text to tokenize.
TYPE:
|
n_grams
|
Whether to detect and collapse n-grams. Requires a trained n-gram model generated with
TYPE:
|
Note
The parameters language, normalize, meta_tokens, stem,
and remove_stopwords are forwarded to
normalize_token() and have the same meaning.
| RETURNS | DESCRIPTION |
|---|---|
list[str]
|
List of normalized tokens represented as a bag of words. |
core.batching.Tokenizer.post_filter_tokens
¤
post_filter_tokens(
tokens: list[str],
language: str | None = None,
meta_tokens: bool = True,
stem: bool = False,
normalize: bool = False,
remove_stopwords: bool = False,
) -> list[str]
Apply post-processing operations to an existing token stream.
This method applies token normalization, stemming, stopword removal, and meta-token handling without performing tokenization.
| PARAMETER | DESCRIPTION |
|---|---|
tokens
|
List of input tokens to process. |
Note
The parameters language, meta_tokens, stem,
normalize, and remove_stopwords are forwarded to
normalize_token()
and have the same meaning.
| RETURNS | DESCRIPTION |
|---|---|
list[str]
|
List of processed tokens. |
core.batching.Tokenizer.tokenize_document_flat
¤
tokenize_document_flat(
document: str,
language: str | None = None,
n_grams: bool = True,
normalize: bool = True,
meta_tokens: bool = True,
stem: bool = True,
remove_stopwords: bool = True,
) -> list[str]
Cleanup and tokenize a document or a sentence as an atomic element, meaning we don’t split it into sentences. Use this either for search-engine purposes (into a document’s body) or if the document is already split into sentences. The document text needs to have been prepared and cleaned, which means :
- lowercased (optional but recommended) with
str.lower(), - translated from Unicode to ASCII (optional but recommended) with core.utils.typography_undo,
- cleaned up for sequences of whitespaces with core.utils.clean_whitespaces
Note
the language is detected internally if not provided as an optional argument. When processing a single sentence extracted from a document, instead of the whole document, it is more accurate to run the language detection on the whole document, ahead of calling this method, and pass on the result here.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
the text of the document to tokenize
TYPE:
|
n_grams
|
TYPE:
|
Note
The parameters language, meta_tokens, stem,
normalize, and remove_stopwords are forwarded to
normalize_token()
and have the same meaning.
| RETURNS | DESCRIPTION |
|---|---|
tokens
|
a 1D list of normalized tokens and meta-tokens. |
core.batching.Tokenizer.tokenize_document_per_sentence
¤
tokenize_document_per_sentence(
document: str,
language: str | None = None,
n_grams: bool = True,
normalize: bool = True,
meta_tokens: bool = True,
stem: bool = True,
remove_stopwords: bool = True,
) -> list[list[str]]
Cleanup and tokenize a whole document as a list of sentences, meaning we split it into sentences before tokenizing. Use this to train a Word2Vec (embedding) model so each token is properly embedded into its syntactic context. The document text needs to have been prepared and cleaned, which means :
- lowercased (optional but recommended) with
str.lower(), - translated from Unicode to ASCII (optional but recommended) with core.utils.typography_undo,
- cleaned up for sequences of whitespaces with core.utils.clean_whitespaces
| PARAMETER | DESCRIPTION |
|---|---|
document
|
the text of the document to tokenize
TYPE:
|
n_grams
|
TYPE:
|
Note
The parameters language, meta_tokens, stem,
normalize, and remove_stopwords are forwarded to
normalize_token()
and have the same meaning.
| RETURNS | DESCRIPTION |
|---|---|
tokens
|
a 2D list of sentences (1st axis), each containing a list of normalized tokens and meta-tokens (2nd axis). |
core.batching.Tokenizer.tokenize_document_per_paragraph
¤
tokenize_document_per_paragraph(
document: str,
language: str | None = None,
n_grams: bool = True,
normalize: bool = True,
meta_tokens: bool = True,
stem: bool = True,
remove_stopwords: bool = True,
) -> list[list[str]]
Cleanup and tokenize a whole document as a list of paragraphs, meaning we split it on `
or
` before tokenizing.
Use this to train a Word2Vec (embedding) model so each token is properly embedded into its syntactic context.
The document text needs to have been prepared and cleaned, which means :
- lowercased (optional but recommended) with `str.lower()`,
- translated from Unicode to ASCII (optional but recommended) with [core.utils.typography_undo][],
- cleaned up for sequences of whitespaces with [core.utils.clean_whitespaces][]
Arguments:
document (str): the text of the document to tokenize
n_grams (bool): see [core.nlp.Tokenizer.tokenize_text][]
others: see [core.nlp.Tokenizer.normalize_token][] arguments
Note:
the language is detected internally if not provided. The text is prefiltered with [self.prefilter][]
Returns:
tokens: a 2D list of paragraphs (1st axis), each containing a list of normalized tokens and meta-tokens (2nd axis).
core.batching.Tokenizer.load
classmethod
¤
load(name: str)
Load an existing trained model by its name from the ../models folder.
core.batching.Tokenizer.members_from_ngram
¤
core.batching.Tokenizer.train_ngrams
¤
train_ngrams(
sentences: list[str],
connector_words: str = "",
min_count: int = 10,
threshold: float = 0.7,
scoring: str = "npmi",
)
Train an n-gram model (bigrams and trigrams).
Detects common phrases such as “New York City” and merges them into single tokens using a statistical phrase model.
| PARAMETER | DESCRIPTION |
|---|---|
sentences
|
Training corpus. Must be a list of tokenized sentences. |
connector_words
|
Space-separated list of connector words allowed inside phrases (e.g. “by” in “piece by piece”). These words are treated as valid bridges when forming n-grams.
TYPE:
|
min_count
|
Minimum number of occurrences required for a phrase to be considered. See
TYPE:
|
threshold
|
Phrase detection sensitivity threshold. See
TYPE:
|
scoring
|
Scoring function used for phrase detection. See
TYPE:
|
Warning
N-gram training must be performed on lightly processed tokenized sentences. Do not apply stemming, stopword removal, or punctuation stripping before training.
See Tokenizer.normalize_token() for required preprocessing options.
Note
- Writes an
ngramslog file in the models directory containing discovered phrases. - Can be executed multiple times (e.g. per language); results are appended to the existing model.
core.batching.Tokenizer.compile_ngrams
¤
core.batching.Tokenizer.replace_ngrams
¤
Identify n-grams among tokens and collapse them into single tokens. N-grams should have been trained before, with core.nlp.Tokenizer.train_ngrams.
| RETURNS | DESCRIPTION |
|---|---|
list[str]
|
the collapsed list of strings, or the original list if no n-grams |
list[str]
|
was found or the n-grams model has not been trained. |
core.batching.Tokenizer.lookup_ngram
¤
Lookup an n-gram in the trie from its token members.
| PARAMETER | DESCRIPTION |
|---|---|
members
|
the tokens iterable |
| RETURNS | DESCRIPTION |
|---|---|
str | None
|
the collapsed n-gram if found in the trie, or |
str | None
|
no known n-gram. |
Example
lookup_ngram((“new”, “york”)) -> “new_york”
lookup_ngram((“new”, “york”, “city”)) -> “new_york_city”
lookup_ngram((“foo”, “bar”)) -> None
core.batching.Data
¤
core.batching.Word2Vec
¤
Word2Vec(
documents: list[list[str]],
name: str = "word2vec",
vector_size: int = 300,
epochs: int = 200,
window: int = 5,
min_count: int = 5,
sample: float = 0.0005,
tokenizer: Tokenizer = None,
compute_idf: bool = False,
**kwargs: dict[str, Any]
)
Bases: gensim.models.Word2Vec
Train, re-train, or load a Word2Vec embedding model.
If a model with the given name already exists, it is automatically
loaded instead of re-trained. Note that in this case, vector_size
will be overridden by the saved model configuration.
| PARAMETER | DESCRIPTION |
|---|---|
documents
|
Pre-tokenized training corpus. Structure: - outer list: documents - inner list: tokenized sentences |
name
|
Name of the model file used for saving/loading.
TYPE:
|
vector_size
|
Dimensionality of word embeddings.
TYPE:
|
epochs
|
Number of training iterations. Higher values improve quality on small corpora but increase training time.
TYPE:
|
window
|
Context window size for word co-occurrence.
TYPE:
|
min_count
|
Minimum frequency threshold for vocabulary filtering.
TYPE:
|
sample
|
Subsampling rate for frequent words.
TYPE:
|
tokenizer
|
Tokenizer instance used for preprocessing (if applicable).
TYPE:
|
compute_idf
|
Whether to compute and store IDF statistics for SIF weighting. Disable to reduce model size when SIF is not used.
TYPE:
|
**kwargs
|
Additional parameters forwarded directly to
|
Attributes¤
core.batching.Word2Vec.tokenizer
instance-attribute
¤
tokenizer = tokenizer if tokenizer is not None else Tokenizer()
Tokenizer used to train the model. We store it to be sure to use the same when using it.
core.batching.Word2Vec.N_docs
instance-attribute
¤
N_docs = len(documents)
Number of documents in the training corpus
core.batching.Word2Vec.N_sentences
instance-attribute
¤
N_sentences = len(sentences)
Number of sentences in the training corpus
core.batching.Word2Vec.N_words
instance-attribute
¤
N_words = len(words)
Number of words (tokens) in the training corpus
core.batching.Word2Vec.N_terms
instance-attribute
¤
N_terms = len(counts)
Number of terms (unique words) in the training corpus
core.batching.Word2Vec.idf
instance-attribute
¤
Inverse Document Frequency, used only for SIF weighting when enabled.
core.batching.Word2Vec.avg_doc_len
instance-attribute
¤
avg_doc_len: float | None = None
Average number of words in documents of the training corpus, available with IDF stats.
Functions¤
core.batching.Word2Vec.compute_idf
¤
Compute and store IDF statistics from a tokenized document corpus.
core.batching.Word2Vec.update_idf
¤
core.batching.Word2Vec.prune_idf
¤
Prune IDF entries to the actual model vocabulary (remove tokens
that were filtered out by gensim during super().__init__).
core.batching.Word2Vec.load_model
classmethod
¤
load_model(name: str)
Load a trained model saved in models folders
core.batching.Word2Vec.get_word
¤
core.batching.Word2Vec.get_wordvec
¤
get_wordvec(
word: str, embed: str = "IN", normalize: bool = True
) -> np.ndarray[np.float32] | None
Return the vector associated to a word, through a dictionnary of words.
| PARAMETER | DESCRIPTION |
|---|---|
word
|
the word to convert to a vector.
TYPE:
|
embed
|
TYPE:
|
-
A Dual Embedding Space Model for Document Ranking (2016), Bhaskar Mitra, Eric Nalisnick, Nick Craswell, Rich Caruana https://arxiv.org/pdf/1602.01137.pdf ↩
| RETURNS | DESCRIPTION |
|---|---|
np.ndarray[np.float32] | None
|
the nD vector if the word was found in the dictionnary, or |
core.batching.Word2Vec.get_features
¤
get_features(
tokens: list[str],
embed: str = "IN",
use_sif: bool = False,
sif_smoothing: float = 0.001,
) -> np.ndarray[np.float32]
Calls core.nlp.Word2Vec.get_wordvec over a list of tokens and returns a single vector representing the whole list.
| PARAMETER | DESCRIPTION |
|---|---|
tokens
|
list of text tokens. |
embed
|
TYPE:
|
use_sif
|
Use SIF weighting on each term when embedding a full sentence or document. See core.nlp.Word2Vec.SIF.
TYPE:
|
sif_smoothing
|
The SIF smoothing coefficient.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
np.ndarray[np.float32]
|
the normalized centroid of word embedding vectors associated with the input tokens |
np.ndarray[np.float32]
|
(aka the average vector), or the null vector if no word from the list was found in dictionnary. |
core.batching.Word2Vec.SIF
¤
Smooth inverse frequency weighting
Taken from A simple but tough-to-beat baseline for sentence embeddings, Sanjeev Arora, Yingyu Liang, Tengyu Ma. https://openreview.net/pdf?id=SyK00v5xx
This helps refining semantics by under-weighting stopwords, however it’s unsuited for File Information Retrieval (search engines) because it over-smoothen the embedding space geometry and hinders relevance discrimination with regard to a query.
| PARAMETER | DESCRIPTION |
|---|---|
token
|
the token to weight. It should be in the model vocabulary.
TYPE:
|
Return
The SIF weight associated with the token or 0. if the token was not found in the vocabulary.
core.batching.Word2Vec.tokens_to_indices
¤
Convert a list of tokens to a list of their index number in the Word2Vec vocabulary. This yields a more compact, albeit purely symbolic, representation of a tokenized document as a series of integers.
The conversion is reversible and the original token can be found with self.wv.index_to_key[i],
where i is the index number output (for each token) from here.
Return
the list of indices as 32 bits integers, meaning the Word2Vec vocabulary needs to contain fewer than 4.29 billions words.
core.batching.Classifier
¤
Classifier(
training_set: list[Data],
name: str,
word2vec: Word2Vec,
validate: bool = True,
variant: str = "svm",
)
Bases: nltk.classify.SklearnClassifier
Initialize a Word2Vec + SVM classification pipeline.
This class wraps a Word2Vec embedding model with a downstream machine-learning classifier (SVM or alternatives).
| PARAMETER | DESCRIPTION |
|---|---|
training_set
|
List of If empty, the system will attempt to load a pre-trained model
using |
name
|
Identifier used to save and reload the trained model.
TYPE:
|
word2vec
|
Word embedding model used to generate feature vectors.
TYPE:
|
validate
|
If True, splits the dataset into training (95%) and testing (5%) subsets and prints evaluation metrics. Useful for classifier selection and sanity checking.
TYPE:
|
variant
|
Type of classifier to use:
TYPE:
|
Note
The previous documentation mentioned path and features,
but these are not part of the current signature and were removed.
Functions¤
core.batching.Classifier.get_features_parallel
¤
Thread-safe call to .get_features() to be called in multiprocessing.Pool map
core.batching.Classifier.load
classmethod
¤
load(name: str)
Load an existing trained model by its name from the ../models folder.
core.batching.Classifier.classify
¤
Apply a label on a post based on the trained model.
core.batching.StemTokenIndex
¤
StemTokenIndex(db: sqlite3.Connection, tokenizer: Tokenizer)
core.batching.SQLitePageCorpus
¤
SQLitePageCorpus(
db,
query,
params=(),
atomic_types=(str, bytes),
max_depth=None,
yield_rows=False,
)
Lazily stream rows from an SQLite request, avoiding full copy.
Example
corpus = SQLitePageCorpus(
db,
"""
SELECT tokenized
FROM pages
WHERE lang IN ('fr', 'en')
""",
max_depth=0
)
max_depth=0 will not flatten the content, so it will return
the original list[list[str]] (list of sentences, aka list of list of words),
- max_depth=1 flattens documents, to it will return
list[str] (list of words)
core.batching.Deduplicator
¤
Deduplicator(
threshold: float = 0.9,
distance: int = 50,
discard_params: bool = True,
n_min: int = 0,
fix_urls: bool = True,
)
Instanciate a depduplicator object.
The duplicates factorizing takes a list of core.types.web_page
Duplication detection is done using canonical URLs (removing query parameters and anchors) and lowercased, ASCII-converted content.
You can edit (append or replace) the list of URLs to ignore core.deduplicator.Deduplicator.urls_to_ignore before doing the actual process.
Optionaly, near-duplicates are detected too by computing the Levenshtein distance between pages contents (lowercased and ASCII-converted). This brings a significant performance penalty on large datasets.
| PARAMETER | DESCRIPTION |
|---|---|
threshold
|
the minimum Levenshtein distance ratio between 2 pages contents for those pages to be considered near-duplicates and be factorized. If set to 1.0, the near-duplicates detection is bypassed which results in a huge speed up.
TYPE:
|
distance
|
the near-duplicates search is performed on the nearest elements after the core.types.web_page list has been ordered alphabetically by URL, for performance, assuming near-duplicates will most likely be found on the same domain and at a resembling path. The distance parameters defines how many elements ahead we will look into.
TYPE:
|
discard_params
|
on modern CMS that enable “pretty URLs” (URL rewriting), pages will be indexed
by a
TYPE:
|
n_min
|
domains that have a number of indexed pages below this threshold will be discarded entirely. This avoids indexing random dude’s website, under the assumption that relevant and reliable domains will have several pages indexed.
TYPE:
|
fix_urls
|
attempt to convert
TYPE:
|
Attributes¤
core.batching.Deduplicator.urls_to_ignore
class-attribute
instance-attribute
¤
urls_to_ignore: list[str] = [
"/tag/",
"/tags/",
"/category/",
"/categories/",
"/author/",
"/authors/",
"/profil/",
"/profiles/",
"/user/",
"/users/",
"/login/",
"/signup/",
"/member/",
"/members/",
"/cart/",
"/shop/",
"/register",
]
URL substrings to find in URLs and remove matching web pages: mostly WordPress archive pages, user profiles and login pages.
Functions¤
core.batching.Deduplicator.prepare_posts_parallel
classmethod
¤
Canonicalize a :class:~core.types.web_page dict for the list path.
Delegates URL normalization to :meth:_canonicalize_url and adds
list-path-specific fallbacks for length and datetime (which are
guaranteed to be pre-computed on the DB path by batch_parse_web_page
but may be absent on hand-assembled lists).
Returns the mutated elem dict, or None if the URL must be
discarded.
core.batching.Deduplicator.get_unique_urls
¤
Pick the most recent, or otherwise the longer, candidate for each canonical URL.
core.batching.Deduplicator.run_on_db
¤
run_on_db(db: sqlite3.Connection, chunksize: int = 4096) -> None
Deduplicate the pages table in-place, matching the full __call__ pipeline.
The method runs four sequential phases that mirror __call__:
- URL canonicalization – stream every row through
:meth:
prepare_posts_parallel(threaded, I/O-bound), normalise URLs, compute a SHA-1 content hash, and write results to the temporary_preparedtable. - URL deduplication – for each canonical URL keep the single best row
using SQL window functions with :attr:
_ELECTION_ORDER. - Exact-content deduplication – among URL winners, collapse rows that share the same SHA-1 hash using the same election order.
- Near-duplicate removal (skipped when
threshold == 1.0) – load the survivors into memory, run the Levenshtein window scan with parallelised comparisons (threaded;python-Levenshteinreleases the GIL), write the final winner set back to a temp table.
The pages table is atomically replaced by the winner set at the end.
All intermediate _prepared / _url_winners / _content_winners /
_near_winners temp tables are cleaned up on success.
Assumptions:
- pages has at least the columns: url, title, content, date,
datetime, parsed, category.
- datetime values, when present, are ISO-8601 strings (SQLite TEXT).
NULL is treated as “oldest possible” in the election.
- The external category label means the page was crawled by following
external links and contains the full <body>; any other category means
it was crawled from a sitemap / REST-API and contains cleaner markup.
Non-external therefore wins over external in the election.
| PARAMETER | DESCRIPTION |
|---|---|
db
|
Open
TYPE:
|
chunksize
|
Number of rows fetched per batch during Phase 1.
TYPE:
|
core.batching.Deduplicator.add_content_hash_column
staticmethod
¤
add_content_hash_column(db: sqlite3.Connection) -> None
Add (or refresh) a content_hash column on the pages table.
Computes a SHA-1 digest of each row’s parsed field and stores it in
content_hash. The column is created if it does not yet exist. Rows
with a NULL parsed value are skipped and left with a NULL hash.
A covering index idx_pages_content_hash is created (or left in place)
after the update so that subsequent deduplication queries are cheap.
This method is a standalone maintenance utility. The deduplication
pipeline (:meth:run_on_db) computes hashes inline during Phase 1 and
does not require this method to be called first.
Assumption: parsed values fit in memory individually (they are fetched
one batch at a time, not all at once).
| PARAMETER | DESCRIPTION |
|---|---|
db
|
Open
TYPE:
|
core.batching.Deduplicator.get_unique_content
¤
Pick the most recent candidate for each canonical content.
Return
canonical content: web_page dictionnary
core.batching.Deduplicator.get_close_content
¤
get_close_content(
posts: list[web_page], threshold: float = 0.9, distance: int = 50
) -> list[web_page]
Find and remove near-duplicates using the Levenshtein ratio.
Delegates the actual scan to :meth:_close_content_scan, which
parallelises comparisons within each window via a
:class:~concurrent.futures.ThreadPoolExecutor. This method is the
list-path counterpart to :meth:_elect_near_duplicates; both call the
same shared scan implementation.
The election among near-duplicate candidates honours the same priority
rules as URL and content deduplication (non-external > newer > longer >
shorter URL) via :meth:_elect_group.
| PARAMETER | DESCRIPTION |
|---|---|
posts
|
List of :class: |
threshold
|
Minimum Levenshtein ratio for two pages to be considered
near-duplicates. Defaults to :attr:
TYPE:
|
distance
|
Positions ahead to scan from each row after sorting by URL.
Defaults to :attr:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[web_page]
|
Filtered list with near-duplicates removed; one survivor per group. |
core.batching.Deduplicator.run_on_list
¤
Deduplicate an in-memory list of web pages, matching the full pipeline.
This is the list-based counterpart to :meth:run_on_db. The two methods
are kept symmetrical: both run the same four phases (URL canonicalization,
exact-URL deduplication, exact-content deduplication, optional
near-duplicate removal) and honour the same election rules.
Note
posts is consumed and partially destroyed during processing to
avoid keeping two copies in memory simultaneously.
| PARAMETER | DESCRIPTION |
|---|---|
posts
|
Flat list of :class: |
| RETURNS | DESCRIPTION |
|---|---|
list[web_page]
|
Deduplicated list of sanitised :class: |
list[web_page]
|
ready for downstream use. Also writes a |
list[web_page]
|
Functions¤
core.batching.parse_lang_to_iso639_1
¤
Normalize language identifier to ISO 639-1.
core.batching.guess_language
¤
guess_language(
string: str,
stopwords_threshold: float = 0.05,
letters_threshold: float = 0.8,
) -> str | None
Basic language guesser based on stopwords detection.
Stopwords are the most common words of a language: for each language, we count how many stopwords we found and return the language having the most matches. It is accurate for paragraphs and long documents, not so much for short sentences.
| PARAMETER | DESCRIPTION |
|---|---|
string
|
the string to analyze. Needs to be lowercased but to retain accents and diacritics.
TYPE:
|
stopwords_threshold
|
the minimum ratio of stopwords divided by total words in strings to be found to conclude on a language. For example, Japanese companies often have technical reports written in Japanese but still containing some English. If less than 5% of the words are known English stopwords, we could conclude it’s not English.
TYPE:
|
letters_threshold
|
the minimum ratio of roman (latin) characters among all characters (including numbers, symbols and non-latin alphabets) to be found to conclude on a language.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str | None
|
ISO 639-1 language code. Defaults to “en” if nothing found. |
core.batching.detect_language
¤
Detect language from arbitrary text safely.
| RETURNS | DESCRIPTION |
|---|---|
str | None
|
ISO 639-1 language code. |
core.batching.tokenize_document_to_words
¤
core.batching.split_document_to_sentences
¤
core.batching.tokenize_document_to_sentences
¤
core.batching.split_url
¤
Split a well-formed URL following RFC3986 into base elements.
| RETURNS | DESCRIPTION |
|---|---|
tuple[str, str, str, str, str] | None
|
a tuple of |
tuple[str, str, str, str, str] | None
|
Empty/missing fields are inited with empty strings so there is no need for individual |
tuple[str, str, str, str, str] | None
|
If the |
core.batching.adapt_array
¤
http://stackoverflow.com/a/31312102/190597 (SoulNibbler)
core.batching.create_db
¤
create_db(name: str) -> sqlite3.Connection
Create the pages table if needed and add any missing columns.
This doesn’t destroy existing tables, rows or columns, so it’s safe
to run on any database.
Warning
Columns are inferred directly from web_page.__annotations__.
Existing columns are preserved unchanged.
The url column is used as the PRIMARY KEY.
core.batching.create_temp_db
¤
create_temp_db(
min_free: float = 2.0, filename: str | None = None
) -> sqlite3.Connection
Create a temporary SQLite database file (in /dev/shm when available) and
initialize the pages table according to web_page annotations.
| PARAMETER | DESCRIPTION |
|---|---|
min_free
|
minimum available disk space in GiB required to create the temporary database. This is checked at runtime and the function will raise an error if the condition is not met.
TYPE:
|
filename
|
the full path and filename to save the temporary database, if it needs to be reused at some point.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
sqlite3.Connection
|
the sqlite3.Connection opened in bulk mode. |
WARNING
the temporary SQLite database doesn’t use web_page URL as primary key, to allow
later deduplication.
core.batching.delete_temp_db
¤
delete_temp_db(db: sqlite3.Connection)
Close and delete a temporary database in one shot.
core.batching.open_db
¤
open_db(name: str, mode: str = 'rw') -> sqlite3.Connection
Open an SQLite database with workload-specific optimizations.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
Database identifier/path passed to
TYPE:
|
mode
|
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
sqlite3.Connection
|
sqlite3.Connection |
core.batching.compress_db
¤
compress_db(
db: sqlite3.Connection,
delete_query: str | None = None,
delete_params: tuple | None = None,
delete_columns: list[str] | None = None,
)
Optionally delete rows, then reclaim SQLite disk space.
| PARAMETER | DESCRIPTION |
|---|---|
db
|
SQLite connection
TYPE:
|
delete_query
|
full DELETE SQL query
TYPE:
|
delete_params
|
optional SQL parameters
TYPE:
|
core.batching.is_primary_key
¤
is_primary_key(db: sqlite3.Connection, table: str, column: str) -> bool
Check whether column is part of the PRIMARY KEY of table.
core.batching.populate_db
¤
populate_db(
db: sqlite3.Connection, pages: list[web_page], batch_size: int = 4096
)
Insert or update web_page records into the SQLite database.
Existing rows are matched using the PRIMARY KEY url.
Warning
Array-like Python values are converted to bytearray
then to bytes in order to be handled as BLOB
by SQLite.
core.batching.db_to_list
¤
db_to_list(db: sqlite3.Connection) -> list[web_page]
Extract all web_page rows from the pages table in db as a list of web_page
core.batching.migrate_url_to_primary_key
¤
migrate_url_to_primary_key(db: sqlite3.Connection)
Rebuild the pages table using url as PRIMARY KEY
for older databases that didn’t use a primary key.
core.batching.merge_databases
¤
merge_databases(old_db: sqlite3.Connection, new_db: sqlite3.Connection)
Merge two pages databases.
Rows from old_db are inserted into new_db
only if their URL does not already exist.
Existing rows in new_db are preserved unchanged.
Only columns existing in BOTH databases are copied.
core.batching.update_pages_from_database
¤
update_pages_from_database(
target_db: sqlite3.Connection, source_db: sqlite3.Connection
) -> list[str]
Update rows in target_db.pages from source_db.pages
using url as PRIMARY KEY.
Only shared columns are updated.
Returns missing_urls: URLs present in target_db but absent from source_db.
core.batching.import_pages
¤
import_pages(
source_db: str | sqlite3.Connection,
destination_db: str | sqlite3.Connection,
where_clause: str = "1=1",
params: tuple = (),
) -> int
Import rows from one SQLite database into another.
Both source_db and destination_db may be either a filesystem
path (str) or an active sqlite3.Connection handle. Passing a
Connection is the only way to target a :memory: database, since
those cannot be addressed by path.
Connection lifecycle - Path supplied – the function opens, commits, and closes the connection itself (original behaviour). - Connection supplied – the caller retains full control; the connection is neither committed nor closed here, so the import can participate in a larger transaction.
Rows are copied from source.pages into destination.pages.
Existing rows are updated on conflict of the url primary key.
Columns present in the destination but absent from the source receive
NULL. Both schemas are discovered at runtime, so the function adapts
automatically if either evolves.
| PARAMETER | DESCRIPTION |
|---|---|
source_db
|
Path to, or an open connection for, the source SQLite database.
TYPE:
|
destination_db
|
Path to, or an open connection for, the destination SQLite database.
TYPE:
|
where_clause
|
SQL WHERE clause applied to
TYPE:
|
params
|
Positional parameters bound to where_clause.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int
|
Number of affected rows. |
Examples::
# File → file (unchanged from before)
import_pages("old.db", "new.db", "domain = ?", ("example.com",))
# In-memory source → file destination
import_pages(mem_conn, "new.db")
# File source → in-memory destination (e.g. for tests)
import_pages("prod.db", mem_conn, "date >= ?", ("2024-01-01",))
# Both in-memory
import_pages(src_conn, dst_conn)
core.batching.inspect_db
¤
inspect_db(db: sqlite3.Connection, message: str = '') -> None
Print useful metadata and statistics about a SQLite database.
| PARAMETER | DESCRIPTION |
|---|---|
db
|
active database connection
TYPE:
|
message
|
optional additional message to indentify several inspections if any.
TYPE:
|
core.batching.sanitize_web_page
¤
Ensure existence and validity of web_page keys/values.
core.batching.batch_guess_dates
¤
batch_guess_dates(db: sqlite3.Connection, chunksize: int = 2048)
High-throughput parallel datetime parsing.
core.batching.batch_parse_web_page
¤
batch_parse_web_page(
documents: sqlite3.Connection,
tokenizer: Tokenizer,
chunksize: int = 512,
cores: int | None = None,
)
High-performance parallel parsing for core.types.web_page objects
This function is meant to cleanup text encoding issues and multi-spacings in web_page title and content.
It prepares the web_page["parsed"] field from title and content for the next stages of tokenization,
and updates language (using declared ISO code or machine-learned detection).
It is needed to call it before core.deduplicator.Deduplicator, so the content duplication has a clean parsed version to compare web pages.
| PARAMETER | DESCRIPTION |
|---|---|
documents
|
any database having core.types.web_page rows stored in a
TYPE:
|
tokenizer
|
we only use it for the the core.nlp.Tokenizer.normalize_text method
TYPE:
|
chunksize
|
number of SQLite rows to process at once, too many is not helpful since some batches may take longer than others, depending on text length.
TYPE:
|
cores
|
CPU cores to use for parallel processing.
TYPE:
|
core.batching.batch_tokenize
¤
batch_tokenize(
db: sqlite3.Connection,
tokenizer: Tokenizer,
chunksize: int = 512,
urls: list[str] | None = None,
only_none: bool = True,
)
Tokenize a list of web_pages in a non-destructive way, in parallel, in a RAM-friendly way, directly in database.
Populate the tokenized database column from the parsed column. This needs to run after
core.batching.batch_parse_web_page and prepares n-gram training if any, or stemming.
Note
The tokenization is forced non-destructive and doesn’t apply stemming, stopwords removal, normalization, or n-grams. Original sentences can be reconstructed from joining back the list of tokens.
| PARAMETER | DESCRIPTION |
|---|---|
urls
|
list of URLs to tokenize. If None, the whole database is processed. |
only_none
|
stem only the new entries that have not been tokenized already. If
TYPE:
|
core.batching.batch_stem
¤
batch_stem(
db: sqlite3.Connection,
tokenizer: Tokenizer,
chunksize: int = 512,
urls: list[str] | None = None,
only_none: bool = True,
)
Tokenize and stem a list of web_pages in parallel, in a RAM-friendly way, directly in database.
Populate the stemmed database column from the tokenized column. This needs to run after
core.batching.batch_tokenize. The tokenization is destructive and apply stemming,
stopwords removal, normalization and n-grams if available.
| PARAMETER | DESCRIPTION |
|---|---|
urls
|
list of URLs to tokenize. If None, the whole database is processed. |
only_none
|
stem only the new entries that have not been stemmed already. If
TYPE:
|
core.batching.batch_vectorize
¤
batch_vectorize(
db: sqlite3.Connection, word2vec: Word2Vec, chunksize: int = 256
)
Vectorize a column of the db database using the provided word2vec model
using all available cores.
Works on the tokenized column of the database and writes the vectorized column.
Vectors are normalized as per nlp.Word2Vec.get_features() output.