core.nlp¤

core.nlp ¤

High-level natural language processing module for message-like (emails, comments, posts) input.

Supports automatic language detection, word tokenization and stemming for 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'italian', 'norwegian', 'portuguese', 'spanish', 'swedish'.

Attributes¤

core.nlp.regex_starter `module-attribute` ¤

regex_starter = '(?<=^|\\s|\\[|\\(|\\{|\\<|\\\'|\\"|`|;|\\>)'

Start of line, or start of document, or start of markup

core.nlp.regex_stopper `module-attribute` ¤

regex_stopper = '(?=$|\\s|\\]|\\)|\\}|\\>|\\\'|\\"|`|;|\\<)'

End of line, or end of document, or end of markup

core.nlp.end_of_word `module-attribute` ¤

end_of_word = '(?=$|\\s|\\]|\\)|\\}|\\>|\\\'|\\"|`|;|:|,|\\?|\\!|\\.|\\<)'

End of word, or end of line, or end of document, or end of markup

core.nlp.regex_algebra `module-attribute` ¤

regex_algebra = '[\\+\\-\\=\\≠\\±]'

Algebraic signs

core.nlp.IP_PATTERN `module-attribute` ¤

IP_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_ip, regex_stopper), re.IGNORECASE
)

IPv4 and IPv6 patterns where the whole IP is captured in the first group.

core.nlp.EMAIL_PATTERN `module-attribute` ¤

EMAIL_PATTERN = re.compile(
    "<?([0-9a-z\\-\\_\\+\\.]+?@[0-9a-z\\-\\_\\+]+(\\.[0-9a-z\\_\\-]{2,})+)>?",
    re.IGNORECASE,
)

Emails patterns like <me@mail.com> or me@mail.com where the whole address is captured in the first group.

core.nlp.URL_PATTERN `module-attribute` ¤

URL_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_url, end_of_word), re.IGNORECASE
)

URL patterns like http(s)://domain.ext/page/subpage?q=x&r=0:1#anchor or //domain.ext/page. URL must follow RFC3986, meaning query parameters should be before anchors, if any. Relying on this assumption allows a faster regex parsing.

the protocol (ftp, ftps, http, https) is captured as the first group,
domain.ext is captured as the second group,
/page/etc is the third group, including leading and trailing /,
page query parameters ?s=x&r=0, including ?, is the fourth group if the URL declares ...?params#anchor,
anchor #anchor is the fifth group, including #, if the URL declares ...?params#anchor.

URLs are captured if they are:

alone on their own line,
enclosed in {}, [], ()
enclosed in whitespaces.

Warning: URLs enclosed in (), [] and {} may retain the closing sign as part of the page name since () and [] are valid in URL pathes and parameters. This pattern will work on plain text only: Markdown, XML, HTML and JSON will need to be parsed ahead.

core.nlp.MEMBERS_PATTERN `module-attribute` ¤

MEMBERS_PATTERN = re.compile('(?<=[a-z])(\\.)(?=[a-z])', re.IGNORECASE)

Domain patterns without leading protocol like cdn.company.com or class members in object-oriented programming languages like params.cookies.client.

core.nlp.DATE_PATTERN `module-attribute` ¤

DATE_PATTERN = re.compile(date_regex, re.IGNORECASE)

Dates like 2022-12-01, 01-12-2022, 01-12-22, 01/12/2022, 01/12/22 where the whole date is captured in the first group, then each group of digits is captured in the order of appearance, in the next 3 groups

core.nlp.TIME_PATTERN `module-attribute` ¤

TIME_PATTERN = re.compile(time_regex, re.IGNORECASE)

Identify more or less standard time patterns, like :

12h15
12:15
12:15:00
12am
12 am
12 h
12:15:00Z
12:15:00+01
12:15:00 UTC+1
11:27:45+0000

RETURNS	DESCRIPTION
`0`	1- or 2-digits hour, TYPE: `str`
`1`	hour/minutes separator or half-day marker among `["h", ":", "am", "pm"]` (case-insensitive) TYPE: `str`
`2`	2-digits minutes, if any, or `None` TYPE: `str`
`3`	2-digits seconds, if any. TYPE: `str`
`4`	hour marker (`h` or `H`), half-day marker (case-insensitive `["am", "pm"]`), or time zone marker (case-sensitive `["Z", "UTC"]`) TYPE: `str`
`5`	1-or 2-digits signed integer timezone shift (referred to UTC). TYPE: `str`

Examples:

see https://regex101.com/r/QNtZAK/2

see src/tests/test-patterns.py

core.nlp.DOMAIN_PATTERN `module-attribute` ¤

DOMAIN_PATTERN = re.compile(
    "from ((?:[a-z0-9\\-_]{0,61}\\.)+[a-z]{2,})", re.IGNORECASE
)

Matches patterns like from (domain.ext) from RFC-822 Received header in emails.

core.nlp.UID_PATTERN `module-attribute` ¤

UID_PATTERN = re.compile('UID ([0-9]+)')

Matches email integer UID from IMAP headers.

core.nlp.FLAGS_PATTERN `module-attribute` ¤

FLAGS_PATTERN = re.compile('FLAGS \\((.*?)\\)')

Matches email flags from IMAP headers.

core.nlp.PATH_PATTERN `module-attribute` ¤

PATH_PATTERN = re.compile('%s%s%s' % (regex_starter, path_regex, end_of_word))

File path pattern like ~/file, /home/file, ./file or C:\windows

core.nlp.PARTIAL_PATH_REGEX `module-attribute` ¤

PARTIAL_PATH_REGEX = re.compile(
    "%s%s%s" % (regex_starter, partial_path_regex, end_of_word)
)

Partial, invalid path patterns missing the leading root, like home/user/stuff. We start capturing after at least two folder separators (slash or backslash).

Warning

this will collide with date detection, so run it after in the pipeline.

core.nlp.RESOLUTION_PATTERN `module-attribute` ¤

RESOLUTION_PATTERN = re.compile('\\d+(?:×|x|X)\\d+')

Pixel resolution like 10x20 or 10×20. Units are discarded.

core.nlp.NUMBER_PATTERN `module-attribute` ¤

NUMBER_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_number, regex_stopper)
)

Signed integers and decimals, fractions and numeric IDs with interal dashes and underscores. Numbers with starting or trailing units are not considered. Lazy decimals (.1 and 1.) are considered.

core.nlp.HASH_PATTERN `module-attribute` ¤

HASH_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_hash, end_of_word), re.IGNORECASE
)

Cryptographic hexadecimal hashes and fingerprints, of a min length of 8 characters.

core.nlp.MULTIPLE_LINES `module-attribute` ¤

MULTIPLE_LINES = re.compile('(?: ?[\\t\\r\\n]{2,} ?)+')

Detect more than 2 newlines and tab, possibly mixed with spaces

core.nlp.MULTIPLE_NEWLINES `module-attribute` ¤

MULTIPLE_NEWLINES = re.compile('(?: ?[\\t\\r\\n]+ ?){2,}')

Detect broken sequences of newlines and spaces.

core.nlp.INTERNAL_NEWLINE `module-attribute` ¤

INTERNAL_NEWLINE = re.compile('(?<=\\w)[\\n\\t\\r]{1}(?=\\w)')

Detect single newline characters nested inside text. Mostly useful for parsed PDF where line wrapping is quite literal ( used instead of space).

core.nlp.EXPOSURE `module-attribute` ¤

EXPOSURE = re.compile(
    "%s%s%s" % (regex_starter, exposure_regex, end_of_word), flags=re.IGNORECASE
)

Exposure values in EV or IL

core.nlp.PHOTOSPEED `module-attribute` ¤

PHOTOSPEED = re.compile(
    "%s%s%s" % (regex_starter, photospeed_regex, end_of_word),
    flags=re.IGNORECASE,
)

Exposure values in EV or IL

core.nlp.SENSIBILITY `module-attribute` ¤

SENSIBILITY = re.compile(
    "%s%s%s" % (regex_starter, sensibility_regex, end_of_word),
    flags=re.IGNORECASE,
)

Photographic sensibility in ISO or ASA

core.nlp.LUMINANCE `module-attribute` ¤

LUMINANCE = re.compile(
    "%s%s%s" % (regex_starter, luminance_regex, end_of_word),
    flags=re.IGNORECASE,
)

Luminance/radiance in nits or Cd/m²

core.nlp.DIAPHRAGM `module-attribute` ¤

DIAPHRAGM = re.compile(
    "%s%s" % (regex_starter, diaphragm_regex), flags=re.IGNORECASE
)

Photographic diaph aperture values like f/2.8 or f/11

core.nlp.GAIN `module-attribute` ¤

GAIN = re.compile(
    "%s%s%s" % (regex_starter, gain_regex, end_of_word), flags=re.IGNORECASE
)

Gain, attenuation and PSNR in dB

core.nlp.FILE_SIZE `module-attribute` ¤

FILE_SIZE = re.compile(
    "%s%s%s" % (regex_starter, filesize_regex, end_of_word), flags=re.IGNORECASE
)

File and memory size in bit, byte, or octet and their multiples

core.nlp.DISTANCE `module-attribute` ¤

DISTANCE = re.compile(
    "%s%s%s" % (regex_starter, distance_regex, end_of_word), flags=re.IGNORECASE
)

Distance in meter, inch, foot and their multiples

core.nlp.PERCENT `module-attribute` ¤

PERCENT = re.compile('%s%s%s' % (regex_starter, percent_regex, end_of_word))

Number followed by %

core.nlp.WEIGHT `module-attribute` ¤

WEIGHT = re.compile(
    "%s%s%s" % (regex_starter, weight_regex, end_of_word), flags=re.IGNORECASE
)

Weight (mass) in British and SI units and their multiples

core.nlp.ANGLE `module-attribute` ¤

ANGLE = re.compile(
    "%s%s%s" % (regex_starter, angle_regex, end_of_word), flags=re.IGNORECASE
)

Angles in radians, degrees and steradians

core.nlp.TEMPERATURE `module-attribute` ¤

TEMPERATURE = re.compile(
    "%s%s%s" % (regex_starter, temperature_regex, end_of_word),
    flags=re.IGNORECASE,
)

Temperatures in °C, °F and K

core.nlp.FREQUENCY `module-attribute` ¤

FREQUENCY = re.compile(
    "%s%s%s" % (regex_starter, frequency_regex, end_of_word),
    flags=re.IGNORECASE,
)

Frequencies in hertz and multiples

core.nlp.TEXT_DATES `module-attribute` ¤

TEXT_DATES = re.compile(
    "([0-9]{1,2})? (jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec|jan|fév|mar|avr|mai|jui|jui|aou|sep|oct|nov|déc|janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre|january|february|march|april|may|june|july|august|september|october|november|december)\\.?( [0-9]{1,2})?( [0-9]{2,4})(?!\\:)",
    flags=re.IGNORECASE | re.MULTILINE,
)

Find textual dates formats:

English dates like 01 Jan 20 or 01 Jan. 2020 but avoid capturing adjacent time like 12:08.
French dates like 01 Jan 20 or 01 Jan. 2020 but avoid capturing adjacent time like 12:08.

RETURNS	DESCRIPTION
`0`	2 digits (day number or year number, depending on language) TYPE: `str`
`1`	month (full-form or abbreviated) TYPE: `str`
`2`	2 digits (day number or year number, depending on language) TYPE: `str`
`3`	4 digits (full year) TYPE: `str`

core.nlp.BASE_64 `module-attribute` ¤

BASE_64 = re.compile(
    "((?:[A-Za-z0-9+\\/]{4}){64,}(?:[A-Za-z0-9+\\/]{2}==|[A-Za-z0-9+\\/]{3}=)?)"
)

Identifies base64 encoding

core.nlp.BB_CODE `module-attribute` ¤

BB_CODE = re.compile('\\[(img|quote)[a-zA-Z0-9 =\\"]*?\\].*?\\[\\/\\1\\]')

Identifies left-over BB code markup [img] and [quote]

core.nlp.MARKUP `module-attribute` ¤

MARKUP = re.compile('(?:\\[|\\{|\\<)([^\\n\\r]+?)(?:\\]|\\}|\\>)')

Identifies left-over HTML and Markdown markup, like <...>, {...}, [...]

core.nlp.USER `module-attribute` ¤

USER = re.compile('([\\w\\-\\+\\.]+)?@([\\w\\-\\+\\.]+)|(user\\-?\\d+)')

Identifies user handles or emails

core.nlp.REPEATED_CHARACTERS `module-attribute` ¤

REPEATED_CHARACTERS = re.compile('(.)\\1{9,}')

Identifies any character repeated more than 9 times

core.nlp.UNFINISHED_SENTENCES `module-attribute` ¤

UNFINISHED_SENTENCES = re.compile('(?<![?!.;:])\\n\\n|\\r\\n')

Identifies sentences finishing with 2 newlines characters without having ending punctuations

core.nlp.MULTIPLE_DOTS `module-attribute` ¤

MULTIPLE_DOTS = re.compile('\\.{2,}')

Identifies dots repeated more than twice

core.nlp.MULTIPLE_DASHES `module-attribute` ¤

MULTIPLE_DASHES = re.compile('[-~]{1,}')

Identifies dashes repeated more than once

core.nlp.MULTIPLE_QUESTIONS `module-attribute` ¤

MULTIPLE_QUESTIONS = re.compile('\\?{1,}')

Identifies question marks repeated more than once

core.nlp.ORDINAL_FR `module-attribute` ¤

ORDINAL_FR = re.compile('n° ?([0-9]+)')

French ordinal numbers (numéros n°)

core.nlp.FRANCAIS `module-attribute` ¤

FRANCAIS = re.compile(
    "%s(j|t|s|d|qu|lorsqu|quelqu|jusqu|m|c|n)\\'(?=[aeiouyéèàêâîôûïüäëöh][\\w\\s])"
    % regex_starter,
    flags=re.IGNORECASE,
)

French contractions of pronouns and determinants

core.nlp.DASHES `module-attribute` ¤

DASHES = re.compile('(?<=\\w)(-|_|=)+(?=\\w)', re.IGNORECASE)

Dashes in the middle of ASCII/Latin compounded words. Will not work if accented or Unicode characters are immediately surrounding the dash.

core.nlp.ALTERNATIVES `module-attribute` ¤

ALTERNATIVES = re.compile('(?<=[a-z])(\\/)(?=[a-z])', re.IGNORECASE)

Slash-separated word alternatives like and/or mr/mrs

core.nlp.PLURAL_S `module-attribute` ¤

PLURAL_S = re.compile('(?<=[a-zA-Z]{4,})s?e{0,2}s%s' % end_of_word)

Identify plural form of nouns (French and English), adjectives (French) and third-person present verbs (English) and second-person verbs (French) in -s.

core.nlp.FEMININE_E `module-attribute` ¤

FEMININE_E = re.compile('(?<=\\w{4,})e{1,2}%s' % end_of_word)

Identify feminine form of adjectives (French) in -e.

core.nlp.DOUBLE_CONSONANTS `module-attribute` ¤

DOUBLE_CONSONANTS = re.compile(
    "(?<=\\w{2,})([bcfghjklmnpqrstvwxz])\\1", re.IGNORECASE
)

Identify double consonants in the middle of words.

core.nlp.FEMININE_TRICE `module-attribute` ¤

FEMININE_TRICE = re.compile('(?<=\\w{4,})t(rice|eur|or)%s' % end_of_word)

Identify French feminine nouns in -trice.

core.nlp.ADVERB_MENT `module-attribute` ¤

ADVERB_MENT = re.compile('(?<=\\w{4,})e?ment%s' % end_of_word)

Identify French adverbs and English nouns ending en -ment

core.nlp.SUBSTANTIVE_TION `module-attribute` ¤

SUBSTANTIVE_TION = re.compile('(?<=\\w{4,})(t|s)ion%s' % end_of_word)

Identify French and English substantives formed from verbs by adding -tion and -sion

core.nlp.SUBSTANTIVE_AT `module-attribute` ¤

SUBSTANTIVE_AT = re.compile('(?<=\\w{4,})at%s' % end_of_word)

Identify French and English substantives formed from other nouns by adding -at

core.nlp.PARTICIPLE_ING `module-attribute` ¤

PARTICIPLE_ING = re.compile('(?<=\\w{4,})ing%s' % end_of_word)

Identify English substantives and present participles formed from verbs by adding -ing

core.nlp.ADJECTIVE_ED `module-attribute` ¤

ADJECTIVE_ED = re.compile('(?<=\\w{4,})ed%s' % end_of_word)

Identify English adjectives formed from verbs by adding -ed

core.nlp.ADJECTIVE_TIF `module-attribute` ¤

ADJECTIVE_TIF = re.compile('(?<=\\w{2,})ti(f|v)%s' % end_of_word)

Identify English and French adjectives formed from verbs by adding -tif or -tive

core.nlp.SUBSTANTIVE_Y `module-attribute` ¤

SUBSTANTIVE_Y = re.compile('(?<=\\w{3,})y%s' % end_of_word)

Identify English substantives ending in -y

core.nlp.VERB_IZ `module-attribute` ¤

VERB_IZ = re.compile('(?<=\\w{4,})(i|y)z%s' % end_of_word)

Identify American verbs ending in -iz that French and Brits write in -is

core.nlp.STUFF_ER `module-attribute` ¤

STUFF_ER = re.compile('(?<=\\w{5,})er%s' % end_of_word)

Identify French 1st group verb (infinitive) and English substantives ending in -er

core.nlp.BRITISH_OUR `module-attribute` ¤

BRITISH_OUR = re.compile('(?<=\\w{3,})our%s' % end_of_word)

Identify British spelling ending in -our (colour, behaviour).

core.nlp.SUBSTANTIVE_ITY `module-attribute` ¤

SUBSTANTIVE_ITY = re.compile('(?<=\\w{4,})it(y|e)%s' % end_of_word)

Identify substantives in -ity (English) and -ite (French).

core.nlp.SUBSTANTIVE_IST `module-attribute` ¤

SUBSTANTIVE_IST = re.compile('(?<=\\w{3,})is(t|m)%s' % end_of_word)

Identify substantives in -ist and -ism.

core.nlp.SUBSTANTIVE_IQU `module-attribute` ¤

SUBSTANTIVE_IQU = re.compile('(?<=\\w{3,})i(qu|c)%s' % end_of_word)

Identify French substantives in -iqu

core.nlp.SUBSTANTIVE_EUR `module-attribute` ¤

SUBSTANTIVE_EUR = re.compile('(?<=\\w{3,})eur%s' % end_of_word)

Identify French substantives -eur

core.nlp.HYPHENIZED `module-attribute` ¤

HYPHENIZED = re.compile('(?<=\\w{3,})[-–—]+ *[\\n\\r]{1,2}(?=\\w)')

Detect hyphenized words at the end of a PDF text line.

core.nlp.WAYBACK_RE `module-attribute` ¤

WAYBACK_RE = re.compile('https?://web\\.archive\\.org/web/[^/]+/(https?://.+)')

Find the canonical URL from web.archive.org (Wayback Machine) URLs

core.nlp.LANG_MAP `module-attribute` ¤

LANG_MAP = {
    "en": "english",
    "fr": "french",
    "de": "german",
    "es": "spanish",
    "it": "italian",
    "pt": "portuguese",
    "nl": "dutch",
    "sv": "swedish",
    "no": "norwegian",
    "da": "danish",
    "fi": "finnish",
    "ru": "russian",
    "ro": "romanian",
    "hu": "hungarian",
    "tr": "turkish",
}

Map ISO 639-1 language codes of supported languages to their full-name, as used by pre-trained corpora

core.nlp.LANG_MAP_REVERSE `module-attribute` ¤

LANG_MAP_REVERSE = {v: k for k, v in (LANG_MAP.items())}

Map the full-name of supported languages, as used by pre-trained corpora, to ISO 639-1 language codes

core.nlp.STOPWORDS_DICT `module-attribute` ¤

STOPWORDS_DICT = {
    language: (set(STOPWORDS_DICT[language])) for language in STOPWORDS_DICT
}

Dictionnary of stopwords (as sets values) mapped to full language names (as keys)

Classes¤

core.nlp.Lexicon `dataclass` ¤

Lexicon(counts: Counter[str] = Counter())

Mutable token frequency index with canonicalization helpers for:

malformed n-grams,
merged/split variants,
plural compound normalization.

Examples:

- liber_tarian  -> libertarian
- etres_humains -> etre_humain

Methods:¤

core.nlp.Lexicon.update ¤

update(corpus: Iterable[Iterable[str]]) -> None

Update token frequencies from a corpus of tokenized sentences.

PARAMETER	DESCRIPTION
`corpus`	Iterable of tokenized sentences: [ [“this”, “is”, “a”, “sentence”], [“another”, “sentence”] ] TYPE: `Iterable[Iterable[str]]`

core.nlp.Lexicon.frequency ¤

frequency(token: str) -> int

Return token frequency.

core.nlp.Lexicon.exists ¤

exists(token: str) -> bool

Check whether a token exists in the lexicon.

core.nlp.Lexicon.prune ¤

prune(min_count: int = 10) -> None

Remove all entries whose frequency is lower than min_count.

PARAMETER	DESCRIPTION
`min_count`	Minimum frequency to keep. TYPE: `int` DEFAULT: `10`

core.nlp.Lexicon.resolve_token ¤

resolve_token(token: str, separator: str = '_', min_ratio: float = 1.0) -> str

Attempt to canonicalize malformed n-grams.

Operations

malformed n-grams: liber_tarian -> libertarian
plural compound reduction: etres_humains -> etre_humain

Strategy

if token exists already -> keep it
otherwise:
- remove separators,
- check if merged variant exists,
- compare frequencies,
- prefer merged form if sufficiently frequent.

PARAMETER	DESCRIPTION
`token`	Token to canonicalize. TYPE: `str`
`separator`	N-gram separator. TYPE: `str` DEFAULT: `'_'`
`min_ratio`	Require merged token frequency to be at least `min_ratio` times the split variant frequency. Helps avoid false positives. TYPE: `float` DEFAULT: `1.0`

RETURNS	DESCRIPTION
`str`	Canonicalized token.

core.nlp.Lexicon.canonicalize_sentence ¤

canonicalize_sentence(
    sentence: list[str], separator: str = "_", min_ratio: float = 1.0
) -> list[str]

Canonicalize all tokens in a sentence.

core.nlp.Tokenizer ¤

Tokenizer(
    meta_tokens: dict[re.Pattern, str] | None = None,
    abbreviations: dict[str, str] | None = None,
    replacements: dict[str, str] | None = None,
    stopwords: set[str] | None = None,
    lang_stopwords: dict[str, set[str]] | None = None,
    backend: str = "blingfire",
)

Pre-processing pipeline and tokenizer.

Splits a string into normalized word tokens after applying a series of configurable text transformations.

PARAMETER	DESCRIPTION
`meta_tokens`	Pipeline of regular-expression substitutions used to replace document fragments with meta-tokens. Keys must be compiled `re.Pattern` objects and values must be meta-token strings, typically enclosed in underscores. Transformations are applied in declaration order. This relies on Python’s ordered dictionaries (Python 3.7+). If not provided, a default pipeline suitable for bilingual English/French technical documents is used. TYPE: `dict[re.Pattern, str] \| None` DEFAULT: `None`
`abbreviations`	Pipeline of abbreviation replacements as a `{to_replace: replacement}` dictionary. Replacements are applied in declaration order. TYPE: `dict[str, str] \| None` DEFAULT: `None`
`replacements`	Dictionary of token-level substitutions applied as `{key: value}` string replacements. TYPE: `dict[str, str] \| None` DEFAULT: `None`
`stopwords`	Language-agnostic stopwords to remove from the token stream. TYPE: `set[str] \| None` DEFAULT: `None`
`lang_stopwords`	Language-specific stopwords. Keys must be ISO 639-1 language codes and values must be sets of stopwords associated with each language. TYPE: `dict[str, set[str]] \| None` DEFAULT: `None`
`backend`	Tokenization backend to use. Supported values are: `"blingfire"`: Microsoft BlingFire tokenizer (pattern-based). `"nltk"`: NLTK Punkt tokenizer. TYPE: `str` DEFAULT: `'blingfire'`

Attributes¤

core.nlp.Tokenizer.characters_cleanup `class-attribute` `instance-attribute` ¤

characters_cleanup: dict[(re.Pattern) : str] = {
    MULTIPLE_DOTS: "...",
    MULTIPLE_DASHES: "-",
    MULTIPLE_QUESTIONS: "?",
    REPEATED_CHARACTERS: " ",
    BB_CODE: " ",
    MARKUP: " \\1 ",
    BASE_64: " ",
}

Dictionnary of regular expressions (keys) to find and replace by the provided strings (values). Cleanup repeated characters, including ellipses and question marks, leftover BBcode and XML markup, base64-encoded strings and French pronominal contractions (e.g “me + a” contracted into “m’a”).

core.nlp.Tokenizer.internal_meta_tokens `class-attribute` `instance-attribute` ¤

internal_meta_tokens: dict[(re.Pattern) : str] = {
    HASH_PATTERN_FAST: "_HASH_",
    NUMBER_PATTERN_FAST: "_NUMBER_",
}

Dictionnary of regular expressions (keys) to find in full-tokens and replace by meta-tokens. Use simplified regex patterns for performance.

core.nlp.Tokenizer.abbreviations `instance-attribute` ¤

abbreviations = abbreviations

Abbreviations and contractions to replace in full documents

core.nlp.Tokenizer.replacements `instance-attribute` ¤

replacements = replacements

Arbitrary string replacements in single tokens

core.nlp.Tokenizer.stopwords `instance-attribute` ¤

stopwords = set(stopwords) if stopwords else None

Language-agnostic stopwords

core.nlp.Tokenizer.lang_stopwords `instance-attribute` ¤

lang_stopwords = lang_stopwords

Language-specific stopwords

core.nlp.Tokenizer.supports_ngrams `instance-attribute` ¤

supports_ngrams: bool = False

Whether or not the tokenizer has an embedded n-grams model

core.nlp.Tokenizer.ngrams_trie `instance-attribute` ¤

ngrams_trie = {}

Prefix tree of known n-grams for efficient lookups

core.nlp.Tokenizer.vocabulary `instance-attribute` ¤

vocabulary: Lexicon = Lexicon()

Known tokens, if trained for n-grams.

Methods:¤

core.nlp.Tokenizer.prefilter ¤

prefilter(string: str, meta_tokens: bool = True) -> str

Tokenizers split words based on unsupervised machine-learned models. Sometimes, they work weird. For example, in emails and user handles like @user, they would split @ and user as 2 different tokens, making it impossible to detect usernames in single tokens later.

To avoid that, we replace data of interest by meta-tokens before the tokenization, with regular expressions.

core.nlp.Tokenizer.lemmatize ¤

lemmatize(word: str) -> str

Find the root (lemma) of words to help topical generalization.

core.nlp.Tokenizer.normalize_text ¤

normalize_text(document: str) -> str

Prepare text for tokenization by converting it to lowercase ASCII characters.

This will loose accents, diacritics and capitals, which means some nuance will be lost at the benefit of generality. In case this does not suit your usecase, you may inherit the Tokenizer class, build a child class and re-implement this method

core.nlp.Tokenizer.normalize_token ¤

normalize_token(
    word: str,
    language: str | None,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> str | None

Return normalized, lemmatized and stemmed word tokens, where dates, times, digits, monetary units and URLs have their actual value replaced by meta-tokens designating their type. Stopwords (“the”, “a”, etc.), punctuation etc. is replaced by None, which should be filtered out at the next step.

PARAMETER	DESCRIPTION
`word`	tokenized word in lower case only. TYPE: `str`
`language`	the ISO 369-1 language code used to remove typical stopwords. TYPE: `str`
`normalize`	remove punctuation and leading/trailing symbols. TYPE: `str` DEFAULT: `True`
`meta_tokens`	replace string patterns by meta_tokens TYPE: `bool` DEFAULT: `True`
`stem`	remove word suffixes, double consonnants, etc. TYPE: `bool` DEFAULT: `True`
`remove_stopwords`	remove stopwords TYPE: `bool` DEFAULT: `True`

NOTE

Tokenization is non-destructive (full sentences can be reconstructed entirely from token lists) if normalize=False, meta_tokens=False, stem=False and remove_stopwords=False. In this setting, only 1:1 token replacements defined in self.replacements will be applied, which can allow to replace abbreviations or accronyms. Other modes start generalizing semantics by removing meaning.

Examples:

Meta-tokens: 10:00 or 10 h or 10am or 10 am will all be replaced by a _TIME_ meta-token. feb, February, feb., monday will all be replaced by a _DATE_ meta-token.

core.nlp.Tokenizer.tokenize_text ¤

tokenize_text(
    sentence: str,
    language: str | None = None,
    n_grams: bool = True,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> list[str]

Split text into normalized word tokens and meta-tokens.

No sentence or paragraph boundary detection is performed.

PARAMETER	DESCRIPTION
`sentence`	Input text to tokenize. TYPE: `str`
`n_grams`	Whether to detect and collapse n-grams. Requires a trained n-gram model generated with `train_ngrams()`. TYPE: `bool` DEFAULT: `True`

Note

The parameters language, normalize, meta_tokens, stem, and remove_stopwords are forwarded to normalize_token() and have the same meaning.

RETURNS	DESCRIPTION
`list[str]`	List of normalized tokens represented as a bag of words.

core.nlp.Tokenizer.post_filter_tokens ¤

post_filter_tokens(
    tokens: list[str],
    language: str | None = None,
    meta_tokens: bool = True,
    stem: bool = False,
    normalize: bool = False,
    remove_stopwords: bool = False,
) -> list[str]

Apply post-processing operations to an existing token stream.

This method applies token normalization, stemming, stopword removal, and meta-token handling without performing tokenization.

PARAMETER	DESCRIPTION
`tokens`	List of input tokens to process. TYPE: `list[str]`

Note

The parameters language, meta_tokens, stem, normalize, and remove_stopwords are forwarded to normalize_token() and have the same meaning.

RETURNS	DESCRIPTION
`list[str]`	List of processed tokens.

core.nlp.Tokenizer.tokenize_document_flat ¤

tokenize_document_flat(
    document: str,
    language: str | None = None,
    n_grams: bool = True,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> list[str]

Cleanup and tokenize a document or a sentence as an atomic element, meaning we don’t split it into sentences. Use this either for search-engine purposes (into a document’s body) or if the document is already split into sentences. The document text needs to have been prepared and cleaned, which means :

lowercased (optional but recommended) with str.lower(),
translated from Unicode to ASCII (optional but recommended) with core.utils.typography_undo,
cleaned up for sequences of whitespaces with core.utils.clean_whitespaces

Note

the language is detected internally if not provided as an optional argument. When processing a single sentence extracted from a document, instead of the whole document, it is more accurate to run the language detection on the whole document, ahead of calling this method, and pass on the result here.

PARAMETER	DESCRIPTION
`document`	the text of the document to tokenize TYPE: `str`
`n_grams`	see core.nlp.Tokenizer.tokenize_text TYPE: `bool` DEFAULT: `True`

Note

The parameters language, meta_tokens, stem, normalize, and remove_stopwords are forwarded to normalize_token() and have the same meaning.

RETURNS	DESCRIPTION
`tokens`	a 1D list of normalized tokens and meta-tokens. TYPE: `list[str]`

core.nlp.Tokenizer.tokenize_document_per_sentence ¤

tokenize_document_per_sentence(
    document: str,
    language: str | None = None,
    n_grams: bool = True,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> list[list[str]]

Cleanup and tokenize a whole document as a list of sentences, meaning we split it into sentences before tokenizing. Use this to train a Word2Vec (embedding) model so each token is properly embedded into its syntactic context. The document text needs to have been prepared and cleaned, which means :

lowercased (optional but recommended) with str.lower(),
translated from Unicode to ASCII (optional but recommended) with core.utils.typography_undo,
cleaned up for sequences of whitespaces with core.utils.clean_whitespaces

PARAMETER	DESCRIPTION
`document`	the text of the document to tokenize TYPE: `str`
`n_grams`	see core.nlp.Tokenizer.tokenize_text TYPE: `bool` DEFAULT: `True`

Note

The parameters language, meta_tokens, stem, normalize, and remove_stopwords are forwarded to normalize_token() and have the same meaning.

RETURNS	DESCRIPTION
`tokens`	a 2D list of sentences (1st axis), each containing a list of normalized tokens and meta-tokens (2nd axis). TYPE: `list[list[str]]`

core.nlp.Tokenizer.tokenize_document_per_paragraph ¤

tokenize_document_per_paragraph(
    document: str,
    language: str | None = None,
    n_grams: bool = True,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> list[list[str]]

Cleanup and tokenize a whole document as a list of paragraphs, meaning we split it on `

or ` before tokenizing. Use this to train a Word2Vec (embedding) model so each token is properly embedded into its syntactic context. The document text needs to have been prepared and cleaned, which means :

    - lowercased (optional but recommended) with `str.lower()`,
    - translated from Unicode to ASCII (optional but recommended) with [core.utils.typography_undo][],
    - cleaned up for sequences of whitespaces with [core.utils.clean_whitespaces][]

    Arguments:
        document (str): the text of the document to tokenize
        n_grams (bool): see [core.nlp.Tokenizer.tokenize_text][]
        others: see [core.nlp.Tokenizer.normalize_token][] arguments

    Note:
        the language is detected internally if not provided. The text is prefiltered with [self.prefilter][]

    Returns:
        tokens: a 2D list of paragraphs (1st axis), each containing a list of normalized tokens and meta-tokens (2nd axis).

core.nlp.Tokenizer.load `classmethod` ¤

load(name: str)

Load an existing trained model by its name from the ../models folder.

core.nlp.Tokenizer.members_from_ngram ¤

members_from_ngram(token: str | None) -> list[str] | None

Recover n-grams members from a single tokenized phrase, separated with _. This expects lower-case tokens, except for meta-tokens which are expected capitalized.

RETURNS	DESCRIPTION
`list[str] \| None`	the list of n-gram members, or None if the token was not an n-gram but a singleton.

core.nlp.Tokenizer.train_ngrams ¤

train_ngrams(
    sentences: list[str],
    connector_words: str = "",
    min_count: int = 10,
    threshold: float = 0.7,
    scoring: str = "npmi",
)

Train an n-gram model (bigrams and trigrams).

Detects common phrases such as “New York City” and merges them into single tokens using a statistical phrase model.

PARAMETER	DESCRIPTION
`sentences`	Training corpus. Must be a list of tokenized sentences. TYPE: `list[str]`
`connector_words`	Space-separated list of connector words allowed inside phrases (e.g. “by” in “piece by piece”). These words are treated as valid bridges when forming n-grams. TYPE: `str` DEFAULT: `''`
`min_count`	Minimum number of occurrences required for a phrase to be considered. See gensim.models.phrases.Phrases for details. TYPE: `int` DEFAULT: `10`
`threshold`	Phrase detection sensitivity threshold. See gensim.models.phrases.Phrases. TYPE: `float` DEFAULT: `0.7`
`scoring`	Scoring function used for phrase detection. See gensim.models.phrases.Phrases. TYPE: `str` DEFAULT: `'npmi'`

Warning

N-gram training must be performed on lightly processed tokenized sentences. Do not apply stemming, stopword removal, or punctuation stripping before training.

See Tokenizer.normalize_token() for required preprocessing options.

Note

Writes an ngrams log file in the models directory containing discovered phrases.
Can be executed multiple times (e.g. per language); results are appended to the existing model.

core.nlp.Tokenizer.compile_ngrams ¤

compile_ngrams(ngrams: list[str])

Build a nested n-grams dictionnary for efficient querying, like:

{
    "new": {
        "york": {
            "__value__": "new_york",
            "city": {
                "__value__": "new_york_city"
            }
        }
    }
}

core.nlp.Tokenizer.replace_ngrams ¤

replace_ngrams(tokens: list[str]) -> list[str]

Identify n-grams among tokens and collapse them into single tokens. N-grams should have been trained before, with core.nlp.Tokenizer.train_ngrams.

RETURNS	DESCRIPTION
`list[str]`	the collapsed list of strings, or the original list if no n-grams
`list[str]`	was found or the n-grams model has not been trained.

core.nlp.Tokenizer.lookup_ngram ¤

lookup_ngram(members: list[str] | tuple[str, ...]) -> str | None

Lookup an n-gram in the trie from its token members.

PARAMETER	DESCRIPTION
`members`	the tokens iterable TYPE: `list[str] \| tuple[str, ...]`

RETURNS	DESCRIPTION
`str \| None`	the collapsed n-gram if found in the trie, or `None` if the input members match
`str \| None`	no known n-gram.

Example

lookup_ngram((“new”, “york”)) -> “new_york”

lookup_ngram((“new”, “york”, “city”)) -> “new_york_city”

lookup_ngram((“foo”, “bar”)) -> None

core.nlp.Data ¤

Data(text: str, label: str)

Represent an item of tagged training data.

PARAMETER	DESCRIPTION
`text`	the content to label, which will be vectorized TYPE: `str`
`label`	the category of the content, which will be predicted by the model TYPE: `str`

core.nlp.LossLogger ¤

LossLogger()

Bases: CallbackAny2Vec

Output loss at each epoch

core.nlp.WordEmbedding ¤

Shared interface and post-processing for word-embedding models.

Gensim-agnostic mixin implementing corpus statistics (IDF/SIF), All-but-the-Top post-processing and the word/document vector-retrieval API used by the search engine. It is combined with a concrete gensim training class (core.nlp.Word2Vec or core.nlp.FastText), which must provide self.wv, an output matrix (syn1neg or syn1) and save().

Both core.nlp.Word2Vec and core.nlp.FastText inherit it, so they expose the same interface and are interchangeable wherever the search core.search.Indexer expects an embedding model. Type-hint against core.nlp.WordEmbedding to depend on the interface rather than a concrete model.

Attributes¤

core.nlp.WordEmbedding.tokenizer `instance-attribute` ¤

tokenizer: 'Tokenizer'

Tokenizer object, instanciated with word replacements and trained for n-grams if needed.

core.nlp.WordEmbedding.vector_size `instance-attribute` ¤

vector_size: int

Number of vector dimensions used to embed words.

core.nlp.WordEmbedding.N_docs `instance-attribute` ¤

N_docs: int

Number of documents used at training time

core.nlp.WordEmbedding.N_sentences `instance-attribute` ¤

N_sentences: int

Number of sentences found in the training corpus

core.nlp.WordEmbedding.N_words `instance-attribute` ¤

N_words: int

Number of words (tokens) found in the training corpus

core.nlp.WordEmbedding.N_terms `instance-attribute` ¤

N_terms: int

Number of unique terms found in the training corpus

core.nlp.WordEmbedding.idf `instance-attribute` ¤

idf: 'dict[str, float] | None'

Inverse document frequency of words. Computed only if core.nlp.WordEmbedding is instanciated with compute_idf=True

core.nlp.WordEmbedding.avg_doc_len `instance-attribute` ¤

avg_doc_len: 'float | None'

Average document length. Computed only if core.nlp.WordEmbedding is instanciated with compute_idf=True

core.nlp.WordEmbedding.wv `instance-attribute` ¤

wv: gensim.models.KeyedVectors

Gensim keyed vectors

Methods:¤

core.nlp.WordEmbedding.prune_idf ¤

prune_idf()

Prune IDF entries to the actual model vocabulary (remove tokens that were filtered out by gensim during super().__init__).

core.nlp.WordEmbedding.apply_abtt ¤

apply_abtt(n_components_in: int = 3, n_components_out: int = 0) -> None

Post-process IN and OUT word vectors using All-but-the-Top.

Applies mean subtraction (and optionally PC removal) to W_IN and W_OUT independently. W_OUT receives lighter treatment because its common component is weaker under negative sampling and because it feeds into document centroids that are already corrected by normalize_pc() in the search index.

The intuition is that the principal components of the embedding vector space encode frequency rather than semantics.

For the FastText drop-in this adjusts only the in-vocabulary full-word vectors (wv.vectors) and the output matrix; OOV vectors reconstructed on the fly from sub-word n-grams are left untouched.

Reference

Mu & Viswanath (2018) “All-but-the-Top: Simple and Effective Postprocessing for Word Representations” https://arxiv.org/abs/1702.01417

PARAMETER	DESCRIPTION
`n_components_in`	PCs to remove from W_IN (query space) beyond the mean. For a specialty corpus, 3-10 is typical. Removing too many components induces a risk of loosing semantic meaning. `0` performs only the mean substraction (no principal components). `-1` disables principal components and mean substraction (bypass). TYPE: `int` DEFAULT: `3`
`n_components_out`	PCs to remove from W_OUT (document space) beyond the mean. Default 0 (mean only) is recommended unless you observe residual domain bias in document clusters after indexing. `0` performs only the mean substraction (no principal components). `-1` disables principal components and mean substraction (bypass). TYPE: `int` DEFAULT: `0`

core.nlp.WordEmbedding.load_model `classmethod` ¤

load_model(name: str)

Load a trained model saved in models folders

core.nlp.WordEmbedding.get_word ¤

get_word(word: str) -> str | None

Find out if word is in dictionary, optionnaly attempting spell-checking if not found.

For core.nlp.Word2Vec this means the word is in vocabulary. For core.nlp.FastText it is also true for out-of-vocabulary words that can be reconstructed from sub-word n-grams.

PARAMETER	DESCRIPTION
`word`	word to find TYPE: `str`

RETURNS	DESCRIPTION
`str \| None`	the original word if found in dictionnary, `None` if both previous conditions were not matched.

core.nlp.WordEmbedding.get_wordvec ¤

get_wordvec(
    word: str, embed: str = "IN", normalize: bool = True
) -> np.ndarray[np.float32] | None

Return the embedding vector associated to a word.

PARAMETER	DESCRIPTION
`word`	the word to convert to a vector. TYPE: `str`
`embed`	`IN` uses the input embedding matrix (query/document encoding). `OUT` uses the output embedding matrix (dual-embedding space document ranking). [^1] TYPE: `str` DEFAULT: `'IN'`

A Dual Embedding Space Model for Document Ranking (2016), Bhaskar Mitra, Eric Nalisnick, Nick Craswell, Rich Caruana https://arxiv.org/pdf/1602.01137.pdf ↩

RETURNS	DESCRIPTION
`np.ndarray[np.float32] \| None`	the nD vector if the word can be vectorized, or `None`. For
`np.ndarray[np.float32] \| None`	FastText, OOV words have no `OUT` vector (output embeddings exist
`np.ndarray[np.float32] \| None`	only for in-vocabulary words) and return `None` for `embed="OUT"`.

core.nlp.WordEmbedding.get_features ¤

get_features(
    tokens: list[str],
    embed: str = "IN",
    use_sif: bool = False,
    sif_smoothing: float = 0.001,
    top_k: int = 0,
) -> np.ndarray[np.float32]

Calls core.nlp.WordEmbedding.get_wordvec over a list of tokens and returns a single centroid vector representing the whole list.

Tokens are aggregated per unique word, so a word’s contribution scales with its in-list frequency (a word occurring n times contributes n × weight). This is mathematically identical to summing over every occurrence, but it also exposes a per-word salience used by top_k.

PARAMETER	DESCRIPTION
`tokens`	list of text tokens. TYPE: `list[str]`
`embed`	see core.nlp.WordEmbedding.get_wordvec TYPE: `str` DEFAULT: `'IN'`
`use_sif`	Use SIF weighting on each term when embedding a full sentence or document. See core.nlp.WordEmbedding.SIF. TYPE: `bool` DEFAULT: `False`
`sif_smoothing`	The SIF smoothing coefficient. TYPE: `float` DEFAULT: `0.001`
`top_k`	length-aware pooling. When `> 0`, keep only the `top_k` most salient unique tokens (highest accumulated `frequency × SIF` weight) before averaging; `0` (default) uses every token. Long documents otherwise drown their topical signal under a long tail of low-salience words, which pulls the centroid toward the corpus mean (centroid dilution) and makes comprehensive pages rank below short, keyword-peaky ones. Capping to the most discriminative tokens de-dilutes long documents while leaving short ones (fewer than `top_k` tokens) untouched. Used at document-vectorization time (see core.batching.batch_vectorize); the default `0` keeps the query path unchanged. TYPE: `int` DEFAULT: `0`

RETURNS	DESCRIPTION
`np.ndarray[np.float32]`	the normalized centroid of word embedding vectors associated with the input tokens
`np.ndarray[np.float32]`	(aka the average vector), or the null vector if no word from the list was found in dictionnary.

core.nlp.WordEmbedding.SIF ¤

SIF(token: str, a: float = 0.005) -> float

Smooth inverse frequency weighting.

This helps refining semantics by under-weighting stopwords, when aggregating word vectors into a document centroid.

Reference

A simple but tough-to-beat baseline for sentence embeddings (2017). Sanjeev Arora, Yingyu Liang, Tengyu Ma. https://openreview.net/pdf?id=SyK00v5xx

PARAMETER	DESCRIPTION
`token`	the token to weight. It should be in the model vocabulary. TYPE: `str`

RETURNS	DESCRIPTION
`float`	The SIF weight associated with the token or 0. if the token was not found in the vocabulary.

Warning

The core.nlp.WordEmbedding model needs to have been trained with compute_idf=True to prepare the statistics needed by SIF weighting. The method will raise an error if the stats are not available.

core.nlp.WordEmbedding.tokens_to_indices ¤

tokens_to_indices(tokens: list[str]) -> np.ndarray[np.int32]

Convert a list of tokens to a list of their index number in the vocabulary. This yields a more compact, albeit purely symbolic, representation of a tokenized document as a series of integers.

Only in-vocabulary tokens are kept (out-of-vocabulary FastText tokens have no stable vocabulary index). The conversion is reversible and the original token can be found with self.wv.index_to_key[i], where i is the index number output (for each token) from here.

RETURNS	DESCRIPTION
`np.ndarray[np.int32]`	the list of indices as 32 bits integers, meaning the vocabulary needs to contain fewer
`np.ndarray[np.int32]`	than 4.29 billions words.

core.nlp.Word2Vec ¤

Word2Vec(
    documents: Iterable[Iterable[list[str]]],
    name: str = "word2vec",
    vector_size: int = 300,
    epochs: int = 200,
    window: int = 5,
    min_count: int = 5,
    sample: float = 0.0005,
    tokenizer: Tokenizer | None = None,
    compute_idf: bool = True,
    n_pc_in: int = 3,
    n_pc_out: int = 0,
    **kwargs: dict[str, Any]
)

Bases: WordEmbedding, gensim.models.Word2Vec

Train a Word2Vec embedding model and compute TF-IDF word statistics on the corpus.

The pre-computed object is automatically saved to VirtualSecretary/models, as per core.utils.get_models_folder

PARAMETER	DESCRIPTION
`documents`	Pre-tokenized training corpus. Structure: - outer list: documents - inner list: tokenized sentences TYPE: `Iterable[Iterable[list[str]]]`
`name`	Name of the model file used for saving/loading. TYPE: `str` DEFAULT: `'word2vec'`
`vector_size`	Dimensionality of word embeddings. TYPE: `int` DEFAULT: `300`
`epochs`	Number of training iterations. Higher values improve quality on small corpora but increase training time. TYPE: `int` DEFAULT: `200`
`window`	Context window size for word co-occurrence. TYPE: `int` DEFAULT: `5`
`min_count`	Minimum frequency threshold for vocabulary filtering. TYPE: `int` DEFAULT: `5`
`sample`	Subsampling rate for frequent words. TYPE: `float` DEFAULT: `0.0005`
`tokenizer`	Tokenizer instance used for preprocessing (if applicable). TYPE: `Tokenizer \| None` DEFAULT: `None`
`compute_idf`	Whether to compute and store IDF statistics for SIF weighting. See core.nlp.Word2Vec.SIF Disable to reduce model size when SIF is not used. TYPE: `bool` DEFAULT: `True`
`n_pc_in`	Number of principal components to remove on the word embedding vectors for the input space. See core.nlp.Word2Vec.apply_abtt TYPE: `int` DEFAULT: `3`
`n_pc_out`	Number of principal components to remove on the word embedding vectors for the output space. See core.nlp.Word2Vec.apply_abtt TYPE: `int` DEFAULT: `0`
`**kwargs`	Additional parameters forwarded directly to gensim.models.word2vec.Word2Vec. TYPE: `dict[str, Any]` DEFAULT: `{}`

Attributes¤

core.nlp.Word2Vec.tokenizer `instance-attribute` ¤

tokenizer: Tokenizer = tokenizer if tokenizer is not None else Tokenizer()

Tokenizer object, instanciated with word replacements and trained for n-grams if needed.

core.nlp.Word2Vec.pathname `instance-attribute` ¤

pathname: str = get_models_folder(name)

Path and filename of the saved model

core.nlp.Word2Vec.vector_size `instance-attribute` ¤

vector_size: int = vector_size

Number of vector dimensions used to embed words.

core.nlp.FastText ¤

FastText(
    documents: Iterable[Iterable[list[str]]],
    name: str = "fasttext",
    vector_size: int = 300,
    epochs: int = 200,
    window: int = 5,
    min_count: int = 5,
    sample: float = 0.0005,
    tokenizer: Tokenizer | None = None,
    compute_idf: bool = True,
    n_pc_in: int = 3,
    n_pc_out: int = 0,
    min_n: int = 3,
    max_n: int = 6,
    bucket: int = 2000000,
    **kwargs: dict[str, Any]
)

Bases: WordEmbedding, gensim.models.FastText

Drop-in alternative to core.nlp.Word2Vec backed by gensim.models.fasttext.FastText.

Same API, corpus statistics (IDF/SIF), All-but-the-Top post-processing and vector-retrieval interface as Word2Vec (both inherit core.nlp.WordEmbedding), so it is interchangeable everywhere the search core.search.Indexer expects an embedding model. The difference is that word vectors are composed from character n-grams, which gives:

- sensible vectors for out-of-vocabulary tokens (rare domain jargon,
  misspellings, morphological variants) on the query (IN) side,
- robustness to FR/EN morphology, allowing lighter stemming upstream.

Notes

Output (OUT) embeddings exist only for in-vocabulary words, so document vectorization (embed="OUT") still skips OOV tokens.
All-but-the-Top adjusts only the in-vocabulary full-word vectors; OOV vectors reconstructed from sub-word n-grams are left raw.

PARAMETER	DESCRIPTION
`min_n`	smallest character n-gram length for sub-word vectors. TYPE: `int` DEFAULT: `3`
`max_n`	largest character n-gram length for sub-word vectors. TYPE: `int` DEFAULT: `6`
`bucket`	number of hash buckets for sub-word n-grams. TYPE: `int` DEFAULT: `2000000`

Other arguments: see core.nlp.Word2Vec.

Attributes¤

core.nlp.FastText.tokenizer `instance-attribute` ¤

tokenizer: Tokenizer = tokenizer if tokenizer is not None else Tokenizer()

Tokenizer object, instanciated with word replacements and trained for n-grams if needed.

core.nlp.FastText.pathname `instance-attribute` ¤

pathname: str = get_models_folder(name)

Path and filename of the saved model

core.nlp.FastText.vector_size `instance-attribute` ¤

vector_size: int = vector_size

Number of vector dimensions used to embed words.

core.nlp.Classifier ¤

Classifier(
    training_set: list[Data],
    name: str,
    word2vec: Word2Vec,
    validate: bool = True,
    variant: str = "svm",
)

Bases: nltk.classify.SklearnClassifier

Initialize a Word2Vec + SVM classification pipeline.

This class wraps a Word2Vec embedding model with a downstream machine-learning classifier (SVM or alternatives).

PARAMETER	DESCRIPTION
`training_set`	List of `Data` samples used for training. If empty, the system will attempt to load a pre-trained model using `name`. TYPE: `list[Data]`
`name`	Identifier used to save and reload the trained model. TYPE: `str`
`word2vec`	Word embedding model used to generate feature vectors. TYPE: `Word2Vec`
`validate`	If True, splits the dataset into training (95%) and testing (5%) subsets and prints evaluation metrics. Useful for classifier selection and sanity checking. TYPE: `bool` DEFAULT: `True`
`variant`	Type of classifier to use: `svm`: RBF-kernel Support Vector Machine (default). Robust and stable across general datasets. `linear svm`: Linear Support Vector Machine. Faster and often better for high-dimensional features. `forest`: Random Forest classifier. Faster than linear SVM in some cases, but produces larger models. TYPE: `str` DEFAULT: `'svm'`

Note

The previous documentation mentioned path and features, but these are not part of the current signature and were removed.

Methods:¤

core.nlp.Classifier.get_features_parallel ¤

get_features_parallel(post: Data) -> tuple[str, str]

Thread-safe call to .get_features() to be called in multiprocessing.Pool map

core.nlp.Classifier.load `classmethod` ¤

load(name: str)

Load an existing trained model by its name from the ../models folder.

core.nlp.Classifier.classify ¤

classify(post: str) -> str

Apply a label on a post based on the trained model.

core.nlp.Classifier.prob_classify ¤

prob_classify(post: str) -> tuple[str, float]

Apply a label on a post based on the trained model and output the probability too.

core.nlp.StemTokenIndex ¤

StemTokenIndex(db: sqlite3.Connection, tokenizer: Tokenizer)

Build a reverse-lookup table in db mapping stems to tokens.

The rationale is that core.nlp.Tokenizer.tokenize_text (and the higher-level method calling it internally), when used with stem=True, produces unlegible tokens for humans. This class helps building a translation dictionnary mapping back the stemmed tokens to the most probable non-stemmed token for UI purposes.

RETURNS	DESCRIPTION
`None`	A new indexed `stem_tokens` table in `db` containing 3 columns: `stem`, `token`, `occurences`. Each row records the frequency of the `(stem, token)` couple. TYPE: `None`

Methods:¤

core.nlp.StemTokenIndex.most_probable_token ¤

most_probable_token(db: sqlite3.Connection, stem: str) -> str

Return the most probable original token associated to the stem. If the stem doesn’t exist in the database, it is returned as-is.

core.nlp.StemTokenIndex.most_probable_tokens ¤

most_probable_tokens(db: sqlite3.Connection, stems: list[str]) -> list[str]

Return the most probable original token for each stem.

Stems not found in DB are returned unchanged.

Functions:¤

core.nlp.split_url ¤

split_url(url: str) -> tuple[str, str, str, str, str] | None

Split a well-formed URL following RFC3986 into base elements.

RETURNS	DESCRIPTION
`tuple[str, str, str, str, str] \| None`	a tuple of `(protocol, domain, page, parameters, anchor)`.
`tuple[str, str, str, str, str] \| None`	Empty/missing fields are inited with empty strings so there is no need for individual `None` checks.
`tuple[str, str, str, str, str] \| None`	If the `url` input doesn’t match an URL format, return `None`.

core.nlp.parse_lang_to_iso639_1 ¤

parse_lang_to_iso639_1(value: str | None) -> str | None

Normalize language identifier to ISO 639-1.

core.nlp.guess_language ¤

guess_language(
    string: str,
    stopwords_threshold: float = 0.05,
    letters_threshold: float = 0.8,
) -> str | None

Basic language guesser based on stopwords detection.

Stopwords are the most common words of a language: for each language, we count how many stopwords we found and return the language having the most matches. It is accurate for paragraphs and long documents, not so much for short sentences.

PARAMETER	DESCRIPTION
`string`	the string to analyze. Needs to be lowercased but to retain accents and diacritics. TYPE: `str`
`stopwords_threshold`	the minimum ratio of stopwords divided by total words in strings to be found to conclude on a language. For example, Japanese companies often have technical reports written in Japanese but still containing some English. If less than 5% of the words are known English stopwords, we could conclude it’s not English. TYPE: `float` DEFAULT: `0.05`
`letters_threshold`	the minimum ratio of roman (latin) characters among all characters (including numbers, symbols and non-latin alphabets) to be found to conclude on a language. TYPE: `float` DEFAULT: `0.8`

RETURNS	DESCRIPTION
`str \| None`	ISO 639-1 language code. Defaults to “en” if nothing found.

core.nlp.detect_language ¤

detect_language(text: str) -> str | None

Detect language from arbitrary text safely.

RETURNS	DESCRIPTION
`str \| None`	ISO 639-1 language code.

core.nlp.tokenize_document_to_words ¤

tokenize_document_to_words(
    text: str, language: str | None = None, backend: str = "blingfire"
) -> list[str]

Split a text into single words

PARAMETER	DESCRIPTION
`language`	ISO 639-1 language code. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`list[str]`	Bag of words for the whole document. Sentence delimiters are removed.

core.nlp.split_document_to_sentences ¤

split_document_to_sentences(
    text: str, language: str | None = None, backend: str = "blingfire"
) -> list[str]

Split a text into a list of sentences.

PARAMETER	DESCRIPTION
`language`	ISO 639-1 language code. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`list[str]`	List of sentences as full text.

core.nlp.tokenize_document_to_sentences ¤

tokenize_document_to_sentences(
    text: str, language: str | None = None, backend: str = "blingfire"
) -> list[list[str]]

Split a text into single words as a list of lists

PARAMETER	DESCRIPTION
`language`	ISO 639-1 language code. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`list[list[str]]`	List of sentences, each sentence is itself a list of words.

core.nlp¤

core.nlp ¤

Attributes¤

core.nlp.regex_starter module-attribute ¤

core.nlp.regex_stopper module-attribute ¤

core.nlp.end_of_word module-attribute ¤

core.nlp.regex_algebra module-attribute ¤

core.nlp.IP_PATTERN module-attribute ¤

core.nlp.EMAIL_PATTERN module-attribute ¤

core.nlp.URL_PATTERN module-attribute ¤

core.nlp.MEMBERS_PATTERN module-attribute ¤

core.nlp.DATE_PATTERN module-attribute ¤

core.nlp.TIME_PATTERN module-attribute ¤

core.nlp.DOMAIN_PATTERN module-attribute ¤

core.nlp.UID_PATTERN module-attribute ¤

core.nlp.FLAGS_PATTERN module-attribute ¤

core.nlp.PATH_PATTERN module-attribute ¤

core.nlp.PARTIAL_PATH_REGEX module-attribute ¤

core.nlp.RESOLUTION_PATTERN module-attribute ¤

core.nlp.NUMBER_PATTERN module-attribute ¤

core.nlp.HASH_PATTERN module-attribute ¤

core.nlp.MULTIPLE_LINES module-attribute ¤

core.nlp.MULTIPLE_NEWLINES module-attribute ¤

core.nlp.INTERNAL_NEWLINE module-attribute ¤

core.nlp.EXPOSURE module-attribute ¤

core.nlp.PHOTOSPEED module-attribute ¤

core.nlp.SENSIBILITY module-attribute ¤

core.nlp.LUMINANCE module-attribute ¤

core.nlp.DIAPHRAGM module-attribute ¤

core.nlp.GAIN module-attribute ¤

core.nlp.FILE_SIZE module-attribute ¤

core.nlp.DISTANCE module-attribute ¤

core.nlp.PERCENT module-attribute ¤

core.nlp.WEIGHT module-attribute ¤

core.nlp.ANGLE module-attribute ¤

core.nlp.TEMPERATURE module-attribute ¤

core.nlp.FREQUENCY module-attribute ¤

core.nlp.TEXT_DATES module-attribute ¤

core.nlp.BASE_64 module-attribute ¤

core.nlp.BB_CODE module-attribute ¤

core.nlp.MARKUP module-attribute ¤

core.nlp.USER module-attribute ¤

core.nlp.REPEATED_CHARACTERS module-attribute ¤

core.nlp.UNFINISHED_SENTENCES module-attribute ¤

core.nlp.MULTIPLE_DOTS module-attribute ¤

core.nlp.MULTIPLE_DASHES module-attribute ¤

core.nlp.MULTIPLE_QUESTIONS module-attribute ¤

core.nlp.ORDINAL_FR module-attribute ¤

core.nlp.FRANCAIS module-attribute ¤

core.nlp.DASHES module-attribute ¤

core.nlp.ALTERNATIVES module-attribute ¤

core.nlp.PLURAL_S module-attribute ¤

core.nlp.FEMININE_E module-attribute ¤

core.nlp.DOUBLE_CONSONANTS module-attribute ¤

core.nlp.FEMININE_TRICE module-attribute ¤

core.nlp.ADVERB_MENT module-attribute ¤

core.nlp.SUBSTANTIVE_TION module-attribute ¤

core.nlp.SUBSTANTIVE_AT module-attribute ¤

core.nlp.PARTICIPLE_ING module-attribute ¤

core.nlp.ADJECTIVE_ED module-attribute ¤

core.nlp.ADJECTIVE_TIF module-attribute ¤

core.nlp.SUBSTANTIVE_Y module-attribute ¤

core.nlp.VERB_IZ module-attribute ¤

core.nlp.STUFF_ER module-attribute ¤

core.nlp.BRITISH_OUR module-attribute ¤

core.nlp.SUBSTANTIVE_ITY module-attribute ¤

core.nlp.SUBSTANTIVE_IST module-attribute ¤

core.nlp.SUBSTANTIVE_IQU module-attribute ¤

core.nlp.SUBSTANTIVE_EUR module-attribute ¤

core.nlp.HYPHENIZED module-attribute ¤

core.nlp.WAYBACK_RE module-attribute ¤

core.nlp.LANG_MAP module-attribute ¤

core.nlp.LANG_MAP_REVERSE module-attribute ¤

core.nlp.STOPWORDS_DICT module-attribute ¤

Classes¤

core.nlp.Lexicon dataclass ¤

Methods:¤

core.nlp.Lexicon.update ¤

core.nlp.Lexicon.frequency ¤

core.nlp.Lexicon.exists ¤

core.nlp.regex_starter `module-attribute` ¤

core.nlp.regex_stopper `module-attribute` ¤

core.nlp.end_of_word `module-attribute` ¤

core.nlp.regex_algebra `module-attribute` ¤

core.nlp.IP_PATTERN `module-attribute` ¤

core.nlp.EMAIL_PATTERN `module-attribute` ¤

core.nlp.URL_PATTERN `module-attribute` ¤

core.nlp.MEMBERS_PATTERN `module-attribute` ¤

core.nlp.DATE_PATTERN `module-attribute` ¤

core.nlp.TIME_PATTERN `module-attribute` ¤

core.nlp.DOMAIN_PATTERN `module-attribute` ¤

core.nlp.UID_PATTERN `module-attribute` ¤

core.nlp.FLAGS_PATTERN `module-attribute` ¤

core.nlp.PATH_PATTERN `module-attribute` ¤

core.nlp.PARTIAL_PATH_REGEX `module-attribute` ¤

core.nlp.RESOLUTION_PATTERN `module-attribute` ¤

core.nlp.NUMBER_PATTERN `module-attribute` ¤

core.nlp.HASH_PATTERN `module-attribute` ¤

core.nlp.MULTIPLE_LINES `module-attribute` ¤

core.nlp.MULTIPLE_NEWLINES `module-attribute` ¤

core.nlp.INTERNAL_NEWLINE `module-attribute` ¤

core.nlp.EXPOSURE `module-attribute` ¤

core.nlp.PHOTOSPEED `module-attribute` ¤

core.nlp.SENSIBILITY `module-attribute` ¤

core.nlp.LUMINANCE `module-attribute` ¤

core.nlp.DIAPHRAGM `module-attribute` ¤

core.nlp.GAIN `module-attribute` ¤

core.nlp.FILE_SIZE `module-attribute` ¤

core.nlp.DISTANCE `module-attribute` ¤

core.nlp.PERCENT `module-attribute` ¤

core.nlp.WEIGHT `module-attribute` ¤

core.nlp.ANGLE `module-attribute` ¤

core.nlp.TEMPERATURE `module-attribute` ¤

core.nlp.FREQUENCY `module-attribute` ¤

core.nlp.TEXT_DATES `module-attribute` ¤

core.nlp.BASE_64 `module-attribute` ¤

core.nlp.BB_CODE `module-attribute` ¤

core.nlp.MARKUP `module-attribute` ¤

core.nlp.USER `module-attribute` ¤

core.nlp.REPEATED_CHARACTERS `module-attribute` ¤

core.nlp.UNFINISHED_SENTENCES `module-attribute` ¤

core.nlp.MULTIPLE_DOTS `module-attribute` ¤

core.nlp.MULTIPLE_DASHES `module-attribute` ¤

core.nlp.MULTIPLE_QUESTIONS `module-attribute` ¤

core.nlp.ORDINAL_FR `module-attribute` ¤

core.nlp.FRANCAIS `module-attribute` ¤

core.nlp.DASHES `module-attribute` ¤

core.nlp.ALTERNATIVES `module-attribute` ¤

core.nlp.PLURAL_S `module-attribute` ¤

core.nlp.FEMININE_E `module-attribute` ¤

core.nlp.DOUBLE_CONSONANTS `module-attribute` ¤

core.nlp.FEMININE_TRICE `module-attribute` ¤

core.nlp.ADVERB_MENT `module-attribute` ¤

core.nlp.SUBSTANTIVE_TION `module-attribute` ¤

core.nlp.SUBSTANTIVE_AT `module-attribute` ¤

core.nlp.PARTICIPLE_ING `module-attribute` ¤

core.nlp.ADJECTIVE_ED `module-attribute` ¤

core.nlp.ADJECTIVE_TIF `module-attribute` ¤

core.nlp.SUBSTANTIVE_Y `module-attribute` ¤

core.nlp.VERB_IZ `module-attribute` ¤

core.nlp.STUFF_ER `module-attribute` ¤

core.nlp.BRITISH_OUR `module-attribute` ¤

core.nlp.SUBSTANTIVE_ITY `module-attribute` ¤

core.nlp.SUBSTANTIVE_IST `module-attribute` ¤

core.nlp.SUBSTANTIVE_IQU `module-attribute` ¤

core.nlp.SUBSTANTIVE_EUR `module-attribute` ¤

core.nlp.HYPHENIZED `module-attribute` ¤

core.nlp.WAYBACK_RE `module-attribute` ¤

core.nlp.LANG_MAP `module-attribute` ¤

core.nlp.LANG_MAP_REVERSE `module-attribute` ¤

core.nlp.STOPWORDS_DICT `module-attribute` ¤

core.nlp.Lexicon `dataclass` ¤

core.nlp.Tokenizer.characters_cleanup `class-attribute` `instance-attribute` ¤

core.nlp.Tokenizer.internal_meta_tokens `class-attribute` `instance-attribute` ¤

core.nlp.Tokenizer.abbreviations `instance-attribute` ¤

core.nlp.Tokenizer.replacements `instance-attribute` ¤

core.nlp.Tokenizer.stopwords `instance-attribute` ¤

core.nlp.Tokenizer.lang_stopwords `instance-attribute` ¤

core.nlp.Tokenizer.supports_ngrams `instance-attribute` ¤

core.nlp.Tokenizer.ngrams_trie `instance-attribute` ¤