Skip to content

core.nlp¤

core.nlp ¤

High-level natural language processing module for message-like (emails, comments, posts) input.

Supports automatic language detection, word tokenization and stemming for 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'italian', 'norwegian', 'portuguese', 'spanish', 'swedish'.

© 2023 - Aurélien Pierre

Attributes¤

core.nlp.regex_starter module-attribute ¤

regex_starter = '(?<=^|\\s|\\[|\\(|\\{|\\<|\\\'|\\"|`|;|\\>)'

Start of line, or start of document, or start of markup

core.nlp.regex_stopper module-attribute ¤

regex_stopper = '(?=$|\\s|\\]|\\)|\\}|\\>|\\\'|\\"|`|;|\\<)'

End of line, or end of document, or end of markup

core.nlp.end_of_word module-attribute ¤

end_of_word = '(?=$|\\s|\\]|\\)|\\}|\\>|\\\'|\\"|`|;|:|,|\\?|\\!|\\.|\\<)'

End of word, or end of line, or end of document, or end of markup

core.nlp.regex_algebra module-attribute ¤

regex_algebra = '[\\+\\-\\=\\\\±]'

Algebraic signs

core.nlp.IP_PATTERN module-attribute ¤

IP_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_ip, regex_stopper), re.IGNORECASE
)

IPv4 and IPv6 patterns where the whole IP is captured in the first group.

core.nlp.EMAIL_PATTERN module-attribute ¤

EMAIL_PATTERN = re.compile(
    "<?([0-9a-z\\-\\_\\+\\.]+?@[0-9a-z\\-\\_\\+]+(\\.[0-9a-z\\_\\-]{2,})+)>?",
    re.IGNORECASE,
)

Emails patterns like <me@mail.com> or me@mail.com where the whole address is captured in the first group.

core.nlp.URL_PATTERN module-attribute ¤

URL_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_url, end_of_word), re.IGNORECASE
)

URL patterns like http(s)://domain.ext/page/subpage?q=x&r=0:1#anchor or //domain.ext/page. URL must follow RFC3986, meaning query parameters should be before anchors, if any. Relying on this assumption allows a faster regex parsing.

  • the protocol (ftp, ftps, http, https) is captured as the first group,
  • domain.ext is captured as the second group,
  • /page/etc is the third group, including leading and trailing /,
  • page query parameters ?s=x&r=0, including ?, is the fourth group if the URL declares ...?params#anchor,
  • anchor #anchor is the fifth group, including #, if the URL declares ...?params#anchor.

URLs are captured if they are:

  • alone on their own line,
  • enclosed in {}, [], ()
  • enclosed in whitespaces.

Warning: URLs enclosed in (), [] and {} may retain the closing sign as part of the page name since () and [] are valid in URL pathes and parameters. This pattern will work on plain text only: Markdown, XML, HTML and JSON will need to be parsed ahead.

core.nlp.MEMBERS_PATTERN module-attribute ¤

MEMBERS_PATTERN = re.compile('(?<=[a-z])(\\.)(?=[a-z])', re.IGNORECASE)

Domain patterns without leading protocol like cdn.company.com or class members in object-oriented programming languages like params.cookies.client.

core.nlp.DATE_PATTERN module-attribute ¤

DATE_PATTERN = re.compile(date_regex, re.IGNORECASE)

Dates like 2022-12-01, 01-12-2022, 01-12-22, 01/12/2022, 01/12/22 where the whole date is captured in the first group, then each group of digits is captured in the order of appearance, in the next 3 groups

core.nlp.TIME_PATTERN module-attribute ¤

TIME_PATTERN = re.compile(time_regex, re.IGNORECASE)

Identify more or less standard time patterns, like :

  • 12h15
  • 12:15
  • 12:15:00
  • 12am
  • 12 am
  • 12 h
  • 12:15:00Z
  • 12:15:00+01
  • 12:15:00 UTC+1
  • 11:27:45+0000
RETURNS DESCRIPTION
0

1- or 2-digits hour,

TYPE: str

1

hour/minutes separator or half-day marker among ["h", ":", "am", "pm"] (case-insensitive)

TYPE: str

2

2-digits minutes, if any, or None

TYPE: str

3

2-digits seconds, if any.

TYPE: str

4

hour marker (h or H), half-day marker (case-insensitive ["am", "pm"]), or time zone marker (case-sensitive ["Z", "UTC"])

TYPE: str

5

1-or 2-digits signed integer timezone shift (referred to UTC).

TYPE: str

Examples:

see https://regex101.com/r/QNtZAK/2

see src/tests/test-patterns.py

core.nlp.DOMAIN_PATTERN module-attribute ¤

DOMAIN_PATTERN = re.compile(
    "from ((?:[a-z0-9\\-_]{0,61}\\.)+[a-z]{2,})", re.IGNORECASE
)

Matches patterns like from (domain.ext) from RFC-822 Received header in emails.

core.nlp.UID_PATTERN module-attribute ¤

UID_PATTERN = re.compile('UID ([0-9]+)')

Matches email integer UID from IMAP headers.

core.nlp.FLAGS_PATTERN module-attribute ¤

FLAGS_PATTERN = re.compile('FLAGS \\((.*?)\\)')

Matches email flags from IMAP headers.

core.nlp.PATH_PATTERN module-attribute ¤

PATH_PATTERN = re.compile('%s%s%s' % (regex_starter, path_regex, end_of_word))

File path pattern like ~/file, /home/file, ./file or C:\windows

core.nlp.PARTIAL_PATH_REGEX module-attribute ¤

PARTIAL_PATH_REGEX = re.compile(
    "%s%s%s" % (regex_starter, partial_path_regex, end_of_word)
)

Partial, invalid path patterns missing the leading root, like home/user/stuff. We start capturing after at least two folder separators (slash or backslash).

Warning

this will collide with date detection, so run it after in the pipeline.

core.nlp.RESOLUTION_PATTERN module-attribute ¤

RESOLUTION_PATTERN = re.compile('\\d+(?:×|x|X)\\d+')

Pixel resolution like 10x20 or 10×20. Units are discarded.

core.nlp.NUMBER_PATTERN module-attribute ¤

NUMBER_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_number, regex_stopper)
)

Signed integers and decimals, fractions and numeric IDs with interal dashes and underscores. Numbers with starting or trailing units are not considered. Lazy decimals (.1 and 1.) are considered.

core.nlp.HASH_PATTERN module-attribute ¤

HASH_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_hash, end_of_word), re.IGNORECASE
)

Cryptographic hexadecimal hashes and fingerprints, of a min length of 8 characters.

core.nlp.MULTIPLE_LINES module-attribute ¤

MULTIPLE_LINES = re.compile('(?: ?[\\t\\r\\n]{2,} ?)+')

Detect more than 2 newlines and tab, possibly mixed with spaces

core.nlp.MULTIPLE_NEWLINES module-attribute ¤

MULTIPLE_NEWLINES = re.compile('(?: ?[\\t\\r\\n]+ ?){2,}')

Detect broken sequences of newlines and spaces.

core.nlp.INTERNAL_NEWLINE module-attribute ¤

INTERNAL_NEWLINE = re.compile('(?<=\\w)[\\n\\t\\r]{1}(?=\\w)')

Detect single newline characters nested inside text. Mostly useful for parsed PDF where line wrapping is quite literal ( used instead of space).

core.nlp.EXPOSURE module-attribute ¤

EXPOSURE = re.compile(
    "%s%s%s" % (regex_starter, exposure_regex, end_of_word), flags=re.IGNORECASE
)

Exposure values in EV or IL

core.nlp.PHOTOSPEED module-attribute ¤

PHOTOSPEED = re.compile(
    "%s%s%s" % (regex_starter, photospeed_regex, end_of_word),
    flags=re.IGNORECASE,
)

Exposure values in EV or IL

core.nlp.SENSIBILITY module-attribute ¤

SENSIBILITY = re.compile(
    "%s%s%s" % (regex_starter, sensibility_regex, end_of_word),
    flags=re.IGNORECASE,
)

Photographic sensibility in ISO or ASA

core.nlp.LUMINANCE module-attribute ¤

LUMINANCE = re.compile(
    "%s%s%s" % (regex_starter, luminance_regex, end_of_word),
    flags=re.IGNORECASE,
)

Luminance/radiance in nits or Cd/m²

core.nlp.DIAPHRAGM module-attribute ¤

DIAPHRAGM = re.compile(
    "%s%s" % (regex_starter, diaphragm_regex), flags=re.IGNORECASE
)

Photographic diaph aperture values like f/2.8 or f/11

core.nlp.GAIN module-attribute ¤

GAIN = re.compile(
    "%s%s%s" % (regex_starter, gain_regex, end_of_word), flags=re.IGNORECASE
)

Gain, attenuation and PSNR in dB

core.nlp.FILE_SIZE module-attribute ¤

FILE_SIZE = re.compile(
    "%s%s%s" % (regex_starter, filesize_regex, end_of_word), flags=re.IGNORECASE
)

File and memory size in bit, byte, or octet and their multiples

core.nlp.DISTANCE module-attribute ¤

DISTANCE = re.compile(
    "%s%s%s" % (regex_starter, distance_regex, end_of_word), flags=re.IGNORECASE
)

Distance in meter, inch, foot and their multiples

core.nlp.PERCENT module-attribute ¤

PERCENT = re.compile('%s%s%s' % (regex_starter, percent_regex, end_of_word))

Number followed by %

core.nlp.WEIGHT module-attribute ¤

WEIGHT = re.compile(
    "%s%s%s" % (regex_starter, weight_regex, end_of_word), flags=re.IGNORECASE
)

Weight (mass) in British and SI units and their multiples

core.nlp.ANGLE module-attribute ¤

ANGLE = re.compile(
    "%s%s%s" % (regex_starter, angle_regex, end_of_word), flags=re.IGNORECASE
)

Angles in radians, degrees and steradians

core.nlp.TEMPERATURE module-attribute ¤

TEMPERATURE = re.compile(
    "%s%s%s" % (regex_starter, temperature_regex, end_of_word),
    flags=re.IGNORECASE,
)

Temperatures in °C, °F and K

core.nlp.FREQUENCY module-attribute ¤

FREQUENCY = re.compile(
    "%s%s%s" % (regex_starter, frequency_regex, end_of_word),
    flags=re.IGNORECASE,
)

Frequencies in hertz and multiples

core.nlp.TEXT_DATES module-attribute ¤

TEXT_DATES = re.compile(
    "([0-9]{1,2})? (jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec|jan|fév|mar|avr|mai|jui|jui|aou|sep|oct|nov|déc|janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre|january|february|march|april|may|june|july|august|september|october|november|december)\\.?( [0-9]{1,2})?( [0-9]{2,4})(?!\\:)",
    flags=re.IGNORECASE | re.MULTILINE,
)

Find textual dates formats:

  • English dates like 01 Jan 20 or 01 Jan. 2020 but avoid capturing adjacent time like 12:08.
  • French dates like 01 Jan 20 or 01 Jan. 2020 but avoid capturing adjacent time like 12:08.
RETURNS DESCRIPTION
0

2 digits (day number or year number, depending on language)

TYPE: str

1

month (full-form or abbreviated)

TYPE: str

2

2 digits (day number or year number, depending on language)

TYPE: str

3

4 digits (full year)

TYPE: str

core.nlp.BASE_64 module-attribute ¤

BASE_64 = re.compile(
    "((?:[A-Za-z0-9+\\/]{4}){64,}(?:[A-Za-z0-9+\\/]{2}==|[A-Za-z0-9+\\/]{3}=)?)"
)

Identifies base64 encoding

core.nlp.BB_CODE module-attribute ¤

BB_CODE = re.compile('\\[(img|quote)[a-zA-Z0-9 =\\"]*?\\].*?\\[\\/\\1\\]')

Identifies left-over BB code markup [img] and [quote]

core.nlp.MARKUP module-attribute ¤

MARKUP = re.compile('(?:\\[|\\{|\\<)([^\\n\\r]+?)(?:\\]|\\}|\\>)')

Identifies left-over HTML and Markdown markup, like <...>, {...}, [...]

core.nlp.USER module-attribute ¤

USER = re.compile('([\\w\\-\\+\\.]+)?@([\\w\\-\\+\\.]+)|(user\\-?\\d+)')

Identifies user handles or emails

core.nlp.REPEATED_CHARACTERS module-attribute ¤

REPEATED_CHARACTERS = re.compile('(.)\\1{9,}')

Identifies any character repeated more than 9 times

core.nlp.UNFINISHED_SENTENCES module-attribute ¤

UNFINISHED_SENTENCES = re.compile('(?<![?!.;:])\\n\\n|\\r\\n')

Identifies sentences finishing with 2 newlines characters without having ending punctuations

core.nlp.MULTIPLE_DOTS module-attribute ¤

MULTIPLE_DOTS = re.compile('\\.{2,}')

Identifies dots repeated more than twice

core.nlp.MULTIPLE_DASHES module-attribute ¤

MULTIPLE_DASHES = re.compile('[-~]{1,}')

Identifies dashes repeated more than once

core.nlp.MULTIPLE_QUESTIONS module-attribute ¤

MULTIPLE_QUESTIONS = re.compile('\\?{1,}')

Identifies question marks repeated more than once

core.nlp.ORDINAL_FR module-attribute ¤

ORDINAL_FR = re.compile('n° ?([0-9]+)')

French ordinal numbers (numéros n°)

core.nlp.FRANCAIS module-attribute ¤

FRANCAIS = re.compile(
    "%s(j|t|s|d|qu|lorsqu|quelqu|jusqu|m|c|n)\\'(?=[aeiouyéèàêâîôûïüäëöh][\\w\\s])"
    % regex_starter,
    flags=re.IGNORECASE,
)

French contractions of pronouns and determinants

core.nlp.DASHES module-attribute ¤

DASHES = re.compile('(?<=\\w)(-|_|=)+(?=\\w)', re.IGNORECASE)

Dashes in the middle of ASCII/Latin compounded words. Will not work if accented or Unicode characters are immediately surrounding the dash.

core.nlp.ALTERNATIVES module-attribute ¤

ALTERNATIVES = re.compile('(?<=[a-z])(\\/)(?=[a-z])', re.IGNORECASE)

Slash-separated word alternatives like and/or mr/mrs

core.nlp.PLURAL_S module-attribute ¤

PLURAL_S = re.compile('(?<=[a-zA-Z]{4,})s?e{0,2}s%s' % end_of_word)

Identify plural form of nouns (French and English), adjectives (French) and third-person present verbs (English) and second-person verbs (French) in -s.

core.nlp.FEMININE_E module-attribute ¤

FEMININE_E = re.compile('(?<=\\w{4,})e{1,2}%s' % end_of_word)

Identify feminine form of adjectives (French) in -e.

core.nlp.DOUBLE_CONSONANTS module-attribute ¤

DOUBLE_CONSONANTS = re.compile(
    "(?<=\\w{2,})([bcfghjklmnpqrstvwxz])\\1", re.IGNORECASE
)

Identify double consonants in the middle of words.

core.nlp.FEMININE_TRICE module-attribute ¤

FEMININE_TRICE = re.compile('(?<=\\w{4,})t(rice|eur|or)%s' % end_of_word)

Identify French feminine nouns in -trice.

core.nlp.ADVERB_MENT module-attribute ¤

ADVERB_MENT = re.compile('(?<=\\w{4,})e?ment%s' % end_of_word)

Identify French adverbs and English nouns ending en -ment

core.nlp.SUBSTANTIVE_TION module-attribute ¤

SUBSTANTIVE_TION = re.compile('(?<=\\w{4,})(t|s)ion%s' % end_of_word)

Identify French and English substantives formed from verbs by adding -tion and -sion

core.nlp.SUBSTANTIVE_AT module-attribute ¤

SUBSTANTIVE_AT = re.compile('(?<=\\w{4,})at%s' % end_of_word)

Identify French and English substantives formed from other nouns by adding -at

core.nlp.PARTICIPLE_ING module-attribute ¤

PARTICIPLE_ING = re.compile('(?<=\\w{4,})ing%s' % end_of_word)

Identify English substantives and present participles formed from verbs by adding -ing

core.nlp.ADJECTIVE_ED module-attribute ¤

ADJECTIVE_ED = re.compile('(?<=\\w{4,})ed%s' % end_of_word)

Identify English adjectives formed from verbs by adding -ed

core.nlp.ADJECTIVE_TIF module-attribute ¤

ADJECTIVE_TIF = re.compile('(?<=\\w{2,})ti(f|v)%s' % end_of_word)

Identify English and French adjectives formed from verbs by adding -tif or -tive

core.nlp.SUBSTANTIVE_Y module-attribute ¤

SUBSTANTIVE_Y = re.compile('(?<=\\w{3,})y%s' % end_of_word)

Identify English substantives ending in -y

core.nlp.VERB_IZ module-attribute ¤

VERB_IZ = re.compile('(?<=\\w{4,})(i|y)z%s' % end_of_word)

Identify American verbs ending in -iz that French and Brits write in -is

core.nlp.STUFF_ER module-attribute ¤

STUFF_ER = re.compile('(?<=\\w{5,})er%s' % end_of_word)

Identify French 1st group verb (infinitive) and English substantives ending in -er

core.nlp.BRITISH_OUR module-attribute ¤

BRITISH_OUR = re.compile('(?<=\\w{3,})our%s' % end_of_word)

Identify British spelling ending in -our (colour, behaviour).

core.nlp.SUBSTANTIVE_ITY module-attribute ¤

SUBSTANTIVE_ITY = re.compile('(?<=\\w{4,})it(y|e)%s' % end_of_word)

Identify substantives in -ity (English) and -ite (French).

core.nlp.SUBSTANTIVE_IST module-attribute ¤

SUBSTANTIVE_IST = re.compile('(?<=\\w{3,})is(t|m)%s' % end_of_word)

Identify substantives in -ist and -ism.

core.nlp.SUBSTANTIVE_IQU module-attribute ¤

SUBSTANTIVE_IQU = re.compile('(?<=\\w{3,})i(qu|c)%s' % end_of_word)

Identify French substantives in -iqu

core.nlp.SUBSTANTIVE_EUR module-attribute ¤

SUBSTANTIVE_EUR = re.compile('(?<=\\w{3,})eur%s' % end_of_word)

Identify French substantives -eur

core.nlp.HYPHENIZED module-attribute ¤

HYPHENIZED = re.compile('(?<=\\w{3,})[-–—]+ *[\\n\\r]{1,2}(?=\\w)')

Detect hyphenized words at the end of a PDF text line.

core.nlp.WAYBACK_RE module-attribute ¤

WAYBACK_RE = re.compile('https?://web\\.archive\\.org/web/[^/]+/(https?://.+)')

Find the canonical URL from web.archive.org (Wayback Machine) URLs

core.nlp.LANG_MAP module-attribute ¤

LANG_MAP = {
    "en": "english",
    "fr": "french",
    "de": "german",
    "es": "spanish",
    "it": "italian",
    "pt": "portuguese",
    "nl": "dutch",
    "sv": "swedish",
    "no": "norwegian",
    "da": "danish",
    "fi": "finnish",
    "ru": "russian",
    "ro": "romanian",
    "hu": "hungarian",
    "tr": "turkish",
}

Map ISO 639-1 language codes of supported languages to their full-name, as used by pre-trained corpora

core.nlp.LANG_MAP_REVERSE module-attribute ¤

LANG_MAP_REVERSE = {v: k for k, v in (LANG_MAP.items())}

Map the full-name of supported languages, as used by pre-trained corpora, to ISO 639-1 language codes

core.nlp.STOPWORDS_DICT module-attribute ¤

STOPWORDS_DICT = {
    language: (set(STOPWORDS_DICT[language])) for language in STOPWORDS_DICT
}

Dictionnary of stopwords (as sets values) mapped to full language names (as keys)

Classes¤

core.nlp.Lexicon dataclass ¤

Lexicon(counts: Counter[str] = Counter())

Mutable token frequency index with canonicalization helpers for: - malformed n-grams, - merged/split variants, - plural compound normalization.

Examples:

liber_tarian -> libertarian etres_humains -> etre_humain

Functions¤
core.nlp.Lexicon.update ¤
update(corpus: Iterable[Iterable[str]]) -> None

Update token frequencies from a corpus of tokenized sentences.

PARAMETER DESCRIPTION
corpus

Iterable of tokenized sentences: [ [“this”, “is”, “a”, “sentence”], [“another”, “sentence”] ]

TYPE: Iterable[Iterable[str]]

core.nlp.Lexicon.frequency ¤
frequency(token: str) -> int

Return token frequency.

core.nlp.Lexicon.exists ¤
exists(token: str) -> bool

Check whether a token exists in the lexicon.

core.nlp.Lexicon.prune ¤
prune(min_count: int = 10) -> None

Remove all entries whose frequency is lower than min_count.

PARAMETER DESCRIPTION
min_count

Minimum frequency to keep.

TYPE: int DEFAULT: 10

core.nlp.Lexicon.resolve_token ¤
resolve_token(token: str, separator: str = '_', min_ratio: float = 1.0) -> str

Attempt to canonicalize malformed n-grams.

Operations: 1. malformed n-grams: liber_tarian -> libertarian

  1. plural compound reduction: etres_humains -> etre_humain

Strategy: - if token exists already -> keep it - otherwise: - remove separators, - check if merged variant exists, - compare frequencies, - prefer merged form if sufficiently frequent.

PARAMETER DESCRIPTION
token

Token to canonicalize.

TYPE: str

separator

N-gram separator.

TYPE: str DEFAULT: '_'

min_ratio

Require merged token frequency to be at least min_ratio times the split variant frequency.

Helps avoid false positives.

TYPE: float DEFAULT: 1.0

RETURNS DESCRIPTION
str

Canonicalized token.

core.nlp.Lexicon.canonicalize_sentence ¤
canonicalize_sentence(
    sentence: list[str], separator: str = "_", min_ratio: float = 1.0
) -> list[str]

Canonicalize all tokens in a sentence.

core.nlp.Tokenizer ¤

Tokenizer(
    meta_tokens: dict[re.Pattern, str] | None = None,
    abbreviations: dict[str, str] | None = None,
    replacements: dict[str, str] | None = None,
    stopwords: set[str] | None = None,
    lang_stopwords: dict[str, set[str]] | None = None,
    backend: str = "blingfire",
)

Pre-processing pipeline and tokenizer.

Splits a string into normalized word tokens after applying a series of configurable text transformations.

PARAMETER DESCRIPTION
meta_tokens

Pipeline of regular-expression substitutions used to replace document fragments with meta-tokens.

Keys must be compiled re.Pattern objects and values must be meta-token strings, typically enclosed in underscores.

Transformations are applied in declaration order. This relies on Python’s ordered dictionaries (Python 3.7+).

If not provided, a default pipeline suitable for bilingual English/French technical documents is used.

TYPE: dict[re.Pattern, str] | None DEFAULT: None

abbreviations

Pipeline of abbreviation replacements as a {to_replace: replacement} dictionary.

Replacements are applied in declaration order.

TYPE: dict[str, str] | None DEFAULT: None

replacements

Dictionary of token-level substitutions applied as {key: value} string replacements.

TYPE: dict[str, str] | None DEFAULT: None

stopwords

Language-agnostic stopwords to remove from the token stream.

TYPE: set[str] | None DEFAULT: None

lang_stopwords

Language-specific stopwords.

Keys must be ISO 639-1 language codes and values must be sets of stopwords associated with each language.

TYPE: dict[str, set[str]] | None DEFAULT: None

backend

Tokenization backend to use.

Supported values are:

  • "blingfire": Microsoft BlingFire tokenizer (pattern-based).
  • "nltk": NLTK Punkt tokenizer.

TYPE: str DEFAULT: 'blingfire'

Attributes¤
core.nlp.Tokenizer.characters_cleanup class-attribute instance-attribute ¤
characters_cleanup: dict[(re.Pattern) : str] = {
    MULTIPLE_DOTS: "...",
    MULTIPLE_DASHES: "-",
    MULTIPLE_QUESTIONS: "?",
    REPEATED_CHARACTERS: " ",
    BB_CODE: " ",
    MARKUP: " \\1 ",
    BASE_64: " ",
}

Dictionnary of regular expressions (keys) to find and replace by the provided strings (values). Cleanup repeated characters, including ellipses and question marks, leftover BBcode and XML markup, base64-encoded strings and French pronominal contractions (e.g “me + a” contracted into “m’a”).

core.nlp.Tokenizer.internal_meta_tokens class-attribute instance-attribute ¤
internal_meta_tokens: dict[(re.Pattern) : str] = {
    HASH_PATTERN_FAST: "_HASH_",
    NUMBER_PATTERN_FAST: "_NUMBER_",
}

Dictionnary of regular expressions (keys) to find in full-tokens and replace by meta-tokens. Use simplified regex patterns for performance.

core.nlp.Tokenizer.abbreviations instance-attribute ¤
abbreviations = abbreviations

Abbreviations and contractions to replace in full documents

core.nlp.Tokenizer.replacements instance-attribute ¤
replacements = replacements

Arbitrary string replacements in single tokens

core.nlp.Tokenizer.stopwords instance-attribute ¤
stopwords = set(stopwords) if stopwords else None

Language-agnostic stopwords

core.nlp.Tokenizer.lang_stopwords instance-attribute ¤
lang_stopwords = lang_stopwords

Language-specific stopwords

core.nlp.Tokenizer.supports_ngrams instance-attribute ¤
supports_ngrams: bool = False

Whether or not the tokenizer has an embedded n-grams model

core.nlp.Tokenizer.ngrams_trie instance-attribute ¤
ngrams_trie = {}

Prefix tree of known n-grams for efficient lookups

core.nlp.Tokenizer.vocabulary instance-attribute ¤
vocabulary: Lexicon = Lexicon()

Known tokens, if trained for n-grams.

Functions¤
core.nlp.Tokenizer.prefilter ¤
prefilter(string: str, meta_tokens: bool = True) -> str

Tokenizers split words based on unsupervised machine-learned models. Sometimes, they work weird. For example, in emails and user handles like @user, they would split @ and user as 2 different tokens, making it impossible to detect usernames in single tokens later.

To avoid that, we replace data of interest by meta-tokens before the tokenization, with regular expressions.

core.nlp.Tokenizer.lemmatize ¤
lemmatize(word: str) -> str

Find the root (lemma) of words to help topical generalization.

core.nlp.Tokenizer.normalize_text ¤
normalize_text(document: str) -> str

Prepare text for tokenization by converting it to lowercase ASCII characters.

This will loose accents, diacritics and capitals, which means some nuance will be lost at the benefit of generality. In case this does not suit your usecase, you may inherit the Tokenizer class, build a child class and re-implement this method

core.nlp.Tokenizer.normalize_token ¤
normalize_token(
    word: str,
    language: str | None,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> str | None

Return normalized, lemmatized and stemmed word tokens, where dates, times, digits, monetary units and URLs have their actual value replaced by meta-tokens designating their type. Stopwords (“the”, “a”, etc.), punctuation etc. is replaced by None, which should be filtered out at the next step.

PARAMETER DESCRIPTION
word

tokenized word in lower case only.

TYPE: str

language

the ISO 369-1 language code used to remove typical stopwords.

TYPE: str

normalize

remove punctuation and leading/trailing symbols.

TYPE: str DEFAULT: True

meta_tokens

replace string patterns by meta_tokens

TYPE: bool DEFAULT: True

stem

remove word suffixes, double consonnants, etc.

TYPE: bool DEFAULT: True

remove_stopwords

remove stopwords

TYPE: bool DEFAULT: True

NOTE

Tokenization is non-destructive (full sentences can be reconstructed entirely from token lists) if normalize=False, meta_tokens=False, stem=False and remove_stopwords=False. In this setting, only 1:1 token replacements defined in self.replacements will be applied, which can allow to replace abbreviations or accronyms. Other modes start generalizing semantics by removing meaning.

Examples:

Meta-tokens: 10:00 or 10 h or 10am or 10 am will all be replaced by a _TIME_ meta-token. feb, February, feb., monday will all be replaced by a _DATE_ meta-token.

core.nlp.Tokenizer.tokenize_text ¤
tokenize_text(
    sentence: str,
    language: str | None = None,
    n_grams: bool = True,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> list[str]

Split text into normalized word tokens and meta-tokens.

No sentence or paragraph boundary detection is performed.

PARAMETER DESCRIPTION
sentence

Input text to tokenize.

TYPE: str

n_grams

Whether to detect and collapse n-grams.

Requires a trained n-gram model generated with train_ngrams().

TYPE: bool DEFAULT: True

Note

The parameters language, normalize, meta_tokens, stem, and remove_stopwords are forwarded to normalize_token() and have the same meaning.

RETURNS DESCRIPTION
list[str]

List of normalized tokens represented as a bag of words.

core.nlp.Tokenizer.post_filter_tokens ¤
post_filter_tokens(
    tokens: list[str],
    language: str | None = None,
    meta_tokens: bool = True,
    stem: bool = False,
    normalize: bool = False,
    remove_stopwords: bool = False,
) -> list[str]

Apply post-processing operations to an existing token stream.

This method applies token normalization, stemming, stopword removal, and meta-token handling without performing tokenization.

PARAMETER DESCRIPTION
tokens

List of input tokens to process.

TYPE: list[str]

Note

The parameters language, meta_tokens, stem, normalize, and remove_stopwords are forwarded to normalize_token() and have the same meaning.

RETURNS DESCRIPTION
list[str]

List of processed tokens.

core.nlp.Tokenizer.tokenize_document_flat ¤
tokenize_document_flat(
    document: str,
    language: str | None = None,
    n_grams: bool = True,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> list[str]

Cleanup and tokenize a document or a sentence as an atomic element, meaning we don’t split it into sentences. Use this either for search-engine purposes (into a document’s body) or if the document is already split into sentences. The document text needs to have been prepared and cleaned, which means :

Note

the language is detected internally if not provided as an optional argument. When processing a single sentence extracted from a document, instead of the whole document, it is more accurate to run the language detection on the whole document, ahead of calling this method, and pass on the result here.

PARAMETER DESCRIPTION
document

the text of the document to tokenize

TYPE: str

n_grams

TYPE: bool DEFAULT: True

Note

The parameters language, meta_tokens, stem, normalize, and remove_stopwords are forwarded to normalize_token() and have the same meaning.

RETURNS DESCRIPTION
tokens

a 1D list of normalized tokens and meta-tokens.

TYPE: list[str]

core.nlp.Tokenizer.tokenize_document_per_sentence ¤
tokenize_document_per_sentence(
    document: str,
    language: str | None = None,
    n_grams: bool = True,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> list[list[str]]

Cleanup and tokenize a whole document as a list of sentences, meaning we split it into sentences before tokenizing. Use this to train a Word2Vec (embedding) model so each token is properly embedded into its syntactic context. The document text needs to have been prepared and cleaned, which means :

PARAMETER DESCRIPTION
document

the text of the document to tokenize

TYPE: str

n_grams

TYPE: bool DEFAULT: True

Note

The parameters language, meta_tokens, stem, normalize, and remove_stopwords are forwarded to normalize_token() and have the same meaning.

RETURNS DESCRIPTION
tokens

a 2D list of sentences (1st axis), each containing a list of normalized tokens and meta-tokens (2nd axis).

TYPE: list[list[str]]

core.nlp.Tokenizer.tokenize_document_per_paragraph ¤
tokenize_document_per_paragraph(
    document: str,
    language: str | None = None,
    n_grams: bool = True,
    normalize: bool = True,
    meta_tokens: bool = True,
    stem: bool = True,
    remove_stopwords: bool = True,
) -> list[list[str]]

Cleanup and tokenize a whole document as a list of paragraphs, meaning we split it on `

or ` before tokenizing. Use this to train a Word2Vec (embedding) model so each token is properly embedded into its syntactic context. The document text needs to have been prepared and cleaned, which means :

    - lowercased (optional but recommended) with `str.lower()`,
    - translated from Unicode to ASCII (optional but recommended) with [core.utils.typography_undo][],
    - cleaned up for sequences of whitespaces with [core.utils.clean_whitespaces][]

    Arguments:
        document (str): the text of the document to tokenize
        n_grams (bool): see [core.nlp.Tokenizer.tokenize_text][]
        others: see [core.nlp.Tokenizer.normalize_token][] arguments

    Note:
        the language is detected internally if not provided. The text is prefiltered with [self.prefilter][]

    Returns:
        tokens: a 2D list of paragraphs (1st axis), each containing a list of normalized tokens and meta-tokens (2nd axis).
core.nlp.Tokenizer.load classmethod ¤
load(name: str)

Load an existing trained model by its name from the ../models folder.

core.nlp.Tokenizer.members_from_ngram ¤
members_from_ngram(token: str | None) -> list[str] | None

Recover n-grams members from a single tokenized phrase, separated with _. This expects lower-case tokens, except for meta-tokens which are expected capitalized.

RETURNS DESCRIPTION
list[str] | None

the list of n-gram members, or None if the token was not an n-gram but a singleton.

core.nlp.Tokenizer.train_ngrams ¤
train_ngrams(
    sentences: list[str],
    connector_words: str = "",
    min_count: int = 10,
    threshold: float = 0.7,
    scoring: str = "npmi",
)

Train an n-gram model (bigrams and trigrams).

Detects common phrases such as “New York City” and merges them into single tokens using a statistical phrase model.

PARAMETER DESCRIPTION
sentences

Training corpus. Must be a list of tokenized sentences.

TYPE: list[str]

connector_words

Space-separated list of connector words allowed inside phrases (e.g. “by” in “piece by piece”).

These words are treated as valid bridges when forming n-grams.

TYPE: str DEFAULT: ''

min_count

Minimum number of occurrences required for a phrase to be considered.

See gensim.models.phrases.Phrases for details.

TYPE: int DEFAULT: 10

threshold

Phrase detection sensitivity threshold.

See gensim.models.phrases.Phrases.

TYPE: float DEFAULT: 0.7

scoring

Scoring function used for phrase detection.

See gensim.models.phrases.Phrases.

TYPE: str DEFAULT: 'npmi'

Warning

N-gram training must be performed on lightly processed tokenized sentences. Do not apply stemming, stopword removal, or punctuation stripping before training.

See Tokenizer.normalize_token() for required preprocessing options.

Note
  • Writes an ngrams log file in the models directory containing discovered phrases.
  • Can be executed multiple times (e.g. per language); results are appended to the existing model.
core.nlp.Tokenizer.compile_ngrams ¤
compile_ngrams(ngrams: list[str])

Build a nested n-grams dictionnary for efficient querying, like:

{
    "new": {
        "york": {
            "__value__": "new_york",
            "city": {
                "__value__": "new_york_city"
            }
        }
    }
}

core.nlp.Tokenizer.replace_ngrams ¤
replace_ngrams(tokens: list[str]) -> list[str]

Identify n-grams among tokens and collapse them into single tokens. N-grams should have been trained before, with core.nlp.Tokenizer.train_ngrams.

RETURNS DESCRIPTION
list[str]

the collapsed list of strings, or the original list if no n-grams

list[str]

was found or the n-grams model has not been trained.

core.nlp.Tokenizer.lookup_ngram ¤
lookup_ngram(members: list[str] | tuple[str, ...]) -> str | None

Lookup an n-gram in the trie from its token members.

PARAMETER DESCRIPTION
members

the tokens iterable

TYPE: list[str] | tuple[str, ...]

RETURNS DESCRIPTION
str | None

the collapsed n-gram if found in the trie, or None if the input members match

str | None

no known n-gram.

Example

lookup_ngram((“new”, “york”)) -> “new_york”

lookup_ngram((“new”, “york”, “city”)) -> “new_york_city”

lookup_ngram((“foo”, “bar”)) -> None

core.nlp.Data ¤

Data(text: str, label: str)

Represent an item of tagged training data.

PARAMETER DESCRIPTION
text

the content to label, which will be vectorized

TYPE: str

label

the category of the content, which will be predicted by the model

TYPE: str

core.nlp.LossLogger ¤

LossLogger()

Bases: CallbackAny2Vec

Output loss at each epoch

core.nlp.Word2Vec ¤

Word2Vec(
    documents: list[list[str]],
    name: str = "word2vec",
    vector_size: int = 300,
    epochs: int = 200,
    window: int = 5,
    min_count: int = 5,
    sample: float = 0.0005,
    tokenizer: Tokenizer = None,
    compute_idf: bool = False,
    **kwargs: dict[str, Any]
)

Bases: gensim.models.Word2Vec

Train, re-train, or load a Word2Vec embedding model.

If a model with the given name already exists, it is automatically loaded instead of re-trained. Note that in this case, vector_size will be overridden by the saved model configuration.

PARAMETER DESCRIPTION
documents

Pre-tokenized training corpus.

Structure: - outer list: documents - inner list: tokenized sentences

TYPE: list[list[str]]

name

Name of the model file used for saving/loading.

TYPE: str DEFAULT: 'word2vec'

vector_size

Dimensionality of word embeddings.

TYPE: int DEFAULT: 300

epochs

Number of training iterations.

Higher values improve quality on small corpora but increase training time.

TYPE: int DEFAULT: 200

window

Context window size for word co-occurrence.

TYPE: int DEFAULT: 5

min_count

Minimum frequency threshold for vocabulary filtering.

TYPE: int DEFAULT: 5

sample

Subsampling rate for frequent words.

TYPE: float DEFAULT: 0.0005

tokenizer

Tokenizer instance used for preprocessing (if applicable).

TYPE: Tokenizer DEFAULT: None

compute_idf

Whether to compute and store IDF statistics for SIF weighting.

Disable to reduce model size when SIF is not used.

TYPE: bool DEFAULT: False

**kwargs

Additional parameters forwarded directly to gensim.models.Word2Vec.

TYPE: dict[str, Any] DEFAULT: {}

Attributes¤
core.nlp.Word2Vec.tokenizer instance-attribute ¤
tokenizer = tokenizer if tokenizer is not None else Tokenizer()

Tokenizer used to train the model. We store it to be sure to use the same when using it.

core.nlp.Word2Vec.N_docs instance-attribute ¤
N_docs = len(documents)

Number of documents in the training corpus

core.nlp.Word2Vec.N_sentences instance-attribute ¤
N_sentences = len(sentences)

Number of sentences in the training corpus

core.nlp.Word2Vec.N_words instance-attribute ¤
N_words = len(words)

Number of words (tokens) in the training corpus

core.nlp.Word2Vec.N_terms instance-attribute ¤
N_terms = len(counts)

Number of terms (unique words) in the training corpus

core.nlp.Word2Vec.idf instance-attribute ¤
idf: dict[str, float] | None = None

Inverse Document Frequency, used only for SIF weighting when enabled.

core.nlp.Word2Vec.avg_doc_len instance-attribute ¤
avg_doc_len: float | None = None

Average number of words in documents of the training corpus, available with IDF stats.

Functions¤
core.nlp.Word2Vec.compute_idf ¤
compute_idf(documents: list[list[str]]) -> None

Compute and store IDF statistics from a tokenized document corpus.

core.nlp.Word2Vec.update_idf ¤
update_idf(documents: list[list[str]]) -> None

Update IDF statistics and corpus-dependent metadata with new documents.

PARAMETER DESCRIPTION
documents

New pre-tokenized documents.

TYPE: list[list[str]]

core.nlp.Word2Vec.prune_idf ¤
prune_idf()

Prune IDF entries to the actual model vocabulary (remove tokens that were filtered out by gensim during super().__init__).

core.nlp.Word2Vec.load_model classmethod ¤
load_model(name: str)

Load a trained model saved in models folders

core.nlp.Word2Vec.get_word ¤
get_word(word: str) -> str | None

Find out if word is in dictionary, optionnaly attempting spell-checking if not found.

PARAMETER DESCRIPTION
word

word to find

TYPE: str

RETURNS DESCRIPTION
str | None
  • the original word if found in dictionnary,
  • None if both previous conditions were not matched.
core.nlp.Word2Vec.get_wordvec ¤
get_wordvec(
    word: str, embed: str = "IN", normalize: bool = True
) -> np.ndarray[np.float32] | None

Return the vector associated to a word, through a dictionnary of words.

PARAMETER DESCRIPTION
word

the word to convert to a vector.

TYPE: str

embed

TYPE: str DEFAULT: 'IN'


  1. A Dual Embedding Space Model for Document Ranking (2016), Bhaskar Mitra, Eric Nalisnick, Nick Craswell, Rich Caruana https://arxiv.org/pdf/1602.01137.pdf 

RETURNS DESCRIPTION
np.ndarray[np.float32] | None

the nD vector if the word was found in the dictionnary, or None.

core.nlp.Word2Vec.get_features ¤
get_features(
    tokens: list[str],
    embed: str = "IN",
    use_sif: bool = False,
    sif_smoothing: float = 0.001,
) -> np.ndarray[np.float32]

Calls core.nlp.Word2Vec.get_wordvec over a list of tokens and returns a single vector representing the whole list.

PARAMETER DESCRIPTION
tokens

list of text tokens.

TYPE: list[str]

embed

TYPE: str DEFAULT: 'IN'

use_sif

Use SIF weighting on each term when embedding a full sentence or document. See core.nlp.Word2Vec.SIF.

TYPE: bool DEFAULT: False

sif_smoothing

The SIF smoothing coefficient.

TYPE: float DEFAULT: 0.001

RETURNS DESCRIPTION
np.ndarray[np.float32]

the normalized centroid of word embedding vectors associated with the input tokens

np.ndarray[np.float32]

(aka the average vector), or the null vector if no word from the list was found in dictionnary.

core.nlp.Word2Vec.SIF ¤
SIF(token: str, a: float = 0.001) -> float

Smooth inverse frequency weighting

Taken from A simple but tough-to-beat baseline for sentence embeddings, Sanjeev Arora, Yingyu Liang, Tengyu Ma. https://openreview.net/pdf?id=SyK00v5xx

This helps refining semantics by under-weighting stopwords, however it’s unsuited for File Information Retrieval (search engines) because it over-smoothen the embedding space geometry and hinders relevance discrimination with regard to a query.

PARAMETER DESCRIPTION
token

the token to weight. It should be in the model vocabulary.

TYPE: str

Return

The SIF weight associated with the token or 0. if the token was not found in the vocabulary.

core.nlp.Word2Vec.tokens_to_indices ¤
tokens_to_indices(tokens: list[str]) -> np.ndarray[np.int32]

Convert a list of tokens to a list of their index number in the Word2Vec vocabulary. This yields a more compact, albeit purely symbolic, representation of a tokenized document as a series of integers.

The conversion is reversible and the original token can be found with self.wv.index_to_key[i], where i is the index number output (for each token) from here.

Return

the list of indices as 32 bits integers, meaning the Word2Vec vocabulary needs to contain fewer than 4.29 billions words.

core.nlp.Classifier ¤

Classifier(
    training_set: list[Data],
    name: str,
    word2vec: Word2Vec,
    validate: bool = True,
    variant: str = "svm",
)

Bases: nltk.classify.SklearnClassifier

Initialize a Word2Vec + SVM classification pipeline.

This class wraps a Word2Vec embedding model with a downstream machine-learning classifier (SVM or alternatives).

PARAMETER DESCRIPTION
training_set

List of Data samples used for training.

If empty, the system will attempt to load a pre-trained model using name.

TYPE: list[Data]

name

Identifier used to save and reload the trained model.

TYPE: str

word2vec

Word embedding model used to generate feature vectors.

TYPE: Word2Vec

validate

If True, splits the dataset into training (95%) and testing (5%) subsets and prints evaluation metrics.

Useful for classifier selection and sanity checking.

TYPE: bool DEFAULT: True

variant

Type of classifier to use:

  • svm: RBF-kernel Support Vector Machine (default). Robust and stable across general datasets.

  • linear svm: Linear Support Vector Machine. Faster and often better for high-dimensional features.

  • forest: Random Forest classifier. Faster than linear SVM in some cases, but produces larger models.

TYPE: str DEFAULT: 'svm'

Note

The previous documentation mentioned path and features, but these are not part of the current signature and were removed.

Functions¤
core.nlp.Classifier.get_features_parallel ¤
get_features_parallel(post: Data) -> tuple[str, str]

Thread-safe call to .get_features() to be called in multiprocessing.Pool map

core.nlp.Classifier.load classmethod ¤
load(name: str)

Load an existing trained model by its name from the ../models folder.

core.nlp.Classifier.classify ¤
classify(post: str) -> str

Apply a label on a post based on the trained model.

core.nlp.Classifier.prob_classify ¤
prob_classify(post: str) -> tuple[str, float]

Apply a label on a post based on the trained model and output the probability too.

core.nlp.StemTokenIndex ¤

StemTokenIndex(db: sqlite3.Connection, tokenizer: Tokenizer)

Reverse normalized stem -> token lookup index.

Functions¤
core.nlp.StemTokenIndex.most_probable_token ¤
most_probable_token(db: sqlite3.Connection, stem: str) -> str

Return the most probable original token associated to the stem. If the stem doesn’t exist in the database, it is returned as-is.

core.nlp.StemTokenIndex.most_probable_tokens ¤
most_probable_tokens(db: sqlite3.Connection, stems: list[str]) -> list[str]

Return the most probable original token for each stem.

Stems not found in DB are returned unchanged.

Functions¤

core.nlp.split_url ¤

split_url(url: str) -> tuple[str, str, str, str, str] | None

Split a well-formed URL following RFC3986 into base elements.

RETURNS DESCRIPTION
tuple[str, str, str, str, str] | None

a tuple of (protocol, domain, page, parameters, anchor).

tuple[str, str, str, str, str] | None

Empty/missing fields are inited with empty strings so there is no need for individual None checks.

tuple[str, str, str, str, str] | None

If the url input doesn’t match an URL format, return None.

core.nlp.parse_lang_to_iso639_1 ¤

parse_lang_to_iso639_1(value: str | None) -> str | None

Normalize language identifier to ISO 639-1.

core.nlp.guess_language ¤

guess_language(
    string: str,
    stopwords_threshold: float = 0.05,
    letters_threshold: float = 0.8,
) -> str | None

Basic language guesser based on stopwords detection.

Stopwords are the most common words of a language: for each language, we count how many stopwords we found and return the language having the most matches. It is accurate for paragraphs and long documents, not so much for short sentences.

PARAMETER DESCRIPTION
string

the string to analyze. Needs to be lowercased but to retain accents and diacritics.

TYPE: str

stopwords_threshold

the minimum ratio of stopwords divided by total words in strings to be found to conclude on a language. For example, Japanese companies often have technical reports written in Japanese but still containing some English. If less than 5% of the words are known English stopwords, we could conclude it’s not English.

TYPE: float DEFAULT: 0.05

letters_threshold

the minimum ratio of roman (latin) characters among all characters (including numbers, symbols and non-latin alphabets) to be found to conclude on a language.

TYPE: float DEFAULT: 0.8

RETURNS DESCRIPTION
str | None

ISO 639-1 language code. Defaults to “en” if nothing found.

core.nlp.detect_language ¤

detect_language(text: str) -> str | None

Detect language from arbitrary text safely.

RETURNS DESCRIPTION
str | None

ISO 639-1 language code.

core.nlp.tokenize_document_to_words ¤

tokenize_document_to_words(
    text: str, language: str | None = None, backend: str = "blingfire"
) -> list[str]

Split a text into single words

PARAMETER DESCRIPTION
language

ISO 639-1 language code.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
list[str]

Bag of words for the whole document. Sentence delimiters are removed.

core.nlp.split_document_to_sentences ¤

split_document_to_sentences(
    text: str, language: str | None = None, backend: str = "blingfire"
) -> list[str]

Split a text into a list of sentences.

PARAMETER DESCRIPTION
language

ISO 639-1 language code.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
list[str]

List of sentences as full text.

core.nlp.tokenize_document_to_sentences ¤

tokenize_document_to_sentences(
    text: str, language: str | None = None, backend: str = "blingfire"
) -> list[list[str]]

Split a text into single words as a list of lists

PARAMETER DESCRIPTION
language

ISO 639-1 language code.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
list[list[str]]

List of sentences, each sentence is itself a list of words.