NLP¤

nlp ¤

High-level natural language processing module for message-like (emails, comments, posts) input.

Supports automatic language detection, word tokenization and stemming for 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'italian', 'norwegian', 'portuguese', 'spanish', 'swedish'.

Classes¤

Tokenizer ¤

Tokenizer(
    meta_tokens: dict[re.Pattern : str] = None,
    abbreviations: dict[str:str] = None,
    stopwords: list[str] = None,
)

Pre-processing pipeline and tokenizer, splitting a string into normalized word tokens.

PARAMETER DESCRIPTION

meta_token

the pipeline of regular expressions to replace with meta-tokens. Keys must be re.Pattern declared with re.compile(), values must be meta-tokens assumed to be nested in underscores. The pipeline dictionnary will be processed in the order of declaration, which relies on using Python >= 3.7 (making dict ordered by default). If not provided, it is inited by default with a pipeline suitable for bilingual English/French language processing on technical writings (see notes).

abbreviations

str]): pipeline of abbreviations to replace, as to_replace: replacement dictionnary. Will be processed in order of declaration.

TYPE: dict[str DEFAULT: None

Attributes¤

characters_cleanup `class-attribute` `instance-attribute` ¤

characters_cleanup: dict[re.Pattern : str] = {
    MULTIPLE_DOTS: "...",
    MULTIPLE_DASHES: "-",
    MULTIPLE_QUESTIONS: "?",
    REPEATED_CHARACTERS: " ",
    BB_CODE: " ",
    MARKUP: " \\1 ",
    BASE_64: " ",
}

Dictionnary of regular expressions (keys) to find and replace by the provided strings (values). Cleanup repeated characters, including ellipses and question marks, leftover BBcode and XML markup, base64-encoded strings and French pronominal contractions (e.g “me + a” contracted into “m’a”).

internal_meta_tokens `class-attribute` `instance-attribute` ¤

internal_meta_tokens: dict[re.Pattern : str] = {
    HASH_PATTERN_FAST: "_HASH_",
    NUMBER_PATTERN_FAST: "_NUMBER_",
}

Dictionnary of regular expressions (keys) to find in full-tokens and replace by meta-tokens. Use simplified regex patterns for performance.

Functions¤

prefilter ¤

prefilter(string: str, meta_tokens: bool = True) -> str

Tokenizers split words based on unsupervised machine-learned models. Sometimes, they work weird. For example, in emails and user handles like @user, they would split @ and user as 2 different tokens, making it impossible to detect usernames in single tokens later.

To avoid that, we replace data of interest by meta-tokens before the tokenization, with regular expressions.

lemmatize ¤

lemmatize(word: str) -> str

Find the root (lemma) of words to help topical generalization.

normalize_token ¤

normalize_token(word: str, language: str, meta_tokens: bool = True)

Return normalized, lemmatized and stemmed word tokens, where dates, times, digits, monetary units and URLs have their actual value replaced by meta-tokens designating their type. Stopwords (“the”, “a”, etc.), punctuation etc. is replaced by None, which should be filtered out at the next step.

PARAMETER	DESCRIPTION
`word`	tokenized word in lower case only. TYPE: `str`
`language`	the language used to detect dates. Supports `"french"`, `"english"` or `"any"`. TYPE: `str`
`vocabulary`	a `token: list` mapping where `token` is the stemmed token and `list` stores all words from corpus which share this stem. Because stemmed tokens are not user-friendly anymore, this vocabulary can be used to build a reverse mapping `normalized token` -> `natural language keyword` for GUI. TYPE: `dict`

Examples:

10:00 or 10 h or 10am or 10 am will all be replaced by a _TIME_ meta-token. feb, February, feb., monday will all be replaced by a _DATE_ meta-token.

tokenize_sentence ¤

tokenize_sentence(
    sentence: str, language: str, meta_tokens: bool = True
) -> list[str]

Split a sentence into normalized word tokens and meta-tokens.

PARAMETER	DESCRIPTION
`sentence`	the input single sentence. TYPE: `str`
`language`	the language string to be used by the tokenizer. It needs to be one of those supported by the module core.nlp. TYPE: `str`
`meta_tokens`	find meta-tokens through regular expressions and replace them in the text. This helps tokenization to keep similar objects together, especially dates that would otherwise be splitted. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`tokens`	the list of normalized tokens. TYPE: `list[str]`

split_sentences ¤

split_sentences(document: str, language: str) -> list[str]

Split a document into sentences using an unsupervised machine learning model.

PARAMETER	DESCRIPTION
`text`	the paragraph to break into sentences. TYPE: `str`
`language`	the language of the text, used to select what pre-trained model will be used. TYPE: `str`

tokenize_document ¤

tokenize_document(
    document: str, language: str = None, meta_tokens: bool = True
) -> list[str]

Cleanup and tokenize a document or a sentence as an atomic element, meaning we don’t split it into sentences. Use this either for search-engine purposes (into a document’s body) or if the document is already split into sentences. The document text needs to have been prepared and cleaned, which means :

lowercased (optional but recommended) with str.lower(),
translated from Unicode to ASCII (optional but recommended) with [utils.typography_undo()][],
cleaned up for sequences of whitespaces with [utils.cleanup_whitespaces()][]

Note

the language is detected internally if not provided as an optional argument. When processing a single sentence extracted from a document, instead of the whole document, it is more accurate to run the language detection on the whole document, ahead of calling this method, and pass on the result here.

PARAMETER	DESCRIPTION
`document`	the text of the document to tokenize TYPE: `str`
`language`	the language of the document. Will be internally inferred if not given. TYPE: `str` DEFAULT: `None`

RETURNS	DESCRIPTION
`tokens`	a 1D list of normalized tokens and meta-tokens. TYPE: `list[str]`

tokenize_per_sentence ¤

tokenize_per_sentence(
    document: str, meta_tokens: bool = True
) -> list[list[str]]

Cleanup and tokenize a whole document as a list of sentences, meaning we split it into sentences before tokenizing. Use this to train a Word2Vec (embedding) model so each token is properly embedded into its syntactic context. The document text needs to have been prepared and cleaned, which means :

lowercased (optional but recommended) with str.lower(),
translated from Unicode to ASCII (optional but recommended) with [utils.typography_undo()][],
cleaned up for sequences of whitespaces with [utils.cleanup_whitespaces()][]

Note

the language is detected internally.

RETURNS	DESCRIPTION
`tokens`	a 2D list of sentences (1st axis), each containing a list of normalizel tokens and meta-tokens (2nd axis). TYPE: `list[list[str]]`

Data ¤

Data(text: str, label: str)

Represent an item of tagged training data.

PARAMETER	DESCRIPTION
`text`	the content to label, which will be vectorized TYPE: `str`
`label`	the category of the content, which will be predicted by the model TYPE: `str`

LossLogger ¤

LossLogger()

Bases: CallbackAny2Vec

Output loss at each epoch

Word2Vec ¤

Word2Vec(
    sentences: list[str],
    name: str = "word2vec",
    vector_size: int = 300,
    epochs: int = 200,
    window: int = 5,
    min_count=5,
    sample=0.0005,
    tokenizer: Tokenizer = None,
)

Bases: gensim.models.Word2Vec

Train, re-train or retrieve an existing word2vec word embedding model

PARAMETER	DESCRIPTION
`name`	filename of the model to save and retrieve. If the model exists already, we automatically load it. Note that this will override the `vector_size` with the parameter defined in the saved model. TYPE: `str` DEFAULT: `'word2vec'`
`vector_size`	number of dimensions of the word vectors TYPE: `int` DEFAULT: `300`
`epochs`	number of iterations of training for the machine learning. Small corpora need 2000 and more epochs. Increases the learning time. TYPE: `int` DEFAULT: `200`
`window`	size of the token collocation window to detect TYPE: `int` DEFAULT: `5`

Functions¤

load_model `classmethod` ¤

load_model(name: str)

Load a trained model saved in models folders

get_word ¤

get_word(word: str) -> str | None

Find out if word is in dictionary, optionnaly attempting spell-checking if not found.

PARAMETER	DESCRIPTION
`word`	word to find TYPE: `str`

RETURNS	DESCRIPTION
`str \| None`	the original word if found in dictionnary, `None` if both previous conditions were not matched.

get_wordvec ¤

get_wordvec(word: str, embed: str = 'IN') -> np.ndarray[np.float32] | None

Return the vector associated to a word, through a dictionnary of words.

PARAMETER	DESCRIPTION
`word`	the word to convert to a vector. TYPE: `str`
`embed`	`IN` uses the input embedding matrix [gensim.models.Word2Vec.wv][], useful to vectorize queries and documents for classification training. `OUT` uses the output embedding matrix [gensim.models.Word2Vec.syn1neg], useful for the dual-space embedding scheme, to train search engines. [^1] TYPE: `str` DEFAULT: `'IN'`

A Dual Embedding Space Model for Document Ranking (2016), Bhaskar Mitra, Eric Nalisnick, Nick Craswell, Rich Caruana https://arxiv.org/pdf/1602.01137.pdf ↩

RETURNS	DESCRIPTION
`np.ndarray[np.float32] \| None`	the nD vector if the word was found in the dictionnary, or `None`.

get_features ¤

get_features(tokens: list[str], embed: str = 'IN') -> np.ndarray[np.float32]

Calls core.nlp.Word2Vec.get_wordvec over a list of tokens and returns a single vector representing the whole list.

PARAMETER	DESCRIPTION
`tokens`	list of text tokens. TYPE: `list[str]`
`embed`	see core.nlp.Word2Vec.get_wordvec TYPE: `str` DEFAULT: `'IN'`

RETURNS	DESCRIPTION
`np.ndarray[np.float32]`	the centroid of word embedding vectors associated with the input tokens (aka the average vector), or the null vector if no word from the list was found in dictionnary.

Classifier ¤

Classifier(
    training_set: list[Data],
    name: str,
    word2vec: Word2Vec,
    validate: bool = True,
    variant: str = "svm",
)

Bases: nltk.classify.SklearnClassifier

Handle the word2vec and SVM machine-learning

PARAMETER	DESCRIPTION
`training_set`	list of Data elements. If the list is empty, it will try to find a pre-trained model matching the `path` name. TYPE: `list[Data]`
`path`	path to save the trained model for reuse, as a Python joblib.
`name`	name under which the model will be saved for la ter reuse. TYPE: `str`
`word2vec`	the instance of word embedding model. TYPE: `Word2Vec`
`validate`	if `True`, split the `feature_list` between a training set (95%) and a testing set (5%) and print in terminal the predictive performance of the model on the testing set. This is useful to choose a classifier. TYPE: `bool` DEFAULT: `True`
`variant`	`svm`: use a Support Vector Machine with a radial-basis kernel. This is a well-rounded classifier, robust and stable, that performs well for all kinds of training samples sizes. `linear svm`: uses a linear Support Vector Machine. It runs faster than the previous and may generalize better for high numbers of features (high dimensionality). `forest`: Random Forest Classifier, which is a set of decision trees. It runs about 15-20% faster than linear SVM but tends to perform marginally better in some contexts, however it produces very large models (several GB to save on disk, where SVM needs a few dozens of MB). TYPE: `str` DEFAULT: `'svm'`
`features`	the number of model features (dimensions) to retain. This sets the number of dimensions for word vectors found by word2vec, which will also be the dimensions in the last training layer. TYPE: `int`

Functions¤

get_features_parallel ¤

get_features_parallel(post: Data) -> tuple[str, str]

Thread-safe call to .get_features() to be called in multiprocessing.Pool map

load `classmethod` ¤

load(name: str)

Load an existing trained model by its name from the ../models folder.

classify ¤

classify(post: str) -> str

Apply a label on a post based on the trained model.

prob_classify ¤

prob_classify(post: str) -> tuple[str, float]

Apply a label on a post based on the trained model and output the probability too.

search_methods ¤

Bases: IntEnum

Search methods available

Indexer ¤

Indexer(data_set: list, name: str, word2vec: Word2Vec)

Search engine based on word similarity.

PARAMETER	DESCRIPTION
`training_set`	list of Data elements. If the list is empty, it will try to find a pre-trained model matching the `path` name. TYPE: `list`
`path`	path to save the trained model for reuse, as a Python joblib.
`name`	name under which the model will be saved for la ter reuse. TYPE: `str`
`word2vec`	the instance of word embedding model. TYPE: `Word2Vec`
`validate`	if `True`, split the `feature_list` between a training set (95%) and a testing set (5%) and print in terminal the predictive performance of the model on the testing set. This is useful to choose a classifier. TYPE: `bool`
`variant`	`svm`: use a Support Vector Machine with a radial-basis kernel. This is a well-rounded classifier, robust and stable, that performs well for all kinds of training samples sizes. `linear svm`: uses a linear Support Vector Machine. It runs faster than the previous and may generalize better for high numbers of features (high dimensionality). `forest`: Random Forest Classifier, which is a set of decision trees. It runs about 15-20% faster than linear SVM but tends to perform marginally better in some contexts, however it produces very large models (several GB to save on disk, where SVM needs a few dozens of MB). TYPE: `str`
`features`	the number of model features (dimensions) to retain. This sets the number of dimensions for word vectors found by word2vec, which will also be the dimensions in the last training layer. TYPE: `int`

Functions¤

get_features_parallel ¤

get_features_parallel(tokens: list[str]) -> tuple[str, str]

Thread-safe call to .get_features() to be called in multiprocessing.Pool map

load `classmethod` ¤

load(name: str)

Load an existing trained model by its name from the ../models folder.

vectorize_query ¤

vectorize_query(
    tokenized_query: list[str],
) -> tuple[np.ndarray, float, list[str]]

Prepare a text search query: cleanup, tokenize and get the centroid vector.

RETURNS	DESCRIPTION
`tuple[np.ndarray, float, list[str]]`	tuple[vector, norm, tokens]

rank ¤

rank(
    query: str | tuple | re.Pattern,
    method: search_methods,
    filter_callback: callable = None,
    pattern: str | re.Pattern = None,
    **kargs
) -> list[tuple[str, float]]

Apply a label on a post based on the trained model.

PARAMETER	DESCRIPTION
`query`	the query to search. `re.Pattern` is available only with the `grep` method. TYPE: `str \| tuple \| re.Pattern`
`method`	`ai`, `fuzzy` or `grep`. `ai` use word embedding and meta-tokens with dual-embedding space, `fuzzy` uses meta-tokens with BM25Okapi stats model, `grep` uses direct string and regex search. TYPE: `str`
`filter_callback`	a function returning a boolean to filter in/out the results of the ranker. TYPE: `callable` DEFAULT: `None`
`pattern`	optional pattern/text search to add on top of AI search TYPE: `str \| re.Pattern` DEFAULT: `None`
`**kargs`	arguments passed as-is to the `filter_callback` DEFAULT: `{}`

RETURNS	DESCRIPTION
`list`	the list of best-matching results as (url, similarity) tuples. TYPE: `list[tuple[str, float]]`

get_page ¤

get_page(url: str) -> dict

Retrieve the requested page data object from the index by url.

Warning

For performance’s sake, it doesn’t check if the url exists in the index. This is no issue if you feed it the output of self.rank() but mind that otherwise.

get_related ¤

get_related(post: tuple, n: int = 15) -> list

Get the n closest keywords from the query.

Functions¤

guess_language ¤

guess_language(string: str) -> str

Basic language guesser based on stopwords detection.

Stopwords are the most common words of a language: for each language, we count how many stopwords we found and return the language having the most matches. It is accurate for paragraphs and long documents, not so much for short sentences.

RETURNS	DESCRIPTION
`str`	2-letters ISO-something language code.

Examples¤

Training a language model¤

Assuming you followed the example of the crawler module, write another user script with:

# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))

# Here starts the real code
from core import utils
from core import nlp

embedding_set = []

# Open an existing dataset
for post in utils.open_data("ansel"):
    # Use only the content field of a `crawler.web_page` object
    embedding_set.append(post["content"])

# Build the word2vec language model
w2v = nlp.Word2Vec(embedding_set, "word2vec", epochs=200, window=15, min_count=32, sample=0.0005)

# Test word2vec: get the closest words from "free"
print(w2v.wv.most_similar("free"))

This will save a word2vec file into VirtualSecretary/models. To retrieve it later, use:

# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))

# Here starts the real code
from core import nlp

w2v = nlp.Word2Vec.load_model("word2vec-public")

Training an AI-based search engine indexer¤

Assuming you built the word2vec model above, create another user script with:

# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))

# Here starts the real code
from core import utils
from core import nlp

# Create an index with pre-computed word embedding for each page
indexer = nlp.Indexer(utils.open_data("ansel"),
                      "search_engine",
                      nlp.Word2Vec.load_model("word2vec"))

# Do a test search
text_request = "install on linux"
tokenized_request = indexer.tokenize_query(text_request)
vectorized_request = indexer.vectorize_query(tokenized_request)
results = indexer.rank(vectorized_request, nlp.search_methods.AI)

# Display only the 25 best results
for url, similarity in results[0:25]:
    page = indexer.get_page(url)
    print(page["title"], page["excerpt"], page["url"], page["date"], similarity)

The Indexer object is automatically saved to VirtualSecretary/models as a compressed joblib object containing its own Word2Vec language model, so the indexer is standalone. To retrieve it later, use:

# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))

# Here starts the real code
from core import nlp

indexer = nlp.Indexer.load("search_engine")

Training an AI classifier¤

We will use the NPS Chat text corpus. It’s a text corpus of chat messages, labelled by category (like “Yes-No question”, “Wh- question”, “Greeting”, “Statement”, “No answer”, “Yes answer”, etc.). The purpose of the classifier will be to automatically find the label of a new message, by learning the properties of each label into the training corpus.

Create a new user script with:

# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))

# Here starts the real code
import nltk
from core import nlp

# Download the training set
nltk.download('nps_chat')

# Build the word2vec language model
training_set = [post.text
                for post in nltk.corpus.nps_chat.xml_posts()]
w2v = nlp.Word2Vec(embedding_set, "word2vec_chat", epochs=2000, window=3)

# Test word2vec
print("free:\n",
      w2v.wv.most_similar(w2v.tokenizer.normalize_token("free", "en")))

# Build the classifier model
training_set = [nlp.Data(post.text, post.get('class')) # (content, label)
                for post in nltk.corpus.nps_chat.xml_posts()]
model = nlp.Classifier(training_set, "chat", w2v, validate=True)

# Classify test messages
test_messages = ["Do you have time to meet at 5 pm ?",
                 "Come with me !",
                 "Nope",
                 "What do you think ?"]

for item in test_messages:
    print(item, model.prob_classify(item))

Output:

free:
[('xbox', 0.3570968210697174),
 ('gam', 0.3551534414291382),
 ('wz', 0.3535629212856293),
 ('howdi', 0.3532298803329468),
 ('anybodi', 0.340751051902771),
 ('against', 0.33561158180236816),
 ('hb', 0.32573479413986206),
 ('yawn', 0.3226745128631592),
 ('tx', 0.32188209891319275),
 ('hiya', 0.31899407505989075)]

accuracy against test set: 0.803030303030303
accuracy against train set: 0.9188166152007172

Do you have time to meet at 5 pm ? ('whQuestion', 0.39620921476180504)
Come with me ! ('Emphasis', 0.46625803160949525)
Nope ('nAnswer', 0.48401087375968443)
What do you think ? ('whQuestion', 0.9756292257900939)

Note

The classifier above was trained with validate=True which splits the training corpus in 2 sets: an actual training set used to extract the properties of labels, and a test set, discarded from the training and only used at the end to test if the model prediction matches the actual label. This helps tuning the hyper-parameters of the model.

The accuracies of the model against each set are shown in the terminal output. Values close to 1.0 mean the model gets it right everytime. It is expected that the accuracy against training set would be higher than against the test set. However, if there is a large difference between both (like 0.65/0.95), it means your model is over-fitting and will lack generality.

When satisfying accuracies have been found, you can retrain the model with validate=False to use all available data for maximum accuracy before using it in production.

The chat model will again be saved automatically in VirtualSecretary/models folder. Similarly to the previous objects, once saved, it can be later retrieved with :

# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))

# Here starts the real code
from core import nlp

model = nlp.Classifier.load("chat")

Conclusion¤

The NLP objects are designed to be easily saved and later retrieved to be used in filters. The general workflow is as follow:

gather data, either from the Secretary (using the learn filter mode to aggregate email bodies, contacts, comments, etc.) or from the Crawler, to scrape web pages and local documents,
train (offline) a language model (Word2Vec) with the data, with user scripts,
train (offline) a search engine Indexer model or a content Classifier model, depending on your needs, with user scripts,
from your processing filters, retrieve the trained models and process (online) the email contents to decide what actions should be taken. For the Classifier, the probability (confidence) of the label is returned as well and can be used in filters to act only when the confidence is above a threshold.

NLP¤

nlp ¤

Classes¤

Tokenizer ¤

Attributes¤

characters_cleanup class-attribute instance-attribute ¤

internal_meta_tokens class-attribute instance-attribute ¤

Functions¤

prefilter ¤

lemmatize ¤

normalize_token ¤

tokenize_sentence ¤

split_sentences ¤

tokenize_document ¤

tokenize_per_sentence ¤

Data ¤

LossLogger ¤

Word2Vec ¤

Functions¤

load_model classmethod ¤

get_word ¤

get_wordvec ¤

get_features ¤

Classifier ¤

Functions¤

get_features_parallel ¤

load classmethod ¤

classify ¤

prob_classify ¤

search_methods ¤

Indexer ¤

Functions¤

get_features_parallel ¤

load classmethod ¤

vectorize_query ¤

rank ¤

get_page ¤

get_related ¤

Functions¤

guess_language ¤

Examples¤

Training a language model¤

Training an AI-based search engine indexer¤

Training an AI classifier¤

Conclusion¤

characters_cleanup `class-attribute` `instance-attribute` ¤

internal_meta_tokens `class-attribute` `instance-attribute` ¤

load_model `classmethod` ¤

load `classmethod` ¤

load `classmethod` ¤