Skip to content

NLP¤

nlp ¤

High-level natural language processing module for message-like (emails, comments, posts) input.

Supports automatic language detection, word tokenization and stemming for 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'italian', 'norwegian', 'portuguese', 'spanish', 'swedish'.

© 2023 - Aurélien Pierre

Classes¤

Tokenizer ¤

Tokenizer(
    meta_tokens: dict[re.Pattern : str] = None,
    abbreviations: dict[str:str] = None,
    stopwords: list[str] = None,
)

Pre-processing pipeline and tokenizer, splitting a string into normalized word tokens.

PARAMETER DESCRIPTION
meta_token

the pipeline of regular expressions to replace with meta-tokens. Keys must be re.Pattern declared with re.compile(), values must be meta-tokens assumed to be nested in underscores. The pipeline dictionnary will be processed in the order of declaration, which relies on using Python >= 3.7 (making dict ordered by default). If not provided, it is inited by default with a pipeline suitable for bilingual English/French language processing on technical writings (see notes).

abbreviations

str]): pipeline of abbreviations to replace, as to_replace: replacement dictionnary. Will be processed in order of declaration.

TYPE: dict[str DEFAULT: None

Attributes¤
characters_cleanup class-attribute instance-attribute ¤
characters_cleanup: dict[re.Pattern : str] = {
    MULTIPLE_DOTS: "...",
    MULTIPLE_DASHES: "-",
    MULTIPLE_QUESTIONS: "?",
    REPEATED_CHARACTERS: " ",
    BB_CODE: " ",
    MARKUP: " \\1 ",
    BASE_64: " ",
}

Dictionnary of regular expressions (keys) to find and replace by the provided strings (values). Cleanup repeated characters, including ellipses and question marks, leftover BBcode and XML markup, base64-encoded strings and French pronominal contractions (e.g “me + a” contracted into “m’a”).

internal_meta_tokens class-attribute instance-attribute ¤
internal_meta_tokens: dict[re.Pattern : str] = {
    HASH_PATTERN_FAST: "_HASH_",
    NUMBER_PATTERN_FAST: "_NUMBER_",
}

Dictionnary of regular expressions (keys) to find in full-tokens and replace by meta-tokens. Use simplified regex patterns for performance.

Functions¤
prefilter ¤
prefilter(string: str, meta_tokens: bool = True) -> str

Tokenizers split words based on unsupervised machine-learned models. Sometimes, they work weird. For example, in emails and user handles like @user, they would split @ and user as 2 different tokens, making it impossible to detect usernames in single tokens later.

To avoid that, we replace data of interest by meta-tokens before the tokenization, with regular expressions.

lemmatize ¤
lemmatize(word: str) -> str

Find the root (lemma) of words to help topical generalization.

normalize_token ¤
normalize_token(word: str, language: str, meta_tokens: bool = True)

Return normalized, lemmatized and stemmed word tokens, where dates, times, digits, monetary units and URLs have their actual value replaced by meta-tokens designating their type. Stopwords (“the”, “a”, etc.), punctuation etc. is replaced by None, which should be filtered out at the next step.

PARAMETER DESCRIPTION
word

tokenized word in lower case only.

TYPE: str

language

the language used to detect dates. Supports "french", "english" or "any".

TYPE: str

vocabulary

a token: list mapping where token is the stemmed token and list stores all words from corpus which share this stem. Because stemmed tokens are not user-friendly anymore, this vocabulary can be used to build a reverse mapping normalized token -> natural language keyword for GUI.

TYPE: dict

Examples:

10:00 or 10 h or 10am or 10 am will all be replaced by a _TIME_ meta-token. feb, February, feb., monday will all be replaced by a _DATE_ meta-token.

tokenize_sentence ¤
tokenize_sentence(
    sentence: str, language: str, meta_tokens: bool = True
) -> list[str]

Split a sentence into normalized word tokens and meta-tokens.

PARAMETER DESCRIPTION
sentence

the input single sentence.

TYPE: str

language

the language string to be used by the tokenizer. It needs to be one of those supported by the module core.nlp.

TYPE: str

meta_tokens

find meta-tokens through regular expressions and replace them in the text. This helps tokenization to keep similar objects together, especially dates that would otherwise be splitted.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
tokens

the list of normalized tokens.

TYPE: list[str]

split_sentences ¤
split_sentences(document: str, language: str) -> list[str]

Split a document into sentences using an unsupervised machine learning model.

PARAMETER DESCRIPTION
text

the paragraph to break into sentences.

TYPE: str

language

the language of the text, used to select what pre-trained model will be used.

TYPE: str

tokenize_document ¤
tokenize_document(
    document: str, language: str = None, meta_tokens: bool = True
) -> list[str]

Cleanup and tokenize a document or a sentence as an atomic element, meaning we don’t split it into sentences. Use this either for search-engine purposes (into a document’s body) or if the document is already split into sentences. The document text needs to have been prepared and cleaned, which means :

  • lowercased (optional but recommended) with str.lower(),
  • translated from Unicode to ASCII (optional but recommended) with [utils.typography_undo()][],
  • cleaned up for sequences of whitespaces with [utils.cleanup_whitespaces()][]
Note

the language is detected internally if not provided as an optional argument. When processing a single sentence extracted from a document, instead of the whole document, it is more accurate to run the language detection on the whole document, ahead of calling this method, and pass on the result here.

PARAMETER DESCRIPTION
document

the text of the document to tokenize

TYPE: str

language

the language of the document. Will be internally inferred if not given.

TYPE: str DEFAULT: None

RETURNS DESCRIPTION
tokens

a 1D list of normalized tokens and meta-tokens.

TYPE: list[str]

tokenize_per_sentence ¤
tokenize_per_sentence(
    document: str, meta_tokens: bool = True
) -> list[list[str]]

Cleanup and tokenize a whole document as a list of sentences, meaning we split it into sentences before tokenizing. Use this to train a Word2Vec (embedding) model so each token is properly embedded into its syntactic context. The document text needs to have been prepared and cleaned, which means :

  • lowercased (optional but recommended) with str.lower(),
  • translated from Unicode to ASCII (optional but recommended) with [utils.typography_undo()][],
  • cleaned up for sequences of whitespaces with [utils.cleanup_whitespaces()][]
Note

the language is detected internally.

RETURNS DESCRIPTION
tokens

a 2D list of sentences (1st axis), each containing a list of normalizel tokens and meta-tokens (2nd axis).

TYPE: list[list[str]]

Data ¤

Data(text: str, label: str)

Represent an item of tagged training data.

PARAMETER DESCRIPTION
text

the content to label, which will be vectorized

TYPE: str

label

the category of the content, which will be predicted by the model

TYPE: str

LossLogger ¤

LossLogger()

Bases: CallbackAny2Vec

Output loss at each epoch

Word2Vec ¤

Word2Vec(
    sentences: list[str],
    name: str = "word2vec",
    vector_size: int = 300,
    epochs: int = 200,
    window: int = 5,
    min_count=5,
    sample=0.0005,
    tokenizer: Tokenizer = None,
)

Bases: gensim.models.Word2Vec

Train, re-train or retrieve an existing word2vec word embedding model

PARAMETER DESCRIPTION
name

filename of the model to save and retrieve. If the model exists already, we automatically load it. Note that this will override the vector_size with the parameter defined in the saved model.

TYPE: str DEFAULT: 'word2vec'

vector_size

number of dimensions of the word vectors

TYPE: int DEFAULT: 300

epochs

number of iterations of training for the machine learning. Small corpora need 2000 and more epochs. Increases the learning time.

TYPE: int DEFAULT: 200

window

size of the token collocation window to detect

TYPE: int DEFAULT: 5

Functions¤
load_model classmethod ¤
load_model(name: str)

Load a trained model saved in models folders

get_word ¤
get_word(word: str) -> str | None

Find out if word is in dictionary, optionnaly attempting spell-checking if not found.

PARAMETER DESCRIPTION
word

word to find

TYPE: str

RETURNS DESCRIPTION
str | None
  • the original word if found in dictionnary,
  • None if both previous conditions were not matched.
get_wordvec ¤
get_wordvec(word: str, embed: str = 'IN') -> np.ndarray[np.float32] | None

Return the vector associated to a word, through a dictionnary of words.

PARAMETER DESCRIPTION
word

the word to convert to a vector.

TYPE: str

embed
  • IN uses the input embedding matrix [gensim.models.Word2Vec.wv][], useful to vectorize queries and documents for classification training.
  • OUT uses the output embedding matrix [gensim.models.Word2Vec.syn1neg], useful for the dual-space embedding scheme, to train search engines. [^1]

TYPE: str DEFAULT: 'IN'


  1. A Dual Embedding Space Model for Document Ranking (2016), Bhaskar Mitra, Eric Nalisnick, Nick Craswell, Rich Caruana https://arxiv.org/pdf/1602.01137.pdf 

RETURNS DESCRIPTION
np.ndarray[np.float32] | None

the nD vector if the word was found in the dictionnary, or None.

get_features ¤
get_features(tokens: list[str], embed: str = 'IN') -> np.ndarray[np.float32]

Calls core.nlp.Word2Vec.get_wordvec over a list of tokens and returns a single vector representing the whole list.

PARAMETER DESCRIPTION
tokens

list of text tokens.

TYPE: list[str]

embed

TYPE: str DEFAULT: 'IN'

RETURNS DESCRIPTION
np.ndarray[np.float32]

the centroid of word embedding vectors associated with the input tokens (aka the average vector), or the null vector if no word from the list was found in dictionnary.

Classifier ¤

Classifier(
    training_set: list[Data],
    name: str,
    word2vec: Word2Vec,
    validate: bool = True,
    variant: str = "svm",
)

Bases: nltk.classify.SklearnClassifier

Handle the word2vec and SVM machine-learning

PARAMETER DESCRIPTION
training_set

list of Data elements. If the list is empty, it will try to find a pre-trained model matching the path name.

TYPE: list[Data]

path

path to save the trained model for reuse, as a Python joblib.

name

name under which the model will be saved for la ter reuse.

TYPE: str

word2vec

the instance of word embedding model.

TYPE: Word2Vec

validate

if True, split the feature_list between a training set (95%) and a testing set (5%) and print in terminal the predictive performance of the model on the testing set. This is useful to choose a classifier.

TYPE: bool DEFAULT: True

variant
  • svm: use a Support Vector Machine with a radial-basis kernel. This is a well-rounded classifier, robust and stable, that performs well for all kinds of training samples sizes.
  • linear svm: uses a linear Support Vector Machine. It runs faster than the previous and may generalize better for high numbers of features (high dimensionality).
  • forest: Random Forest Classifier, which is a set of decision trees. It runs about 15-20% faster than linear SVM but tends to perform marginally better in some contexts, however it produces very large models (several GB to save on disk, where SVM needs a few dozens of MB).

TYPE: str DEFAULT: 'svm'

features

the number of model features (dimensions) to retain. This sets the number of dimensions for word vectors found by word2vec, which will also be the dimensions in the last training layer.

TYPE: int

Functions¤
get_features_parallel ¤
get_features_parallel(post: Data) -> tuple[str, str]

Thread-safe call to .get_features() to be called in multiprocessing.Pool map

load classmethod ¤
load(name: str)

Load an existing trained model by its name from the ../models folder.

classify ¤
classify(post: str) -> str

Apply a label on a post based on the trained model.

prob_classify ¤
prob_classify(post: str) -> tuple[str, float]

Apply a label on a post based on the trained model and output the probability too.

search_methods ¤

Bases: IntEnum

Search methods available

Indexer ¤

Indexer(data_set: list, name: str, word2vec: Word2Vec)

Search engine based on word similarity.

PARAMETER DESCRIPTION
training_set

list of Data elements. If the list is empty, it will try to find a pre-trained model matching the path name.

TYPE: list

path

path to save the trained model for reuse, as a Python joblib.

name

name under which the model will be saved for la ter reuse.

TYPE: str

word2vec

the instance of word embedding model.

TYPE: Word2Vec

validate

if True, split the feature_list between a training set (95%) and a testing set (5%) and print in terminal the predictive performance of the model on the testing set. This is useful to choose a classifier.

TYPE: bool

variant
  • svm: use a Support Vector Machine with a radial-basis kernel. This is a well-rounded classifier, robust and stable, that performs well for all kinds of training samples sizes.
  • linear svm: uses a linear Support Vector Machine. It runs faster than the previous and may generalize better for high numbers of features (high dimensionality).
  • forest: Random Forest Classifier, which is a set of decision trees. It runs about 15-20% faster than linear SVM but tends to perform marginally better in some contexts, however it produces very large models (several GB to save on disk, where SVM needs a few dozens of MB).

TYPE: str

features

the number of model features (dimensions) to retain. This sets the number of dimensions for word vectors found by word2vec, which will also be the dimensions in the last training layer.

TYPE: int

Functions¤
get_features_parallel ¤
get_features_parallel(tokens: list[str]) -> tuple[str, str]

Thread-safe call to .get_features() to be called in multiprocessing.Pool map

load classmethod ¤
load(name: str)

Load an existing trained model by its name from the ../models folder.

vectorize_query ¤
vectorize_query(
    tokenized_query: list[str],
) -> tuple[np.ndarray, float, list[str]]

Prepare a text search query: cleanup, tokenize and get the centroid vector.

RETURNS DESCRIPTION
tuple[np.ndarray, float, list[str]]

tuple[vector, norm, tokens]

rank ¤
rank(
    query: str | tuple | re.Pattern,
    method: search_methods,
    filter_callback: callable = None,
    pattern: str | re.Pattern = None,
    **kargs
) -> list[tuple[str, float]]

Apply a label on a post based on the trained model.

PARAMETER DESCRIPTION
query

the query to search. re.Pattern is available only with the grep method.

TYPE: str | tuple | re.Pattern

method

ai, fuzzy or grep. ai use word embedding and meta-tokens with dual-embedding space, fuzzy uses meta-tokens with BM25Okapi stats model, grep uses direct string and regex search.

TYPE: str

filter_callback

a function returning a boolean to filter in/out the results of the ranker.

TYPE: callable DEFAULT: None

pattern

optional pattern/text search to add on top of AI search

TYPE: str | re.Pattern DEFAULT: None

**kargs

arguments passed as-is to the filter_callback

DEFAULT: {}

RETURNS DESCRIPTION
list

the list of best-matching results as (url, similarity) tuples.

TYPE: list[tuple[str, float]]

get_page ¤
get_page(url: str) -> dict

Retrieve the requested page data object from the index by url.

Warning

For performance’s sake, it doesn’t check if the url exists in the index. This is no issue if you feed it the output of self.rank() but mind that otherwise.

get_related(post: tuple, n: int = 15) -> list

Get the n closest keywords from the query.

Functions¤

guess_language ¤

guess_language(string: str) -> str

Basic language guesser based on stopwords detection.

Stopwords are the most common words of a language: for each language, we count how many stopwords we found and return the language having the most matches. It is accurate for paragraphs and long documents, not so much for short sentences.

RETURNS DESCRIPTION
str

2-letters ISO-something language code.

Examples¤

Training a language model¤

Assuming you followed the example of the crawler module, write another user script with:

# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))

# Here starts the real code
from core import utils
from core import nlp

embedding_set = []

# Open an existing dataset
for post in utils.open_data("ansel"):
    # Use only the content field of a `crawler.web_page` object
    embedding_set.append(post["content"])

# Build the word2vec language model
w2v = nlp.Word2Vec(embedding_set, "word2vec", epochs=200, window=15, min_count=32, sample=0.0005)

# Test word2vec: get the closest words from "free"
print(w2v.wv.most_similar("free"))

This will save a word2vec file into VirtualSecretary/models. To retrieve it later, use:

# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))

# Here starts the real code
from core import nlp

w2v = nlp.Word2Vec.load_model("word2vec-public")

Training an AI-based search engine indexer¤

Assuming you built the word2vec model above, create another user script with:

# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))

# Here starts the real code
from core import utils
from core import nlp

# Create an index with pre-computed word embedding for each page
indexer = nlp.Indexer(utils.open_data("ansel"),
                      "search_engine",
                      nlp.Word2Vec.load_model("word2vec"))

# Do a test search
text_request = "install on linux"
tokenized_request = indexer.tokenize_query(text_request)
vectorized_request = indexer.vectorize_query(tokenized_request)
results = indexer.rank(vectorized_request, nlp.search_methods.AI)

# Display only the 25 best results
for url, similarity in results[0:25]:
    page = indexer.get_page(url)
    print(page["title"], page["excerpt"], page["url"], page["date"], similarity)

The Indexer object is automatically saved to VirtualSecretary/models as a compressed joblib object containing its own Word2Vec language model, so the indexer is standalone. To retrieve it later, use:

# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))

# Here starts the real code
from core import nlp

indexer = nlp.Indexer.load("search_engine")

Training an AI classifier¤

We will use the NPS Chat text corpus. It’s a text corpus of chat messages, labelled by category (like “Yes-No question”, “Wh- question”, “Greeting”, “Statement”, “No answer”, “Yes answer”, etc.). The purpose of the classifier will be to automatically find the label of a new message, by learning the properties of each label into the training corpus.

Create a new user script with:

# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))

# Here starts the real code
import nltk
from core import nlp

# Download the training set
nltk.download('nps_chat')

# Build the word2vec language model
training_set = [post.text
                for post in nltk.corpus.nps_chat.xml_posts()]
w2v = nlp.Word2Vec(embedding_set, "word2vec_chat", epochs=2000, window=3)

# Test word2vec
print("free:\n",
      w2v.wv.most_similar(w2v.tokenizer.normalize_token("free", "en")))

# Build the classifier model
training_set = [nlp.Data(post.text, post.get('class')) # (content, label)
                for post in nltk.corpus.nps_chat.xml_posts()]
model = nlp.Classifier(training_set, "chat", w2v, validate=True)

# Classify test messages
test_messages = ["Do you have time to meet at 5 pm ?",
                 "Come with me !",
                 "Nope",
                 "What do you think ?"]

for item in test_messages:
    print(item, model.prob_classify(item))

Output:

free:
[('xbox', 0.3570968210697174),
 ('gam', 0.3551534414291382),
 ('wz', 0.3535629212856293),
 ('howdi', 0.3532298803329468),
 ('anybodi', 0.340751051902771),
 ('against', 0.33561158180236816),
 ('hb', 0.32573479413986206),
 ('yawn', 0.3226745128631592),
 ('tx', 0.32188209891319275),
 ('hiya', 0.31899407505989075)]

accuracy against test set: 0.803030303030303
accuracy against train set: 0.9188166152007172

Do you have time to meet at 5 pm ? ('whQuestion', 0.39620921476180504)
Come with me ! ('Emphasis', 0.46625803160949525)
Nope ('nAnswer', 0.48401087375968443)
What do you think ? ('whQuestion', 0.9756292257900939)

Note

The classifier above was trained with validate=True which splits the training corpus in 2 sets: an actual training set used to extract the properties of labels, and a test set, discarded from the training and only used at the end to test if the model prediction matches the actual label. This helps tuning the hyper-parameters of the model.

The accuracies of the model against each set are shown in the terminal output. Values close to 1.0 mean the model gets it right everytime. It is expected that the accuracy against training set would be higher than against the test set. However, if there is a large difference between both (like 0.65/0.95), it means your model is over-fitting and will lack generality.

When satisfying accuracies have been found, you can retrain the model with validate=False to use all available data for maximum accuracy before using it in production.

The chat model will again be saved automatically in VirtualSecretary/models folder. Similarly to the previous objects, once saved, it can be later retrieved with :

# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))

# Here starts the real code
from core import nlp

model = nlp.Classifier.load("chat")

Conclusion¤

The NLP objects are designed to be easily saved and later retrieved to be used in filters. The general workflow is as follow:

  1. gather data, either from the Secretary (using the learn filter mode to aggregate email bodies, contacts, comments, etc.) or from the Crawler, to scrape web pages and local documents,
  2. train (offline) a language model (Word2Vec) with the data, with user scripts,
  3. train (offline) a search engine Indexer model or a content Classifier model, depending on your needs, with user scripts,
  4. from your processing filters, retrieve the trained models and process (online) the email contents to decide what actions should be taken. For the Classifier, the probability (confidence) of the label is returned as well and can be used in filters to act only when the confidence is above a threshold.