NLP¤
nlp ¤
High-level natural language processing module for message-like (emails, comments, posts) input.
Supports automatic language detection, word tokenization and stemming for 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'italian', 'norwegian', 'portuguese', 'spanish', 'swedish'
.
© 2023 - Aurélien Pierre
Classes¤
Tokenizer ¤
Tokenizer(
meta_tokens: dict[re.Pattern : str] = None,
abbreviations: dict[str:str] = None,
stopwords: list[str] = None,
)
Pre-processing pipeline and tokenizer, splitting a string into normalized word tokens.
PARAMETER | DESCRIPTION |
---|---|
meta_token |
the pipeline of regular expressions to replace with meta-tokens. Keys must be
|
abbreviations |
str]): pipeline of abbreviations to replace, as
TYPE:
|
Attributes¤
characters_cleanup
class-attribute
instance-attribute
¤
characters_cleanup: dict[re.Pattern : str] = {
MULTIPLE_DOTS: "...",
MULTIPLE_DASHES: "-",
MULTIPLE_QUESTIONS: "?",
REPEATED_CHARACTERS: " ",
BB_CODE: " ",
MARKUP: " \\1 ",
BASE_64: " ",
}
Dictionnary of regular expressions (keys) to find and replace by the provided strings (values). Cleanup repeated characters, including ellipses and question marks, leftover BBcode and XML markup, base64-encoded strings and French pronominal contractions (e.g “me + a” contracted into “m’a”).
internal_meta_tokens
class-attribute
instance-attribute
¤
internal_meta_tokens: dict[re.Pattern : str] = {
HASH_PATTERN_FAST: "_HASH_",
NUMBER_PATTERN_FAST: "_NUMBER_",
}
Dictionnary of regular expressions (keys) to find in full-tokens and replace by meta-tokens. Use simplified regex patterns for performance.
Functions¤
prefilter ¤
Tokenizers split words based on unsupervised machine-learned models. Sometimes, they work weird.
For example, in emails and user handles like @user
, they would split @
and user
as 2 different tokens,
making it impossible to detect usernames in single tokens later.
To avoid that, we replace data of interest by meta-tokens before the tokenization, with regular expressions.
lemmatize ¤
Find the root (lemma) of words to help topical generalization.
normalize_token ¤
Return normalized, lemmatized and stemmed word tokens, where dates, times, digits, monetary units and URLs have their actual value replaced by meta-tokens designating their type. Stopwords (“the”, “a”, etc.), punctuation etc. is replaced by None
, which should be filtered out at the next step.
PARAMETER | DESCRIPTION |
---|---|
word |
tokenized word in lower case only.
TYPE:
|
language |
the language used to detect dates. Supports
TYPE:
|
vocabulary |
a
TYPE:
|
Examples:
10:00
or 10 h
or 10am
or 10 am
will all be replaced by a _TIME_
meta-token.
feb
, February
, feb.
, monday
will all be replaced by a _DATE_
meta-token.
tokenize_sentence ¤
Split a sentence into normalized word tokens and meta-tokens.
PARAMETER | DESCRIPTION |
---|---|
sentence |
the input single sentence.
TYPE:
|
language |
the language string to be used by the tokenizer. It needs to be one of those supported by the module core.nlp.
TYPE:
|
meta_tokens |
find meta-tokens through regular expressions and replace them in the text. This helps tokenization to keep similar objects together, especially dates that would otherwise be splitted.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
tokens
|
the list of normalized tokens. |
split_sentences ¤
tokenize_document ¤
Cleanup and tokenize a document or a sentence as an atomic element, meaning we don’t split it into sentences. Use this either for search-engine purposes (into a document’s body) or if the document is already split into sentences. The document text needs to have been prepared and cleaned, which means :
- lowercased (optional but recommended) with
str.lower()
, - translated from Unicode to ASCII (optional but recommended) with [utils.typography_undo()][],
- cleaned up for sequences of whitespaces with [utils.cleanup_whitespaces()][]
Note
the language is detected internally if not provided as an optional argument. When processing a single sentence extracted from a document, instead of the whole document, it is more accurate to run the language detection on the whole document, ahead of calling this method, and pass on the result here.
PARAMETER | DESCRIPTION |
---|---|
document |
the text of the document to tokenize
TYPE:
|
language |
the language of the document. Will be internally inferred if not given.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
tokens
|
a 1D list of normalized tokens and meta-tokens. |
tokenize_per_sentence ¤
Cleanup and tokenize a whole document as a list of sentences, meaning we split it into sentences before tokenizing. Use this to train a Word2Vec (embedding) model so each token is properly embedded into its syntactic context. The document text needs to have been prepared and cleaned, which means :
- lowercased (optional but recommended) with
str.lower()
, - translated from Unicode to ASCII (optional but recommended) with [utils.typography_undo()][],
- cleaned up for sequences of whitespaces with [utils.cleanup_whitespaces()][]
Note
the language is detected internally.
RETURNS | DESCRIPTION |
---|---|
tokens
|
a 2D list of sentences (1st axis), each containing a list of normalizel tokens and meta-tokens (2nd axis). |
Data ¤
Word2Vec ¤
Word2Vec(
sentences: list[str],
name: str = "word2vec",
vector_size: int = 300,
epochs: int = 200,
window: int = 5,
min_count=5,
sample=0.0005,
tokenizer: Tokenizer = None,
)
Bases: gensim.models.Word2Vec
Train, re-train or retrieve an existing word2vec word embedding model
PARAMETER | DESCRIPTION |
---|---|
name |
filename of the model to save and retrieve. If the model exists already, we automatically load it. Note that this will override the
TYPE:
|
vector_size |
number of dimensions of the word vectors
TYPE:
|
epochs |
number of iterations of training for the machine learning. Small corpora need 2000 and more epochs. Increases the learning time.
TYPE:
|
window |
size of the token collocation window to detect
TYPE:
|
Functions¤
get_word ¤
get_wordvec ¤
Return the vector associated to a word, through a dictionnary of words.
PARAMETER | DESCRIPTION |
---|---|
word |
the word to convert to a vector.
TYPE:
|
embed |
TYPE:
|
-
A Dual Embedding Space Model for Document Ranking (2016), Bhaskar Mitra, Eric Nalisnick, Nick Craswell, Rich Caruana https://arxiv.org/pdf/1602.01137.pdf ↩
RETURNS | DESCRIPTION |
---|---|
np.ndarray[np.float32] | None
|
the nD vector if the word was found in the dictionnary, or |
get_features ¤
Calls core.nlp.Word2Vec.get_wordvec over a list of tokens and returns a single vector representing the whole list.
PARAMETER | DESCRIPTION |
---|---|
tokens |
list of text tokens. |
embed |
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
np.ndarray[np.float32]
|
the centroid of word embedding vectors associated with the input tokens (aka the average vector), or the null vector if no word from the list was found in dictionnary. |
Classifier ¤
Classifier(
training_set: list[Data],
name: str,
word2vec: Word2Vec,
validate: bool = True,
variant: str = "svm",
)
Bases: nltk.classify.SklearnClassifier
Handle the word2vec and SVM machine-learning
PARAMETER | DESCRIPTION |
---|---|
training_set |
list of Data elements. If the list is empty, it will try to find a pre-trained model matching the |
path |
path to save the trained model for reuse, as a Python joblib.
|
name |
name under which the model will be saved for la ter reuse.
TYPE:
|
word2vec |
the instance of word embedding model.
TYPE:
|
validate |
if
TYPE:
|
variant |
TYPE:
|
features |
the number of model features (dimensions) to retain. This sets the number of dimensions for word vectors found by word2vec, which will also be the dimensions in the last training layer.
TYPE:
|
Indexer ¤
Search engine based on word similarity.
PARAMETER | DESCRIPTION |
---|---|
training_set |
list of Data elements. If the list is empty, it will try to find a pre-trained model matching the
TYPE:
|
path |
path to save the trained model for reuse, as a Python joblib.
|
name |
name under which the model will be saved for la ter reuse.
TYPE:
|
word2vec |
the instance of word embedding model.
TYPE:
|
validate |
if
TYPE:
|
variant |
TYPE:
|
features |
the number of model features (dimensions) to retain. This sets the number of dimensions for word vectors found by word2vec, which will also be the dimensions in the last training layer.
TYPE:
|
Functions¤
get_features_parallel ¤
Thread-safe call to .get_features()
to be called in multiprocessing.Pool map
load
classmethod
¤
Load an existing trained model by its name from the ../models
folder.
vectorize_query ¤
rank ¤
rank(
query: str | tuple | re.Pattern,
method: search_methods,
filter_callback: callable = None,
pattern: str | re.Pattern = None,
**kargs
) -> list[tuple[str, float]]
Apply a label on a post based on the trained model.
PARAMETER | DESCRIPTION |
---|---|
query |
the query to search. |
method |
TYPE:
|
filter_callback |
a function returning a boolean to filter in/out the results of the ranker.
TYPE:
|
pattern |
optional pattern/text search to add on top of AI search
TYPE:
|
**kargs |
arguments passed as-is to the
DEFAULT:
|
RETURNS | DESCRIPTION |
---|---|
list
|
the list of best-matching results as (url, similarity) tuples. |
get_page ¤
Retrieve the requested page data object from the index by url.
Warning
For performance’s sake, it doesn’t check if the url exists in the index.
This is no issue if you feed it the output of self.rank()
but mind that otherwise.
get_related ¤
Get the n closest keywords from the query.
Functions¤
guess_language ¤
Basic language guesser based on stopwords detection.
Stopwords are the most common words of a language: for each language, we count how many stopwords we found and return the language having the most matches. It is accurate for paragraphs and long documents, not so much for short sentences.
RETURNS | DESCRIPTION |
---|---|
str
|
2-letters ISO-something language code. |
Examples¤
Training a language model¤
Assuming you followed the example of the crawler module, write another user script with:
# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))
# Here starts the real code
from core import utils
from core import nlp
embedding_set = []
# Open an existing dataset
for post in utils.open_data("ansel"):
# Use only the content field of a `crawler.web_page` object
embedding_set.append(post["content"])
# Build the word2vec language model
w2v = nlp.Word2Vec(embedding_set, "word2vec", epochs=200, window=15, min_count=32, sample=0.0005)
# Test word2vec: get the closest words from "free"
print(w2v.wv.most_similar("free"))
This will save a word2vec
file into VirtualSecretary/models
. To retrieve it later, use:
# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))
# Here starts the real code
from core import nlp
w2v = nlp.Word2Vec.load_model("word2vec-public")
Training an AI-based search engine indexer¤
Assuming you built the word2vec
model above, create another user script with:
# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))
# Here starts the real code
from core import utils
from core import nlp
# Create an index with pre-computed word embedding for each page
indexer = nlp.Indexer(utils.open_data("ansel"),
"search_engine",
nlp.Word2Vec.load_model("word2vec"))
# Do a test search
text_request = "install on linux"
tokenized_request = indexer.tokenize_query(text_request)
vectorized_request = indexer.vectorize_query(tokenized_request)
results = indexer.rank(vectorized_request, nlp.search_methods.AI)
# Display only the 25 best results
for url, similarity in results[0:25]:
page = indexer.get_page(url)
print(page["title"], page["excerpt"], page["url"], page["date"], similarity)
The Indexer object is automatically saved to VirtualSecretary/models
as a compressed joblib object containing its own Word2Vec language model, so the indexer is standalone. To retrieve it later, use:
# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))
# Here starts the real code
from core import nlp
indexer = nlp.Indexer.load("search_engine")
Training an AI classifier¤
We will use the NPS Chat text corpus. It’s a text corpus of chat messages, labelled by category (like “Yes-No question”, “Wh- question”, “Greeting”, “Statement”, “No answer”, “Yes answer”, etc.). The purpose of the classifier will be to automatically find the label of a new message, by learning the properties of each label into the training corpus.
Create a new user script with:
# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))
# Here starts the real code
import nltk
from core import nlp
# Download the training set
nltk.download('nps_chat')
# Build the word2vec language model
training_set = [post.text
for post in nltk.corpus.nps_chat.xml_posts()]
w2v = nlp.Word2Vec(embedding_set, "word2vec_chat", epochs=2000, window=3)
# Test word2vec
print("free:\n",
w2v.wv.most_similar(w2v.tokenizer.normalize_token("free", "en")))
# Build the classifier model
training_set = [nlp.Data(post.text, post.get('class')) # (content, label)
for post in nltk.corpus.nps_chat.xml_posts()]
model = nlp.Classifier(training_set, "chat", w2v, validate=True)
# Classify test messages
test_messages = ["Do you have time to meet at 5 pm ?",
"Come with me !",
"Nope",
"What do you think ?"]
for item in test_messages:
print(item, model.prob_classify(item))
Output:
free:
[('xbox', 0.3570968210697174),
('gam', 0.3551534414291382),
('wz', 0.3535629212856293),
('howdi', 0.3532298803329468),
('anybodi', 0.340751051902771),
('against', 0.33561158180236816),
('hb', 0.32573479413986206),
('yawn', 0.3226745128631592),
('tx', 0.32188209891319275),
('hiya', 0.31899407505989075)]
accuracy against test set: 0.803030303030303
accuracy against train set: 0.9188166152007172
Do you have time to meet at 5 pm ? ('whQuestion', 0.39620921476180504)
Come with me ! ('Emphasis', 0.46625803160949525)
Nope ('nAnswer', 0.48401087375968443)
What do you think ? ('whQuestion', 0.9756292257900939)
Note
The classifier above was trained with validate=True
which splits the training corpus in 2 sets: an actual training set used to extract the properties of labels, and a test set, discarded from the training and only used at the end to test if the model prediction matches the actual label. This helps tuning the hyper-parameters of the model.
The accuracies of the model against each set are shown in the terminal output. Values close to 1.0 mean the model gets it right everytime. It is expected that the accuracy against training set would be higher than against the test set. However, if there is a large difference between both (like 0.65/0.95), it means your model is over-fitting and will lack generality.
When satisfying accuracies have been found, you can retrain the model with validate=False
to use all available data for maximum accuracy before using it in production.
The chat
model will again be saved automatically in VirtualSecretary/models
folder. Similarly to the previous objects, once saved, it can be later retrieved with :
# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))
# Here starts the real code
from core import nlp
model = nlp.Classifier.load("chat")
Conclusion¤
The NLP objects are designed to be easily saved and later retrieved to be used in filters. The general workflow is as follow:
- gather data, either from the
Secretary
(using thelearn
filter mode to aggregate email bodies, contacts, comments, etc.) or from theCrawler
, to scrape web pages and local documents, - train (offline) a language model (
Word2Vec
) with the data, with user scripts, - train (offline) a search engine
Indexer
model or a contentClassifier
model, depending on your needs, with user scripts, - from your processing filters, retrieve the trained models and process (online) the email contents to decide what actions should be taken. For the
Classifier
, the probability (confidence) of the label is returned as well and can be used in filters to act only when the confidence is above a threshold.