core.search¤
core.search
¤
Classes¤
core.search.BM25PlusCSR
¤
BM25PlusCSR(
corpus: list[list[int]],
word2vec,
k1: float = 1.7,
b: float = 0.3,
delta: float = 0.65,
)
BM25+ with CSR inverted index: - doc_ids / tfs stored in contiguous arrays - indptr for token → posting list slicing - fully vectorized scoring
core.search.Indexer
¤
Indexer(
db: sqlite3.Connection,
name: str,
word2vec: Word2Vec,
strip_collocations: bool = False,
principal_components: int = 1,
)
Search engine based on word similarity.
| PARAMETER | DESCRIPTION |
|---|---|
db
|
Opened SQLite database containing at least a
TYPE:
|
name
|
name under which the model will be saved for la ter reuse.
TYPE:
|
word2vec
|
the instance of word embedding model.
TYPE:
|
strip_collocations
|
remove the matrix of collocations in documents, which is the list of word tokens represented by their index in the word2vec dictionnary. It is used for core.search.Indexer.find_query_pattern, which is optional and significatively slower (but not significatively better), so if you don’t plan on using it, removing collocations saves some RAM and I/O.
TYPE:
|
principal_components
|
number of principal components to compute and remove from the index dataset. This helps to make queries more selective and specific in the presence of boilerplate text and formatting language in the sampling.
TYPE:
|
NOTE
The class is optimized to run online, on server: load fast when spawning a new server-side worker, use RAM sparingly.
Attributes¤
core.search.Indexer.sql
instance-attribute
¤
sql: str = ''
Cache the previous SQL filtering conditions
core.search.Indexer.word2vec
instance-attribute
¤
word2vec: Word2Vec = word2vec
Word2Vec embedding language model
core.search.Indexer.collocations
instance-attribute
¤
Store the list of document tokens encoded by their index number in the Word2Vec vocabulary. Unknown tokens are discarded. This gives a symbolic and more compact representation of tokens collocations in documents (32 bits/token).
Documents are on the first axis.
core.search.Indexer.ranker
instance-attribute
¤
ranker: BM25PlusCSR = BM25PlusCSR(
corpus_token_indices, self.word2vec, k1=1.8, b=0.4, delta=0.8
)
BM25+ CSR ranker (TF-IDF).
core.search.Indexer.pc
instance-attribute
¤
Principal component(s) of the dataset vectors (normalized)
core.search.Indexer.vectors
instance-attribute
¤
Store the list of document-wise vector embeddings, where the vector represents the normalized centroid of tokens vectors contained the document. Documents are on the first axis.
core.search.Indexer.index
instance-attribute
¤
LUT of document URLs as ordered when building the ranker, lazily loaded from the database.
core.search.Indexer.url_to_index
instance-attribute
¤
Reverse LUT of self.index
Functions¤
core.search.Indexer.init_search_table
¤
init_search_table(db: sqlite3.Connection)
Create or migrate the one-row search cache table.
core.search.Indexer.init_stats_table
¤
init_stats_table(db: sqlite3.Connection)
Create or migrate the plain-SQLite stats table.
Scalar stats use item = ''. Grouped stats use name for the metric
family and item for the domain/category/language key.
core.search.Indexer.init_categories_table
¤
init_categories_table(db: sqlite3.Connection)
Create a queryable catalog of all page categories.
core.search.Indexer.init_pages_search_indexes
¤
init_pages_search_indexes(db: sqlite3.Connection)
Add persistent indexes for the user-facing search filters.
The URL primary key already gives us an index for URL lookups. The
category indexes help equality filters and provide a stable ordering
path for category-only queries. Substring filters such as
NOT LIKE '%github.com%', instr(parsed, ?), and REGEXP cannot use a
normal B-tree index, so filter_contents() narrows those to ranked
candidate URLs before SQLite evaluates them.
core.search.Indexer.save_search_values
¤
save_search_values(db: sqlite3.Connection, values: dict)
Store one or more precomputed search cache values in the database.
core.search.Indexer.clear_search_cache
¤
clear_search_cache(db: sqlite3.Connection)
Clear current heavy cache values before writing the refreshed cache.
core.search.Indexer.get_search_values
¤
get_search_values(db: sqlite3.Connection, columns: list[str]) -> any
Fetch one or more precomputed search cache values from the database.
core.search.Indexer.get_legacy_search_values
¤
get_legacy_search_values(db: sqlite3.Connection, columns: list[str]) -> any
Fetch legacy cache columns when loading an old DB before migration.
core.search.Indexer.save_search_index
¤
save_search_index(db: sqlite3.Connection, index: list[str])
Store the URL lookup table into search.doc_index.
core.search.Indexer.save_search_ranker
¤
save_search_ranker(db: sqlite3.Connection, ranker: BM25PlusCSR)
Store the BM25+ CSR ranker state into search.
core.search.Indexer.save_search_vectors
¤
save_search_vectors(db: sqlite3.Connection, vectors: np.ndarray)
Store the single global vector matrix into search.vectors.
core.search.Indexer.save_search_word2vec
¤
save_search_word2vec(db: sqlite3.Connection)
Store Word2Vec vocabulary and embedding matrices into the search cache.
core.search.Indexer.save_search_tokenizer
¤
save_search_tokenizer(db: sqlite3.Connection)
Store the tokenizer’s large n-gram state in the search cache.
core.search.Indexer.save_search_stats
¤
save_search_stats(db: sqlite3.Connection, stats: dict)
Store cheap display and diagnostic metadata in plain SQLite rows.
core.search.Indexer.save_categories_index
¤
save_categories_index(db: sqlite3.Connection, category_counts: dict[str, int])
Store all existing non-empty categories and their page counts.
core.search.Indexer.build_stats
¤
build_stats(db: sqlite3.Connection) -> dict
Compute index metadata while the database is writable/prepared.
core.search.Indexer.array_to_raw
staticmethod
¤
Store arrays without the .npy wrapper used by SQLite converters.
Raw contiguous blobs hydrate faster with np.frombuffer() at runtime.
core.search.Indexer.normalize_pc
¤
Remove the principal component of the dataset to the vector. This helps removing stopwords, webpage boilerplates (menu, sidebars), formatting language and SEO junk, and makes cosine similarity between query and documents more specific.
Taken from A simple but tough-to-beat baseline for sentence embeddings, Sanjeev Arora, Yingyu Liang, Tengyu Ma. https://openreview.net/pdf?id=SyK00v5xx
| PARAMETER | DESCRIPTION |
|---|---|
vector
|
can be a single vector (1D) or a document-wise stack of vectors (2D). We always consider the embedding vector to be on the last axis, document-wise vectors should be vertically stacked. |
Return: normalized vector
core.search.Indexer.filter_contents
¤
filter_contents(
db: sqlite3.Connection,
sql_query: str = "",
sql_params: list[str] | None = None,
candidate_indices: np.ndarray | list[int] | None = None,
) -> list[int]
Filter pages by arbitrary SQL queries
core.search.Indexer.load
classmethod
¤
load(name: str, db: sqlite3.Connection)
Load an existing trained model by its name from the ../models folder.
core.search.Indexer.tokenize_query
¤
tokenize_query(
query: str,
language: str | None = None,
meta_tokens: bool = True,
n_grams: bool = True,
) -> list[str]
Tokenize a query string, returning only tokens known to our vocabulary.
core.search.Indexer.vectorize_query
¤
core.search.Indexer.find_query_pattern
¤
find_query_pattern(
indexed_query: np.ndarray[np.int32],
documents: list[tuple[int, str, float]],
fast: bool = False,
) -> list[tuple[int, str, float]]
The rankers methods treat documents as continuous bag of words (CBOW). As such, they are good for topic extraction (aboutness), but they do not care about words colocations and ordering, therefore they loose syntactical meaning.
This method adds an additional layer of detection using convolution filters that will detect word sequences, direct or reversed, and correct the similarity factor set by the other ranking methods using that collocation factor.
Its major drawback is to be 100 to 500 times slower than the other rankers, due to 2D convolutions, which means it needs to run on a subset of the search index, after previous methods were tried, to refine a previous ranking.
| PARAMETER | DESCRIPTION |
|---|---|
indexed_query
|
the search query tokens translated into their integer indices in the Word2Vec vocabulary. Use core.nlp.Word2Vec.tokens_to_indices to convert the tokenized query. |
documents
|
a symbolic list of documents, as a |
fast
|
if
TYPE:
|
References
Text Matching as Image Recognition, Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. (2016). https://arxiv.org/pdf/1602.06359.pdf
core.search.Indexer.rrf
¤
Reciprocal Rank Fusion
Aggregate 2 sets of page rankings obtained from different semantic geometries and weighted differently.
From Reciprocal rank fusion outperforms condorcet and individual rank learning methods, Gordon V. Cormack, Charles L A Clarke, Stefan Buettcher. https://dl.acm.org/doi/10.1145/1571941.1572114
core.search.Indexer.rank
¤
rank(
db: sqlite3.Connection,
tokens: list[str],
method: search_methods,
n_results: int = 500,
fine_search: bool = False,
sql_query: str = "",
sql_params: list[str] = [],
) -> list[tuple[int, str, float]]
Apply a label on a post based on the trained model.
| PARAMETER | DESCRIPTION |
|---|---|
db
|
the SQLite database holding the indexed set of document. This database must absolutely be up-to-date with the one used to instanciate this class, regarding row ordering of documents, otherwise rowid mismatches are to be expected between fuzzy, AI and regex searches. tokens: the tokenized query.
TYPE:
|
method
|
TYPE:
|
n_results
|
number of results to retain
TYPE:
|
fine_search
|
optionally refine the search using a 2D interaction matrix. See [1]
TYPE:
|
sql_query
|
SQL query to narrow-down the search, for example
TYPE:
|
sql_params
|
the SQL parameters such that: where eachsql_params item is matched in the sql_query by a ?. For example:
sql_params = ['google.com'] will filter all URLs from Google.
|
Note
Both SQL search into the database and Python filtering into the index are supported, and can be combined. The local index is a partial copy of the database and is already a Python object, so it will be faster to filter if you only need to parse the copied data to filter in/out.
| RETURNS | DESCRIPTION |
|---|---|
list
|
the list of best-matching results as (rank, url, similarity) tuples. |