Skip to content

core.search¤

core.search ¤

Classes¤

core.search.BM25PlusCSR ¤

BM25PlusCSR(
    corpus: list[list[int]],
    word2vec,
    k1: float = 1.7,
    b: float = 0.3,
    delta: float = 0.65,
)

BM25+ with CSR inverted index: - doc_ids / tfs stored in contiguous arrays - indptr for token → posting list slicing - fully vectorized scoring

core.search.search_methods ¤

Bases: IntEnum

Search methods available

core.search.Indexer ¤

Indexer(
    db: sqlite3.Connection,
    name: str,
    word2vec: Word2Vec,
    strip_collocations: bool = False,
    principal_components: int = 1,
)

Search engine based on word similarity.

PARAMETER DESCRIPTION
db

Opened SQLite database containing at least a pages table of core.types.web_page items saved as database.

TYPE: sqlite3.Connection

name

name under which the model will be saved for la ter reuse.

TYPE: str

word2vec

the instance of word embedding model.

TYPE: Word2Vec

strip_collocations

remove the matrix of collocations in documents, which is the list of word tokens represented by their index in the word2vec dictionnary. It is used for core.search.Indexer.find_query_pattern, which is optional and significatively slower (but not significatively better), so if you don’t plan on using it, removing collocations saves some RAM and I/O.

TYPE: bool DEFAULT: False

principal_components

number of principal components to compute and remove from the index dataset. This helps to make queries more selective and specific in the presence of boilerplate text and formatting language in the sampling.

TYPE: int DEFAULT: 1

NOTE

The class is optimized to run online, on server: load fast when spawning a new server-side worker, use RAM sparingly.

Attributes¤
core.search.Indexer.sql instance-attribute ¤
sql: str = ''

Cache the previous SQL filtering conditions

core.search.Indexer.word2vec instance-attribute ¤
word2vec: Word2Vec = word2vec

Word2Vec embedding language model

core.search.Indexer.collocations instance-attribute ¤
collocations: np.ndarray | None = None

Store the list of document tokens encoded by their index number in the Word2Vec vocabulary. Unknown tokens are discarded. This gives a symbolic and more compact representation of tokens collocations in documents (32 bits/token).

Documents are on the first axis.

core.search.Indexer.ranker instance-attribute ¤
ranker: BM25PlusCSR = BM25PlusCSR(
    corpus_token_indices, self.word2vec, k1=1.8, b=0.4, delta=0.8
)

BM25+ CSR ranker (TF-IDF).

core.search.Indexer.pc instance-attribute ¤
pc: np.ndarray = pca.components_

Principal component(s) of the dataset vectors (normalized)

core.search.Indexer.vectors instance-attribute ¤
vectors = self.normalize_pc(self.vectors)

Store the list of document-wise vector embeddings, where the vector represents the normalized centroid of tokens vectors contained the document. Documents are on the first axis.

core.search.Indexer.index instance-attribute ¤
index: list[str] = self.build_index(db)

LUT of document URLs as ordered when building the ranker, lazily loaded from the database.

core.search.Indexer.url_to_index instance-attribute ¤
url_to_index: dict[str, int] = self.build_index_reverse()

Reverse LUT of self.index

Functions¤
core.search.Indexer.init_search_table ¤
init_search_table(db: sqlite3.Connection)

Create or migrate the one-row search cache table.

core.search.Indexer.init_stats_table ¤
init_stats_table(db: sqlite3.Connection)

Create or migrate the plain-SQLite stats table.

Scalar stats use item = ''. Grouped stats use name for the metric family and item for the domain/category/language key.

core.search.Indexer.init_categories_table ¤
init_categories_table(db: sqlite3.Connection)

Create a queryable catalog of all page categories.

core.search.Indexer.init_pages_search_indexes ¤
init_pages_search_indexes(db: sqlite3.Connection)

Add persistent indexes for the user-facing search filters.

The URL primary key already gives us an index for URL lookups. The category indexes help equality filters and provide a stable ordering path for category-only queries. Substring filters such as NOT LIKE '%github.com%', instr(parsed, ?), and REGEXP cannot use a normal B-tree index, so filter_contents() narrows those to ranked candidate URLs before SQLite evaluates them.

core.search.Indexer.save_search_values ¤
save_search_values(db: sqlite3.Connection, values: dict)

Store one or more precomputed search cache values in the database.

core.search.Indexer.clear_search_cache ¤
clear_search_cache(db: sqlite3.Connection)

Clear current heavy cache values before writing the refreshed cache.

core.search.Indexer.get_search_values ¤
get_search_values(db: sqlite3.Connection, columns: list[str]) -> any

Fetch one or more precomputed search cache values from the database.

core.search.Indexer.get_legacy_search_values ¤
get_legacy_search_values(db: sqlite3.Connection, columns: list[str]) -> any

Fetch legacy cache columns when loading an old DB before migration.

core.search.Indexer.save_search_index ¤
save_search_index(db: sqlite3.Connection, index: list[str])

Store the URL lookup table into search.doc_index.

core.search.Indexer.save_search_ranker ¤
save_search_ranker(db: sqlite3.Connection, ranker: BM25PlusCSR)

Store the BM25+ CSR ranker state into search.

core.search.Indexer.save_search_vectors ¤
save_search_vectors(db: sqlite3.Connection, vectors: np.ndarray)

Store the single global vector matrix into search.vectors.

core.search.Indexer.save_search_word2vec ¤
save_search_word2vec(db: sqlite3.Connection)

Store Word2Vec vocabulary and embedding matrices into the search cache.

core.search.Indexer.save_search_tokenizer ¤
save_search_tokenizer(db: sqlite3.Connection)

Store the tokenizer’s large n-gram state in the search cache.

core.search.Indexer.save_search_stats ¤
save_search_stats(db: sqlite3.Connection, stats: dict)

Store cheap display and diagnostic metadata in plain SQLite rows.

core.search.Indexer.save_categories_index ¤
save_categories_index(db: sqlite3.Connection, category_counts: dict[str, int])

Store all existing non-empty categories and their page counts.

core.search.Indexer.build_stats ¤
build_stats(db: sqlite3.Connection) -> dict

Compute index metadata while the database is writable/prepared.

core.search.Indexer.array_to_raw staticmethod ¤
array_to_raw(array: np.ndarray) -> sqlite3.Binary

Store arrays without the .npy wrapper used by SQLite converters. Raw contiguous blobs hydrate faster with np.frombuffer() at runtime.

core.search.Indexer.normalize_pc ¤
normalize_pc(vector: np.ndarray) -> np.ndarray

Remove the principal component of the dataset to the vector. This helps removing stopwords, webpage boilerplates (menu, sidebars), formatting language and SEO junk, and makes cosine similarity between query and documents more specific.

Taken from A simple but tough-to-beat baseline for sentence embeddings, Sanjeev Arora, Yingyu Liang, Tengyu Ma. https://openreview.net/pdf?id=SyK00v5xx

PARAMETER DESCRIPTION
vector

can be a single vector (1D) or a document-wise stack of vectors (2D). We always consider the embedding vector to be on the last axis, document-wise vectors should be vertically stacked.

TYPE: np.ndarray

Return: normalized vector

core.search.Indexer.filter_contents ¤
filter_contents(
    db: sqlite3.Connection,
    sql_query: str = "",
    sql_params: list[str] | None = None,
    candidate_indices: np.ndarray | list[int] | None = None,
) -> list[int]

Filter pages by arbitrary SQL queries

core.search.Indexer.load classmethod ¤
load(name: str, db: sqlite3.Connection)

Load an existing trained model by its name from the ../models folder.

core.search.Indexer.tokenize_query ¤
tokenize_query(
    query: str,
    language: str | None = None,
    meta_tokens: bool = True,
    n_grams: bool = True,
) -> list[str]

Tokenize a query string, returning only tokens known to our vocabulary.

core.search.Indexer.vectorize_query ¤
vectorize_query(tokenized_query: list[str]) -> np.ndarray

Prepare a text search query: cleanup, tokenize and get the centroid vector.

RETURNS DESCRIPTION
np.ndarray

tuple[vector, norm, tokens]

core.search.Indexer.find_query_pattern ¤
find_query_pattern(
    indexed_query: np.ndarray[np.int32],
    documents: list[tuple[int, str, float]],
    fast: bool = False,
) -> list[tuple[int, str, float]]

The rankers methods treat documents as continuous bag of words (CBOW). As such, they are good for topic extraction (aboutness), but they do not care about words colocations and ordering, therefore they loose syntactical meaning.

This method adds an additional layer of detection using convolution filters that will detect word sequences, direct or reversed, and correct the similarity factor set by the other ranking methods using that collocation factor.

Its major drawback is to be 100 to 500 times slower than the other rankers, due to 2D convolutions, which means it needs to run on a subset of the search index, after previous methods were tried, to refine a previous ranking.

PARAMETER DESCRIPTION
indexed_query

the search query tokens translated into their integer indices in the Word2Vec vocabulary. Use core.nlp.Word2Vec.tokens_to_indices to convert the tokenized query.

TYPE: np.ndarray[np.int32]

documents

a symbolic list of documents, as a (index, url, similarity) tuple.

TYPE: list[tuple[int, str, float]]

fast

if True, uses a simplified variant that is 6 times faster and only uses local averages. Results from this method are rather inaccurate, for example, for a request like token_1 token_2, sentences repeating token_1 twice will score as much as sentences containing the desired sequence token_1 token_2. If False, use the convolutional filter.

TYPE: bool DEFAULT: False

References

Text Matching as Image Recognition, Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. (2016). https://arxiv.org/pdf/1602.06359.pdf

core.search.Indexer.rrf ¤
rrf(ranks_1: np.ndarray, ranks_2: np.ndarray, coeff: float = 60) -> np.ndarray

Reciprocal Rank Fusion

Aggregate 2 sets of page rankings obtained from different semantic geometries and weighted differently.

From Reciprocal rank fusion outperforms condorcet and individual rank learning methods, Gordon V. Cormack, Charles L A Clarke, Stefan Buettcher. https://dl.acm.org/doi/10.1145/1571941.1572114

core.search.Indexer.rank ¤
rank(
    db: sqlite3.Connection,
    tokens: list[str],
    method: search_methods,
    n_results: int = 500,
    fine_search: bool = False,
    sql_query: str = "",
    sql_params: list[str] = [],
) -> list[tuple[int, str, float]]

Apply a label on a post based on the trained model.

PARAMETER DESCRIPTION
db

the SQLite database holding the indexed set of document. This database must absolutely be up-to-date with the one used to instanciate this class, regarding row ordering of documents, otherwise rowid mismatches are to be expected between fuzzy, AI and regex searches. tokens: the tokenized query.

TYPE: sqlite3.Connection

method

ai, fuzzy or grep: - ai use word embedding and meta-tokens with dual-embedding space, - fuzzy uses meta-tokens with BM25Okapi stats model, - grep uses direct string and regex search.

TYPE: search_methods

n_results

number of results to retain

TYPE: int DEFAULT: 500

fine_search

optionally refine the search using a 2D interaction matrix. See [1]

TYPE: bool DEFAULT: False

sql_query

SQL query to narrow-down the search, for example WHERE field = value. Supports PCRE regex with WHERE field REGEXP 'pattern'.

TYPE: str DEFAULT: ''

sql_params

the SQL parameters such that:

    cursor = db.execute(
    f"SELECT url FROM pages {sql_query}",
    sql_params
)
where each sql_params item is matched in the sql_query by a ?. For example:
    SELECT url              // imposed by the search API
    FROM pages              // imposed by the search API
    WHERE instr(url, ?) > 0 // implementation-side `sql_query`
    ORDER BY url            // imposed by the search API
and sql_params = ['google.com'] will filter all URLs from Google.

TYPE: list[str] DEFAULT: []

Note

Both SQL search into the database and Python filtering into the index are supported, and can be combined. The local index is a partial copy of the database and is already a Python object, so it will be faster to filter if you only need to parse the copied data to filter in/out.

RETURNS DESCRIPTION
list

the list of best-matching results as (rank, url, similarity) tuples.

TYPE: list[tuple[int, str, float]]

get_related(tokens: list[str], n: int = 15, k: int = 5) -> list

Get the n closest keywords from the query.

Functions¤