core.search¤

core.search ¤

Classes¤

core.search.BM25PlusCSR ¤

BM25PlusCSR(
    corpus: list[list[int]],
    word2vec,
    k1: float = 1.7,
    b: float = 0.3,
    delta: float = 0.65,
)

BM25+ with CSR inverted index: - doc_ids / tfs stored in contiguous arrays - indptr for token → posting list slicing - fully vectorized scoring

core.search.search_methods ¤

Bases: IntEnum

Search methods available

Attributes¤

core.search.search_methods.AI `class-attribute` `instance-attribute` ¤

AI = 1

Vector-based similarity on document centroid in embedding space

core.search.search_methods.FUZZY `class-attribute` `instance-attribute` ¤

FUZZY = 2

BM25+ keywords statistics on normalized and stemmed content

core.search.search_methods.MIXED `class-attribute` `instance-attribute` ¤

MIXED = 3

Combination of AI and FUZZY aggregated by Reciprocal Rank Fusion.

core.search.Indexer ¤

Indexer(
    db: sqlite3.Connection,
    name: str,
    word2vec: WordEmbedding,
    strip_collocations: bool = False,
    principal_components: int = 1,
)

Search engine based on word similarity.

PARAMETER	DESCRIPTION
`db`	Opened SQLite database containing at least a `pages` table of core.types.web_page items saved as database. TYPE: `sqlite3.Connection`
`name`	name under which the model will be saved for la ter reuse. TYPE: `str`
`word2vec`	the instance of word embedding model. TYPE: `WordEmbedding`
`strip_collocations`	remove the matrix of collocations in documents, which is the list of word tokens represented by their index in the word2vec dictionnary. It is used for core.search.Indexer.find_query_pattern, which is optional and significatively slower (but not significatively better), so if you don’t plan on using it, removing collocations saves some RAM and I/O. TYPE: `bool` DEFAULT: `False`
`principal_components`	number of principal components to compute and remove from the index dataset. This helps to make queries more selective and specific in the presence of boilerplate text and formatting language in the sampling. TYPE: `int` DEFAULT: `1`

NOTE

The class is optimized to run online, on server: load fast when spawning a new server-side worker, use RAM sparingly.

Attributes¤

core.search.Indexer.sql `instance-attribute` ¤

sql: str = ''

Cache the previous SQL filtering conditions

core.search.Indexer.word2vec `instance-attribute` ¤

word2vec: WordEmbedding = word2vec

Word embedding model (Word2Vec or FastText), via the WordEmbedding interface

core.search.Indexer.collocations `instance-attribute` ¤

collocations: np.ndarray | None = None

Store the list of document tokens encoded by their index number in the Word2Vec vocabulary. Unknown tokens are discarded. This gives a symbolic and more compact representation of tokens collocations in documents (32 bits/token).

Documents are on the first axis.

core.search.Indexer.ranker `instance-attribute` ¤

ranker: BM25PlusCSR = BM25PlusCSR(
    corpus_token_indices, self.word2vec, k1=1.8, b=0.4, delta=0.8
)

BM25+ CSR ranker (TF-IDF).

core.search.Indexer.pc `instance-attribute` ¤

pc: np.ndarray = pca.components_

Principal component(s) of the dataset vectors (normalized)

core.search.Indexer.vectors `instance-attribute` ¤

vectors = self.normalize_pc(self.vectors)

Store the list of document-wise vector embeddings, where the vector represents the normalized centroid of tokens vectors contained the document. Documents are on the first axis.

Methods:¤

core.search.Indexer.init_stats_table ¤

init_stats_table(db: sqlite3.Connection)

Create or migrate the plain-SQLite stats table.

Scalar stats use item = ''. Grouped stats use name for the metric family and item for the domain/category/language key.

core.search.Indexer.init_categories_table ¤

init_categories_table(db: sqlite3.Connection)

Create a queryable catalog of all page categories.

core.search.Indexer.init_pages_search_indexes ¤

init_pages_search_indexes(db: sqlite3.Connection)

Create or migrate all persistent indexes needed by the search layer.

search_rowid is an explicit INTEGER column we assign (0, 1, 2 …) to every page in ORDER BY url order at build time. Because it is a real column value, VACUUM cannot renumber it — unlike SQLite’s implicit rowid for tables with a TEXT primary key.

The covering index idx_pages_search_rowid_category is the key performance fix for filter_contents: SQLite can answer the entire candidate-filter query from the compact index without ever touching the main pages rows (which are large due to stored embeddings / content). Benchmark: 500-candidate category filter 439 ms → < 2 ms.

core.search.Indexer.save_search_stats ¤

save_search_stats(db: sqlite3.Connection, stats: dict)

Store cheap display and diagnostic metadata in plain SQLite rows.

core.search.Indexer.save_categories_index ¤

save_categories_index(db: sqlite3.Connection, category_counts: dict[str, int])

Store all existing non-empty categories and their page counts.

core.search.Indexer.build_stats ¤

build_stats(db: sqlite3.Connection) -> dict

Compute index metadata while the database is writable/prepared.

core.search.Indexer.array_to_raw `staticmethod` ¤

array_to_raw(array: np.ndarray) -> sqlite3.Binary

Store arrays without the .npy wrapper used by SQLite converters. Raw contiguous blobs hydrate faster with np.frombuffer() at runtime.

core.search.Indexer.verify_db_integrity ¤

verify_db_integrity(db: sqlite3.Connection, full: bool = False)

Raise RuntimeError if the DB has changed since this Indexer was built.

PARAMETER	DESCRIPTION
`full`	`False` (default) — page count + boundary-URL hash, three O(log N) index seeks, effectively O(1). Catches all insertions, deletions, and boundary-URL edits. `True` — hashes every URL in rowid order, O(N), detects any mid-corpus mutation. TYPE: `bool` DEFAULT: `False`

Called automatically by :meth:load.

core.search.Indexer.normalize_pc ¤

normalize_pc(vector: np.ndarray) -> np.ndarray

Remove the principal component of the dataset to the vector. This helps removing stopwords, webpage boilerplates (menu, sidebars), formatting language and SEO junk, and makes cosine similarity between query and documents more specific.

Taken from A simple but tough-to-beat baseline for sentence embeddings, Sanjeev Arora, Yingyu Liang, Tengyu Ma. https://openreview.net/pdf?id=SyK00v5xx

PARAMETER	DESCRIPTION
`vector`	can be a single vector (1D) or a document-wise stack of vectors (2D). We always consider the embedding vector to be on the last axis, document-wise vectors should be vertically stacked. TYPE: `np.ndarray`

Returns: normalized vector

core.search.Indexer.filter_contents ¤

filter_contents(
    db: sqlite3.Connection,
    sql_query: str = "",
    sql_params: list[str] | None = None,
    candidate_indices: np.ndarray | list[int] | None = None,
) -> list[int]

Filter pages by an arbitrary SQL predicate, returning search_rowid integers.

With candidate_indices, the predicate is evaluated only over that small set via WHERE search_rowid IN (...). This avoids a full table scan and, combined with the covering index idx_pages_search_rowid_category, avoids touching the main pages rows entirely for category/url filters — the most common case.

Without candidate_indices, the full table is scanned (used for building candidate sets outside of :meth:rank).

core.search.Indexer.load `classmethod` ¤

load(name: str, db: sqlite3.Connection)

Load an existing trained model by its name from the ../models folder.

core.search.Indexer.tokenize_query ¤

tokenize_query(
    query: str,
    language: str | None = None,
    meta_tokens: bool = True,
    n_grams: bool = True,
) -> list[str]

Tokenize a query string, returning only tokens known to our vocabulary.

core.search.Indexer.vectorize_query ¤

vectorize_query(
    tokenized_query: list[str],
    use_sif: bool = True,
    sif_smoothing: float = 0.001,
) -> np.ndarray

Prepare a text search query: cleanup, tokenize and get the centroid vector.

RETURNS	DESCRIPTION
`np.ndarray`	tuple[vector, norm, tokens]

core.search.Indexer.find_query_pattern ¤

find_query_pattern(
    indexed_query: np.ndarray[np.int32],
    documents: list[tuple[int, str, float]],
    fast: bool = False,
) -> list[tuple[int, str, float]]

The rankers methods treat documents as continuous bag of words (CBOW). As such, they are good for topic extraction (aboutness), but they do not care about words colocations and ordering, therefore they loose syntactical meaning.

This method adds an additional layer of detection using convolution filters that will detect word sequences, direct or reversed, and correct the similarity factor set by the other ranking methods using that collocation factor.

Its major drawback is to be 100 to 500 times slower than the other rankers, due to 2D convolutions, which means it needs to run on a subset of the search index, after previous methods were tried, to refine a previous ranking.

PARAMETER	DESCRIPTION
`indexed_query`	the search query tokens translated into their integer indices in the Word2Vec vocabulary. Use core.nlp.Word2Vec.tokens_to_indices to convert the tokenized query. TYPE: `np.ndarray[np.int32]`
`documents`	a symbolic list of documents, as a `(index, url, similarity)` tuple. TYPE: `list[tuple[int, str, float]]`
`fast`	if `True`, uses a simplified variant that is 6 times faster and only uses local averages. Results from this method are rather inaccurate, for example, for a request like `token_1 token_2`, sentences repeating `token_1` twice will score as much as sentences containing the desired sequence `token_1 token_2`. If `False`, use the convolutional filter. TYPE: `bool` DEFAULT: `False`

References

Text Matching as Image Recognition, Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. (2016). https://arxiv.org/pdf/1602.06359.pdf

core.search.Indexer.rank_ai ¤

rank_ai(
    tokens: list[str],
    fast: bool = False,
    clip: bool = False,
    coverage: float = 0.2,
    use_sif: bool = True,
    sif_smoothing: float = 0.001,
) -> np.ndarray

Cosine-similarity ranking against document centroid vectors.

PARAMETER	DESCRIPTION
`tokens`	tokenised query (output of `tokenize_query`). TYPE: `list[str]`
`fast`	use a single dot-product against the aggregate query vector instead of the per-token dual-embedding loop. TYPE: `bool` DEFAULT: `False`
`clip`	clamp scores to [0, 1]. Off by default: clipping saturates the SIF-weighted cosine and destroys rank resolution, which is fatal for the rank-based RRF fusion. Only enable when blending raw scores. TYPE: `bool` DEFAULT: `False`
`coverage`	if cluster data has been loaded, restrict the matmul to the nearest clusters that together cover at least this fraction of the corpus. Documents outside the selected clusters keep a score of 0; RRF fusion with BM25+ in `rank()` still surfaces them. A small fixed handful of clusters (the old behaviour) was far too aggressive for broad queries — the right cluster was easily missed. TYPE: `float` DEFAULT: `0.2`

core.search.Indexer.rrf ¤

rrf(
    ranks_1: np.ndarray,
    ranks_2: np.ndarray,
    coeff: float = 60,
    weight_1: float = 1.0,
    weight_2: float = 1.0,
) -> np.ndarray

Reciprocal Rank Fusion

Aggregate 2 sets of page rankings obtained from different semantic geometries and weighted differently.

Reference

Reciprocal rank fusion outperforms condorcet and individual rank learning methods, Gordon V. Cormack, Charles L A Clarke, Stefan Buettcher. https://dl.acm.org/doi/10.1145/1571941.1572114

PARAMETER	DESCRIPTION
`ranks_1`	0-based ranks from the first ranker (best = 0). TYPE: `np.ndarray`
`ranks_2`	0-based ranks from the second ranker (best = 0). TYPE: `np.ndarray`
`coeff`	RRF smoothing constant `k`; larger flattens the contribution of top ranks. TYPE: `float` DEFAULT: `60`
`weight_1`	vote weight of `ranks_1`. TYPE: `float` DEFAULT: `1.0`
`weight_2`	vote weight of `ranks_2`. Plain RRF (both weights `1.0`) gives each ranker an equal say, which is only sound when both input rankings are individually trustworthy. Here the AI centroid ranker is not: on this small, domain-specific corpus it confidently places off-topic documents in its own top-10 whenever a query word is semantically generic (e.g. “waterfall” pulling in paintings/3D-renders, “backup” pulling in a generic encyclopedia article), and it simultaneously buries canonical but short/link-heavy pages whose centroid is diluted toward the corpus mean. Down-weighting its vote lets BM25 — the higher-precision lexical signal for keyword queries — own the top of the ranking while the AI ranker re-orders within the lexically-supported set. See core.search.Indexer.rank. TYPE: `float` DEFAULT: `1.0`

core.search.Indexer.rank ¤

rank(
    db: sqlite3.Connection,
    tokens: list[str],
    method: search_methods,
    n_results: int = 500,
    fine_search: bool = False,
    sql_query: str = "",
    sql_params: list[str] = [],
    ai_weight: float = 0.33,
) -> list[tuple[int, str, float]]

Apply a label on a post based on the trained model.

PARAMETER	DESCRIPTION
`db`	the SQLite database holding the indexed set of document. This database must absolutely be up-to-date with the one used to instanciate this class, regarding row ordering of documents, otherwise rowid mismatches are to be expected between fuzzy, AI and regex searches. tokens: the tokenized query. TYPE: `sqlite3.Connection`
`method`	`ai`, `fuzzy` or `grep`: - `ai` use word embedding and meta-tokens with dual-embedding space, - `fuzzy` uses meta-tokens with BM25Okapi stats model, - `mixed` use a combination of `ai` and `fuzzy` merged by Reciprocal Rank Fusion, using the `ai_weight` factor. TYPE: `search_methods`
`n_results`	number of results to retain TYPE: `int` DEFAULT: `500`
`fine_search`	optionally refine the search using a 2D interaction matrix. See [1] TYPE: `bool` DEFAULT: `False`
`sql_query`	SQL query to narrow-down the search, for example `WHERE field = value`. Supports PCRE regex with `WHERE field REGEXP 'pattern'`. TYPE: `str` DEFAULT: `''`
`sql_params`	the SQL parameters such that: `cursor = db.execute( f"SELECT url FROM pages {sql_query}", sql_params )` where each `sql_params` item is matched in the `sql_query` by a `?`. For example: SELECT url // imposed by the search API FROM pages // imposed by the search API WHERE instr(url, ?) > 0 // implementation-side `sql_query` ORDER BY url // imposed by the search API and `sql_params = ['google.com']` will filter all URLs from Google. TYPE: `list[str]` DEFAULT: `[]`
`ai_weight`	vote weight of the AI (embedding) ranker in the weighted RRF fusion with BM25+ (which keeps weight 1.0), for `method=MIXED`. - `0.0` effectively disables the AI part and is equivalent to `method=FUZZY`. - `< 0.5` makes BM25 the primary signal and lets the noisier centroid ranker only re-order within lexically-supported candidates. - `0.33` is the tuned default (drives top-10 junk to zero); - `0.5` uses plain symmetric RRF: AI and FUZZY contribute as much - `1.0 effectively disables the FUZZY part and is equivalent to `method=AI`. TYPE: `float` DEFAULT: `0.33`

Note

Both SQL search into the database and Python filtering into the index are supported, and can be combined. The local index is a partial copy of the database and is already a Python object, so it will be faster to filter if you only need to parse the copied data to filter in/out.

RETURNS	DESCRIPTION
`list`	the list of best-matching results as (rank, url, similarity) tuples. TYPE: `list[tuple[int, str, float]]`

core.search.Indexer.get_related ¤

get_related(
    tokens: list[str],
    n: int = 15,
    k: int = 5,
    use_sif: bool = True,
    sif_smoothing: float = 0.001,
) -> list

Get the n closest keywords from the query.

core.search.Indexer.get_clusters ¤

get_clusters(db: sqlite3.Connection)

Find document latent topics modelled as clusters of document centroids.

Writes to the database

clusters table — one row per cluster: label (PK), human-legible keyword labels, centroid BLOB, and max cosine radius so callers can gauge cluster tightness.
pages.cluster — integer FK into clusters.label for each page.
search table — three new BLOB columns (cluster_labels_raw, cluster_centroids_raw, cluster_centroids_shape) that mirror the pattern used for vectors_raw so the data loads at the same speed on startup without touching pages at all.

Sets on self (immediately usable without reloading): self.cluster_centroids: (K, D) float32 — one centroid per cluster. self.cluster_doc_indices: dict[int, np.ndarray[int32]] — maps each cluster label to its member row indices in self.vectors / self.ranker (positions = search_rowid).

core.search.Indexer.compute_ctfidf_labels ¤

compute_ctfidf_labels(
    labels: np.ndarray, top_n: int = 10
) -> dict[int, list[str]]

Compute c-TF-IDF topic keywords for each cluster using the existing BM25+ ranker.

Returns a dict mapping cluster label → list of top_n discriminative keywords.

core.search¤

core.search ¤

Classes¤

core.search.BM25PlusCSR ¤

core.search.search_methods ¤

Attributes¤

core.search.search_methods.AI class-attribute instance-attribute ¤

core.search.search_methods.FUZZY class-attribute instance-attribute ¤

core.search.search_methods.MIXED class-attribute instance-attribute ¤

core.search.Indexer ¤

Attributes¤

core.search.Indexer.sql instance-attribute ¤

core.search.Indexer.word2vec instance-attribute ¤

core.search.Indexer.collocations instance-attribute ¤

core.search.Indexer.ranker instance-attribute ¤

core.search.Indexer.pc instance-attribute ¤

core.search.Indexer.vectors instance-attribute ¤

Methods:¤

core.search.Indexer.init_stats_table ¤

core.search.Indexer.init_categories_table ¤

core.search.Indexer.init_pages_search_indexes ¤

core.search.Indexer.save_search_stats ¤

core.search.Indexer.save_categories_index ¤

core.search.Indexer.build_stats ¤

core.search.Indexer.array_to_raw staticmethod ¤

core.search.Indexer.verify_db_integrity ¤

core.search.Indexer.normalize_pc ¤

core.search.Indexer.filter_contents ¤

core.search.Indexer.load classmethod ¤

core.search.Indexer.tokenize_query ¤

core.search.Indexer.vectorize_query ¤

core.search.Indexer.find_query_pattern ¤

core.search.Indexer.rank_ai ¤

core.search.Indexer.rrf ¤

core.search.Indexer.rank ¤

core.search.Indexer.get_related ¤

core.search.Indexer.get_clusters ¤

core.search.Indexer.compute_ctfidf_labels ¤

Functions:¤

core.search.search_methods.AI `class-attribute` `instance-attribute` ¤

core.search.search_methods.FUZZY `class-attribute` `instance-attribute` ¤

core.search.search_methods.MIXED `class-attribute` `instance-attribute` ¤

core.search.Indexer.sql `instance-attribute` ¤

core.search.Indexer.word2vec `instance-attribute` ¤

core.search.Indexer.collocations `instance-attribute` ¤

core.search.Indexer.ranker `instance-attribute` ¤

core.search.Indexer.pc `instance-attribute` ¤

core.search.Indexer.vectors `instance-attribute` ¤

core.search.Indexer.array_to_raw `staticmethod` ¤

core.search.Indexer.load `classmethod` ¤