core.deduplicator¤

core.deduplicator ¤

Find and remove duplicates and near-duplicates in a list of core.types.web_page

Classes¤

core.deduplicator.Deduplicator ¤

Deduplicator(
    threshold: float = 0.9,
    distance: int = 50,
    discard_params: bool = True,
    n_min: int = 0,
    fix_urls: bool = True,
)

Instanciate a depduplicator object.

The duplicates factorizing takes a list of core.types.web_page

Duplication detection is done using canonical URLs (removing query parameters and anchors) and lowercased, ASCII-converted content.

You can edit (append or replace) the list of URLs to ignore core.deduplicator.Deduplicator.urls_to_ignore before doing the actual process.

Optionaly, near-duplicates are detected too by computing the Levenshtein distance between pages contents (lowercased and ASCII-converted). This brings a significant performance penalty on large datasets.

PARAMETER	DESCRIPTION
`threshold`	the minimum Levenshtein distance ratio between 2 pages contents for those pages to be considered near-duplicates and be factorized. If set to 1.0, the near-duplicates detection is bypassed which results in a huge speed up. TYPE: `float` DEFAULT: `0.9`
`distance`	the near-duplicates search is performed on the nearest elements after the core.types.web_page list has been ordered alphabetically by URL, for performance, assuming near-duplicates will most likely be found on the same domain and at a resembling path. The distance parameters defines how many elements ahead we will look into. TYPE: `int` DEFAULT: `50`
`discard_params`	on modern CMS that enable “pretty URLs” (URL rewriting), pages will be indexed by a `domain/section/subsection/page` and URL query parameters will most likely be used my meaningless pages like social sharing links or search results page so this parameter can be set to `True` to discard those. On Rest-API-driven websites, streaming websites and old CMS using “ugly URLS”, pages will be indexed by `domain?content=id` and the query parameters need to be kept by setting this parameter to `False` TYPE: `bool` DEFAULT: `True`
`n_min`	domains that have a number of indexed pages below this threshold will be discarded entirely. This avoids indexing random dude’s website, under the assumption that relevant and reliable domains will have several pages indexed. TYPE: `int` DEFAULT: `0`
`fix_urls`	attempt to convert `http` to `https` URLs and remove leading `www.`. This sends DNS requests to assess if the `https` and `www.`-less variants can be reached, which takes a most 2 s per URL. Set to `False` to speed things up. TYPE: `bool` DEFAULT: `True`

Attributes¤

core.deduplicator.Deduplicator.urls_to_ignore `class-attribute` `instance-attribute` ¤

urls_to_ignore: list[str] = [
    "/tag/",
    "/tags/",
    "/category/",
    "/categories/",
    "/author/",
    "/authors/",
    "/profil/",
    "/profiles/",
    "/user/",
    "/users/",
    "/login/",
    "/signup/",
    "/member/",
    "/members/",
    "/cart/",
    "/shop/",
    "/register",
]

URL substrings to find in URLs and remove matching web pages: mostly WordPress archive pages, user profiles and login pages.

Methods:¤

core.deduplicator.Deduplicator.prepare_posts_parallel `classmethod` ¤

prepare_posts_parallel(elem, discard_params, urls_to_ignore, fix_urls)

Canonicalize a :class:~core.types.web_page dict for the list path.

Delegates URL normalization to :meth:_canonicalize_url and adds list-path-specific fallbacks for length and datetime (which are guaranteed to be pre-computed on the DB path by batch_parse_web_page but may be absent on hand-assembled lists).

Returns the mutated elem dict, or None if the URL must be discarded.

core.deduplicator.Deduplicator.get_unique_urls ¤

get_unique_urls(posts: list[web_page]) -> list[web_page]

Pick the most recent, or otherwise the longer, candidate for each canonical URL.

core.deduplicator.Deduplicator.run_on_db ¤

run_on_db(db: sqlite3.Connection, chunksize: int = 4096) -> None

Deduplicate the pages table in-place, matching the full __call__ pipeline.

Runs six sequential phases:

URL canonicalization – stream every row through :meth:_canonicalize_url (threaded, I/O-bound), normalise URLs, populate _prepared with canonical URL, domain, and metadata copied verbatim from pages (pre-computed by batch_parse_web_page).
URL deduplication – for each canonical URL keep the single best row via SQL window functions ordered by :attr:_ELECTION_ORDER_URL.
Exact-content deduplication – among URL winners, collapse rows that share the same content_hash using :attr:_ELECTION_ORDER_CONTENT. Rows without a hash (archival stubs) pass through unchanged.
Near-duplicate removal (skipped when threshold == 1.0) – load survivors with non-NULL parsed text into memory, run the parallel Levenshtein window scan, write the final winner set back. Archival stubs bypass this phase entirely.
Domain frequency filter (skipped when n_min == 0) – drop every row whose canonical domain appears fewer than :attr:n_min times in the survivor set. Rows with NULL domain are kept unconditionally.
Table rebuild – atomically replace pages with the winner rows, writing back canonicalised url, domain, and wayback.

All intermediate temp tables are cleaned up on success.

PARAMETER	DESCRIPTION
`db`	Open `sqlite3.Connection` to the database. TYPE: `sqlite3.Connection`
`chunksize`	Rows fetched per batch during Phase 1. TYPE: `int` DEFAULT: `4096`

core.deduplicator.Deduplicator.add_content_hash_column `staticmethod` ¤

add_content_hash_column(db: sqlite3.Connection) -> None

Add (or refresh) a content_hash column on the pages table.

Computes a SHA-1 digest of each row’s parsed field and stores it in content_hash. The column is created if it does not yet exist. Rows with a NULL parsed value are skipped and left with a NULL hash.

A covering index idx_pages_content_hash is created (or left in place) after the update so that subsequent deduplication queries are cheap.

This method is a standalone maintenance utility. The deduplication pipeline (:meth:run_on_db) computes hashes inline during Phase 1 and does not require this method to be called first.

Assumption: parsed values fit in memory individually (they are fetched one batch at a time, not all at once).

PARAMETER	DESCRIPTION
`db`	Open `sqlite3.Connection` to the target database. TYPE: `sqlite3.Connection`

core.deduplicator.Deduplicator.get_unique_content ¤

get_unique_content(posts: list[web_page]) -> list[web_page]

Pick the most recent candidate for each canonical content.

RETURNS	DESCRIPTION
`list[web_page]`	`canonical content: web_page` dictionnary

core.deduplicator.Deduplicator.get_close_content ¤

get_close_content(
    posts: list[web_page], threshold: float = 0.9, distance: int = 50
) -> list[web_page]

Find and remove near-duplicates using the Levenshtein ratio.

Delegates the actual scan to :meth:_close_content_scan, which parallelises comparisons within each window via a :class:~concurrent.futures.ThreadPoolExecutor. This method is the list-path counterpart to :meth:_elect_near_duplicates; both call the same shared scan implementation.

The election among near-duplicate candidates honours the same priority rules as URL and content deduplication (non-external > newer > longer > shorter URL) via :meth:_elect_group.

PARAMETER	DESCRIPTION
`posts`	List of :class:`core.types.web_page` dicts after URL and exact-content deduplication. TYPE: `list[web_page]`
`threshold`	Minimum Levenshtein ratio for two pages to be considered near-duplicates. Defaults to :attr:`self.threshold`. TYPE: `float` DEFAULT: `0.9`
`distance`	Positions ahead to scan from each row after sorting by URL. Defaults to :attr:`self.distance`. TYPE: `int` DEFAULT: `50`

RETURNS	DESCRIPTION
`list[web_page]`	Filtered list with near-duplicates removed; one survivor per group.

core.deduplicator.Deduplicator.run_on_list ¤

run_on_list(posts: list[web_page]) -> list[web_page]

Deduplicate an in-memory list of web pages, matching the full pipeline.

This is the list-based counterpart to :meth:run_on_db. The two methods are kept symmetrical: both run the same four phases (URL canonicalization, exact-URL deduplication, exact-content deduplication, optional near-duplicate removal) and honour the same election rules.

Note

posts is consumed and partially destroyed during processing to avoid keeping two copies in memory simultaneously.

PARAMETER	DESCRIPTION
`posts`	Flat list of :class:`~core.types.web_page` dicts. The list is modified in-place; callers should not rely on its contents after this call returns. TYPE: `list[web_page]`

RETURNS	DESCRIPTION
`list[web_page]`	Deduplicated list of sanitised :class:`~core.types.web_page` dicts,
`list[web_page]`	ready for downstream use. Also writes a `domains` frequency file
`list[web_page]`	via core.utils.get_models_folder.

core.deduplicator¤