Skip to content

core.deduplicator¤

core.deduplicator ¤

Find and remove duplicates and near-duplicates in a list of core.types.web_page

© 2024 - Aurélien Pierre.

Classes¤

core.deduplicator.Deduplicator ¤

Deduplicator(
    threshold: float = 0.9,
    distance: int = 50,
    discard_params: bool = True,
    n_min: int = 0,
    fix_urls: bool = True,
)

Instanciate a depduplicator object.

The duplicates factorizing takes a list of core.types.web_page

Duplication detection is done using canonical URLs (removing query parameters and anchors) and lowercased, ASCII-converted content.

You can edit (append or replace) the list of URLs to ignore core.deduplicator.Deduplicator.urls_to_ignore before doing the actual process.

Optionaly, near-duplicates are detected too by computing the Levenshtein distance between pages contents (lowercased and ASCII-converted). This brings a significant performance penalty on large datasets.

PARAMETER DESCRIPTION
threshold

the minimum Levenshtein distance ratio between 2 pages contents for those pages to be considered near-duplicates and be factorized. If set to 1.0, the near-duplicates detection is bypassed which results in a huge speed up.

TYPE: float DEFAULT: 0.9

distance

the near-duplicates search is performed on the nearest elements after the core.types.web_page list has been ordered alphabetically by URL, for performance, assuming near-duplicates will most likely be found on the same domain and at a resembling path. The distance parameters defines how many elements ahead we will look into.

TYPE: int DEFAULT: 50

discard_params

on modern CMS that enable “pretty URLs” (URL rewriting), pages will be indexed by a domain/section/subsection/page and URL query parameters will most likely be used my meaningless pages like social sharing links or search results page so this parameter can be set to True to discard those. On Rest-API-driven websites, streaming websites and old CMS using “ugly URLS”, pages will be indexed by domain?content=id and the query parameters need to be kept by setting this parameter to False

TYPE: bool DEFAULT: True

n_min

domains that have a number of indexed pages below this threshold will be discarded entirely. This avoids indexing random dude’s website, under the assumption that relevant and reliable domains will have several pages indexed.

TYPE: int DEFAULT: 0

fix_urls

attempt to convert http to https URLs and remove leading www.. This sends DNS requests to assess if the https and www.-less variants can be reached, which takes a most 2 s per URL. Set to False to speed things up.

TYPE: bool DEFAULT: True

Attributes¤
core.deduplicator.Deduplicator.urls_to_ignore class-attribute instance-attribute ¤
urls_to_ignore: list[str] = [
    "/tag/",
    "/tags/",
    "/category/",
    "/categories/",
    "/author/",
    "/authors/",
    "/profil/",
    "/profiles/",
    "/user/",
    "/users/",
    "/login/",
    "/signup/",
    "/member/",
    "/members/",
    "/cart/",
    "/shop/",
    "/register",
]

URL substrings to find in URLs and remove matching web pages: mostly WordPress archive pages, user profiles and login pages.

Functions¤
core.deduplicator.Deduplicator.prepare_posts_parallel classmethod ¤
prepare_posts_parallel(elem, discard_params, urls_to_ignore, fix_urls)

Canonicalize a :class:~core.types.web_page dict for the list path.

Delegates URL normalization to :meth:_canonicalize_url and adds list-path-specific fallbacks for length and datetime (which are guaranteed to be pre-computed on the DB path by batch_parse_web_page but may be absent on hand-assembled lists).

Returns the mutated elem dict, or None if the URL must be discarded.

core.deduplicator.Deduplicator.get_unique_urls ¤
get_unique_urls(posts: list[web_page]) -> list[web_page]

Pick the most recent, or otherwise the longer, candidate for each canonical URL.

core.deduplicator.Deduplicator.run_on_db ¤
run_on_db(db: sqlite3.Connection, chunksize: int = 4096) -> None

Deduplicate the pages table in-place, matching the full __call__ pipeline.

The method runs four sequential phases that mirror __call__:

  1. URL canonicalization – stream every row through :meth:prepare_posts_parallel (threaded, I/O-bound), normalise URLs, compute a SHA-1 content hash, and write results to the temporary _prepared table.
  2. URL deduplication – for each canonical URL keep the single best row using SQL window functions with :attr:_ELECTION_ORDER.
  3. Exact-content deduplication – among URL winners, collapse rows that share the same SHA-1 hash using the same election order.
  4. Near-duplicate removal (skipped when threshold == 1.0) – load the survivors into memory, run the Levenshtein window scan with parallelised comparisons (threaded; python-Levenshtein releases the GIL), write the final winner set back to a temp table.

The pages table is atomically replaced by the winner set at the end. All intermediate _prepared / _url_winners / _content_winners / _near_winners temp tables are cleaned up on success.

Assumptions: - pages has at least the columns: url, title, content, date, datetime, parsed, category. - datetime values, when present, are ISO-8601 strings (SQLite TEXT). NULL is treated as “oldest possible” in the election. - The external category label means the page was crawled by following external links and contains the full <body>; any other category means it was crawled from a sitemap / REST-API and contains cleaner markup. Non-external therefore wins over external in the election.

PARAMETER DESCRIPTION
db

Open sqlite3.Connection to the database.

TYPE: sqlite3.Connection

chunksize

Number of rows fetched per batch during Phase 1.

TYPE: int DEFAULT: 4096

core.deduplicator.Deduplicator.add_content_hash_column staticmethod ¤
add_content_hash_column(db: sqlite3.Connection) -> None

Add (or refresh) a content_hash column on the pages table.

Computes a SHA-1 digest of each row’s parsed field and stores it in content_hash. The column is created if it does not yet exist. Rows with a NULL parsed value are skipped and left with a NULL hash.

A covering index idx_pages_content_hash is created (or left in place) after the update so that subsequent deduplication queries are cheap.

This method is a standalone maintenance utility. The deduplication pipeline (:meth:run_on_db) computes hashes inline during Phase 1 and does not require this method to be called first.

Assumption: parsed values fit in memory individually (they are fetched one batch at a time, not all at once).

PARAMETER DESCRIPTION
db

Open sqlite3.Connection to the target database.

TYPE: sqlite3.Connection

core.deduplicator.Deduplicator.get_unique_content ¤
get_unique_content(posts: list[web_page]) -> list[web_page]

Pick the most recent candidate for each canonical content.

Return

canonical content: web_page dictionnary

core.deduplicator.Deduplicator.get_close_content ¤
get_close_content(
    posts: list[web_page], threshold: float = 0.9, distance: int = 50
) -> list[web_page]

Find and remove near-duplicates using the Levenshtein ratio.

Delegates the actual scan to :meth:_close_content_scan, which parallelises comparisons within each window via a :class:~concurrent.futures.ThreadPoolExecutor. This method is the list-path counterpart to :meth:_elect_near_duplicates; both call the same shared scan implementation.

The election among near-duplicate candidates honours the same priority rules as URL and content deduplication (non-external > newer > longer > shorter URL) via :meth:_elect_group.

PARAMETER DESCRIPTION
posts

List of :class:core.types.web_page dicts after URL and exact-content deduplication.

TYPE: list[web_page]

threshold

Minimum Levenshtein ratio for two pages to be considered near-duplicates. Defaults to :attr:self.threshold.

TYPE: float DEFAULT: 0.9

distance

Positions ahead to scan from each row after sorting by URL. Defaults to :attr:self.distance.

TYPE: int DEFAULT: 50

RETURNS DESCRIPTION
list[web_page]

Filtered list with near-duplicates removed; one survivor per group.

core.deduplicator.Deduplicator.run_on_list ¤
run_on_list(posts: list[web_page]) -> list[web_page]

Deduplicate an in-memory list of web pages, matching the full pipeline.

This is the list-based counterpart to :meth:run_on_db. The two methods are kept symmetrical: both run the same four phases (URL canonicalization, exact-URL deduplication, exact-content deduplication, optional near-duplicate removal) and honour the same election rules.

Note

posts is consumed and partially destroyed during processing to avoid keeping two copies in memory simultaneously.

PARAMETER DESCRIPTION
posts

Flat list of :class:~core.types.web_page dicts. The list is modified in-place; callers should not rely on its contents after this call returns.

TYPE: list[web_page]

RETURNS DESCRIPTION
list[web_page]

Deduplicated list of sanitised :class:~core.types.web_page dicts,

list[web_page]

ready for downstream use. Also writes a domains frequency file

list[web_page]

Functions¤