core.deduplicator¤
core.deduplicator
¤
Find and remove duplicates and near-duplicates in a list of core.types.web_page
© 2024 - Aurélien Pierre.
Classes¤
core.deduplicator.Deduplicator
¤
Deduplicator(
threshold: float = 0.9,
distance: int = 50,
discard_params: bool = True,
n_min: int = 0,
fix_urls: bool = True,
)
Instanciate a depduplicator object.
The duplicates factorizing takes a list of core.types.web_page
Duplication detection is done using canonical URLs (removing query parameters and anchors) and lowercased, ASCII-converted content.
You can edit (append or replace) the list of URLs to ignore core.deduplicator.Deduplicator.urls_to_ignore before doing the actual process.
Optionaly, near-duplicates are detected too by computing the Levenshtein distance between pages contents (lowercased and ASCII-converted). This brings a significant performance penalty on large datasets.
| PARAMETER | DESCRIPTION |
|---|---|
threshold
|
the minimum Levenshtein distance ratio between 2 pages contents for those pages to be considered near-duplicates and be factorized. If set to 1.0, the near-duplicates detection is bypassed which results in a huge speed up.
TYPE:
|
distance
|
the near-duplicates search is performed on the nearest elements after the core.types.web_page list has been ordered alphabetically by URL, for performance, assuming near-duplicates will most likely be found on the same domain and at a resembling path. The distance parameters defines how many elements ahead we will look into.
TYPE:
|
discard_params
|
on modern CMS that enable “pretty URLs” (URL rewriting), pages will be indexed
by a
TYPE:
|
n_min
|
domains that have a number of indexed pages below this threshold will be discarded entirely. This avoids indexing random dude’s website, under the assumption that relevant and reliable domains will have several pages indexed.
TYPE:
|
fix_urls
|
attempt to convert
TYPE:
|
Attributes¤
core.deduplicator.Deduplicator.urls_to_ignore
class-attribute
instance-attribute
¤
urls_to_ignore: list[str] = [
"/tag/",
"/tags/",
"/category/",
"/categories/",
"/author/",
"/authors/",
"/profil/",
"/profiles/",
"/user/",
"/users/",
"/login/",
"/signup/",
"/member/",
"/members/",
"/cart/",
"/shop/",
"/register",
]
URL substrings to find in URLs and remove matching web pages: mostly WordPress archive pages, user profiles and login pages.
Functions¤
core.deduplicator.Deduplicator.prepare_posts_parallel
classmethod
¤
Canonicalize a :class:~core.types.web_page dict for the list path.
Delegates URL normalization to :meth:_canonicalize_url and adds
list-path-specific fallbacks for length and datetime (which are
guaranteed to be pre-computed on the DB path by batch_parse_web_page
but may be absent on hand-assembled lists).
Returns the mutated elem dict, or None if the URL must be
discarded.
core.deduplicator.Deduplicator.get_unique_urls
¤
Pick the most recent, or otherwise the longer, candidate for each canonical URL.
core.deduplicator.Deduplicator.run_on_db
¤
run_on_db(db: sqlite3.Connection, chunksize: int = 4096) -> None
Deduplicate the pages table in-place, matching the full __call__ pipeline.
The method runs four sequential phases that mirror __call__:
- URL canonicalization – stream every row through
:meth:
prepare_posts_parallel(threaded, I/O-bound), normalise URLs, compute a SHA-1 content hash, and write results to the temporary_preparedtable. - URL deduplication – for each canonical URL keep the single best row
using SQL window functions with :attr:
_ELECTION_ORDER. - Exact-content deduplication – among URL winners, collapse rows that share the same SHA-1 hash using the same election order.
- Near-duplicate removal (skipped when
threshold == 1.0) – load the survivors into memory, run the Levenshtein window scan with parallelised comparisons (threaded;python-Levenshteinreleases the GIL), write the final winner set back to a temp table.
The pages table is atomically replaced by the winner set at the end.
All intermediate _prepared / _url_winners / _content_winners /
_near_winners temp tables are cleaned up on success.
Assumptions:
- pages has at least the columns: url, title, content, date,
datetime, parsed, category.
- datetime values, when present, are ISO-8601 strings (SQLite TEXT).
NULL is treated as “oldest possible” in the election.
- The external category label means the page was crawled by following
external links and contains the full <body>; any other category means
it was crawled from a sitemap / REST-API and contains cleaner markup.
Non-external therefore wins over external in the election.
| PARAMETER | DESCRIPTION |
|---|---|
db
|
Open
TYPE:
|
chunksize
|
Number of rows fetched per batch during Phase 1.
TYPE:
|
core.deduplicator.Deduplicator.add_content_hash_column
staticmethod
¤
add_content_hash_column(db: sqlite3.Connection) -> None
Add (or refresh) a content_hash column on the pages table.
Computes a SHA-1 digest of each row’s parsed field and stores it in
content_hash. The column is created if it does not yet exist. Rows
with a NULL parsed value are skipped and left with a NULL hash.
A covering index idx_pages_content_hash is created (or left in place)
after the update so that subsequent deduplication queries are cheap.
This method is a standalone maintenance utility. The deduplication
pipeline (:meth:run_on_db) computes hashes inline during Phase 1 and
does not require this method to be called first.
Assumption: parsed values fit in memory individually (they are fetched
one batch at a time, not all at once).
| PARAMETER | DESCRIPTION |
|---|---|
db
|
Open
TYPE:
|
core.deduplicator.Deduplicator.get_unique_content
¤
Pick the most recent candidate for each canonical content.
Return
canonical content: web_page dictionnary
core.deduplicator.Deduplicator.get_close_content
¤
get_close_content(
posts: list[web_page], threshold: float = 0.9, distance: int = 50
) -> list[web_page]
Find and remove near-duplicates using the Levenshtein ratio.
Delegates the actual scan to :meth:_close_content_scan, which
parallelises comparisons within each window via a
:class:~concurrent.futures.ThreadPoolExecutor. This method is the
list-path counterpart to :meth:_elect_near_duplicates; both call the
same shared scan implementation.
The election among near-duplicate candidates honours the same priority
rules as URL and content deduplication (non-external > newer > longer >
shorter URL) via :meth:_elect_group.
| PARAMETER | DESCRIPTION |
|---|---|
posts
|
List of :class: |
threshold
|
Minimum Levenshtein ratio for two pages to be considered
near-duplicates. Defaults to :attr:
TYPE:
|
distance
|
Positions ahead to scan from each row after sorting by URL.
Defaults to :attr:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[web_page]
|
Filtered list with near-duplicates removed; one survivor per group. |
core.deduplicator.Deduplicator.run_on_list
¤
Deduplicate an in-memory list of web pages, matching the full pipeline.
This is the list-based counterpart to :meth:run_on_db. The two methods
are kept symmetrical: both run the same four phases (URL canonicalization,
exact-URL deduplication, exact-content deduplication, optional
near-duplicate removal) and honour the same election rules.
Note
posts is consumed and partially destroyed during processing to
avoid keeping two copies in memory simultaneously.
| PARAMETER | DESCRIPTION |
|---|---|
posts
|
Flat list of :class: |
| RETURNS | DESCRIPTION |
|---|---|
list[web_page]
|
Deduplicated list of sanitised :class: |
list[web_page]
|
ready for downstream use. Also writes a |
list[web_page]
|