Skip to content

core.types¤

core.types ¤

Classes¤

core.types.web_page ¤

Bases: TypedDict

Typed dictionnary representing a web page and its metadata. It can also be used for any text document having an URL/URI.

The database module automatically uses this dictionnary’s keys to create DB columns when generating the web_page DB. Therefore, keys needs to be kept in sync along all the modules, that is modules should not add their own custom keys.

PEP 705 adds the ability to declare read-only typed dictionnaries, which should be used here as soon as it gets merged to stable Python to forbid custom keys definitions in web_pages instances from other modules.

Attributes¤
core.types.web_page.title instance-attribute ¤
title: str

Title of the page

core.types.web_page.url instance-attribute ¤
url: str

Where to find the page on the network. Can be a local or distant URI, with or without protocol, or even an unique identifier.

core.types.web_page.domain instance-attribute ¤
domain: str

Domain of the url

core.types.web_page.date instance-attribute ¤
date: str

Date of the last modification of the page, to assess relevance of the content, as a string.

core.types.web_page.content instance-attribute ¤
content: str

The actual content of the page in a human-readable way.

core.types.web_page.excerpt instance-attribute ¤
excerpt: str

Shortened version of the content for search results previews. Typically provided as description meta tag by websites.

core.types.web_page.h1 instance-attribute ¤
h1: list

Title of the post if any. There should be only one h1 per page, matching title, but some templates wrongly use h1 for section titles.

core.types.web_page.h2 instance-attribute ¤
h2: list

Section titles if any

core.types.web_page.lang instance-attribute ¤
lang: str

2-letters ISO code of the page language. Not used internally, it’s important only if you need to use it in implementations.

core.types.web_page.category instance-attribute ¤
category: str

Arbitrary category, tag or label set by user, to be reused for example in AI document tagging.

core.types.web_page.datetime instance-attribute ¤
datetime: dt

The page date as a datetime.datetime object directly usable

core.types.web_page.length instance-attribute ¤
length: int

Characters length of content

core.types.web_page.parsed instance-attribute ¤
parsed: str

The normalized content of the page (lowercase, possibly converted to simple ASCII characters) for machine view over the content.

core.types.web_page.tokenized instance-attribute ¤
tokenized: list

List of parsed content text tokens, including metatokens, if needed, as a list of sentences, where sentences are themselves a list of string tokens. This is a basic tokenization meant to train n-grams, so it should retain enough semantics.

core.types.web_page.stemmed instance-attribute ¤
stemmed: list

Same as tokenized but stemmed and normalized on top.

core.types.web_page.vectorized instance-attribute ¤
vectorized: np.ndarray

Precomputed vector representation of the tokenized content.

core.types.web_page.wayback instance-attribute ¤
wayback: str

URL of the page accessed through web.archive.org/Wayback Machine, so we can store the canonical URL in the url key.

core.types.web_page.crawled instance-attribute ¤
crawled: dt

Date and time of the last crawling for this page.

core.types.web_page.content_hash instance-attribute ¤
content_hash: str

SHA1 hash of the parsed (normalized) content.

Functions¤

core.types.sanitize_web_page ¤

sanitize_web_page(page: web_page) -> web_page

Ensure existence and validity of web_page keys/values.

core.types.db_row_to_web_page ¤

db_row_to_web_page(row: list[tuple[any]]) -> web_page

Turn an SQL extraction of a full row containing a web_page. Columns are matched to keys in the same order. The database needs to be saved with columns in the right order, call core.types.sanitize_web_page first

core.types.get_web_page_ram ¤

get_web_page_ram(item: web_page) -> int

Get RAM usage of a web_page in bytes

core.types.get_web_pages_ram ¤

get_web_pages_ram(web_pages: list[web_page]) -> int

Get RAM usage of a list of web_pages in bytes