core.types¤
core.types
¤
Classes¤
core.types.web_page
¤
Bases: TypedDict
Typed dictionnary representing a web page and its metadata. It can also be used for any text document having an URL/URI.
The database module automatically uses this dictionnary’s keys to create DB columns when generating the web_page DB.
Therefore, keys needs to be kept in sync along all the modules, that is modules should not add their own custom keys.
PEP 705 adds the ability to declare read-only
typed dictionnaries, which should be used here as soon as it gets merged to stable Python to forbid custom keys definitions
in web_pages instances from other modules.
Attributes¤
core.types.web_page.url
instance-attribute
¤
url: str
Where to find the page on the network. Can be a local or distant URI, with or without protocol, or even an unique identifier.
core.types.web_page.date
instance-attribute
¤
date: str
Date of the last modification of the page, to assess relevance of the content, as a string.
core.types.web_page.content
instance-attribute
¤
content: str
The actual content of the page in a human-readable way.
core.types.web_page.excerpt
instance-attribute
¤
excerpt: str
Shortened version of the content for search results previews. Typically provided as description meta tag by websites.
core.types.web_page.h1
instance-attribute
¤
h1: list
Title of the post if any. There should be only one h1 per page, matching title, but some templates wrongly use h1 for section titles.
core.types.web_page.lang
instance-attribute
¤
lang: str
2-letters ISO code of the page language. Not used internally, it’s important only if you need to use it in implementations.
core.types.web_page.category
instance-attribute
¤
category: str
Arbitrary category, tag or label set by user, to be reused for example in AI document tagging.
core.types.web_page.datetime
instance-attribute
¤
datetime: dt
The page date as a datetime.datetime object directly usable
core.types.web_page.parsed
instance-attribute
¤
parsed: str
The normalized content of the page (lowercase, possibly converted to simple ASCII characters) for machine view over the content.
core.types.web_page.tokenized
instance-attribute
¤
tokenized: list
List of parsed content text tokens, including metatokens, if needed, as a list of sentences, where sentences are themselves a list of string tokens. This is a basic tokenization meant to train n-grams, so it should retain enough semantics.
core.types.web_page.stemmed
instance-attribute
¤
stemmed: list
Same as tokenized but stemmed and normalized on top.
core.types.web_page.vectorized
instance-attribute
¤
Precomputed vector representation of the tokenized content.
core.types.web_page.wayback
instance-attribute
¤
wayback: str
URL of the page accessed through web.archive.org/Wayback Machine, so we can store the canonical URL in the url key.
Functions¤
core.types.sanitize_web_page
¤
Ensure existence and validity of web_page keys/values.
core.types.db_row_to_web_page
¤
Turn an SQL extraction of a full row containing a web_page. Columns are matched to keys in the same order.
The database needs to be saved with columns in the right order, call core.types.sanitize_web_page first
core.types.get_web_page_ram
¤
Get RAM usage of a web_page in bytes