Skip to content

core.crawler¤

core.crawler ¤

Module containing utilities to crawl websites for HTML, XML and PDF pages for their text content. PDF can be read from their text content if any, or through optical characters recognition for scans. Websites can be crawled from a sitemap.xml file or by following internal links recursively from and index page. Each page is aggregated on a list of core.types.web_page objects, meant to be used as input to train natural language AI models and to index and rank for search engines.

© 2023-2024 - Aurélien Pierre

Classes¤

core.crawler.Crawler ¤

Crawler(
    delay: float = 1.0,
    no_follow: list[str] = [],
    known_urls: dict[str, datetime.datetime] | None = None,
    since: datetime.datetime | None = None,
)

Bases: DelayedClass

Crawl a website from its sitemap or by following internal links recusively from an index page. This class needs therefore to be used within a with statement that will take care of resources allocations and releases in background.

PARAMETER DESCRIPTION
delay

time in seconds to wait before 2 HTTP requests. The right delay will prevent the crawler from being throttled by anti-DoS rules while making it as fast as possible. Set to 0.0 if you are crawling your own servers and they have no DoS protection.

TYPE: float DEFAULT: 1.0

no_follow

list of URL parts to completely ignore, that is not index them but not even crawl them for internal links.

TYPE: list[str] DEFAULT: []

known_urls

mapping of url → last crawled datetime for pages already in the index. When provided, crawling methods use it to skip pages that have not changed since they were last indexed. Populate it conveniently with load_known_urls.

TYPE: dict[str, datetime.datetime] | None DEFAULT: None

since

global freshness cut-off for recursive crawling. Any URL present in known_urls and last crawled on or after this datetime will be skipped entirely. Has no effect when known_urls is empty or when a URL is not yet known. For sitemap crawling, the sitemap’s own <lastmod> field takes precedence; since is only used as a fallback for entries that have no <lastmod>.

TYPE: datetime.datetime | None DEFAULT: None

Example
db = database.open_db("my-engine.db")

with crawler.Crawler(delay=1.0) as cr:
    cr.load_known_urls(db)      # populate incremental-update map
    cr.since = datetime.datetime(2025, 1, 1, tzinfo=datetime.timezone.utc)

    # Only re-fetches pages whose <lastmod> is newer than the stored crawl date.
    pages = cr.get_website_from_sitemap("https://domain.com", "en")

    # Only re-fetches pages not yet in the index, or crawled before since.
    pages += cr.get_website_from_crawling("https://forum.domain.com", "en")
Attributes¤
core.crawler.Crawler.no_follow class-attribute instance-attribute ¤
no_follow: list[str] = [
    "api.whatsapp.com/share",
    "api.whatsapp.com/send",
    "pinterest.fr/pin/create",
    "pinterest.com/pin/create",
    "facebook.com/sharer",
    "twitter.com/intent/tweet",
    "twitter.com/share",
    "x.com/share",
    "reddit.com/submit",
    "t.me/share",
    "linkedin.com/share",
    "vk.com/share.php",
    "bufferapp.com/add",
    "getpocket.com/edit",
    "tumblr.com/share",
    "www.addtoany.com/add_to",
    "share.flipboard.com/bookmarklet/",
    "?share=",
    "?replytocom=",
    "translate.google.com/translate",
    "flickr.com",
    "//flic.kr/",
    "instagram.com",
    "threads.com",
    "facebook.com",
    "linkedin.com",
    "twitter.com",
    "//t.co/tiktok.com",
    "pinterest.com",
    "//x.com/",
    "reddit.com",
    "sciprofiles.com",
    "www.citeulike.org",
    "linktr.ee",
    "mailto:",
    "/profile/",
    "/login/",
    "/login.php",
    "/wp-login.php/signup/",
    "/signup.php",
    "/login?",
    "/signup?/user/",
    "/member/",
    "/register?",
    "login.microsoftonline.com",
    ".css",
    ".js",
    ".json",
    ".jpg",
    ".png",
    ".jpeg",
    ".gif",
    ".webp",
    ".heif",
    ".tif",
]

List of URLs sub-strings that will disable crawling if they are found in URLs. Mostly social networks sharing links.

core.crawler.Crawler.crawled_URL instance-attribute ¤
crawled_URL: list[str] = []

List of { URL + category } hashes already visited. Websites crawled from sitemap and also following internal links recursively will tag recursively-crawled pages with an external category, which will later be considered by core.deduplicator.Deduplicator with a lower priority than any other category. Sitemap crawling may restrict content to selected HTML tags and produce better-quality data, with less noise. So we need to keep crawling everything from sitemap, whether or not it was already crawled from internal links earlier, and dedup will sort it out.

core.crawler.Crawler.crawled_content instance-attribute ¤
crawled_content: list[str] = []

List of hashes of content already known

core.crawler.Crawler.known_urls instance-attribute ¤
known_urls: dict[str, datetime.datetime] = (
    dict(known_urls) if known_urls else {}
)

Mapping of URL → last-crawled datetime for incremental updates. Populated at construction time or via load_known_urls.

Note

We strip leading and trailing / for generality, in URL keys.

core.crawler.Crawler.since instance-attribute ¤
since: datetime.datetime | None = since

Global freshness cut-off for recursive and API-based crawling. Pages in known_urls last crawled on or after this datetime are skipped.

core.crawler.Crawler.errors instance-attribute ¤
errors = []

URLs that couldn’t be accessed due to blocking or throttling

core.crawler.Crawler.notfound instance-attribute ¤
notfound = []

URLs returning error 404 - not found

Functions¤
core.crawler.Crawler.load_known_urls ¤
load_known_urls(db: sqlite3.Connection) -> int

Populate the incremental-update map from an existing index database.

After calling this, all crawling methods will skip pages whose stored crawl timestamp indicates they are still fresh (see self.since and the <lastmod> logic in get_website_from_sitemap).

PARAMETER DESCRIPTION
db

an open SQLite connection to a Virtual Secretary database (as returned by core.database.open_db or core.database.create_db).

TYPE: sqlite3.Connection

RETURNS DESCRIPTION
int

Number of URL entries loaded.

Example
db  = database.open_db("my-engine.db")
with crawler.Crawler(delay=1.0) as cr:
    cr.load_known_urls(db)
    cr.since = datetime.datetime(2025, 6, 1, tzinfo=datetime.timezone.utc)
    pages = cr.get_website_from_sitemap("https://domain.com", "en")
db.close()
core.crawler.Crawler.get_most_recent_page ¤
get_most_recent_page(db: sqlite3.Connection) -> datetime.datetime | None

Get the datetime of the most recent web_page indexed in the db database

core.crawler.Crawler.get_most_recent_crawl ¤
get_most_recent_crawl(db: sqlite3.Connection) -> datetime.datetime | None

Get the datetime of the most recently crawled web_page indexed in the db database

core.crawler.Crawler.get_crawling_threshold ¤
get_crawling_threshold(db: sqlite3.Connection) -> datetime.datetime | None

Get the safe date from which we should restart incremental crawling of a website. We use the oldest among the last crawling date and the most recent page, to account for possibly badly-formed page dates set in the future at the time of crawling.

discard_link(url)

Returns True if the url is found in the self.no_follow list

get_immediate_links(
    links: list[str],
    domain,
    default_lang,
    langs,
    category,
    contains_str,
    internal_links: str = "any",
    mine_pdf=False,
) -> list[web_page]

Follow internal and external links contained in a webpage only to one recursivity level, including PDF files and HTML pages. This is useful to index references docs linked from a page.

PARAMETER DESCRIPTION
internal_links

defines what to do with links found inside the HTML page body/content: - any: follow and include all links found in page content, no matter what domain they point to, - internal: follow and include links found in page content only if they point to the same domain as the current page, - external: follow and include links found in page content only if they point to a different domain than the current page, - ignore: don’t follow links found in page content.

TYPE: str DEFAULT: 'any'

RETURNS DESCRIPTION
list[web_page]

list of links targets content

update_link(
    old_link: str, new_link: str | None, category: str, status_code: int
) -> str

Update target link with possible HTTP redirections

PARAMETER DESCRIPTION
old_link

original URL followed, found in HTML

TYPE: str

new_link

destination URL retrieved, possibly after HTTP redirections.

TYPE: str | None

category

tagged category of the page

TYPE: str

status_code

HTML returned status code

TYPE: int

core.crawler.Crawler.get_website_from_crawling ¤
get_website_from_crawling(
    website: str,
    default_lang: str = "en",
    child: str = "/",
    langs: tuple = ("en", "fr"),
    markup: str = "body",
    contains_str: str | list[str] = "",
    max_recurse_level: int = -1,
    category: str = "",
    restrict_section: bool = False,
    mine_pdf: bool = False,
    _recursion_level: int = 0,
    _mainthread: bool = True,
) -> list[web_page]

Recursively crawl all pages of a website from internal links found starting from the child page. This applies to all HTML pages hosted on the domain of website and to PDF documents either from the current domain or from external domains but referenced on HTML pages of the current domain.

PARAMETER DESCRIPTION
website

root of the website, including https:// or http:// without trailing slash.

TYPE: str

default_lang

provided or guessed main language of the website content. Not used internally.

TYPE: str DEFAULT: 'en'

child

page of the website to use as index to start crawling for internal links.

TYPE: str DEFAULT: '/'

langs

ISO-something 2-letters code of the languages for which we attempt to fetch the translation if available, looking for the HTML <link rel="alternate" hreflang="..."> tag.

TYPE: tuple DEFAULT: ('en', 'fr')

contains_str

a string or a list of strings that should be contained in a page URL for the page to be indexed. On a forum, you could for example restrict pages to URLs containing "discussion" to get only the threads and avoid user profiles or archive pages.

TYPE: str | list[str] DEFAULT: ''

markup

TYPE: str DEFAULT: 'body'

max_recurse_level

this method will call itself recursively on each internal link found in the current page, starting from the website/child page. The max_recursion_level defines how many times it calls itself until it is stopped, if it is stopped. When set to -1, it stops when all the internal links have been crawled.

TYPE: int DEFAULT: -1

category

arbitrary category or label set by user for classification. Will be automatically set to external for URLs followed outside of the main domain.

TYPE: str DEFAULT: ''

restrict_section

set to True to limit crawling to the website section defined by ://website/child/*. This is useful when indexing parts of very large websites when you are only interested in a small subset.

TYPE: bool DEFAULT: False

mine_pdf

set to True to aggressively try to crawl PDF linked on external HTML pages. This may increase RAM consumption dramatically.

TYPE: bool DEFAULT: False

_recursion_level

DON’T USE IT. Everytime this method calls itself recursively, it increments this variable internally, and recursion stops when the level is equal to the max_recurse_level.

TYPE: int DEFAULT: 0

RETURNS DESCRIPTION
list[web_page]

a list of all valid pages found. Invalid pages (wrong markup, empty HTML response, 404 errors) will be silently ignored.

Examples:

>>> from core import crawler
>>> cr = crawler.Crawler()
>>> pages = cr.get_website_from_crawling("https://aurelienpierre.com", default_lang="fr", markup=("div", { "class": "post-content" }))
core.crawler.Crawler.get_website_from_sitemap ¤
get_website_from_sitemap(
    website: str,
    default_lang: str,
    sitemap: str = "/sitemap.xml",
    langs: tuple[str] = ("en", "fr"),
    markup: str | tuple | list[str] | list[tuple] = "body",
    category: str = "",
    contains_str: str | list[str] = "",
    internal_links: str = "any",
    mine_pdf: bool = False,
    _recursion_level: int = 0,
) -> list[web_page]

Recursively crawl all pages of a website from links found in a sitemap. This applies to all HTML pages hosted on the domain of website and to PDF documents either from the current domain or from external domains but referenced on HTML pages of the current domain. Sitemaps of sitemaps are followed recursively.

PARAMETER DESCRIPTION
website

root of the website, including https:// or http:// without trailing slash.

TYPE: str

default_lang

provided or guessed main language of the website content. Not used internally.

TYPE: str

sitemap

relative path of the XML sitemap.

TYPE: str DEFAULT: '/sitemap.xml'

langs

ISO-something 2-letters code of the languages for which we attempt to fetch the translation if available, looking for the HTML <link rel="alternate" hreflang="..."> tag.

TYPE: tuple[str] DEFAULT: ('en', 'fr')

markup

TYPE: str | tuple | list[str] | list[tuple] DEFAULT: 'body'

category

arbitrary category or label

TYPE: str DEFAULT: ''

contains_str

limit recursive crawling from sitemap-defined pages to pages containing this string or list of strings. Will get passed as-is to core.crawler.Crawler.get_website_from_crawling.

TYPE: str | list[str] DEFAULT: ''

internal_links

defines what to do with links found inside the HTML page body/content. - any: follow and include all links found in page, no matter what domain they point to, - internal: follow and include links found in page only if they point to the same domain as the current page, - external: follow and include links found in page only if they point to a different domain than the current page, - ignore: don’t follow internal links

TYPE: str DEFAULT: 'any'

RETURNS DESCRIPTION
list[web_page]

a list of all valid pages found. Invalid pages (wrong markup, empty HTML response, 404 errors) will be silently ignored.

Examples:

>>> from core import crawler
>>> cr = crawler.Crawler()
>>> pages = cr.get_website_from_sitemap("https://aurelienpierre.com", default_lang="fr", markup=("div", { "class": "post-content" }))
core.crawler.Crawler.get_unique_internal_url ¤
get_unique_internal_url(
    page: ParsedHTML, domain: str, currentURL: str
) -> list[str]

Grab the internal links found in page, except PDF, and return only the ones we don’t already know

core.crawler.Crawler.get_youtube_channels ¤
get_youtube_channels(
    channel_ids: list[str],
    api_key: str,
    default_lang: str = "en",
    category: str = "video",
    since: datetime.datetime | None = None,
) -> list[web_page]

Index YouTube channels via the Data API v3 (no OAuth required).

Retrieves the full upload list for each channel by walking the channel’s uploads playlist, then fetches the complete snippet for each video. The result mirrors what get_website_from_sitemap produces for a normal website: one core.types.web_page per video, with title, content (video description), date, lang, and category populated.

Incremental update logic:

  • If since is provided, any video URL already present in self.known_urls and last crawled on or after since is skipped.
  • If since is None but self.since is set, self.since is used as the cut-off.
  • Videos not yet in self.known_urls are always fetched.

Rate limiting respects self.delay and the www.googleapis.com domain bucket, consistent with the rest of the crawler.

PARAMETER DESCRIPTION
channel_ids

list of YouTube channel IDs — the UC… string visible in any channel URL (youtube.com/channel/UC…).

TYPE: list[str]

api_key

Google Cloud API key with YouTube Data API v3 enabled. See https://developers.google.com/youtube/v3/getting-started.

TYPE: str

default_lang

fallback language code when the video metadata does not declare one.

TYPE: str DEFAULT: 'en'

category

label applied to every indexed video, reused by search filters.

TYPE: str DEFAULT: 'video'

since

skip videos whose URL is already known and was crawled on or after this datetime. Pass None (default) to (re-)index everything.

TYPE: datetime.datetime | None DEFAULT: None

RETURNS DESCRIPTION
list[web_page]

list of core.types.web_page objects, one per video.

Example
db = database.open_db("my-engine.db")
with crawler.Crawler(delay=0.5) as cr:
    cr.load_known_urls(db)
    pages = cr.get_youtube_channels(
        channel_ids = ["UCmsSn3fujI81EKEr4NLxrcg",
                       "UCkqe4BYsllmcxo2dsF-rFQw"],
        api_key     = "YOUR_KEY",
        default_lang = "en",
        category    = "video",
        since       = datetime.datetime(2025, 1, 1, tzinfo=datetime.timezone.utc),
    )
database.populate_db(db, pages)
core.crawler.Crawler.get_github_repositories ¤
get_github_repositories(
    repositories: list[tuple[str, str]],
    api_key: str,
    features: list[str] | None = None,
    langs: tuple[str, ...] = ("en", "fr"),
    category: str = "Github",
    since: datetime.datetime | None = None,
    mine_pdf: bool = True,
) -> list[web_page]

Index GitHub repository content via the REST API.

Supported features: "issues", "pulls", "commits", "discussions". Issue and pull-request comments are concatenated with the parent body. External links found in Markdown bodies are followed at one recursion level (same behaviour as get_website_from_crawling with max_recurse_level=1), and PDF files linked from those pages are mined when mine_pdf is True.

Incremental update:

  • For issues, pulls, and commits, the GitHub API’s native ?since= query parameter is used when since is provided, so only items updated after that date are fetched — minimising API quota usage.
  • For discussions, the REST API has no since filter; client-side filtering by created_at is applied instead.
  • 429 / 403 rate-limit responses are handled automatically: the crawler reads the Retry-After header and waits accordingly.
PARAMETER DESCRIPTION
repositories

list of (owner, repo) tuples, e.g. [("aurelienpierreeng", "ansel"), ("darktable-org", "darktable")].

TYPE: list[tuple[str, str]]

api_key

GitHub personal access token (classic or fine-grained, read-only repo scope is sufficient). See https://docs.github.com/en/rest/authentication.

TYPE: str

features

subset of ["issues", "pulls", "commits", "discussions"] to index. Defaults to all four when None.

TYPE: list[str] | None DEFAULT: None

langs

language codes passed through to get_immediate_links when following external links from item bodies.

TYPE: tuple[str, ...] DEFAULT: ('en', 'fr')

category

label applied to every indexed item.

TYPE: str DEFAULT: 'Github'

since

only fetch items created or updated after this datetime. Overrides self.since when provided.

TYPE: datetime.datetime | None DEFAULT: None

mine_pdf

whether to follow and extract PDF files linked from item bodies.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
list[web_page]

list of core.types.web_page objects.

Example
db = database.open_db("my-engine.db")
with crawler.Crawler(delay=0.72) as cr:
    cr.load_known_urls(db)
    pages = cr.get_github_repositories(
        repositories = [("aurelienpierreeng", "ansel"),
                        ("darktable-org", "rawspeed")],
        api_key      = "ghp_…",
        features     = ["issues", "pulls", "commits"],
        since        = datetime.datetime(2025, 1, 1,
                                         tzinfo=datetime.timezone.utc),
        mine_pdf     = True,
    )
database.populate_db(db, pages)
core.crawler.Crawler.get_stackexchange_posts ¤
get_stackexchange_posts(
    site: str,
    api_key: str | None = None,
    category: str = "forum",
    langs: tuple[str, ...] = ("en",),
    since: datetime.datetime | None = None,
    window_days: int = 90,
    earliest_date: datetime.datetime | None = None,
    se_filter: str = "!14e92L7CSAvro*ufn5-s.s23LqfumIAci09lv0z)*cLWPr",
) -> list[web_page]

Index a Stack Exchange community via the public API v2.3.

Retrieves all posts (questions, answers) together with their embedded comments from the posts endpoint. Each post’s body and its comments are concatenated into a single core.types.web_page and external links found in the Markdown bodies are followed at one recursion level (PDFs included).

Pagination and rate limits. Without an API key the SE API allows 300 requests/day and a maximum of 25 pages per date window. With a key, the daily quota rises to 10 000 requests. The method handles both cases: it pages through 25-page windows, each covering window_days days of posts, sliding backward in time until earliest_date is reached. When since is provided the window collapses to a single forward pass from since to now, which is the efficient path for incremental updates. The API’s backoff field is always respected.

Incremental update. Two complementary mechanisms combine:

  • since (or self.since) is passed as fromdate to the API, so the server only returns posts created or edited after that point.
  • self.known_urls provides per-URL precision: for each post the last_edit_date field is compared with the stored crawl timestamp, and the post is skipped when the stored timestamp is more recent — catching the case where a post was fetched as part of a wide window but not actually changed.

SE filter. The default se_filter string was built at api.stackexchange.com/docs/filters and requests the following fields: body_markdown, comments, comments.body_markdown, comments.link, creation_date, last_edit_date, link, title. Pass a custom filter string if you need additional fields.

PARAMETER DESCRIPTION
site

Stack Exchange site name as used in the API, e.g. "photo", "stackoverflow", "unix", "electronics". Sites with standalone domains (stackoverflow.com, superuser.com, serverfault.com, askubuntu.com, mathoverflow.net) are resolved automatically.

TYPE: str

api_key

Optional Stack Exchange API key. Raises daily quota from 300 to 10 000 requests/day. Obtain one free at https://stackapps.com/apps/oauth/register.

TYPE: str | None DEFAULT: None

category

Label applied to every indexed post.

TYPE: str DEFAULT: 'forum'

langs

Language codes passed to get_immediate_links when following external links from post bodies.

TYPE: tuple[str, ...] DEFAULT: ('en',)

since

Only fetch posts whose creation_date or last_edit_date is at or after this datetime. Passed as fromdate to the API. Overrides self.since when provided.

TYPE: datetime.datetime | None DEFAULT: None

window_days

Size (in days) of each date window used when doing a full crawl (i.e. when since is None). Smaller windows mean more API requests but fewer items per page, reducing the chance of hitting the 25-page cap. Default 90.

TYPE: int DEFAULT: 90

earliest_date

Stop the full-crawl backward walk when this date is reached. Defaults to 2010-01-01 (SE’s approximate launch date).

TYPE: datetime.datetime | None DEFAULT: None

se_filter

Opaque SE filter string defining which fields are returned. Override only when you need fields beyond the defaults.

TYPE: str DEFAULT: '!14e92L7CSAvro*ufn5-s.s23LqfumIAci09lv0z)*cLWPr'

RETURNS DESCRIPTION
list[web_page]

list of core.types.web_page objects.

Example
db = database.open_db("my-engine.db")
with crawler.Crawler(delay=1.0) as cr:
    cr.load_known_urls(db)
    pages = cr.get_stackexchange_posts(
        site     = "photo",
        api_key  = "YOUR_SE_APP_KEY",
        category = "forum",
        since    = datetime.datetime(2025, 1, 1,
                                     tzinfo=datetime.timezone.utc),
    )
database.populate_db(db, pages)

Functions¤

core.crawler.get_content_type ¤

get_content_type(
    url: str, delay: DelayedClass, bypass_robots_txt=False
) -> tuple[str, bool, str, dict | None, int]

Probe an URL for HTTP headers only to see what type of content it returns. Try to sanitize partly-invalid URLs, like when protocols are not handled/redirected (http vs https), or invalid trailing characters, URL parameters and anchors are passed.

PARAMETER DESCRIPTION
url

fully-formed link to try (including protocol)

TYPE: str

delay

time to wait before re-trying different sanitized URLs

TYPE: DelayedClass

RETURNS DESCRIPTION
type

the type of content, like plain/html, application/pdf, etc.

TYPE: str

status

the state flag:

  • True if the URL exists, but it might unreachable or forbidden,
  • False if the URL raises an HTTP 404 error (not found).

TYPE: bool

response_url

the actual target URL, sanitized and possibly redirected.

TYPE: str

states

HTTP return code

TYPE: int

core.crawler.relative_to_absolute ¤

relative_to_absolute(URL: str, domain: str, current_page: str) -> str

Convert a relative URL to absolute by prepending the domain.

PARAMETER DESCRIPTION
URL

the URL string to normalize to absolute,

TYPE: str

domain

the domain name of the website, without protocol (http://) nor trailing slash. It will be appended to the relative links starting by /.

TYPE: str

current_page

the URL of the page from which we analyze links. It will be appended to the relative links starting by ./.

TYPE: str

RETURNS DESCRIPTION
str

The normalized, absolute URL on this website.

Examples:

>>> relative_to_absolute("folder/page", "me.com")
"://me.com/folder/page"

core.crawler.radical_url ¤

radical_url(URL: str) -> str

Trim an URL to the page (radical) part, removing anchors if any (internal links)

Examples:

>>> radical_url("http://me.com/page#section-1")
"http://me.com/page"

core.crawler.get_page_content ¤

get_page_content(
    url: str | None,
    delay: DelayedClass,
    content: str | None = None,
    custom_header={},
) -> tuple[ParsedHTML | None, str | None, int]

Request an (x)HTML page through the network with HTTP GET and feed its response to a ParsedHTML handler. This needs a functionnal network connection.

The DOM is pre-filtered as follow to keep only natural language and avoid duplicate strings:

  • media tags are removed (<iframe>, <embed>, <img>, <svg>, <audio>, <video>, etc.),
  • code and machine language tags are removed (<script>, <style>, <math>),
  • menus and sidebars are removed (<nav>, <aside>),
  • forms, fields and buttons are removed(<select>, <input>, <button>, <textarea>, etc.)

The HTML is un-minified to help end-of-sentences detections in cases where sentences don’t end with punctuation (e.g. in titles).

PARAMETER DESCRIPTION
url

a valid URL that can be fetched with an HTTP GET request.

TYPE: str | None

content

a string buffer used as HTML source. If this argument is passed, we don’t fetch url from network and directly use this input.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
tuple[ParsedHTML | None, str | None, int]

a tuple with: 1. core.parser.ParsedHTML object initialized with the page DOM for further text mining. None if the HTML response was empty or the URL could not be reached. The list of URLs found in page before removing meaningless markup is stored as a list of strings in the object.links member. object.h1 and object.h2 contain a set of headers 1 and 2 found in the page before removing any markup. object.date contains the best-guess for the date. 2. the final URL of the retrieved page, which might be different from the input URL if HTTP redirections happened,

core.crawler.parse_page ¤

parse_page(
    page: ParsedHTML,
    url: str,
    lang: str | None,
    markup: str | tuple | list[str] | list[tuple] | None,
    date: str | None = None,
    category: str | None = None,
) -> list[web_page]

Get the requested markup from the requested page URL.

This chains in a single call:

PARAMETER DESCRIPTION
page

a core.parser.ParsedHTML handler with pre-filtered DOM,

TYPE: ParsedHTML

url

the valid URL accessible by HTTP GET request of the page

TYPE: str

lang

the provided or guessed language of the page,

TYPE: str | None

markup

the markup to search for. See core.parser.ParsedHTML.get_page_markup for details.

TYPE: str | tuple | list[str] | list[tuple] | None

date

if the page was retrieved from a sitemap, usually the date is available in ISO format (yyyy-mm-ddTHH:MM:SS) and can be passed directly here. Otherwise, several attempts will be made to extract it from the page content (see core.parser.ParsedHTML.get_date).

TYPE: str | None DEFAULT: None

category

arbitrary category or label defined by user

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
list[web_page]

The content of the page, including metadata, in a core.types.web_page singleton.

core.crawler.hash_with_category ¤

hash_with_category(data: str, category: str) -> str

Produce a unique identifier mixing data and category.