core.crawler¤

core.crawler ¤

Module containing utilities to crawl websites for HTML, XML and PDF pages for their text content. PDF can be read from their text content if any, or through optical characters recognition for scans. Websites can be crawled from a sitemap.xml file or by following internal links recursively from and index page. Each page is aggregated on a list of core.types.web_page objects, meant to be used as input to train natural language AI models and to index and rank for search engines.

Classes¤

core.crawler.Crawler ¤

Crawler(
    delay: float = 1.0,
    no_follow: list[str] = [],
    known_urls: dict[str, datetime.datetime] | None = None,
    since: datetime.datetime | None = None,
)

Bases: DelayedClass

Crawl a website from its sitemap or by following internal links recusively from an index page. This class needs therefore to be used within a with statement that will take care of resources allocations and releases in background.

PARAMETER	DESCRIPTION
`delay`	time in seconds to wait before 2 HTTP requests. The right delay will prevent the crawler from being throttled by anti-DoS rules while making it as fast as possible. Set to `0.0` if you are crawling your own servers and they have no DoS protection. TYPE: `float` DEFAULT: `1.0`
`no_follow`	list of URL parts to completely ignore, that is not index them but not even crawl them for internal links. TYPE: `list[str]` DEFAULT: `[]`
`known_urls`	mapping of `url → last crawled datetime` for pages already in the index. When provided, crawling methods use it to skip pages that have not changed since they were last indexed. Populate it conveniently with load_known_urls. TYPE: `dict[str, datetime.datetime] \| None` DEFAULT: `None`
`since`	global freshness cut-off for recursive crawling. Any URL present in known_urls and last crawled on or after this datetime will be skipped entirely. Has no effect when known_urls is empty or when a URL is not yet known. For sitemap crawling, the sitemap’s own `<lastmod>` field takes precedence; since is only used as a fallback for entries that have no `<lastmod>`. TYPE: `datetime.datetime \| None` DEFAULT: `None`

Example

db = database.open_db("my-engine.db")

with crawler.Crawler(delay=1.0) as cr:
    cr.load_known_urls(db)      # populate incremental-update map
    cr.since = datetime.datetime(2025, 1, 1, tzinfo=datetime.timezone.utc)

    # Only re-fetches pages whose <lastmod> is newer than the stored crawl date.
    pages = cr.get_website_from_sitemap("https://domain.com", "en")

    # Only re-fetches pages not yet in the index, or crawled before since.
    pages += cr.get_website_from_crawling("https://forum.domain.com", "en")

Attributes¤

core.crawler.Crawler.no_follow `class-attribute` `instance-attribute` ¤

no_follow: list[str] = [
    "api.whatsapp.com/share",
    "api.whatsapp.com/send",
    "pinterest.fr/pin/create",
    "pinterest.com/pin/create",
    "facebook.com/sharer",
    "twitter.com/intent/tweet",
    "twitter.com/share",
    "x.com/share",
    "reddit.com/submit",
    "t.me/share",
    "linkedin.com/share",
    "vk.com/share.php",
    "bufferapp.com/add",
    "getpocket.com/edit",
    "tumblr.com/share",
    "www.addtoany.com/add_to",
    "share.flipboard.com/bookmarklet/",
    "?share=",
    "?replytocom=",
    "translate.google.com/translate",
    "flickr.com",
    "//flic.kr/",
    "instagram.com",
    "threads.com",
    "facebook.com",
    "linkedin.com",
    "twitter.com",
    "//t.co/tiktok.com",
    "pinterest.com",
    "//x.com/",
    "reddit.com",
    "sciprofiles.com",
    "www.citeulike.org",
    "linktr.ee",
    "mailto:",
    "/profile/",
    "/login/",
    "/login.php",
    "/wp-login.php/signup/",
    "/signup.php",
    "/login?",
    "/signup?/user/",
    "/member/",
    "/register?",
    "login.microsoftonline.com",
    ".css",
    ".js",
    ".json",
    ".jpg",
    ".png",
    ".jpeg",
    ".gif",
    ".webp",
    ".heif",
    ".tif",
]

List of URLs sub-strings that will disable crawling if they are found in URLs. Mostly social networks sharing links.

core.crawler.Crawler.crawled_URL `instance-attribute` ¤

crawled_URL: list[str] = []

List of { URL + category } hashes already visited. Websites crawled from sitemap and also following internal links recursively will tag recursively-crawled pages with an external category, which will later be considered by core.deduplicator.Deduplicator with a lower priority than any other category. Sitemap crawling may restrict content to selected HTML tags and produce better-quality data, with less noise. So we need to keep crawling everything from sitemap, whether or not it was already crawled from internal links earlier, and dedup will sort it out.

core.crawler.Crawler.crawled_content `instance-attribute` ¤

crawled_content: list[str] = []

List of hashes of content already known

core.crawler.Crawler.known_urls `instance-attribute` ¤

known_urls: dict[str, datetime.datetime] = (
    dict(known_urls) if known_urls else {}
)

Mapping of URL → last-crawled datetime for incremental updates. Populated at construction time or via load_known_urls.

Note

We strip leading and trailing / for generality, in URL keys.

core.crawler.Crawler.since `instance-attribute` ¤

since: datetime.datetime | None = since

Global freshness cut-off for recursive and API-based crawling. Pages in known_urls last crawled on or after this datetime are skipped.

core.crawler.Crawler.errors `instance-attribute` ¤

errors = []

URLs that couldn’t be accessed due to blocking or throttling

core.crawler.Crawler.notfound `instance-attribute` ¤

notfound = []

URLs returning error 404 - not found

Methods:¤

core.crawler.Crawler.load_known_urls ¤

load_known_urls(db: sqlite3.Connection) -> int

Populate the incremental-update map from an existing index database.

After calling this, all crawling methods will skip pages whose stored crawl timestamp indicates they are still fresh (see self.since and the <lastmod> logic in get_website_from_sitemap).

PARAMETER	DESCRIPTION
`db`	an open SQLite connection to a Virtual Secretary database (as returned by core.database.open_db or core.database.create_db). TYPE: `sqlite3.Connection`

RETURNS	DESCRIPTION
`int`	Number of URL entries loaded.

Example

db  = database.open_db("my-engine.db")
with crawler.Crawler(delay=1.0) as cr:
    cr.load_known_urls(db)
    cr.since = datetime.datetime(2025, 6, 1, tzinfo=datetime.timezone.utc)
    pages = cr.get_website_from_sitemap("https://domain.com", "en")
db.close()

core.crawler.Crawler.get_most_recent_page ¤

get_most_recent_page(db: sqlite3.Connection) -> datetime.datetime | None

Get the datetime of the most recent web_page indexed in the db database

core.crawler.Crawler.get_most_recent_crawl ¤

get_most_recent_crawl(db: sqlite3.Connection) -> datetime.datetime | None

Get the datetime of the most recently crawled web_page indexed in the db database

core.crawler.Crawler.get_crawling_threshold ¤

get_crawling_threshold(db: sqlite3.Connection) -> datetime.datetime | None

Get the safe date from which we should restart incremental crawling of a website. We use the oldest among the last crawling date and the most recent page, to account for possibly badly-formed page dates set in the future at the time of crawling.

core.crawler.Crawler.discard_link ¤

discard_link(url)

Returns True if the url is found in the self.no_follow list

core.crawler.Crawler.get_immediate_links ¤

get_immediate_links(
    links: list[str],
    domain,
    default_lang,
    langs,
    category,
    contains_str,
    internal_links: str = "any",
    mine_pdf=False,
) -> list[web_page]

Follow internal and external links contained in a webpage only to one recursivity level, including PDF files and HTML pages. This is useful to index references docs linked from a page.

PARAMETER DESCRIPTION

internal_links

defines what to do with links found inside the HTML page body/content: - any: follow and include all links found in page content, no matter what domain they point to, - internal: follow and include links found in page content only if they point to the same domain as the current page, - external: follow and include links found in page content only if they point to a different domain than the current page, - ignore: don’t follow links found in page content.

TYPE: str DEFAULT: 'any'

RETURNS	DESCRIPTION
`list[web_page]`	list of links targets content

core.crawler.Crawler.update_link ¤

update_link(
    old_link: str, new_link: str | None, category: str, status_code: int
) -> str

Update target link with possible HTTP redirections

PARAMETER	DESCRIPTION
`old_link`	original URL followed, found in HTML TYPE: `str`
`new_link`	destination URL retrieved, possibly after HTTP redirections. TYPE: `str \| None`
`category`	tagged category of the page TYPE: `str`
`status_code`	HTML returned status code TYPE: `int`

core.crawler.Crawler.get_website_from_crawling ¤

get_website_from_crawling(
    website: str,
    default_lang: str = "en",
    child: str = "/",
    langs: tuple = ("en", "fr"),
    markup: str | list[str] | tuple | list[tuple] = "body",
    contains_str: str | list[str] = "",
    max_recurse_level: int = -1,
    category: str = "",
    restrict_section: bool = False,
    mine_pdf: bool = False,
    _recursion_level: int = 0,
    _mainthread: bool = True,
) -> list[web_page]

Recursively crawl all pages of a website from internal links found starting from the child page. This applies to all HTML pages hosted on the domain of website and to PDF documents either from the current domain or from external domains but referenced on HTML pages of the current domain.

PARAMETER	DESCRIPTION
`website`	root of the website, including `https://` or `http://` without trailing slash. TYPE: `str`
`default_lang`	provided or guessed main language of the website content. Not used internally. TYPE: `str` DEFAULT: `'en'`
`child`	page of the website to use as index to start crawling for internal links. TYPE: `str` DEFAULT: `'/'`
`langs`	ISO-something 2-letters code of the languages for which we attempt to fetch the translation if available, looking for the HTML `<link rel="alternate" hreflang="...">` tag. TYPE: `tuple` DEFAULT: `('en', 'fr')`
`contains_str`	a string or a list of strings that should be contained in a page URL for the page to be indexed. On a forum, you could for example restrict pages to URLs containing `"discussion"` to get only the threads and avoid user profiles or archive pages. TYPE: `str \| list[str]` DEFAULT: `''`
`markup`	see core.parser.ParsedHTML.get_page_markup TYPE: `str \| list[str] \| tuple \| list[tuple]` DEFAULT: `'body'`
`max_recurse_level`	this method will call itself recursively on each internal link found in the current page, starting from the `website/child` page. The `max_recursion_level` defines how many times it calls itself until it is stopped, if it is stopped. When set to `-1`, it stops when all the internal links have been crawled. TYPE: `int` DEFAULT: `-1`
`category`	arbitrary category or label set by user for classification. Will be automatically set to `external` for URLs followed outside of the main domain. TYPE: `str` DEFAULT: `''`
`restrict_section`	set to `True` to limit crawling to the website section defined by `://website/child/`. This is useful when indexing parts of very large websites when you are only interested in a small subset. TYPE:* `bool` DEFAULT: `False`
`mine_pdf`	set to `True` to aggressively try to crawl PDF linked on external HTML pages. This may increase RAM consumption dramatically. TYPE: `bool` DEFAULT: `False`
`_recursion_level`	DON’T USE IT. Everytime this method calls itself recursively, it increments this variable internally, and recursion stops when the level is equal to the `max_recurse_level`. TYPE: `int` DEFAULT: `0`

RETURNS	DESCRIPTION
`list[web_page]`	a list of all valid pages found. Invalid pages (wrong markup, empty HTML response, 404 errors) will be silently ignored.

Examples:

>>> from core import crawler
>>> cr = crawler.Crawler()
>>> pages = cr.get_website_from_crawling("https://aurelienpierre.com", default_lang="fr", markup=("div", { "class": "post-content" }))

core.crawler.Crawler.get_website_from_sitemap ¤

get_website_from_sitemap(
    website: str,
    default_lang: str,
    sitemap: str = "/sitemap.xml",
    langs: tuple[str] = ("en", "fr"),
    markup: str | tuple | list[str] | list[tuple] = "body",
    category: str = "",
    contains_str: str | list[str] = "",
    internal_links: str = "any",
    mine_pdf: bool = False,
    _recursion_level: int = 0,
) -> list[web_page]

Recursively crawl all pages of a website from links found in a sitemap. This applies to all HTML pages hosted on the domain of website and to PDF documents either from the current domain or from external domains but referenced on HTML pages of the current domain. Sitemaps of sitemaps are followed recursively.

PARAMETER	DESCRIPTION
`website`	root of the website, including `https://` or `http://` without trailing slash. TYPE: `str`
`default_lang`	provided or guessed main language of the website content. Not used internally. TYPE: `str`
`sitemap`	relative path of the XML sitemap. TYPE: `str` DEFAULT: `'/sitemap.xml'`
`langs`	ISO-something 2-letters code of the languages for which we attempt to fetch the translation if available, looking for the HTML `<link rel="alternate" hreflang="...">` tag. TYPE: `tuple[str]` DEFAULT: `('en', 'fr')`
`markup`	see core.parser.ParsedHTML.get_page_markup TYPE: `str \| tuple \| list[str] \| list[tuple]` DEFAULT: `'body'`
`category`	arbitrary category or label TYPE: `str` DEFAULT: `''`
`contains_str`	limit recursive crawling from sitemap-defined pages to pages containing this string or list of strings. Will get passed as-is to core.crawler.Crawler.get_website_from_crawling. TYPE: `str \| list[str]` DEFAULT: `''`
`internal_links`	defines what to do with links found inside the HTML page body/content. - `any`: follow and include all links found in page, no matter what domain they point to, - `internal`: follow and include links found in page only if they point to the same domain as the current page, - `external`: follow and include links found in page only if they point to a different domain than the current page, - `ignore`: don’t follow internal links TYPE: `str` DEFAULT: `'any'`

RETURNS	DESCRIPTION
`list[web_page]`	a list of all valid pages found. Invalid pages (wrong markup, empty HTML response, 404 errors) will be silently ignored.

Examples:

>>> from core import crawler
>>> cr = crawler.Crawler()
>>> pages = cr.get_website_from_sitemap("https://aurelienpierre.com", default_lang="fr", markup=("div", { "class": "post-content" }))

core.crawler.Crawler.get_unique_internal_url ¤

get_unique_internal_url(
    page: ParsedHTML, domain: str, currentURL: str
) -> list[str]

Grab the internal links found in page, except PDF, and return only the ones we don’t already know

core.crawler.Crawler.get_youtube_channels ¤

get_youtube_channels(
    channel_ids: list[str],
    api_key: str,
    default_lang: str = "en",
    category: str = "video",
    since: datetime.datetime | None = None,
) -> list[web_page]

Index YouTube channels via the Data API v3 (no OAuth required).

Retrieves the full upload list for each channel by walking the channel’s uploads playlist, then fetches the complete snippet for each video. The result mirrors what get_website_from_sitemap produces for a normal website: one core.types.web_page per video, with title, content (video description), date, lang, and category populated.

Incremental update logic:

If since is provided, any video URL already present in self.known_urls and last crawled on or after since is skipped.
If since is None but self.since is set, self.since is used as the cut-off.
Videos not yet in self.known_urls are always fetched.

Rate limiting respects self.delay and the www.googleapis.com domain bucket, consistent with the rest of the crawler.

PARAMETER	DESCRIPTION
`channel_ids`	list of YouTube channel IDs — the `UC…` string visible in any channel URL (`youtube.com/channel/UC…`). TYPE: `list[str]`
`api_key`	Google Cloud API key with YouTube Data API v3 enabled. See https://developers.google.com/youtube/v3/getting-started. TYPE: `str`
`default_lang`	fallback language code when the video metadata does not declare one. TYPE: `str` DEFAULT: `'en'`
`category`	label applied to every indexed video, reused by search filters. TYPE: `str` DEFAULT: `'video'`
`since`	skip videos whose URL is already known and was crawled on or after this datetime. Pass `None` (default) to (re-)index everything. TYPE: `datetime.datetime \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`list[web_page]`	list of core.types.web_page objects, one per video.

Example

db = database.open_db("my-engine.db")
with crawler.Crawler(delay=0.5) as cr:
    cr.load_known_urls(db)
    pages = cr.get_youtube_channels(
        channel_ids = ["UCmsSn3fujI81EKEr4NLxrcg",
                       "UCkqe4BYsllmcxo2dsF-rFQw"],
        api_key     = "YOUR_KEY",
        default_lang = "en",
        category    = "video",
        since       = datetime.datetime(2025, 1, 1, tzinfo=datetime.timezone.utc),
    )
database.populate_db(db, pages)

core.crawler.Crawler.get_github_repositories ¤

get_github_repositories(
    repositories: list[tuple[str, str]],
    api_key: str,
    features: list[str] | None = None,
    langs: tuple[str, ...] = ("en", "fr"),
    category: str = "Github",
    since: datetime.datetime | None = None,
    mine_pdf: bool = True,
) -> list[web_page]

Index GitHub repository content via the REST API.

Supported features: "issues", "pulls", "commits", "discussions". Issue and pull-request comments are concatenated with the parent body. External links found in Markdown bodies are followed at one recursion level (same behaviour as get_website_from_crawling with max_recurse_level=1), and PDF files linked from those pages are mined when mine_pdf is True.

Incremental update:

For issues, pulls, and commits, the GitHub API’s native ?since= query parameter is used when since is provided, so only items updated after that date are fetched — minimising API quota usage.
For discussions, the REST API has no since filter; client-side filtering by created_at is applied instead.
429 / 403 rate-limit responses are handled automatically: the crawler reads the Retry-After header and waits accordingly.

PARAMETER	DESCRIPTION
`repositories`	list of `(owner, repo)` tuples, e.g. `[("aurelienpierreeng", "ansel"), ("darktable-org", "darktable")]`. TYPE: `list[tuple[str, str]]`
`api_key`	GitHub personal access token (classic or fine-grained, read-only `repo` scope is sufficient). See https://docs.github.com/en/rest/authentication. TYPE: `str`
`features`	subset of `["issues", "pulls", "commits", "discussions"]` to index. Defaults to all four when `None`. TYPE: `list[str] \| None` DEFAULT: `None`
`langs`	language codes passed through to get_immediate_links when following external links from item bodies. TYPE: `tuple[str, ...]` DEFAULT: `('en', 'fr')`
`category`	label applied to every indexed item. TYPE: `str` DEFAULT: `'Github'`
`since`	only fetch items created or updated after this datetime. Overrides `self.since` when provided. TYPE: `datetime.datetime \| None` DEFAULT: `None`
`mine_pdf`	whether to follow and extract PDF files linked from item bodies. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`list[web_page]`	list of core.types.web_page objects.

Example

db = database.open_db("my-engine.db")
with crawler.Crawler(delay=0.72) as cr:
    cr.load_known_urls(db)
    pages = cr.get_github_repositories(
        repositories = [("aurelienpierreeng", "ansel"),
                        ("darktable-org", "rawspeed")],
        api_key      = "ghp_…",
        features     = ["issues", "pulls", "commits"],
        since        = datetime.datetime(2025, 1, 1,
                                         tzinfo=datetime.timezone.utc),
        mine_pdf     = True,
    )
database.populate_db(db, pages)

core.crawler.Crawler.get_stackexchange_posts ¤

get_stackexchange_posts(
    site: str,
    api_key: str | None = None,
    category: str = "forum",
    langs: tuple[str, ...] = ("en",),
    since: datetime.datetime | None = None,
    window_days: int = 90,
    earliest_date: datetime.datetime | None = None,
    se_filter: str = "!14e92L7CSAvro*ufn5-s.s23LqfumIAci09lv0z)*cLWPr",
) -> list[web_page]

Index a Stack Exchange community via the public API v2.3.

Retrieves all posts (questions, answers) together with their embedded comments from the posts endpoint. Each post’s body and its comments are concatenated into a single core.types.web_page and external links found in the Markdown bodies are followed at one recursion level (PDFs included).

Pagination and rate limits. Without an API key the SE API allows 300 requests/day and a maximum of 25 pages per date window. With a key, the daily quota rises to 10 000 requests. The method handles both cases: it pages through 25-page windows, each covering window_days days of posts, sliding backward in time until earliest_date is reached. When since is provided the window collapses to a single forward pass from since to now, which is the efficient path for incremental updates. The API’s backoff field is always respected.

Incremental update. Two complementary mechanisms combine:

since (or self.since) is passed as fromdate to the API, so the server only returns posts created or edited after that point.
self.known_urls provides per-URL precision: for each post the last_edit_date field is compared with the stored crawl timestamp, and the post is skipped when the stored timestamp is more recent — catching the case where a post was fetched as part of a wide window but not actually changed.

SE filter. The default se_filter string was built at api.stackexchange.com/docs/filters and requests the following fields: body_markdown, comments, comments.body_markdown, comments.link, creation_date, last_edit_date, link, title. Pass a custom filter string if you need additional fields.

PARAMETER	DESCRIPTION
`site`	Stack Exchange site name as used in the API, e.g. `"photo"`, `"stackoverflow"`, `"unix"`, `"electronics"`. Sites with standalone domains (`stackoverflow.com`, `superuser.com`, `serverfault.com`, `askubuntu.com`, `mathoverflow.net`) are resolved automatically. TYPE: `str`
`api_key`	Optional Stack Exchange API key. Raises daily quota from 300 to 10 000 requests/day. Obtain one free at https://stackapps.com/apps/oauth/register. TYPE: `str \| None` DEFAULT: `None`
`category`	Label applied to every indexed post. TYPE: `str` DEFAULT: `'forum'`
`langs`	Language codes passed to get_immediate_links when following external links from post bodies. TYPE: `tuple[str, ...]` DEFAULT: `('en',)`
`since`	Only fetch posts whose `creation_date` or `last_edit_date` is at or after this datetime. Passed as `fromdate` to the API. Overrides `self.since` when provided. TYPE: `datetime.datetime \| None` DEFAULT: `None`
`window_days`	Size (in days) of each date window used when doing a full crawl (i.e. when since is `None`). Smaller windows mean more API requests but fewer items per page, reducing the chance of hitting the 25-page cap. Default `90`. TYPE: `int` DEFAULT: `90`
`earliest_date`	Stop the full-crawl backward walk when this date is reached. Defaults to 2010-01-01 (SE’s approximate launch date). TYPE: `datetime.datetime \| None` DEFAULT: `None`
`se_filter`	Opaque SE filter string defining which fields are returned. Override only when you need fields beyond the defaults. TYPE: `str` DEFAULT: `'!14e92L7CSAvroufn5-s.s23LqfumIAci09lv0z)cLWPr'`

RETURNS	DESCRIPTION
`list[web_page]`	list of core.types.web_page objects.

Example

db = database.open_db("my-engine.db")
with crawler.Crawler(delay=1.0) as cr:
    cr.load_known_urls(db)
    pages = cr.get_stackexchange_posts(
        site     = "photo",
        api_key  = "YOUR_SE_APP_KEY",
        category = "forum",
        since    = datetime.datetime(2025, 1, 1,
                                     tzinfo=datetime.timezone.utc),
    )
database.populate_db(db, pages)

Functions:¤

core.crawler.get_content_type ¤

get_content_type(
    url: str, delay: DelayedClass, bypass_robots_txt=False
) -> tuple[str, bool, str, dict | None, int]

Probe an URL for HTTP headers only to see what type of content it returns. Try to sanitize partly-invalid URLs, like when protocols are not handled/redirected (http vs https), or invalid trailing characters, URL parameters and anchors are passed.

PARAMETER	DESCRIPTION
`url`	fully-formed link to try (including protocol) TYPE: `str`
`delay`	time to wait before re-trying different sanitized URLs TYPE: `DelayedClass`

RETURNS	DESCRIPTION
`type`	the type of content, like `plain/html`, `application/pdf`, etc. TYPE: `str`
`status`	the state flag: `True` if the URL exists, but it might unreachable or forbidden, `False` if the URL raises an HTTP 404 error (not found). TYPE: `bool`
`response_url`	the actual target URL, sanitized and possibly redirected. TYPE: `str`
`states`	HTTP return code TYPE: `int`

core.crawler.relative_to_absolute ¤

relative_to_absolute(URL: str, domain: str, current_page: str) -> str

Convert a relative URL to absolute by prepending the domain.

PARAMETER	DESCRIPTION
`URL`	the URL string to normalize to absolute, TYPE: `str`
`domain`	the domain name of the website, without protocol (`http://`) nor trailing slash. It will be appended to the relative links starting by `/`. TYPE: `str`
`current_page`	the URL of the page from which we analyze links. It will be appended to the relative links starting by `./`. TYPE: `str`

RETURNS	DESCRIPTION
`str`	The normalized, absolute URL on this website.

Examples:

>>> relative_to_absolute("folder/page", "me.com")
"://me.com/folder/page"

core.crawler.radical_url ¤

radical_url(URL: str) -> str

Trim an URL to the page (radical) part, removing anchors if any (internal links)

Examples:

>>> radical_url("http://me.com/page#section-1")
"http://me.com/page"

core.crawler.get_page_content ¤

get_page_content(
    url: str | None,
    delay: DelayedClass,
    content: str | None = None,
    custom_header={},
) -> tuple[ParsedHTML | None, str | None, int]

Request an (x)HTML page through the network with HTTP GET and feed its response to a ParsedHTML handler. This needs a functionnal network connection.

The DOM is pre-filtered as follow to keep only natural language and avoid duplicate strings:

media tags are removed (<iframe>, <embed>, <img>, <svg>, <audio>, <video>, etc.),
code and machine language tags are removed (<script>, <style>, <math>),
menus and sidebars are removed (<nav>, <aside>),
forms, fields and buttons are removed(<select>, <input>, <button>, <textarea>, etc.)

The HTML is un-minified to help end-of-sentences detections in cases where sentences don’t end with punctuation (e.g. in titles).

PARAMETER	DESCRIPTION
`url`	a valid URL that can be fetched with an HTTP GET request. TYPE: `str \| None`
`content`	a string buffer used as HTML source. If this argument is passed, we don’t fetch `url` from network and directly use this input. TYPE: `str \| None` DEFAULT: `None`

RETURNS DESCRIPTION

tuple[ParsedHTML | None, str | None, int]

a tuple with: 1. core.parser.ParsedHTML object initialized with the page DOM for further text mining. None if the HTML response was empty or the URL could not be reached. The list of URLs found in page before removing meaningless markup is stored as a list of strings in the object.links member. object.h1 and object.h2 contain a set of headers 1 and 2 found in the page before removing any markup. object.date contains the best-guess for the date. 2. the final URL of the retrieved page, which might be different from the input URL if HTTP redirections happened,

core.crawler.parse_page ¤

parse_page(
    page: ParsedHTML,
    url: str,
    lang: str | None,
    markup: str | tuple | list[str] | list[tuple] | None,
    date: str | None = None,
    category: str | None = None,
) -> list[web_page]

Get the requested markup from the requested page URL.

This chains in a single call:

PARAMETER	DESCRIPTION
`page`	a core.parser.ParsedHTML handler with pre-filtered DOM, TYPE: `ParsedHTML`
`url`	the valid URL accessible by HTTP GET request of the page TYPE: `str`
`lang`	the provided or guessed language of the page, TYPE: `str \| None`
`markup`	the markup to search for. See core.parser.ParsedHTML.get_page_markup for details. TYPE: `str \| tuple \| list[str] \| list[tuple] \| None`
`date`	if the page was retrieved from a sitemap, usually the date is available in ISO format (`yyyy-mm-ddTHH:MM:SS`) and can be passed directly here. Otherwise, several attempts will be made to extract it from the page content (see core.parser.ParsedHTML.get_date). TYPE: `str \| None` DEFAULT: `None`
`category`	arbitrary category or label defined by user TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`list[web_page]`	The content of the page, including metadata, in a core.types.web_page singleton.

core.crawler.hash_with_category ¤

hash_with_category(data: str, category: str) -> str

Produce a unique identifier mixing data and category.

core.crawler¤

core.crawler ¤

Classes¤

core.crawler.Crawler ¤

Attributes¤

core.crawler.Crawler.no_follow class-attribute instance-attribute ¤

core.crawler.Crawler.crawled_URL instance-attribute ¤

core.crawler.Crawler.crawled_content instance-attribute ¤

core.crawler.Crawler.known_urls instance-attribute ¤

core.crawler.Crawler.since instance-attribute ¤

core.crawler.Crawler.errors instance-attribute ¤

core.crawler.Crawler.notfound instance-attribute ¤

Methods:¤

core.crawler.Crawler.load_known_urls ¤

core.crawler.Crawler.get_most_recent_page ¤

core.crawler.Crawler.get_most_recent_crawl ¤

core.crawler.Crawler.get_crawling_threshold ¤

core.crawler.Crawler.discard_link ¤

core.crawler.Crawler.get_immediate_links ¤

core.crawler.Crawler.update_link ¤

core.crawler.Crawler.get_website_from_crawling ¤

core.crawler.Crawler.get_website_from_sitemap ¤

core.crawler.Crawler.get_unique_internal_url ¤

core.crawler.Crawler.get_youtube_channels ¤

core.crawler.Crawler.get_github_repositories ¤

core.crawler.Crawler.get_stackexchange_posts ¤

Functions:¤

core.crawler.get_content_type ¤

core.crawler.relative_to_absolute ¤

core.crawler.radical_url ¤

core.crawler.get_page_content ¤

core.crawler.parse_page ¤

core.crawler.hash_with_category ¤

core.crawler.Crawler.no_follow `class-attribute` `instance-attribute` ¤

core.crawler.Crawler.crawled_URL `instance-attribute` ¤

core.crawler.Crawler.crawled_content `instance-attribute` ¤

core.crawler.Crawler.known_urls `instance-attribute` ¤

core.crawler.Crawler.since `instance-attribute` ¤

core.crawler.Crawler.errors `instance-attribute` ¤

core.crawler.Crawler.notfound `instance-attribute` ¤