core.crawler¤
core.crawler
¤
Module containing utilities to crawl websites for HTML, XML and PDF pages for their text content. PDF can be read from their text content if any, or through optical characters recognition for scans. Websites can be crawled from a sitemap.xml file or by following internal links recursively from and index page. Each page is aggregated on a list of core.types.web_page objects, meant to be used as input to train natural language AI models and to index and rank for search engines.
© 2023-2024 - Aurélien Pierre
Classes¤
core.crawler.Crawler
¤
Crawler(
delay: float = 1.0,
no_follow: list[str] = [],
known_urls: dict[str, datetime.datetime] | None = None,
since: datetime.datetime | None = None,
)
Bases: DelayedClass
Crawl a website from its sitemap or by following internal links recusively from an index page.
This class needs therefore to be used within a with statement that will take care of resources
allocations and releases in background.
| PARAMETER | DESCRIPTION |
|---|---|
delay
|
time in seconds to wait before 2 HTTP requests.
The right delay will prevent the crawler from being throttled by anti-DoS rules while making it as fast as possible.
Set to
TYPE:
|
no_follow
|
list of URL parts to completely ignore, that is not index them but not even crawl them for internal links. |
known_urls
|
mapping of |
since
|
global freshness cut-off for recursive crawling. Any URL present in known_urls
and last crawled on or after this datetime will be skipped entirely.
Has no effect when known_urls is empty or when a URL is not yet known.
For sitemap crawling, the sitemap’s own |
Example
db = database.open_db("my-engine.db")
with crawler.Crawler(delay=1.0) as cr:
cr.load_known_urls(db) # populate incremental-update map
cr.since = datetime.datetime(2025, 1, 1, tzinfo=datetime.timezone.utc)
# Only re-fetches pages whose <lastmod> is newer than the stored crawl date.
pages = cr.get_website_from_sitemap("https://domain.com", "en")
# Only re-fetches pages not yet in the index, or crawled before since.
pages += cr.get_website_from_crawling("https://forum.domain.com", "en")
Attributes¤
core.crawler.Crawler.no_follow
class-attribute
instance-attribute
¤
no_follow: list[str] = [
"api.whatsapp.com/share",
"api.whatsapp.com/send",
"pinterest.fr/pin/create",
"pinterest.com/pin/create",
"facebook.com/sharer",
"twitter.com/intent/tweet",
"twitter.com/share",
"x.com/share",
"reddit.com/submit",
"t.me/share",
"linkedin.com/share",
"vk.com/share.php",
"bufferapp.com/add",
"getpocket.com/edit",
"tumblr.com/share",
"www.addtoany.com/add_to",
"share.flipboard.com/bookmarklet/",
"?share=",
"?replytocom=",
"translate.google.com/translate",
"flickr.com",
"//flic.kr/",
"instagram.com",
"threads.com",
"facebook.com",
"linkedin.com",
"twitter.com",
"//t.co/tiktok.com",
"pinterest.com",
"//x.com/",
"reddit.com",
"sciprofiles.com",
"www.citeulike.org",
"linktr.ee",
"mailto:",
"/profile/",
"/login/",
"/login.php",
"/wp-login.php/signup/",
"/signup.php",
"/login?",
"/signup?/user/",
"/member/",
"/register?",
"login.microsoftonline.com",
".css",
".js",
".json",
".jpg",
".png",
".jpeg",
".gif",
".webp",
".heif",
".tif",
]
List of URLs sub-strings that will disable crawling if they are found in URLs. Mostly social networks sharing links.
core.crawler.Crawler.crawled_URL
instance-attribute
¤
List of { URL + category } hashes already visited.
Websites crawled from sitemap and also following internal links recursively will tag
recursively-crawled pages with an external category, which will later be considered
by core.deduplicator.Deduplicator with a lower priority than any other category.
Sitemap crawling may restrict content to selected HTML tags and produce better-quality data,
with less noise. So we need to keep crawling everything from sitemap, whether or not it was
already crawled from internal links earlier, and dedup will sort it out.
core.crawler.Crawler.crawled_content
instance-attribute
¤
List of hashes of content already known
core.crawler.Crawler.known_urls
instance-attribute
¤
Mapping of URL → last-crawled datetime for incremental updates. Populated at construction time or via load_known_urls.
Note
We strip leading and trailing / for generality, in URL keys.
core.crawler.Crawler.since
instance-attribute
¤
Global freshness cut-off for recursive and API-based crawling. Pages in known_urls last crawled on or after this datetime are skipped.
core.crawler.Crawler.errors
instance-attribute
¤
URLs that couldn’t be accessed due to blocking or throttling
core.crawler.Crawler.notfound
instance-attribute
¤
URLs returning error 404 - not found
Functions¤
core.crawler.Crawler.load_known_urls
¤
load_known_urls(db: sqlite3.Connection) -> int
Populate the incremental-update map from an existing index database.
After calling this, all crawling methods will skip pages whose stored crawl
timestamp indicates they are still fresh (see self.since and the
<lastmod> logic in get_website_from_sitemap).
| PARAMETER | DESCRIPTION |
|---|---|
db
|
an open SQLite connection to a Virtual Secretary database (as returned by core.database.open_db or core.database.create_db).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int
|
Number of URL entries loaded. |
core.crawler.Crawler.get_most_recent_page
¤
get_most_recent_page(db: sqlite3.Connection) -> datetime.datetime | None
Get the datetime of the most recent web_page indexed in the db database
core.crawler.Crawler.get_most_recent_crawl
¤
get_most_recent_crawl(db: sqlite3.Connection) -> datetime.datetime | None
Get the datetime of the most recently crawled web_page indexed in the db database
core.crawler.Crawler.get_crawling_threshold
¤
get_crawling_threshold(db: sqlite3.Connection) -> datetime.datetime | None
Get the safe date from which we should restart incremental crawling of a website. We use the oldest among the last crawling date and the most recent page, to account for possibly badly-formed page dates set in the future at the time of crawling.
core.crawler.Crawler.discard_link
¤
Returns True if the url is found in the self.no_follow list
core.crawler.Crawler.get_immediate_links
¤
get_immediate_links(
links: list[str],
domain,
default_lang,
langs,
category,
contains_str,
internal_links: str = "any",
mine_pdf=False,
) -> list[web_page]
Follow internal and external links contained in a webpage only to one recursivity level, including PDF files and HTML pages. This is useful to index references docs linked from a page.
| PARAMETER | DESCRIPTION |
|---|---|
internal_links
|
defines what to do with links found inside the HTML page body/content:
-
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[web_page]
|
list of links targets content |
core.crawler.Crawler.update_link
¤
Update target link with possible HTTP redirections
| PARAMETER | DESCRIPTION |
|---|---|
old_link
|
original URL followed, found in HTML
TYPE:
|
new_link
|
destination URL retrieved, possibly after HTTP redirections.
TYPE:
|
category
|
tagged category of the page
TYPE:
|
status_code
|
HTML returned status code
TYPE:
|
core.crawler.Crawler.get_website_from_crawling
¤
get_website_from_crawling(
website: str,
default_lang: str = "en",
child: str = "/",
langs: tuple = ("en", "fr"),
markup: str = "body",
contains_str: str | list[str] = "",
max_recurse_level: int = -1,
category: str = "",
restrict_section: bool = False,
mine_pdf: bool = False,
_recursion_level: int = 0,
_mainthread: bool = True,
) -> list[web_page]
Recursively crawl all pages of a website from internal links found starting from the child page. This applies to all HTML pages hosted on the domain of website and to PDF documents either from the current domain or from external domains but referenced on HTML pages of the current domain.
| PARAMETER | DESCRIPTION |
|---|---|
website
|
root of the website, including
TYPE:
|
default_lang
|
provided or guessed main language of the website content. Not used internally.
TYPE:
|
child
|
page of the website to use as index to start crawling for internal links.
TYPE:
|
langs
|
ISO-something 2-letters code of the languages for which we attempt to fetch the translation
if available, looking for the HTML
TYPE:
|
contains_str
|
a string or a list of strings that should be contained in a page URL for the page to be indexed.
On a forum, you could for example restrict pages to URLs containing |
markup
|
TYPE:
|
max_recurse_level
|
this method will call itself recursively on each internal link found in the current page,
starting from the
TYPE:
|
category
|
arbitrary category or label set by user for classification. Will be automatically set to
TYPE:
|
restrict_section
|
set to
TYPE:
|
mine_pdf
|
set to
TYPE:
|
_recursion_level
|
DON’T USE IT. Everytime this method calls itself recursively,
it increments this variable internally, and recursion stops when the level is equal to the
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[web_page]
|
a list of all valid pages found. Invalid pages (wrong markup, empty HTML response, 404 errors) will be silently ignored. |
Examples:
core.crawler.Crawler.get_website_from_sitemap
¤
get_website_from_sitemap(
website: str,
default_lang: str,
sitemap: str = "/sitemap.xml",
langs: tuple[str] = ("en", "fr"),
markup: str | tuple | list[str] | list[tuple] = "body",
category: str = "",
contains_str: str | list[str] = "",
internal_links: str = "any",
mine_pdf: bool = False,
_recursion_level: int = 0,
) -> list[web_page]
Recursively crawl all pages of a website from links found in a sitemap.
This applies to all HTML pages hosted on the domain of website and to PDF documents either from
the current domain or from external domains but referenced on HTML pages of the current domain.
Sitemaps of sitemaps are followed recursively.
| PARAMETER | DESCRIPTION |
|---|---|
website
|
root of the website, including
TYPE:
|
default_lang
|
provided or guessed main language of the website content. Not used internally.
TYPE:
|
sitemap
|
relative path of the XML sitemap.
TYPE:
|
langs
|
ISO-something 2-letters code of the languages for which we attempt to fetch the translation if available,
looking for the HTML |
markup
|
|
category
|
arbitrary category or label
TYPE:
|
contains_str
|
limit recursive crawling from sitemap-defined pages to pages containing this string or list of strings. Will get passed as-is to core.crawler.Crawler.get_website_from_crawling. |
internal_links
|
defines what to do with links found inside the HTML page body/content.
-
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[web_page]
|
a list of all valid pages found. Invalid pages (wrong markup, empty HTML response, 404 errors) will be silently ignored. |
Examples:
core.crawler.Crawler.get_unique_internal_url
¤
get_unique_internal_url(
page: ParsedHTML, domain: str, currentURL: str
) -> list[str]
Grab the internal links found in page, except PDF, and return only the ones we don’t already know
core.crawler.Crawler.get_youtube_channels
¤
get_youtube_channels(
channel_ids: list[str],
api_key: str,
default_lang: str = "en",
category: str = "video",
since: datetime.datetime | None = None,
) -> list[web_page]
Index YouTube channels via the Data API v3 (no OAuth required).
Retrieves the full upload list for each channel by walking the channel’s
uploads playlist, then fetches the complete snippet for each video. The
result mirrors what get_website_from_sitemap
produces for a normal
website: one core.types.web_page per video, with title,
content (video description), date, lang, and category
populated.
Incremental update logic:
- If since is provided, any video URL already present in
self.known_urlsand last crawled on or after since is skipped. - If since is
Nonebutself.sinceis set,self.sinceis used as the cut-off. - Videos not yet in
self.known_urlsare always fetched.
Rate limiting respects self.delay and the www.googleapis.com domain
bucket, consistent with the rest of the crawler.
| PARAMETER | DESCRIPTION |
|---|---|
channel_ids
|
list of YouTube channel IDs — the |
api_key
|
Google Cloud API key with YouTube Data API v3 enabled. See https://developers.google.com/youtube/v3/getting-started.
TYPE:
|
default_lang
|
fallback language code when the video metadata does not declare one.
TYPE:
|
category
|
label applied to every indexed video, reused by search filters.
TYPE:
|
since
|
skip videos whose URL is already known and was crawled on or after
this datetime. Pass |
| RETURNS | DESCRIPTION |
|---|---|
list[web_page]
|
list of core.types.web_page objects, one per video. |
Example
db = database.open_db("my-engine.db")
with crawler.Crawler(delay=0.5) as cr:
cr.load_known_urls(db)
pages = cr.get_youtube_channels(
channel_ids = ["UCmsSn3fujI81EKEr4NLxrcg",
"UCkqe4BYsllmcxo2dsF-rFQw"],
api_key = "YOUR_KEY",
default_lang = "en",
category = "video",
since = datetime.datetime(2025, 1, 1, tzinfo=datetime.timezone.utc),
)
database.populate_db(db, pages)
core.crawler.Crawler.get_github_repositories
¤
get_github_repositories(
repositories: list[tuple[str, str]],
api_key: str,
features: list[str] | None = None,
langs: tuple[str, ...] = ("en", "fr"),
category: str = "Github",
since: datetime.datetime | None = None,
mine_pdf: bool = True,
) -> list[web_page]
Index GitHub repository content via the REST API.
Supported features: "issues", "pulls", "commits",
"discussions". Issue and pull-request comments are concatenated with
the parent body. External links found in Markdown bodies are followed at
one recursion level (same behaviour as
get_website_from_crawling with
max_recurse_level=1), and PDF files linked from those pages are mined
when mine_pdf is True.
Incremental update:
- For
issues,pulls, andcommits, the GitHub API’s native?since=query parameter is used when since is provided, so only items updated after that date are fetched — minimising API quota usage. - For
discussions, the REST API has nosincefilter; client-side filtering bycreated_atis applied instead. - 429 / 403 rate-limit responses are handled automatically: the crawler
reads the
Retry-Afterheader and waits accordingly.
| PARAMETER | DESCRIPTION |
|---|---|
repositories
|
list of |
api_key
|
GitHub personal access token (classic or fine-grained, read-only
TYPE:
|
features
|
subset of |
langs
|
language codes passed through to get_immediate_links when following external links from item bodies. |
category
|
label applied to every indexed item.
TYPE:
|
since
|
only fetch items created or updated after this datetime.
Overrides |
mine_pdf
|
whether to follow and extract PDF files linked from item bodies.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[web_page]
|
list of core.types.web_page objects. |
Example
db = database.open_db("my-engine.db")
with crawler.Crawler(delay=0.72) as cr:
cr.load_known_urls(db)
pages = cr.get_github_repositories(
repositories = [("aurelienpierreeng", "ansel"),
("darktable-org", "rawspeed")],
api_key = "ghp_…",
features = ["issues", "pulls", "commits"],
since = datetime.datetime(2025, 1, 1,
tzinfo=datetime.timezone.utc),
mine_pdf = True,
)
database.populate_db(db, pages)
core.crawler.Crawler.get_stackexchange_posts
¤
get_stackexchange_posts(
site: str,
api_key: str | None = None,
category: str = "forum",
langs: tuple[str, ...] = ("en",),
since: datetime.datetime | None = None,
window_days: int = 90,
earliest_date: datetime.datetime | None = None,
se_filter: str = "!14e92L7CSAvro*ufn5-s.s23LqfumIAci09lv0z)*cLWPr",
) -> list[web_page]
Index a Stack Exchange community via the public API v2.3.
Retrieves all posts (questions, answers) together with their embedded
comments from the posts endpoint. Each post’s body and its comments
are concatenated into a single core.types.web_page and external
links found in the Markdown bodies are followed at one recursion level
(PDFs included).
Pagination and rate limits. Without an API key the SE API allows
300 requests/day and a maximum of 25 pages per date window. With a key,
the daily quota rises to 10 000 requests. The method handles both
cases: it pages through 25-page windows, each covering window_days
days of posts, sliding backward in time until earliest_date is
reached. When since is provided the window collapses to a single
forward pass from since to now, which is the efficient path for
incremental updates. The API’s backoff field is always respected.
Incremental update. Two complementary mechanisms combine:
- since (or
self.since) is passed asfromdateto the API, so the server only returns posts created or edited after that point. self.known_urlsprovides per-URL precision: for each post thelast_edit_datefield is compared with the stored crawl timestamp, and the post is skipped when the stored timestamp is more recent — catching the case where a post was fetched as part of a wide window but not actually changed.
SE filter. The default se_filter string was built at
api.stackexchange.com/docs/filters and requests the following fields:
body_markdown, comments, comments.body_markdown,
comments.link, creation_date, last_edit_date, link,
title. Pass a custom filter string if you need additional fields.
| PARAMETER | DESCRIPTION |
|---|---|
site
|
Stack Exchange site name as used in the API, e.g.
TYPE:
|
api_key
|
Optional Stack Exchange API key. Raises daily quota from 300 to 10 000 requests/day. Obtain one free at https://stackapps.com/apps/oauth/register.
TYPE:
|
category
|
Label applied to every indexed post.
TYPE:
|
langs
|
Language codes passed to get_immediate_links when following external links from post bodies. |
since
|
Only fetch posts whose |
window_days
|
Size (in days) of each date window used when doing a full crawl
(i.e. when since is
TYPE:
|
earliest_date
|
Stop the full-crawl backward walk when this date is reached. Defaults to 2010-01-01 (SE’s approximate launch date). |
se_filter
|
Opaque SE filter string defining which fields are returned. Override only when you need fields beyond the defaults.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[web_page]
|
list of core.types.web_page objects. |
Example
db = database.open_db("my-engine.db")
with crawler.Crawler(delay=1.0) as cr:
cr.load_known_urls(db)
pages = cr.get_stackexchange_posts(
site = "photo",
api_key = "YOUR_SE_APP_KEY",
category = "forum",
since = datetime.datetime(2025, 1, 1,
tzinfo=datetime.timezone.utc),
)
database.populate_db(db, pages)
Functions¤
core.crawler.get_content_type
¤
get_content_type(
url: str, delay: DelayedClass, bypass_robots_txt=False
) -> tuple[str, bool, str, dict | None, int]
Probe an URL for HTTP headers only to see what type of content it returns.
Try to sanitize partly-invalid URLs, like when protocols are not handled/redirected
(http vs https), or invalid trailing characters, URL parameters and anchors are passed.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
fully-formed link to try (including protocol)
TYPE:
|
delay
|
time to wait before re-trying different sanitized URLs
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
type
|
the type of content, like
TYPE:
|
status
|
the state flag:
TYPE:
|
response_url
|
the actual target URL, sanitized and possibly redirected.
TYPE:
|
states
|
HTTP return code
TYPE:
|
core.crawler.relative_to_absolute
¤
Convert a relative URL to absolute by prepending the domain.
| PARAMETER | DESCRIPTION |
|---|---|
URL
|
the URL string to normalize to absolute,
TYPE:
|
domain
|
the domain name of the website, without protocol (
TYPE:
|
current_page
|
the URL of the page from which we analyze links.
It will be appended to the relative links starting by
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The normalized, absolute URL on this website. |
Examples:
>>> relative_to_absolute("folder/page", "me.com")
"://me.com/folder/page"
core.crawler.radical_url
¤
core.crawler.get_page_content
¤
get_page_content(
url: str | None,
delay: DelayedClass,
content: str | None = None,
custom_header={},
) -> tuple[ParsedHTML | None, str | None, int]
Request an (x)HTML page through the network with HTTP GET and feed its response to a ParsedHTML handler. This needs a functionnal network connection.
The DOM is pre-filtered as follow to keep only natural language and avoid duplicate strings:
- media tags are removed (
<iframe>,<embed>,<img>,<svg>,<audio>,<video>, etc.), - code and machine language tags are removed (
<script>,<style>,<math>), - menus and sidebars are removed (
<nav>,<aside>), - forms, fields and buttons are removed(
<select>,<input>,<button>,<textarea>, etc.)
The HTML is un-minified to help end-of-sentences detections in cases where sentences don’t end with punctuation (e.g. in titles).
| PARAMETER | DESCRIPTION |
|---|---|
url
|
a valid URL that can be fetched with an HTTP GET request.
TYPE:
|
content
|
a string buffer used as HTML source. If this argument is passed, we don’t fetch
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[ParsedHTML | None, str | None, int]
|
a tuple with:
1. core.parser.ParsedHTML object initialized with the page DOM for further text mining. |
core.crawler.parse_page
¤
parse_page(
page: ParsedHTML,
url: str,
lang: str | None,
markup: str | tuple | list[str] | list[tuple] | None,
date: str | None = None,
category: str | None = None,
) -> list[web_page]
Get the requested markup from the requested page URL.
This chains in a single call:
- core.parser.ParsedHTML.get_page_markup
- core.parser.ParsedHTML.get_date
- core.parser.ParsedHTML.get_excerpt
| PARAMETER | DESCRIPTION |
|---|---|
page
|
a core.parser.ParsedHTML handler with pre-filtered DOM,
TYPE:
|
url
|
the valid URL accessible by HTTP GET request of the page
TYPE:
|
lang
|
the provided or guessed language of the page,
TYPE:
|
markup
|
the markup to search for. See core.parser.ParsedHTML.get_page_markup for details. |
date
|
if the page was retrieved from a sitemap, usually the date is available in ISO format (
TYPE:
|
category
|
arbitrary category or label defined by user
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[web_page]
|
The content of the page, including metadata, in a core.types.web_page singleton. |