Skip to content

Crawler¤

crawler ¤

Module containing utilities to crawl websites for HTML, XML and PDF pages for their text content. PDF can be read from their text content if any, or through optical characters recognition for scans. Websites can be crawled from a sitemap.xml file or by following internal links recursively from and index page. Each page is aggregated on a list of core.crawler.web_page objects, meant to be used as input to train natural language AI models and to index and rank for search engines.

© 2023-2024 - Aurélien Pierre

Classes¤

web_page ¤

Bases: TypedDict

Typed dictionnary representing a web page and its metadata. It can also be used for any text document having an URL/URI

Attributes¤
title instance-attribute ¤
title: str

Title of the page

url instance-attribute ¤
url: str

Where to find the page on the network. Can be a local or distant URI, with or without protocol, or even an unique identifier.

date instance-attribute ¤
date: str

Date of the last modification of the page, to assess relevance of the content.

content instance-attribute ¤
content: str

The actual content of the page.

excerpt instance-attribute ¤
excerpt: str

Shortened version of the content for search results previews. Typically provided as description meta tag by websites.

h1 instance-attribute ¤
h1: set[str]

Title of the post if any. There should be only one h1 per page, matching title, but some templates wrongly use h1 for section titles.

h2 instance-attribute ¤
h2: set[str]

Section titles if any

lang instance-attribute ¤
lang: str

2-letters code of the page language. Not used internally, it’s important only if you need to use it in implementations.

category instance-attribute ¤
category: str

Arbitrary category or label set by user

Deduplicator ¤

Deduplicator(threshold: float = 0.9, distance: int = 500)

Instanciate a duplicator object.

The duplicates factorizing takes a list of core.crawler.web_page and happens when calling core.crawler.Deduplicator.process.

Duplication detection is done using canonical URLs (removing query parameters and anchors) and lowercased, ASCII-converted content.

You can edit (append or replace) the list of URLs to ignore core.crawler.Deduplicator.urls_to_ignore before doing the actual process.

Optionaly, near-duplicates are detected too by computing the Levenshtein distance between pages contents (lowercased and ASCII-converted). This brings a significant performance penalty on large datasets.

PARAMETER DESCRIPTION
threshold

the minimum Levenshtein distance ratio between 2 pages contents for those pages to be considered near-duplicates and be factorized. If set to 1.0, the near-duplicates detection is bypassed which results in a huge speed up.

TYPE: float DEFAULT: 0.9

distance

the near-duplicates search is performed on the nearest elements after the core.crawler.web_page list has been ordered alphabetically by URL, for performance, assuming near-duplicates will most likely be found on the same domain and at a resembling path. The distance parameters defines how many elements ahead we will look into.

TYPE: int DEFAULT: 500

Attributes¤
urls_to_ignore class-attribute instance-attribute ¤
urls_to_ignore: list[str] = [
    "/tag/",
    "/tags/",
    "/category/",
    "/categories/",
    "/author/",
    "/authors/",
    "/archive/",
    "/archives/",
    "/profil/",
    "/profiles/",
    "/user/",
    "/users/",
    "/login/",
    "/signup/",
    "/member/",
    "/members/",
    "/cart/",
    "/shop/",
]

URL substrings to find in URLs and remove matching web pages: mostly WordPress archive pages, user profiles and login pages.

Functions¤
prepare_urls ¤
prepare_urls(posts: list[web_page]) -> dict[str:list[web_page]]

Find the canonical URL of each post and aggregate a list of matching pages in a canonical_url: [candidates] dict

Precompute datetime object and minified ASCII content variant for later processing.

get_unique_urls ¤
get_unique_urls(posts: dict[str:list[web_page]]) -> dict[str:list[web_page]]

Pick the most recent, or otherwise the longer, candidate for each canonical URL.

Return

canonical_url: web_page dictionnary

prepare_content ¤
prepare_content(posts: dict[str:list[web_page]]) -> dict[str:list[web_page]]

Find the canonical content for each post and aggregate a list of matching pages”

RETURNS DESCRIPTION
dict[str:list[web_page]]

a canonical_content: list[web_pages] dictionnary.

get_unique_content ¤
get_unique_content(posts: dict[str:list[web_page]]) -> dict[str:list[web_page]]

Pick the most recent candidate for each canonical content.

Return

canonical content: web_page dictionnary

get_close_content ¤
get_close_content(
    posts: dict[str : list[web_page]],
    threshold: float = 0.9,
    distance: float = 500,
) -> dict[str : list[web_page]]

Find near-duplicate by computing the Levenshtein distance between pages contents.

PARAMETER DESCRIPTION
posts

dictionnary mapping an unused key to a liste of crawler.web_page

TYPE: dict[str:list[web_page]]

threshold

the minimum distance ratio of Lenvenshtein metric for 2 contents to be assumed duplicates

TYPE: float DEFAULT: 0.9

distance

for efficiency, the list of web_page is first sorted alphabetically by URL, assuming duplicates

TYPE: float DEFAULT: 500

process ¤
process(posts: list[web_page])

Launch the actual duplicate finder

Crawler ¤

Crawler()

Crawl a website from its sitemap or by following internal links recusively from an index page.

Attributes¤
no_follow class-attribute instance-attribute ¤
no_follow: list[str] = [
    "api.whatsapp.com/share",
    "api.whatsapp.com/send",
    "pinterest.fr/pin/create",
    "pinterest.com/pin/create",
    "facebook.com/sharer",
    "twitter.com/intent/tweet",
    "reddit.com/submit",
    "t.me/share",
    "linkedin.com/share",
    "bufferapp.com/add",
    "getpocket.com/edit",
    "tumblr.com/share",
    "mailto:",
    "/profile/",
    "/login/",
    "/signup/",
    "/login?",
    "/signup?/user/",
    "/member/",
    ".css",
    ".js",
    ".json",
]

List of URLs sub-strings that will disable crawling if they are found in URLs. Mostly social networks sharing links.

crawled_URL instance-attribute ¤
crawled_URL: list[str] = []

List of URLs already visited

Functions¤
discard_link(url)

Returns True if the url is found in the self.no_follow list

get_immediate_links(
    page,
    domain,
    currentURL,
    default_lang,
    langs,
    category,
    contains_str,
    external_only: bool = False,
) -> list[web_page]

Follow internal and external links contained in a webpage only to one recursivity level, including PDF files and HTML pages. This is useful to index references docs linked from a page.

get_website_from_crawling ¤
get_website_from_crawling(
    website: str,
    default_lang: str = "en",
    child: str = "/",
    langs: tuple = ("en", "fr"),
    markup: str = "body",
    contains_str: str | list[str] = "",
    max_recurse_level: int = -1,
    category: str = None,
    restrict_section: bool = False,
    _recursion_level: int = 0,
) -> list[web_page]

Recursively crawl all pages of a website from internal links found starting from the child page. This applies to all HTML pages hosted on the domain of website and to PDF documents either from the current domain or from external domains but referenced on HTML pages of the current domain.

PARAMETER DESCRIPTION
website

root of the website, including https:// or http:// without trailing slash.

TYPE: str

default_lang

provided or guessed main language of the website content. Not used internally.

TYPE: str DEFAULT: 'en'

child

page of the website to use as index to start crawling for internal links.

TYPE: str DEFAULT: '/'

langs

ISO-something 2-letters code of the languages for which we attempt to fetch the translation if available, looking for the HTML <link rel="alternate" hreflang="..."> tag.

TYPE: tuple DEFAULT: ('en', 'fr')

contains_str

a string or a list of strings that should be contained in a page URL for the page to be indexed. On a forum, you could for example restrict pages to URLs containing "discussion" to get only the threads and avoid user profiles or archive pages.

TYPE: str | list[str] DEFAULT: ''

markup

TYPE: str DEFAULT: 'body'

max_recursion_level

this method will call itself recursively on each internal link found in the current page, starting from the website/child page. The max_recursion_level defines how many times it calls itself until it is stopped, if it is stopped. When set to -1, it stops when all the internal links have been crawled.

category

arbitrary category or label set by user.

TYPE: str DEFAULT: None

restrict_section

set to True to limit crawling to the website section defined by ://website/child/*. This is useful when indexing parts of very large websites when you are only interested in a small subset.

TYPE: bool DEFAULT: False

_recursion_level

DON’T USE IT. Everytime this method calls itself recursively, it increments this variable internally, and recursion stops when the level is equal to the max_recurse_level.

TYPE: int DEFAULT: 0

RETURNS DESCRIPTION
list[web_page]

a list of all valid pages found. Invalid pages (wrong markup, empty HTML response, 404 errors) will be silently ignored.

Examples:

>>> from core import crawler
>>> cr = crawler.Crawler()
>>> pages = cr.get_website_from_crawling("https://aurelienpierre.com", default_lang="fr", markup=("div", { "class": "post-content" }))
get_website_from_sitemap ¤
get_website_from_sitemap(
    website: str,
    default_lang: str,
    sitemap: str = "/sitemap.xml",
    langs: tuple[str] = ("en", "fr"),
    markup: str | tuple[str] = "body",
    category: str = None,
    contains_str: str | list[str] = "",
    external_only: bool = False,
) -> list[web_page]

Recursively crawl all pages of a website from links found in a sitemap. This applies to all HTML pages hosted on the domain of website and to PDF documents either from the current domain or from external domains but referenced on HTML pages of the current domain. Sitemaps of sitemaps are followed recursively.

PARAMETER DESCRIPTION
website

root of the website, including https:// or http:// without trailing slash.

TYPE: str

default_lang

provided or guessed main language of the website content. Not used internally.

TYPE: str

sitemap

relative path of the XML sitemap.

TYPE: str DEFAULT: '/sitemap.xml'

langs

ISO-something 2-letters code of the languages for which we attempt to fetch the translation if available, looking for the HTML <link rel="alternate" hreflang="..."> tag.

TYPE: tuple[str] DEFAULT: ('en', 'fr')

markup

TYPE: str | tuple[str] DEFAULT: 'body'

category

arbitrary category or label

TYPE: str DEFAULT: None

contains_str

limit recursive crawling from sitemap-defined pages to pages containing this string or list of strings. Will get passed as-is to [get_website_from_crawling][].

TYPE: str | list[str] DEFAULT: ''

external_only

follow links from internal pages (outside of sitemaps) only if they point to an external domain.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
list[web_page]

a list of all valid pages found. Invalid pages (wrong markup, empty HTML response, 404 errors) will be silently ignored.

Examples:

>>> from core import crawler
>>> cr = crawler.Crawler()
>>> pages = cr.get_website_from_sitemap("https://aurelienpierre.com", default_lang="fr", markup=("div", { "class": "post-content" }))

Functions¤

get_content_type ¤

get_content_type(url: str) -> tuple[str, bool]

Probe an URL for HTTP headers only to see what type of content it returns.

RETURNS DESCRIPTION
type

the type of content, like plain/html, application/pdf, etc.

TYPE: str

status

the state flag:

  • True if the URL can be reached and fetched,
  • False if there is some kind of error or empty response.

TYPE: bool

relative_to_absolute ¤

relative_to_absolute(URL: str, domain: str, current_page: str) -> str

Convert a relative URL to absolute by prepending the domain.

PARAMETER DESCRIPTION
URL

the URL string to normalize to absolute,

TYPE: str

domain

the domain name of the website, without protocol (http://) nor trailing slash. It will be appended to the relative links starting by /.

TYPE: str

current_page

the URL of the page from which we analyze links. It will be appended to the relative links starting by ./.

TYPE: str

RETURNS DESCRIPTION
str

The normalized, absolute URL on this website.

Examples:

>>> relative_to_absolute("folder/page", "me.com")
"://me.com/folder/page"

radical_url ¤

radical_url(URL: str) -> str

Trim an URL to the page (radical) part, removing anchors if any (internal links)

Examples:

>>> radical_url("http://me.com/page#section-1")
"http://me.com/page"

ocr_pdf ¤

ocr_pdf(
    document: bytes,
    output_images: bool = False,
    path: str = None,
    repair: int = 1,
    upscale: int = 3,
    contrast: float = 1.5,
    sharpening: float = 1.2,
    threshold: float = 0.4,
    tesseract_lang: str = "eng+fra+equ",
    tesseract_bin: str = None,
) -> str

Extract text from PDF using OCR through Tesseract. Both the binding Python package PyTesseract and the Tesseract binaries need to be installed.

To run on a server where you don’t have sudo access to install package, you will need to download the AppImage package and pass its path to the tesseract_bin argument.

Tesseract uses machine-learning to identify words and needs the relevant language models to be installed on the system as well. Linux packaged version of Tesseract seem to generally ship French, English and equations (math) models by default. Other languages need to be installed manually, see Tesseract docs for available packages. Use pytesseract.get_languages(config='') to list available language packages installed locally.

The OCR is preceeded by an image processing step aiming at text reconstruction, by sharpening, increasing contrast and iteratively reconstructing holes in letters using an inpainting method in wavelets space. This is computationaly expensive, which may not be suitable to run on server.

PARAMETER DESCRIPTION
document

the PDF document to open.

TYPE: bytes

output_images

set to True, each page of the document is saved as PNG in the path directory before and after contrast enhancement. This is useful to tune the image contrast and sharpness enhancements, prior to OCR.

TYPE: bool DEFAULT: False

repair

number of iterations of enhancements (sharpening, contrast and inpainting) to perform. More iterations take longer, too many iterations might simplify their geometry (as if they were fluid and would drip, removing corners and pointy ends) in a way that actually degrades OCR.

TYPE: int DEFAULT: 1

upscale

upscaling factor to apply before enhancement. This can help recovering ink leaks but takes more memory and time to compute.

TYPE: int DEFAULT: 3

contrast

1.0 is the neutral value. Moves RGB values farther away from the threshold.

TYPE: float DEFAULT: 1.5

sharpening

1.0 is the neutral value. Increases sharpness. Values too high can produce ringing (replicated ghost edges).

TYPE: float DEFAULT: 1.2

threshold

the reference value (fulcrum) for contrast enhancement. Good values are typically in the range 0.20-0.50.

TYPE: float DEFAULT: 0.4

tesseract_lang

the Tesseract command argument -l defining languages models to use for OCR. Languages are referenced by their 3-letters ISO-something code. See Tesseract doc for syntax and meaning. You can mix several languages by joining them with +.

TYPE: str DEFAULT: 'eng+fra+equ'

tesseract_bin

the path to the Tesseract executable if it is not in the global CLI path. This is passed as-is to pytesseract.pytesseract.tesseract_cmd of the PyTesseract binding library.

TYPE: str DEFAULT: None

RETURNS DESCRIPTION
str

All the retrieved text from all the PDF pages as a single string. No pagination is done.

RAISES DESCRIPTION
RuntimeError

when using a language package is attempted while Tesseract has no such package installed.

get_pdf_content ¤

get_pdf_content(
    url: str,
    lang: str,
    file_path: str = None,
    process_outline: bool = True,
    category: str = None,
    ocr: int = 1,
    **kwargs
) -> list[web_page]

Retrieve a PDF document through the network with HTTP GET or from the local filesystem, and parse its text content, using OCR if needed. This needs a functionnal network connection if file_path is not provided.

PARAMETER DESCRIPTION
url

the online address of the document, or the downloading page if the doc is not directly accessible from a GET request (for some old-schools website where downloads are inited from a POST request to some PHP form handler, or publications behind a paywall).

TYPE: str

lang

the ISO code of the language.

TYPE: str

file_path

local path to the PDF file if the URL can’t be directly fetched by GET request. The content will be extracted from the local file but the original/remote URL will still be referenced as the source.

TYPE: str DEFAULT: None

process_outline

set to True to split the document according to its outline (table of content), so each section will be in fact a document in itself. PDF pages are processed in full, so sections are at least equal to a page length and there will be some overlapping.

TYPE: bool DEFAULT: True

category

arbitrary category or label set by user

TYPE: str DEFAULT: None

ocr
  • 0 disables any attempt at using OCR,
  • 1 enables OCR through Tesseract if no text was found in the PDF document
  • 2 forces OCR through Tesseract even when text was found in the PDF document.

TYPE: int DEFAULT: 1

PARAMETER DESCRIPTION
**kwargs

directly passed-through to core.crawler.ocr_pdf. See this function documentation for more info.

RETURNS DESCRIPTION
list[web_page]

a list of core.crawler.web_page objects holding the text content and the PDF metadata

get_page_content ¤

get_page_content(
    url: str, content: str = None
) -> [BeautifulSoup | None, list[str]]

Request an (x)HTML page through the network with HTTP GET and feed its response to a BeautifulSoup handler. This needs a functionnal network connection.

The DOM is pre-filtered as follow to keep only natural language and avoid duplicate strings:

  • media tags are removed (<iframe>, <embed>, <img>, <svg>, <audio>, <video>, etc.),
  • code and machine language tags are removed (<script>, <style>, <code>, <pre>, <math>),
  • menus and sidebars are removed (<nav>, <aside>),
  • forms, fields and buttons are removed(<select>, <input>, <button>, <textarea>, etc.)
  • quotes tags are removed (<quote>, <blockquote>).

The HTML is un-minified to help end-of-sentences detections in cases where sentences don’t end with punctuation (e.g. in titles).

PARAMETER DESCRIPTION
url

a valid URL that can be fetched with an HTTP GET request.

TYPE: str

content

a string buffer used as HTML source. If this argument is passed, we don’t fetch url from network and directly use this input.

TYPE: str DEFAULT: None

RETURNS DESCRIPTION
[BeautifulSoup | None, list[str]]

a bs4.BeautifulSoup object initialized with the page DOM for further text mining. None if the HTML response was empty or the URL could not be reached. The list of URLs found in page before removing meaningless markup is stored as a list of strings in the object.links member. object.h1 and object.h2 contain a set of headers 1 and 2 found in the page before removing any markup.

get_page_markup ¤

get_page_markup(
    page: BeautifulSoup, markup: str | tuple | list[str] | list[tuple] | None
) -> str

Extract the text content of an HTML page DOM by targeting only the specific tags.

PARAMETER DESCRIPTION
page

a bs4.BeautifulSoup handler with pre-filtered DOM,

TYPE: BeautifulSoup

markup

any kind of tags supported by [bs4.BeautifulSoup.find_all][]:

  • (str): the single tag to select. For example, "body" will select <body>...</body>.
  • (tuple): the tag and properties to select. For example, ("div", { "class": "right" }) will select <div class="right">...</div>.
  • all combinations of the above can be chained in lists.
  • None: don’t parse the page internal content. Links, h1 and h2 headers will still be parsed.

TYPE: str | tuple | list[str] | list[tuple] | None

RETURNS DESCRIPTION
str

The text content of all instances of all tags in markup as a single string, if any, else an empty string.

Examples:

>>> get_page_markup(page, "article")
>>> get_page_markup(page, ["h1", "h2", "h3", "article"])
>>> get_page_markup(page, [("div", {"id": "content"}), "details", ("div", {"class": "comment-reply"})])

get_excerpt ¤

get_excerpt(html: BeautifulSoup) -> str | None

Find HTML tags possibly containing the shortened version of the page content.

Looks for HTML tags:

  • <meta name="description" content="...">
  • <meta property="og:description" content="...">
PARAMETER DESCRIPTION
page

a bs4.BeautifulSoup handler with pre-filtered DOM,

RETURNS DESCRIPTION
str | None

The content of the meta tag if any.

get_date ¤

get_date(html: BeautifulSoup)

Find HTML tags possibly containing the page date.

Looks for HTML tags:

  • <meta property="article:modified_time" content="...">
  • <time datetime="...">
  • <relative-time datetime="...">
  • <div class="dateline">...</div>
PARAMETER DESCRIPTION
page

a bs4.BeautifulSoup handler with pre-filtered DOM,

RETURNS DESCRIPTION

The content of the meta tag if any.

parse_page ¤

parse_page(
    page: BeautifulSoup,
    url: str,
    lang: str,
    markup: str | list[str],
    date: str = None,
    category: str = None,
) -> list[web_page]

Get the requested markup from the requested page URL.

This chains in a single call:

PARAMETER DESCRIPTION
page

a bs4.BeautifulSoup handler with pre-filtered DOM,

TYPE: BeautifulSoup

url

the valid URL accessible by HTTP GET request of the page

TYPE: str

lang

the provided or guessed language of the page,

TYPE: str

markup

the markup to search for. See core.crawler.get_page_markup for details.

TYPE: str | list[str]

date

if the page was retrieved from a sitemap, usually the date is available in ISO format (yyyy-mm-ddTHH:MM:SS) and can be passed directly here. Otherwise, several attempts will be made to extract it from the page content (see core.crawler.get_date).

TYPE: str DEFAULT: None

category

arbitrary category or label defined by user

TYPE: str DEFAULT: None

RETURNS DESCRIPTION
list[web_page]

The content of the page, including metadata, in a core.crawler.web_page singleton.

Examples¤

The best way to use the crawler is by adding a script in src/user_scripts, since it is meant to be used as an offline training step (and not in filters).

To crawl a website where some content has a sitemap and the rest does not:

# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))

# Here starts the real code
from core import crawler
from core import utils

# Init a crawler object
cr = crawler.Crawler()

# Add URLs to ignore when crawling, for performance.
# for example if those URLs are crawled in another data set.
cr.no_follow += [
  "aurelienpierre.com",
  "persons-profile-",
  "/view-album/",
]

# Scrape one site using the sitemap
output = cr.get_website_from_sitemap("https://ansel.photos",
                                      "en",
                                      markup="article")

# Scrape another site recursively
output += cr.get_website_from_crawling("https://community.ansel.photos",
                                       "en",
                                       child="/discussions-home",
                                       markup=[("div", {"class": "bx-content-description"}),
                                               ("div", {"class": "cmt-body"})],
                                       contains_str="view-discussion")

# ... can keep appending as many websites as you want to `output` list

dedup = crawler.Deduplicator()
dedup.urls_to_ignore += [
  # Can add other urls to ignore here too.
]
output = dedup.process(output)

utils.save_data(output, "ansel")
# You will find the dataset saved as a pickle Python object in a .tar.gz in ./data/

Note

In the above example, we reuse the cr object between the “sitemap” and the “recurse” calls. It means that the second call will inherit the Crawler.crawled_URL list from the previous, which contains all the URLs already processed. All URLs from this list will be ignored in the next calls. This can be good to avoid duplicates, but can be bad for some use cases. For those cases, instantiate a new Crawler object instead of reusing the previous one.

The core.utils.save_data method will directly save the list of core.crawler.web_page objects as a pickle file compressed in a .tar.gz archive, into the VirtualSecretary/data folder. To re-open, decompress and decode it later, use core.utils.open_data:

# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))

# Here starts the real code
from core import crawler
from core import utils

pages = utils.open_data("ansel")

for page in pages:
  # do stuff...
  print(page["title"], page["url"])