Crawler¤
crawler ¤
Module containing utilities to crawl websites for HTML, XML and PDF pages for their text content. PDF can be read from their text content if any, or through optical characters recognition for scans. Websites can be crawled from a sitemap.xml
file or by following internal links recursively from and index page. Each page is aggregated on a list of core.crawler.web_page objects, meant to be used as input to train natural language AI models and to index and rank for search engines.
© 2023-2024 - Aurélien Pierre
Classes¤
web_page ¤
Bases: TypedDict
Typed dictionnary representing a web page and its metadata. It can also be used for any text document having an URL/URI
Attributes¤
url
instance-attribute
¤
Where to find the page on the network. Can be a local or distant URI, with or without protocol, or even an unique identifier.
date
instance-attribute
¤
Date of the last modification of the page, to assess relevance of the content.
excerpt
instance-attribute
¤
Shortened version of the content for search results previews. Typically provided as description
meta tag by websites.
h1
instance-attribute
¤
Title of the post if any. There should be only one h1 per page, matching title, but some templates wrongly use h1 for section titles.
lang
instance-attribute
¤
2-letters code of the page language. Not used internally, it’s important only if you need to use it in implementations.
Deduplicator ¤
Instanciate a duplicator object.
The duplicates factorizing takes a list of core.crawler.web_page and happens when calling core.crawler.Deduplicator.process.
Duplication detection is done using canonical URLs (removing query parameters and anchors) and lowercased, ASCII-converted content.
You can edit (append or replace) the list of URLs to ignore core.crawler.Deduplicator.urls_to_ignore before doing the actual process.
Optionaly, near-duplicates are detected too by computing the Levenshtein distance between pages contents (lowercased and ASCII-converted). This brings a significant performance penalty on large datasets.
PARAMETER | DESCRIPTION |
---|---|
threshold |
the minimum Levenshtein distance ratio between 2 pages contents for those pages to be considered near-duplicates and be factorized. If set to 1.0, the near-duplicates detection is bypassed which results in a huge speed up.
TYPE:
|
distance |
the near-duplicates search is performed on the nearest elements after the core.crawler.web_page list has been ordered alphabetically by URL, for performance, assuming near-duplicates will most likely be found on the same domain and at a resembling path. The distance parameters defines how many elements ahead we will look into.
TYPE:
|
Attributes¤
urls_to_ignore
class-attribute
instance-attribute
¤
urls_to_ignore: list[str] = [
"/tag/",
"/tags/",
"/category/",
"/categories/",
"/author/",
"/authors/",
"/archive/",
"/archives/",
"/profil/",
"/profiles/",
"/user/",
"/users/",
"/login/",
"/signup/",
"/member/",
"/members/",
"/cart/",
"/shop/",
]
URL substrings to find in URLs and remove matching web pages: mostly WordPress archive pages, user profiles and login pages.
Functions¤
prepare_urls ¤
Find the canonical URL of each post and aggregate a list of matching pages in a
canonical_url: [candidates]
dict
Precompute datetime object and minified ASCII content variant for later processing.
get_unique_urls ¤
Pick the most recent, or otherwise the longer, candidate for each canonical URL.
Return
canonical_url: web_page
dictionnary
prepare_content ¤
get_unique_content ¤
Pick the most recent candidate for each canonical content.
Return
canonical content: web_page
dictionnary
get_close_content ¤
get_close_content(
posts: dict[str : list[web_page]],
threshold: float = 0.9,
distance: float = 500,
) -> dict[str : list[web_page]]
Find near-duplicate by computing the Levenshtein distance between pages contents.
PARAMETER | DESCRIPTION |
---|---|
posts |
dictionnary mapping an unused key to a liste of |
threshold |
the minimum distance ratio of Lenvenshtein metric for 2 contents to be assumed duplicates
TYPE:
|
distance |
for efficiency, the list of web_page is first sorted alphabetically by URL, assuming duplicates
TYPE:
|
Crawler ¤
Crawl a website from its sitemap or by following internal links recusively from an index page.
Attributes¤
no_follow
class-attribute
instance-attribute
¤
no_follow: list[str] = [
"api.whatsapp.com/share",
"api.whatsapp.com/send",
"pinterest.fr/pin/create",
"pinterest.com/pin/create",
"facebook.com/sharer",
"twitter.com/intent/tweet",
"reddit.com/submit",
"t.me/share",
"linkedin.com/share",
"bufferapp.com/add",
"getpocket.com/edit",
"tumblr.com/share",
"mailto:",
"/profile/",
"/login/",
"/signup/",
"/login?",
"/signup?/user/",
"/member/",
".css",
".js",
".json",
]
List of URLs sub-strings that will disable crawling if they are found in URLs. Mostly social networks sharing links.
Functions¤
get_immediate_links ¤
get_immediate_links(
page,
domain,
currentURL,
default_lang,
langs,
category,
contains_str,
external_only: bool = False,
) -> list[web_page]
Follow internal and external links contained in a webpage only to one recursivity level, including PDF files and HTML pages. This is useful to index references docs linked from a page.
get_website_from_crawling ¤
get_website_from_crawling(
website: str,
default_lang: str = "en",
child: str = "/",
langs: tuple = ("en", "fr"),
markup: str = "body",
contains_str: str | list[str] = "",
max_recurse_level: int = -1,
category: str = None,
restrict_section: bool = False,
_recursion_level: int = 0,
) -> list[web_page]
Recursively crawl all pages of a website from internal links found starting from the child
page. This applies to all HTML pages hosted on the domain of website
and to PDF documents either from the current domain or from external domains but referenced on HTML pages of the current domain.
PARAMETER | DESCRIPTION |
---|---|
website |
root of the website, including
TYPE:
|
default_lang |
provided or guessed main language of the website content. Not used internally.
TYPE:
|
child |
page of the website to use as index to start crawling for internal links.
TYPE:
|
langs |
ISO-something 2-letters code of the languages for which we attempt to fetch the translation if available, looking for the HTML
TYPE:
|
contains_str |
a string or a list of strings that should be contained in a page URL for the page to be indexed. On a forum, you could for example restrict pages to URLs containing |
markup |
TYPE:
|
max_recursion_level |
this method will call itself recursively on each internal link found in the current page, starting from the
|
category |
arbitrary category or label set by user.
TYPE:
|
restrict_section |
set to
TYPE:
|
_recursion_level |
DON’T USE IT. Everytime this method calls itself recursively, it increments this variable internally, and recursion stops when the level is equal to the
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[web_page]
|
a list of all valid pages found. Invalid pages (wrong markup, empty HTML response, 404 errors) will be silently ignored. |
Examples:
get_website_from_sitemap ¤
get_website_from_sitemap(
website: str,
default_lang: str,
sitemap: str = "/sitemap.xml",
langs: tuple[str] = ("en", "fr"),
markup: str | tuple[str] = "body",
category: str = None,
contains_str: str | list[str] = "",
external_only: bool = False,
) -> list[web_page]
Recursively crawl all pages of a website from links found in a sitemap. This applies to all HTML pages hosted on the domain of website
and to PDF documents either from the current domain or from external domains but referenced on HTML pages of the current domain. Sitemaps of sitemaps are followed recursively.
PARAMETER | DESCRIPTION |
---|---|
website |
root of the website, including
TYPE:
|
default_lang |
provided or guessed main language of the website content. Not used internally.
TYPE:
|
sitemap |
relative path of the XML sitemap.
TYPE:
|
langs |
ISO-something 2-letters code of the languages for which we attempt to fetch the translation if available, looking for the HTML |
markup |
|
category |
arbitrary category or label
TYPE:
|
contains_str |
limit recursive crawling from sitemap-defined pages to pages containing this string or list of strings. Will get passed as-is to [get_website_from_crawling][]. |
external_only |
follow links from internal pages (outside of sitemaps) only if they point to an external domain.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[web_page]
|
a list of all valid pages found. Invalid pages (wrong markup, empty HTML response, 404 errors) will be silently ignored. |
Examples:
Functions¤
get_content_type ¤
Probe an URL for HTTP headers only to see what type of content it returns.
RETURNS | DESCRIPTION |
---|---|
type
|
the type of content, like
TYPE:
|
status
|
the state flag:
TYPE:
|
relative_to_absolute ¤
Convert a relative URL to absolute by prepending the domain.
PARAMETER | DESCRIPTION |
---|---|
URL |
the URL string to normalize to absolute,
TYPE:
|
domain |
the domain name of the website, without protocol (
TYPE:
|
current_page |
the URL of the page from which we analyze links.
It will be appended to the relative links starting by
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str
|
The normalized, absolute URL on this website. |
Examples:
>>> relative_to_absolute("folder/page", "me.com")
"://me.com/folder/page"
radical_url ¤
ocr_pdf ¤
ocr_pdf(
document: bytes,
output_images: bool = False,
path: str = None,
repair: int = 1,
upscale: int = 3,
contrast: float = 1.5,
sharpening: float = 1.2,
threshold: float = 0.4,
tesseract_lang: str = "eng+fra+equ",
tesseract_bin: str = None,
) -> str
Extract text from PDF using OCR through Tesseract. Both the binding Python package PyTesseract and the Tesseract binaries need to be installed.
To run on a server where you don’t have sudo
access to install package, you will need to download the AppImage package and pass its path to the tesseract_bin
argument.
Tesseract uses machine-learning to identify words and needs the relevant language models to be installed on the system as well. Linux packaged version of Tesseract seem to generally ship French, English and equations (math) models by default. Other languages need to be installed manually, see Tesseract docs for available packages. Use pytesseract.get_languages(config='')
to list available language packages installed locally.
The OCR is preceeded by an image processing step aiming at text reconstruction, by sharpening, increasing contrast and iteratively reconstructing holes in letters using an inpainting method in wavelets space. This is computationaly expensive, which may not be suitable to run on server.
PARAMETER | DESCRIPTION |
---|---|
document |
the PDF document to open.
TYPE:
|
output_images |
set to
TYPE:
|
repair |
number of iterations of enhancements (sharpening, contrast and inpainting) to perform. More iterations take longer, too many iterations might simplify their geometry (as if they were fluid and would drip, removing corners and pointy ends) in a way that actually degrades OCR.
TYPE:
|
upscale |
upscaling factor to apply before enhancement. This can help recovering ink leaks but takes more memory and time to compute.
TYPE:
|
contrast |
TYPE:
|
sharpening |
TYPE:
|
threshold |
the reference value (fulcrum) for contrast enhancement. Good values are typically in the range 0.20-0.50.
TYPE:
|
tesseract_lang |
the Tesseract command argument
TYPE:
|
tesseract_bin |
the path to the Tesseract executable if it is not in the global CLI path. This is passed as-is to
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str
|
All the retrieved text from all the PDF pages as a single string. No pagination is done. |
RAISES | DESCRIPTION |
---|---|
RuntimeError
|
when using a language package is attempted while Tesseract has no such package installed. |
get_pdf_content ¤
get_pdf_content(
url: str,
lang: str,
file_path: str = None,
process_outline: bool = True,
category: str = None,
ocr: int = 1,
**kwargs
) -> list[web_page]
Retrieve a PDF document through the network with HTTP GET or from the local filesystem, and parse its text content, using OCR if needed. This needs a functionnal network connection if file_path
is not provided.
PARAMETER | DESCRIPTION |
---|---|
url |
the online address of the document, or the downloading page if the doc is not directly accessible from a GET request (for some old-schools website where downloads are inited from a POST request to some PHP form handler, or publications behind a paywall).
TYPE:
|
lang |
the ISO code of the language.
TYPE:
|
file_path |
local path to the PDF file if the URL can’t be directly fetched by GET request. The content will be extracted from the local file but the original/remote URL will still be referenced as the source.
TYPE:
|
process_outline |
set to
TYPE:
|
category |
arbitrary category or label set by user
TYPE:
|
ocr |
TYPE:
|
PARAMETER | DESCRIPTION |
---|---|
**kwargs |
directly passed-through to core.crawler.ocr_pdf. See this function documentation for more info.
|
RETURNS | DESCRIPTION |
---|---|
list[web_page]
|
a list of core.crawler.web_page objects holding the text content and the PDF metadata |
get_page_content ¤
Request an (x)HTML page through the network with HTTP GET and feed its response to a BeautifulSoup handler. This needs a functionnal network connection.
The DOM is pre-filtered as follow to keep only natural language and avoid duplicate strings:
- media tags are removed (
<iframe>
,<embed>
,<img>
,<svg>
,<audio>
,<video>
, etc.), - code and machine language tags are removed (
<script>
,<style>
,<code>
,<pre>
,<math>
), - menus and sidebars are removed (
<nav>
,<aside>
), - forms, fields and buttons are removed(
<select>
,<input>
,<button>
,<textarea>
, etc.) - quotes tags are removed (
<quote>
,<blockquote>
).
The HTML is un-minified to help end-of-sentences detections in cases where sentences don’t end with punctuation (e.g. in titles).
PARAMETER | DESCRIPTION |
---|---|
url |
a valid URL that can be fetched with an HTTP GET request.
TYPE:
|
content |
a string buffer used as HTML source. If this argument is passed, we don’t fetch
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
[BeautifulSoup | None, list[str]]
|
a bs4.BeautifulSoup object initialized with the page DOM for further text mining. |
get_page_markup ¤
Extract the text content of an HTML page DOM by targeting only the specific tags.
PARAMETER | DESCRIPTION |
---|---|
page |
a bs4.BeautifulSoup handler with pre-filtered DOM,
TYPE:
|
markup |
any kind of tags supported by [bs4.BeautifulSoup.find_all][]:
|
RETURNS | DESCRIPTION |
---|---|
str
|
The text content of all instances of all tags in markup as a single string, if any, else an empty string. |
Examples:
get_excerpt ¤
Find HTML tags possibly containing the shortened version of the page content.
Looks for HTML tags:
<meta name="description" content="...">
<meta property="og:description" content="...">
PARAMETER | DESCRIPTION |
---|---|
page |
a bs4.BeautifulSoup handler with pre-filtered DOM,
|
RETURNS | DESCRIPTION |
---|---|
str | None
|
The content of the meta tag if any. |
get_date ¤
Find HTML tags possibly containing the page date.
Looks for HTML tags:
<meta property="article:modified_time" content="...">
<time datetime="...">
<relative-time datetime="...">
<div class="dateline">...</div>
PARAMETER | DESCRIPTION |
---|---|
page |
a bs4.BeautifulSoup handler with pre-filtered DOM,
|
RETURNS | DESCRIPTION |
---|---|
The content of the meta tag if any. |
parse_page ¤
parse_page(
page: BeautifulSoup,
url: str,
lang: str,
markup: str | list[str],
date: str = None,
category: str = None,
) -> list[web_page]
Get the requested markup from the requested page URL.
This chains in a single call:
PARAMETER | DESCRIPTION |
---|---|
page |
a bs4.BeautifulSoup handler with pre-filtered DOM,
TYPE:
|
url |
the valid URL accessible by HTTP GET request of the page
TYPE:
|
lang |
the provided or guessed language of the page,
TYPE:
|
markup |
the markup to search for. See core.crawler.get_page_markup for details. |
date |
if the page was retrieved from a sitemap, usually the date is available in ISO format (
TYPE:
|
category |
arbitrary category or label defined by user
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[web_page]
|
The content of the page, including metadata, in a core.crawler.web_page singleton. |
Examples¤
The best way to use the crawler is by adding a script in src/user_scripts
, since it is meant to be used as an offline training step (and not in filters).
To crawl a website where some content has a sitemap and the rest does not:
# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))
# Here starts the real code
from core import crawler
from core import utils
# Init a crawler object
cr = crawler.Crawler()
# Add URLs to ignore when crawling, for performance.
# for example if those URLs are crawled in another data set.
cr.no_follow += [
"aurelienpierre.com",
"persons-profile-",
"/view-album/",
]
# Scrape one site using the sitemap
output = cr.get_website_from_sitemap("https://ansel.photos",
"en",
markup="article")
# Scrape another site recursively
output += cr.get_website_from_crawling("https://community.ansel.photos",
"en",
child="/discussions-home",
markup=[("div", {"class": "bx-content-description"}),
("div", {"class": "cmt-body"})],
contains_str="view-discussion")
# ... can keep appending as many websites as you want to `output` list
dedup = crawler.Deduplicator()
dedup.urls_to_ignore += [
# Can add other urls to ignore here too.
]
output = dedup.process(output)
utils.save_data(output, "ansel")
# You will find the dataset saved as a pickle Python object in a .tar.gz in ./data/
Note
In the above example, we reuse the cr
object between the “sitemap” and the “recurse” calls. It means that the second call will inherit the Crawler.crawled_URL list from the previous, which contains all the URLs already processed. All URLs from this list will be ignored in the next calls. This can be good to avoid duplicates, but can be bad for some use cases. For those cases, instantiate a new Crawler
object instead of reusing the previous one.
The core.utils.save_data method will directly save the list of core.crawler.web_page objects as a pickle file compressed in a .tar.gz
archive, into the VirtualSecretary/data
folder. To re-open, decompress and decode it later, use core.utils.open_data:
# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))
# Here starts the real code
from core import crawler
from core import utils
pages = utils.open_data("ansel")
for page in pages:
# do stuff...
print(page["title"], page["url"])