Crawler¤

crawler ¤

Module containing utilities to crawl websites for HTML, XML and PDF pages for their text content. PDF can be read from their text content if any, or through optical characters recognition for scans. Websites can be crawled from a sitemap.xml file or by following internal links recursively from and index page. Each page is aggregated on a list of core.crawler.web_page objects, meant to be used as input to train natural language AI models and to index and rank for search engines.

Classes¤

web_page ¤

Bases: TypedDict

Typed dictionnary representing a web page and its metadata. It can also be used for any text document having an URL/URI

Attributes¤

title `instance-attribute` ¤

title: str

Title of the page

url `instance-attribute` ¤

url: str

Where to find the page on the network. Can be a local or distant URI, with or without protocol, or even an unique identifier.

date `instance-attribute` ¤

date: str

Date of the last modification of the page, to assess relevance of the content.

content `instance-attribute` ¤

content: str

The actual content of the page.

excerpt `instance-attribute` ¤

excerpt: str

Shortened version of the content for search results previews. Typically provided as description meta tag by websites.

h1 `instance-attribute` ¤

h1: set[str]

Title of the post if any. There should be only one h1 per page, matching title, but some templates wrongly use h1 for section titles.

h2 `instance-attribute` ¤

h2: set[str]

Section titles if any

lang `instance-attribute` ¤

lang: str

2-letters code of the page language. Not used internally, it’s important only if you need to use it in implementations.

category `instance-attribute` ¤

category: str

Arbitrary category or label set by user

Deduplicator ¤

Deduplicator(threshold: float = 0.9, distance: int = 500)

Instanciate a duplicator object.

The duplicates factorizing takes a list of core.crawler.web_page and happens when calling core.crawler.Deduplicator.process.

Duplication detection is done using canonical URLs (removing query parameters and anchors) and lowercased, ASCII-converted content.

You can edit (append or replace) the list of URLs to ignore core.crawler.Deduplicator.urls_to_ignore before doing the actual process.

Optionaly, near-duplicates are detected too by computing the Levenshtein distance between pages contents (lowercased and ASCII-converted). This brings a significant performance penalty on large datasets.

PARAMETER	DESCRIPTION
`threshold`	the minimum Levenshtein distance ratio between 2 pages contents for those pages to be considered near-duplicates and be factorized. If set to 1.0, the near-duplicates detection is bypassed which results in a huge speed up. TYPE: `float` DEFAULT: `0.9`
`distance`	the near-duplicates search is performed on the nearest elements after the core.crawler.web_page list has been ordered alphabetically by URL, for performance, assuming near-duplicates will most likely be found on the same domain and at a resembling path. The distance parameters defines how many elements ahead we will look into. TYPE: `int` DEFAULT: `500`

Attributes¤

urls_to_ignore `class-attribute` `instance-attribute` ¤

urls_to_ignore: list[str] = [
    "/tag/",
    "/tags/",
    "/category/",
    "/categories/",
    "/author/",
    "/authors/",
    "/archive/",
    "/archives/",
    "/profil/",
    "/profiles/",
    "/user/",
    "/users/",
    "/login/",
    "/signup/",
    "/member/",
    "/members/",
    "/cart/",
    "/shop/",
]

URL substrings to find in URLs and remove matching web pages: mostly WordPress archive pages, user profiles and login pages.

Functions¤

prepare_urls ¤

prepare_urls(posts: list[web_page]) -> dict[str:list[web_page]]

Find the canonical URL of each post and aggregate a list of matching pages in a canonical_url: [candidates] dict

Precompute datetime object and minified ASCII content variant for later processing.

get_unique_urls ¤

get_unique_urls(posts: dict[str:list[web_page]]) -> dict[str:list[web_page]]

Pick the most recent, or otherwise the longer, candidate for each canonical URL.

Return

canonical_url: web_page dictionnary

prepare_content ¤

prepare_content(posts: dict[str:list[web_page]]) -> dict[str:list[web_page]]

Find the canonical content for each post and aggregate a list of matching pages”

RETURNS	DESCRIPTION
`dict[str:list[web_page]]`	a `canonical_content: list[web_pages]` dictionnary.

get_unique_content ¤

get_unique_content(posts: dict[str:list[web_page]]) -> dict[str:list[web_page]]

Pick the most recent candidate for each canonical content.

Return

canonical content: web_page dictionnary

get_close_content ¤

get_close_content(
    posts: dict[str : list[web_page]],
    threshold: float = 0.9,
    distance: float = 500,
) -> dict[str : list[web_page]]

Find near-duplicate by computing the Levenshtein distance between pages contents.

PARAMETER	DESCRIPTION
`posts`	dictionnary mapping an unused key to a liste of `crawler.web_page` TYPE: `dict[str:list[web_page]]`
`threshold`	the minimum distance ratio of Lenvenshtein metric for 2 contents to be assumed duplicates TYPE: `float` DEFAULT: `0.9`
`distance`	for efficiency, the list of web_page is first sorted alphabetically by URL, assuming duplicates TYPE: `float` DEFAULT: `500`

process ¤

process(posts: list[web_page])

Launch the actual duplicate finder

Crawler ¤

Crawler()

Crawl a website from its sitemap or by following internal links recusively from an index page.

Attributes¤

no_follow `class-attribute` `instance-attribute` ¤

no_follow: list[str] = [
    "api.whatsapp.com/share",
    "api.whatsapp.com/send",
    "pinterest.fr/pin/create",
    "pinterest.com/pin/create",
    "facebook.com/sharer",
    "twitter.com/intent/tweet",
    "reddit.com/submit",
    "t.me/share",
    "linkedin.com/share",
    "bufferapp.com/add",
    "getpocket.com/edit",
    "tumblr.com/share",
    "mailto:",
    "/profile/",
    "/login/",
    "/signup/",
    "/login?",
    "/signup?/user/",
    "/member/",
    ".css",
    ".js",
    ".json",
]

List of URLs sub-strings that will disable crawling if they are found in URLs. Mostly social networks sharing links.

crawled_URL `instance-attribute` ¤

crawled_URL: list[str] = []

List of URLs already visited

Functions¤

discard_link ¤

discard_link(url)

Returns True if the url is found in the self.no_follow list

get_immediate_links ¤

get_immediate_links(
    page,
    domain,
    currentURL,
    default_lang,
    langs,
    category,
    contains_str,
    external_only: bool = False,
) -> list[web_page]

Follow internal and external links contained in a webpage only to one recursivity level, including PDF files and HTML pages. This is useful to index references docs linked from a page.

get_website_from_crawling ¤

get_website_from_crawling(
    website: str,
    default_lang: str = "en",
    child: str = "/",
    langs: tuple = ("en", "fr"),
    markup: str = "body",
    contains_str: str | list[str] = "",
    max_recurse_level: int = -1,
    category: str = None,
    restrict_section: bool = False,
    _recursion_level: int = 0,
) -> list[web_page]

Recursively crawl all pages of a website from internal links found starting from the child page. This applies to all HTML pages hosted on the domain of website and to PDF documents either from the current domain or from external domains but referenced on HTML pages of the current domain.

PARAMETER	DESCRIPTION
`website`	root of the website, including `https://` or `http://` without trailing slash. TYPE: `str`
`default_lang`	provided or guessed main language of the website content. Not used internally. TYPE: `str` DEFAULT: `'en'`
`child`	page of the website to use as index to start crawling for internal links. TYPE: `str` DEFAULT: `'/'`
`langs`	ISO-something 2-letters code of the languages for which we attempt to fetch the translation if available, looking for the HTML `<link rel="alternate" hreflang="...">` tag. TYPE: `tuple` DEFAULT: `('en', 'fr')`
`contains_str`	a string or a list of strings that should be contained in a page URL for the page to be indexed. On a forum, you could for example restrict pages to URLs containing `"discussion"` to get only the threads and avoid user profiles or archive pages. TYPE: `str \| list[str]` DEFAULT: `''`
`markup`	see core.crawler.get_page_markup TYPE: `str` DEFAULT: `'body'`
`max_recursion_level`	this method will call itself recursively on each internal link found in the current page, starting from the `website/child` page. The `max_recursion_level` defines how many times it calls itself until it is stopped, if it is stopped. When set to `-1`, it stops when all the internal links have been crawled.
`category`	arbitrary category or label set by user. TYPE: `str` DEFAULT: `None`
`restrict_section`	set to `True` to limit crawling to the website section defined by `://website/child/`. This is useful when indexing parts of very large websites when you are only interested in a small subset. TYPE:* `bool` DEFAULT: `False`
`_recursion_level`	DON’T USE IT. Everytime this method calls itself recursively, it increments this variable internally, and recursion stops when the level is equal to the `max_recurse_level`. TYPE: `int` DEFAULT: `0`

RETURNS	DESCRIPTION
`list[web_page]`	a list of all valid pages found. Invalid pages (wrong markup, empty HTML response, 404 errors) will be silently ignored.

Examples:

>>> from core import crawler
>>> cr = crawler.Crawler()
>>> pages = cr.get_website_from_crawling("https://aurelienpierre.com", default_lang="fr", markup=("div", { "class": "post-content" }))

get_website_from_sitemap ¤

get_website_from_sitemap(
    website: str,
    default_lang: str,
    sitemap: str = "/sitemap.xml",
    langs: tuple[str] = ("en", "fr"),
    markup: str | tuple[str] = "body",
    category: str = None,
    contains_str: str | list[str] = "",
    external_only: bool = False,
) -> list[web_page]

Recursively crawl all pages of a website from links found in a sitemap. This applies to all HTML pages hosted on the domain of website and to PDF documents either from the current domain or from external domains but referenced on HTML pages of the current domain. Sitemaps of sitemaps are followed recursively.

PARAMETER	DESCRIPTION
`website`	root of the website, including `https://` or `http://` without trailing slash. TYPE: `str`
`default_lang`	provided or guessed main language of the website content. Not used internally. TYPE: `str`
`sitemap`	relative path of the XML sitemap. TYPE: `str` DEFAULT: `'/sitemap.xml'`
`langs`	ISO-something 2-letters code of the languages for which we attempt to fetch the translation if available, looking for the HTML `<link rel="alternate" hreflang="...">` tag. TYPE: `tuple[str]` DEFAULT: `('en', 'fr')`
`markup`	see core.crawler.get_page_markup TYPE: `str \| tuple[str]` DEFAULT: `'body'`
`category`	arbitrary category or label TYPE: `str` DEFAULT: `None`
`contains_str`	limit recursive crawling from sitemap-defined pages to pages containing this string or list of strings. Will get passed as-is to [get_website_from_crawling][]. TYPE: `str \| list[str]` DEFAULT: `''`
`external_only`	follow links from internal pages (outside of sitemaps) only if they point to an external domain. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`list[web_page]`	a list of all valid pages found. Invalid pages (wrong markup, empty HTML response, 404 errors) will be silently ignored.

Examples:

>>> from core import crawler
>>> cr = crawler.Crawler()
>>> pages = cr.get_website_from_sitemap("https://aurelienpierre.com", default_lang="fr", markup=("div", { "class": "post-content" }))

Functions¤

get_content_type ¤

get_content_type(url: str) -> tuple[str, bool]

Probe an URL for HTTP headers only to see what type of content it returns.

RETURNS	DESCRIPTION
`type`	the type of content, like `plain/html`, `application/pdf`, etc. TYPE: `str`
`status`	the state flag: `True` if the URL can be reached and fetched, `False` if there is some kind of error or empty response. TYPE: `bool`

relative_to_absolute ¤

relative_to_absolute(URL: str, domain: str, current_page: str) -> str

Convert a relative URL to absolute by prepending the domain.

PARAMETER	DESCRIPTION
`URL`	the URL string to normalize to absolute, TYPE: `str`
`domain`	the domain name of the website, without protocol (`http://`) nor trailing slash. It will be appended to the relative links starting by `/`. TYPE: `str`
`current_page`	the URL of the page from which we analyze links. It will be appended to the relative links starting by `./`. TYPE: `str`

RETURNS	DESCRIPTION
`str`	The normalized, absolute URL on this website.

Examples:

>>> relative_to_absolute("folder/page", "me.com")
"://me.com/folder/page"

radical_url ¤

radical_url(URL: str) -> str

Trim an URL to the page (radical) part, removing anchors if any (internal links)

Examples:

>>> radical_url("http://me.com/page#section-1")
"http://me.com/page"

ocr_pdf ¤

ocr_pdf(
    document: bytes,
    output_images: bool = False,
    path: str = None,
    repair: int = 1,
    upscale: int = 3,
    contrast: float = 1.5,
    sharpening: float = 1.2,
    threshold: float = 0.4,
    tesseract_lang: str = "eng+fra+equ",
    tesseract_bin: str = None,
) -> str

Extract text from PDF using OCR through Tesseract. Both the binding Python package PyTesseract and the Tesseract binaries need to be installed.

To run on a server where you don’t have sudo access to install package, you will need to download the AppImage package and pass its path to the tesseract_bin argument.

Tesseract uses machine-learning to identify words and needs the relevant language models to be installed on the system as well. Linux packaged version of Tesseract seem to generally ship French, English and equations (math) models by default. Other languages need to be installed manually, see Tesseract docs for available packages. Use pytesseract.get_languages(config='') to list available language packages installed locally.

The OCR is preceeded by an image processing step aiming at text reconstruction, by sharpening, increasing contrast and iteratively reconstructing holes in letters using an inpainting method in wavelets space. This is computationaly expensive, which may not be suitable to run on server.

PARAMETER	DESCRIPTION
`document`	the PDF document to open. TYPE: `bytes`
`output_images`	set to `True`, each page of the document is saved as PNG in the `path` directory before and after contrast enhancement. This is useful to tune the image contrast and sharpness enhancements, prior to OCR. TYPE: `bool` DEFAULT: `False`
`repair`	number of iterations of enhancements (sharpening, contrast and inpainting) to perform. More iterations take longer, too many iterations might simplify their geometry (as if they were fluid and would drip, removing corners and pointy ends) in a way that actually degrades OCR. TYPE: `int` DEFAULT: `1`
`upscale`	upscaling factor to apply before enhancement. This can help recovering ink leaks but takes more memory and time to compute. TYPE: `int` DEFAULT: `3`
`contrast`	`1.0` is the neutral value. Moves RGB values farther away from the threshold. TYPE: `float` DEFAULT: `1.5`
`sharpening`	`1.0` is the neutral value. Increases sharpness. Values too high can produce ringing (replicated ghost edges). TYPE: `float` DEFAULT: `1.2`
`threshold`	the reference value (fulcrum) for contrast enhancement. Good values are typically in the range 0.20-0.50. TYPE: `float` DEFAULT: `0.4`
`tesseract_lang`	the Tesseract command argument `-l` defining languages models to use for OCR. Languages are referenced by their 3-letters ISO-something code. See Tesseract doc for syntax and meaning. You can mix several languages by joining them with `+`. TYPE: `str` DEFAULT: `'eng+fra+equ'`
`tesseract_bin`	the path to the Tesseract executable if it is not in the global CLI path. This is passed as-is to `pytesseract.pytesseract.tesseract_cmd` of the PyTesseract binding library. TYPE: `str` DEFAULT: `None`

RETURNS	DESCRIPTION
`str`	All the retrieved text from all the PDF pages as a single string. No pagination is done.

RAISES	DESCRIPTION
`RuntimeError`	when using a language package is attempted while Tesseract has no such package installed.

get_pdf_content ¤

get_pdf_content(
    url: str,
    lang: str,
    file_path: str = None,
    process_outline: bool = True,
    category: str = None,
    ocr: int = 1,
    **kwargs
) -> list[web_page]

Retrieve a PDF document through the network with HTTP GET or from the local filesystem, and parse its text content, using OCR if needed. This needs a functionnal network connection if file_path is not provided.

PARAMETER	DESCRIPTION
`url`	the online address of the document, or the downloading page if the doc is not directly accessible from a GET request (for some old-schools website where downloads are inited from a POST request to some PHP form handler, or publications behind a paywall). TYPE: `str`
`lang`	the ISO code of the language. TYPE: `str`
`file_path`	local path to the PDF file if the URL can’t be directly fetched by GET request. The content will be extracted from the local file but the original/remote URL will still be referenced as the source. TYPE: `str` DEFAULT: `None`
`process_outline`	set to `True` to split the document according to its outline (table of content), so each section will be in fact a document in itself. PDF pages are processed in full, so sections are at least equal to a page length and there will be some overlapping. TYPE: `bool` DEFAULT: `True`
`category`	arbitrary category or label set by user TYPE: `str` DEFAULT: `None`
`ocr`	`0` disables any attempt at using OCR, `1` enables OCR through Tesseract if no text was found in the PDF document `2` forces OCR through Tesseract even when text was found in the PDF document. TYPE: `int` DEFAULT: `1`

PARAMETER	DESCRIPTION
`**kwargs`	directly passed-through to core.crawler.ocr_pdf. See this function documentation for more info.

RETURNS	DESCRIPTION
`list[web_page]`	a list of core.crawler.web_page objects holding the text content and the PDF metadata

get_page_content ¤

get_page_content(
    url: str, content: str = None
) -> [BeautifulSoup | None, list[str]]

Request an (x)HTML page through the network with HTTP GET and feed its response to a BeautifulSoup handler. This needs a functionnal network connection.

The DOM is pre-filtered as follow to keep only natural language and avoid duplicate strings:

media tags are removed (<iframe>, <embed>, <img>, <svg>, <audio>, <video>, etc.),
code and machine language tags are removed (<script>, <style>, <code>, <pre>, <math>),
menus and sidebars are removed (<nav>, <aside>),
forms, fields and buttons are removed(<select>, <input>, <button>, <textarea>, etc.)
quotes tags are removed (<quote>, <blockquote>).

The HTML is un-minified to help end-of-sentences detections in cases where sentences don’t end with punctuation (e.g. in titles).

PARAMETER	DESCRIPTION
`url`	a valid URL that can be fetched with an HTTP GET request. TYPE: `str`
`content`	a string buffer used as HTML source. If this argument is passed, we don’t fetch `url` from network and directly use this input. TYPE: `str` DEFAULT: `None`

RETURNS	DESCRIPTION
`[BeautifulSoup \| None, list[str]]`	a bs4.BeautifulSoup object initialized with the page DOM for further text mining. `None` if the HTML response was empty or the URL could not be reached. The list of URLs found in page before removing meaningless markup is stored as a list of strings in the `object.links` member. `object.h1` and `object.h2` contain a set of headers 1 and 2 found in the page before removing any markup.

get_page_markup ¤

get_page_markup(
    page: BeautifulSoup, markup: str | tuple | list[str] | list[tuple] | None
) -> str

Extract the text content of an HTML page DOM by targeting only the specific tags.

PARAMETER DESCRIPTION

page

a bs4.BeautifulSoup handler with pre-filtered DOM,

TYPE: BeautifulSoup

markup

any kind of tags supported by [bs4.BeautifulSoup.find_all][]:

(str): the single tag to select. For example, "body" will select <body>...</body>.
(tuple): the tag and properties to select. For example, ("div", { "class": "right" }) will select <div class="right">...</div>.
all combinations of the above can be chained in lists.
None: don’t parse the page internal content. Links, h1 and h2 headers will still be parsed.

TYPE: str | tuple | list[str] | list[tuple] | None

RETURNS	DESCRIPTION
`str`	The text content of all instances of all tags in markup as a single string, if any, else an empty string.

Examples:

>>> get_page_markup(page, "article")

>>> get_page_markup(page, ["h1", "h2", "h3", "article"])

>>> get_page_markup(page, [("div", {"id": "content"}), "details", ("div", {"class": "comment-reply"})])

get_excerpt ¤

get_excerpt(html: BeautifulSoup) -> str | None

Find HTML tags possibly containing the shortened version of the page content.

Looks for HTML tags:

<meta name="description" content="...">
<meta property="og:description" content="...">

PARAMETER	DESCRIPTION
`page`	a bs4.BeautifulSoup handler with pre-filtered DOM,

RETURNS	DESCRIPTION
`str \| None`	The content of the meta tag if any.

get_date ¤

get_date(html: BeautifulSoup)

Find HTML tags possibly containing the page date.

Looks for HTML tags:

<meta property="article:modified_time" content="...">
<time datetime="...">
<relative-time datetime="...">
<div class="dateline">...</div>

PARAMETER	DESCRIPTION
`page`	a bs4.BeautifulSoup handler with pre-filtered DOM,

RETURNS	DESCRIPTION
	The content of the meta tag if any.

parse_page ¤

parse_page(
    page: BeautifulSoup,
    url: str,
    lang: str,
    markup: str | list[str],
    date: str = None,
    category: str = None,
) -> list[web_page]

Get the requested markup from the requested page URL.

This chains in a single call:

core.crawler.get_page_markup
core.crawler.get_date
core.crawler.get_excerpt

PARAMETER	DESCRIPTION
`page`	a bs4.BeautifulSoup handler with pre-filtered DOM, TYPE: `BeautifulSoup`
`url`	the valid URL accessible by HTTP GET request of the page TYPE: `str`
`lang`	the provided or guessed language of the page, TYPE: `str`
`markup`	the markup to search for. See core.crawler.get_page_markup for details. TYPE: `str \| list[str]`
`date`	if the page was retrieved from a sitemap, usually the date is available in ISO format (`yyyy-mm-ddTHH:MM:SS`) and can be passed directly here. Otherwise, several attempts will be made to extract it from the page content (see core.crawler.get_date). TYPE: `str` DEFAULT: `None`
`category`	arbitrary category or label defined by user TYPE: `str` DEFAULT: `None`

RETURNS	DESCRIPTION
`list[web_page]`	The content of the page, including metadata, in a core.crawler.web_page singleton.

Examples¤

The best way to use the crawler is by adding a script in src/user_scripts, since it is meant to be used as an offline training step (and not in filters).

To crawl a website where some content has a sitemap and the rest does not:

# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))

# Here starts the real code
from core import crawler
from core import utils

# Init a crawler object
cr = crawler.Crawler()

# Add URLs to ignore when crawling, for performance.
# for example if those URLs are crawled in another data set.
cr.no_follow += [
  "aurelienpierre.com",
  "persons-profile-",
  "/view-album/",
]

# Scrape one site using the sitemap
output = cr.get_website_from_sitemap("https://ansel.photos",
                                      "en",
                                      markup="article")

# Scrape another site recursively
output += cr.get_website_from_crawling("https://community.ansel.photos",
                                       "en",
                                       child="/discussions-home",
                                       markup=[("div", {"class": "bx-content-description"}),
                                               ("div", {"class": "cmt-body"})],
                                       contains_str="view-discussion")

# ... can keep appending as many websites as you want to `output` list

dedup = crawler.Deduplicator()
dedup.urls_to_ignore += [
  # Can add other urls to ignore here too.
]
output = dedup.process(output)

utils.save_data(output, "ansel")
# You will find the dataset saved as a pickle Python object in a .tar.gz in ./data/

Note

In the above example, we reuse the cr object between the “sitemap” and the “recurse” calls. It means that the second call will inherit the Crawler.crawled_URL list from the previous, which contains all the URLs already processed. All URLs from this list will be ignored in the next calls. This can be good to avoid duplicates, but can be bad for some use cases. For those cases, instantiate a new Crawler object instead of reusing the previous one.

The core.utils.save_data method will directly save the list of core.crawler.web_page objects as a pickle file compressed in a .tar.gz archive, into the VirtualSecretary/data folder. To re-open, decompress and decode it later, use core.utils.open_data:

# Boilerplate stuff to access src/core from src/user_scripts
import os
import sys

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(SCRIPT_DIR))

# Here starts the real code
from core import crawler
from core import utils

pages = utils.open_data("ansel")

for page in pages:
  # do stuff...
  print(page["title"], page["url"])

Crawler¤

crawler ¤

Classes¤

web_page ¤

Attributes¤

title instance-attribute ¤

url instance-attribute ¤

date instance-attribute ¤

content instance-attribute ¤

excerpt instance-attribute ¤

h1 instance-attribute ¤

h2 instance-attribute ¤

lang instance-attribute ¤

category instance-attribute ¤

Deduplicator ¤

Attributes¤

urls_to_ignore class-attribute instance-attribute ¤

Functions¤

prepare_urls ¤

get_unique_urls ¤

prepare_content ¤

get_unique_content ¤

get_close_content ¤

process ¤

Crawler ¤

Attributes¤

no_follow class-attribute instance-attribute ¤

crawled_URL instance-attribute ¤

Functions¤

discard_link ¤

get_immediate_links ¤

get_website_from_crawling ¤

get_website_from_sitemap ¤

Functions¤

get_content_type ¤

relative_to_absolute ¤

radical_url ¤

ocr_pdf ¤

get_pdf_content ¤

get_page_content ¤

get_page_markup ¤

get_excerpt ¤

get_date ¤

parse_page ¤

Examples¤

title `instance-attribute` ¤

url `instance-attribute` ¤

date `instance-attribute` ¤

content `instance-attribute` ¤

excerpt `instance-attribute` ¤

h1 `instance-attribute` ¤

h2 `instance-attribute` ¤

lang `instance-attribute` ¤

category `instance-attribute` ¤

urls_to_ignore `class-attribute` `instance-attribute` ¤

no_follow `class-attribute` `instance-attribute` ¤

crawled_URL `instance-attribute` ¤