core.parser¤

core.parser ¤

Classes¤

core.parser.ParsedHTML `dataclass` ¤

ParsedHTML(
    soup: BeautifulSoup,
    links: list[str] = list(),
    h1: set[str] = set(),
    h2: set[str] = set(),
    content: str | None = str(),
    title: str | None = str(),
    excerpt: str | None = str(),
    date: str | None = str(),
    lang: str | None = str(),
    scripts: list[str] = list(),
)

Wrapper around BeautifulSoup with precomputed crawler metadata.

Methods:¤

core.parser.ParsedHTML.from_html `classmethod` ¤

from_html(html: str, parser: str = 'html5lib') -> ParsedHTML

Build ParsedHTML from raw HTML string.

core.parser.ParsedHTML.get_page_markup ¤

get_page_markup(
    markup: str | tuple | list[str] | list[tuple] | None,
) -> str | None

Extract the text content of an HTML page DOM by targeting only the specific tags.

PARAMETER DESCRIPTION

markup

any kind of tags supported by bs4.Tag.find_all:

(str): the single tag to select. For example, "body" will select <body>...</body>.
(tuple): the tag and properties to select. For example, ("div", { "class": "right" }) will select <div class="right">...</div>.
all combinations of the above can be chained in lists.
None: don’t parse the page internal content. Links, h1 and h2 headers will still be parsed.

TYPE: str | tuple | list[str] | list[tuple] | None

RETURNS	DESCRIPTION
`str \| None`	The text content of all instances of all tags in markup as a single string, if any, else an empty string.

Examples:

>>> get_page_markup(page, "article")

>>> get_page_markup(page, ["h1", "h2", "h3", "article"])

>>> get_page_markup(page, [("div", {"id": "content"}), "details", ("div", {"class": "comment-reply"})])

core.parser.ParsedHTML.get_excerpt ¤

get_excerpt() -> str | None

Find HTML tags possibly containing the shortened version of the page content.

Looks for HTML tags

<meta name="description" content="...">
<meta property="og:description" content="...">

RETURNS	DESCRIPTION
`str \| None`	The content of the meta tag if any.

core.parser.ParsedHTML.get_date ¤

get_date() -> str | None

Find HTML tags possibly containing the page date.

Looks for HTML tags:

<meta name="date" content="...">
<meta name="publishDate" content="...">
<meta property="article:published_time" content="...">
<meta property="article:modified_time" content="...">
<meta name="dc.date" content="...">
<time datetime="...">
<relative-time datetime="...">
<div class="dateline">...</div>
<script type="application/ld+json">{"dateModified":"...", }</script> (Wikipedia)

RETURNS	DESCRIPTION
`str \| None`	The content of the meta tag if any.

core.parser.ParsedHTML.get_lang ¤

get_lang() -> str | None

Attempt to find the page language