Skip to content

core.parser¤

core.parser ¤

Classes¤

core.parser.ParsedHTML dataclass ¤

ParsedHTML(
    soup: BeautifulSoup,
    links: list[str] = list(),
    h1: set[str] = set(),
    h2: set[str] = set(),
    content: str | None = str(),
    title: str | None = str(),
    excerpt: str | None = str(),
    date: str | None = str(),
    lang: str | None = str(),
    scripts: list[str] = list(),
)

Wrapper around BeautifulSoup with precomputed crawler metadata.

Functions¤
core.parser.ParsedHTML.from_html classmethod ¤
from_html(html: str, parser: str = 'html5lib') -> ParsedHTML

Build ParsedHTML from raw HTML string.

core.parser.ParsedHTML.get_page_markup ¤
get_page_markup(
    markup: str | tuple | list[str] | list[tuple] | None,
) -> str | None

Extract the text content of an HTML page DOM by targeting only the specific tags.

PARAMETER DESCRIPTION
markup

any kind of tags supported by bs4.Tag.find_all:

  • (str): the single tag to select. For example, "body" will select <body>...</body>.
  • (tuple): the tag and properties to select. For example, ("div", { "class": "right" }) will select <div class="right">...</div>.
  • all combinations of the above can be chained in lists.
  • None: don’t parse the page internal content. Links, h1 and h2 headers will still be parsed.

TYPE: str | tuple | list[str] | list[tuple] | None

RETURNS DESCRIPTION
str | None

The text content of all instances of all tags in markup as a single string, if any, else an empty string.

Examples:

>>> get_page_markup(page, "article")
>>> get_page_markup(page, ["h1", "h2", "h3", "article"])
>>> get_page_markup(page, [("div", {"id": "content"}), "details", ("div", {"class": "comment-reply"})])
core.parser.ParsedHTML.get_excerpt ¤
get_excerpt() -> str | None

Find HTML tags possibly containing the shortened version of the page content.

Looks for HTML tags
  • <meta name="description" content="...">
  • <meta property="og:description" content="...">
RETURNS DESCRIPTION
str | None

The content of the meta tag if any.

core.parser.ParsedHTML.get_date ¤
get_date() -> str | None

Find HTML tags possibly containing the page date.

Looks for HTML tags:

  • <meta name="date" content="...">
  • <meta name="publishDate" content="...">
  • <meta property="article:published_time" content="...">
  • <meta property="article:modified_time" content="...">
  • <meta name="dc.date" content="...">
  • <time datetime="...">
  • <relative-time datetime="...">
  • <div class="dateline">...</div>
  • <script type="application/ld+json">{"dateModified":"...", }</script> (Wikipedia)
RETURNS DESCRIPTION
str | None

The content of the meta tag if any.

core.parser.ParsedHTML.get_lang ¤
get_lang() -> str | None

Attempt to find the page language

Functions¤