core.parser¤
core.parser
¤
Classes¤
core.parser.ParsedHTML
dataclass
¤
ParsedHTML(
soup: BeautifulSoup,
links: list[str] = list(),
h1: set[str] = set(),
h2: set[str] = set(),
content: str | None = str(),
title: str | None = str(),
excerpt: str | None = str(),
date: str | None = str(),
lang: str | None = str(),
scripts: list[str] = list(),
)
Wrapper around BeautifulSoup with precomputed crawler metadata.
Functions¤
core.parser.ParsedHTML.from_html
classmethod
¤
from_html(html: str, parser: str = 'html5lib') -> ParsedHTML
Build ParsedHTML from raw HTML string.
core.parser.ParsedHTML.get_page_markup
¤
Extract the text content of an HTML page DOM by targeting only the specific tags.
| PARAMETER | DESCRIPTION |
|---|---|
markup
|
any kind of tags supported by bs4.Tag.find_all:
|
| RETURNS | DESCRIPTION |
|---|---|
str | None
|
The text content of all instances of all tags in markup as a single string, if any, else an empty string. |
Examples:
core.parser.ParsedHTML.get_excerpt
¤
get_excerpt() -> str | None
Find HTML tags possibly containing the shortened version of the page content.
| RETURNS | DESCRIPTION |
|---|---|
str | None
|
The content of the meta tag if any. |
core.parser.ParsedHTML.get_date
¤
get_date() -> str | None
Find HTML tags possibly containing the page date.
Looks for HTML tags:
<meta name="date" content="..."><meta name="publishDate" content="..."><meta property="article:published_time" content="..."><meta property="article:modified_time" content="..."><meta name="dc.date" content="..."><time datetime="..."><relative-time datetime="..."><div class="dateline">...</div><script type="application/ld+json">{"dateModified":"...", }</script>(Wikipedia)
| RETURNS | DESCRIPTION |
|---|---|
str | None
|
The content of the meta tag if any. |