core.network¤

core.network ¤

Classes¤

core.network.DelayedClass ¤

DelayedClass(protocol: str, domain: str, delay: float, timeout: float)

Bases: ABC

Abstract class for any implementation having an internal timer that won’t trigger a given action more often than delay seconds

Attributes¤

core.network.DelayedClass.robots_txt `class-attribute` `instance-attribute` ¤

robots_txt: dict[str, Protego] = {}

Dictionnary of domain/robots.txt

core.network.DelayedClass.last_requests `class-attribute` `instance-attribute` ¤

last_requests: dict[str, float] = {}

Dictionnary of domain/timestamp for the last request sent to a domain

core.network.DelayedClass.domain_thresholds `class-attribute` `instance-attribute` ¤

domain_thresholds: dict[str, float] = {}

Dictionnary of known domains remembering the robots.txt rate thresholds

core.network.DelayedClass.main_domain `class-attribute` `instance-attribute` ¤

main_domain: str | None = domain

Main domain from where we crawl, either the one holding the sitemap or the one at the root of the recursion.

core.network.DelayedClass.delay `class-attribute` `instance-attribute` ¤

delay: float = self.get_crawling_rate(domain, robots_txt)

Timeout between two requests

Methods:¤

core.network.DelayedClass.sleep ¤

sleep(domain: str, overwrite: float | None = None)

Sleep for at most the remaining timeout time.

PARAMETER	DESCRIPTION
`overwrite`	temporary timeout overwrite TYPE: `float \| None` DEFAULT: `None`

core.network.HTTPResponse `dataclass` ¤

HTTPResponse(
    url: str,
    status_code: int,
    headers: dict,
    content: bytes | None,
    encoding: str,
    apparent_encoding: str,
    text: str,
    raw_response: object,
    error_type: str | None = None,
)

Attributes¤

core.network.HTTPResponse.url `instance-attribute` ¤

url: str

Redirected URL, if redirection, else initial URL

core.network.HTTPResponse.status_code `instance-attribute` ¤

status_code: int

HTTP response return code

core.network.HTTPResponse.headers `instance-attribute` ¤

headers: dict

Server response HTTP headers

core.network.HTTPResponse.content `instance-attribute` ¤

content: bytes | None

Page content as bytes

core.network.HTTPResponse.text `instance-attribute` ¤

text: str

Page content as text, if possible

core.network.HTTPResponse.raw_response `instance-attribute` ¤

raw_response: object

Original curl_cffi or httpx Response object

core.network.HTTPResponse.error_type `class-attribute` `instance-attribute` ¤

error_type: str | None = None

‘dns’ | ‘connection’ | None

Functions:¤

core.network.is_hard_error ¤

is_hard_error(response: HTTPResponse) -> bool

True when retrying URL variants or spoofing headers is pointless.

core.network.wrap_response ¤

wrap_response(r: requests.Response | httpx.Response) -> HTTPResponse

Unify responses from HTTPx and cURL into uniform types

core.network.request ¤

request(method, url, timeout=30, headers=None) -> HTTPResponse

Try curl_cffi first, fallback to httpx on ANY transport-level failure.

core.network.try_url ¤

try_url(
    url,
    delay: DelayedClass,
    timeout: int | float = 30,
    bypass_robots_txt: bool = False,
) -> tuple[HTTPResponse | None, dict | None, str]

Probe the URL head, without getting the content.

This will

resolve redirections
check with robots.txt (if any) if we have permission to crawl and at what rate,
fallback to web.archive.org if hitting 404 (not found) error,
handle requests rate thresholding,
find out what headers spoofing combination is accepted by the server (or by fucking Cloudflare), when robots.txt didn’t block us explicitely, but server/proxy returned 403 (unauthorized) error.

PARAMETER	DESCRIPTION
`delay`	class holding a thresholding timer/delay method, TYPE: `DelayedClass`
`timeout`	abort any connection that takes longer than this (in seconds) to finish. That might cancel loading large PDFs if too small. TYPE: `int \| float` DEFAULT: `30`
`bypass_robots_txt`	don’t check if robots.txt allows us to crawl the current page. That makes us spare some requests. When crawling pages from `sitemap.xml`, we can safely assume that all pages there are allowed or the webmaster is an idiot. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`response`	the HTTP response object, TYPE: `HTTPResponse \| None`
`headers`	the HTTP client headers that succeeded in spoofing the server, if any, TYPE: `dict \| None`
`url`	the final, redirected, URL (can be the same as the input one). TYPE: `str`

core.network.get_url ¤

get_url(
    url: str, delay: DelayedClass, timeout=60, custom_header={}
) -> tuple[bytes | None, str, int, str, str]

Get the content of an URL using requests. .pdf, .xml and .txt URLs always use requests.

RETURNS	DESCRIPTION
`content`	the raw DOM, TYPE: `bytes \| None`
`url`	the final URL after possible redirections TYPE: `str`
`status`	the HTTP code (integer) TYPE: `int`
`encoding`	TYPE: `str`
`str`	apparent encoding:

core.network¤

core.network ¤

Classes¤

core.network.DelayedClass ¤

Attributes¤

core.network.DelayedClass.robots_txt class-attribute instance-attribute ¤

core.network.DelayedClass.last_requests class-attribute instance-attribute ¤

core.network.DelayedClass.domain_thresholds class-attribute instance-attribute ¤

core.network.DelayedClass.main_domain class-attribute instance-attribute ¤

core.network.DelayedClass.delay class-attribute instance-attribute ¤

Methods:¤

core.network.DelayedClass.sleep ¤

core.network.HTTPResponse dataclass ¤

Attributes¤

core.network.HTTPResponse.url instance-attribute ¤

core.network.HTTPResponse.status_code instance-attribute ¤

core.network.HTTPResponse.headers instance-attribute ¤

core.network.HTTPResponse.content instance-attribute ¤

core.network.HTTPResponse.text instance-attribute ¤

core.network.HTTPResponse.raw_response instance-attribute ¤

core.network.HTTPResponse.error_type class-attribute instance-attribute ¤

Functions:¤

core.network.is_hard_error ¤

core.network.wrap_response ¤

core.network.request ¤

core.network.try_url ¤

core.network.get_url ¤

core.network.DelayedClass.robots_txt `class-attribute` `instance-attribute` ¤

core.network.DelayedClass.last_requests `class-attribute` `instance-attribute` ¤

core.network.DelayedClass.domain_thresholds `class-attribute` `instance-attribute` ¤

core.network.DelayedClass.main_domain `class-attribute` `instance-attribute` ¤

core.network.DelayedClass.delay `class-attribute` `instance-attribute` ¤

core.network.HTTPResponse `dataclass` ¤

core.network.HTTPResponse.url `instance-attribute` ¤

core.network.HTTPResponse.status_code `instance-attribute` ¤

core.network.HTTPResponse.headers `instance-attribute` ¤

core.network.HTTPResponse.content `instance-attribute` ¤

core.network.HTTPResponse.text `instance-attribute` ¤

core.network.HTTPResponse.raw_response `instance-attribute` ¤

core.network.HTTPResponse.error_type `class-attribute` `instance-attribute` ¤