Skip to content

core.network¤

core.network ¤

Classes¤

core.network.DelayedClass ¤

DelayedClass(protocol: str, domain: str, delay: float, timeout: float)

Bases: ABC

Abstract class for any implementation having an internal timer that won’t trigger a given action more often than delay seconds

Attributes¤
core.network.DelayedClass.robots_txt class-attribute instance-attribute ¤
robots_txt: dict[str, Protego] = {}

Dictionnary of domain/robots.txt

core.network.DelayedClass.last_requests class-attribute instance-attribute ¤
last_requests: dict[str, float] = {}

Dictionnary of domain/timestamp for the last request sent to a domain

core.network.DelayedClass.domain_thresholds class-attribute instance-attribute ¤
domain_thresholds: dict[str, float] = {}

Dictionnary of known domains remembering the robots.txt rate thresholds

core.network.DelayedClass.main_domain class-attribute instance-attribute ¤
main_domain: str | None = domain

Main domain from where we crawl, either the one holding the sitemap or the one at the root of the recursion.

core.network.DelayedClass.delay class-attribute instance-attribute ¤
delay: float = self.get_crawling_rate(domain, robots_txt)

Timeout between two requests

Functions¤
core.network.DelayedClass.sleep ¤
sleep(domain: str, overwrite: float | None = None)

Sleep for at most the remaining timeout time.

PARAMETER DESCRIPTION
overwrite

temporary timeout overwrite

TYPE: float | None DEFAULT: None

core.network.HTTPResponse dataclass ¤

HTTPResponse(
    url: str,
    status_code: int,
    headers: dict,
    content: bytes | None,
    encoding: str,
    apparent_encoding: str,
    text: str,
    raw_response: object,
    error_type: str | None = None,
)
Attributes¤
core.network.HTTPResponse.url instance-attribute ¤
url: str

Redirected URL, if redirection, else initial URL

core.network.HTTPResponse.status_code instance-attribute ¤
status_code: int

HTTP response return code

core.network.HTTPResponse.headers instance-attribute ¤
headers: dict

Server response HTTP headers

core.network.HTTPResponse.content instance-attribute ¤
content: bytes | None

Page content as bytes

core.network.HTTPResponse.text instance-attribute ¤
text: str

Page content as text, if possible

core.network.HTTPResponse.raw_response instance-attribute ¤
raw_response: object

Original curl_cffi or httpx Response object

core.network.HTTPResponse.error_type class-attribute instance-attribute ¤
error_type: str | None = None

‘dns’ | ‘connection’ | None

Functions¤

core.network.is_hard_error ¤

is_hard_error(response: HTTPResponse) -> bool

True when retrying URL variants or spoofing headers is pointless.

core.network.wrap_response ¤

wrap_response(r: requests.Response | httpx.Response) -> HTTPResponse

Unify responses from HTTPx and cURL into uniform types

core.network.request ¤

request(method, url, timeout=30, headers=None) -> HTTPResponse

Try curl_cffi first, fallback to httpx on ANY transport-level failure.

core.network.try_url ¤

try_url(
    url,
    delay: DelayedClass,
    timeout: int | float = 30,
    bypass_robots_txt: bool = False,
) -> tuple[HTTPResponse | None, dict | None, str]

Probe the URL head, without getting the content.

This will
  1. resolve redirections
  2. check with robots.txt (if any) if we have permission to crawl and at what rate,
  3. fallback to web.archive.org if hitting 404 (not found) error,
  4. handle requests rate thresholding,
  5. find out what headers spoofing combination is accepted by the server (or by fucking Cloudflare), when robots.txt didn’t block us explicitely, but server/proxy returned 403 (unauthorized) error.
PARAMETER DESCRIPTION
delay

class holding a thresholding timer/delay method,

TYPE: DelayedClass

timeout

abort any connection that takes longer than this (in seconds) to finish. That might cancel loading large PDFs if too small.

TYPE: int | float DEFAULT: 30

bypass_robots_txt

don’t check if robots.txt allows us to crawl the current page. That makes us spare some requests. When crawling pages from sitemap.xml, we can safely assume that all pages there are allowed or the webmaster is an idiot.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
response

the HTTP response object,

TYPE: HTTPResponse | None

headers

the HTTP client headers that succeeded in spoofing the server, if any,

TYPE: dict | None

url

the final, redirected, URL (can be the same as the input one).

TYPE: str

core.network.get_url ¤

get_url(
    url: str, delay: DelayedClass, timeout=60, custom_header={}
) -> tuple[bytes | None, str, int, str, str]

Get the content of an URL using requests. .pdf, .xml and .txt URLs always use requests.

Return

content: the raw DOM, url: the final URL after possible redirections status: the HTTP code (integer) encoding: apparent encoding: