core.network¤
core.network
¤
Classes¤
core.network.DelayedClass
¤
Bases: ABC
Abstract class for any implementation
having an internal timer that won’t trigger a given action more often
than delay seconds
Attributes¤
core.network.DelayedClass.robots_txt
class-attribute
instance-attribute
¤
Dictionnary of domain/robots.txt
core.network.DelayedClass.last_requests
class-attribute
instance-attribute
¤
Dictionnary of domain/timestamp for the last request sent to a domain
core.network.DelayedClass.domain_thresholds
class-attribute
instance-attribute
¤
Dictionnary of known domains remembering the robots.txt rate thresholds
core.network.DelayedClass.main_domain
class-attribute
instance-attribute
¤
main_domain: str | None = domain
Main domain from where we crawl, either the one holding the sitemap or the one at the root of the recursion.
core.network.DelayedClass.delay
class-attribute
instance-attribute
¤
delay: float = self.get_crawling_rate(domain, robots_txt)
Timeout between two requests
Functions¤
core.network.HTTPResponse
dataclass
¤
HTTPResponse(
url: str,
status_code: int,
headers: dict,
content: bytes | None,
encoding: str,
apparent_encoding: str,
text: str,
raw_response: object,
error_type: str | None = None,
)
Attributes¤
core.network.HTTPResponse.url
instance-attribute
¤
url: str
Redirected URL, if redirection, else initial URL
core.network.HTTPResponse.status_code
instance-attribute
¤
status_code: int
HTTP response return code
Functions¤
core.network.is_hard_error
¤
is_hard_error(response: HTTPResponse) -> bool
True when retrying URL variants or spoofing headers is pointless.
core.network.wrap_response
¤
wrap_response(r: requests.Response | httpx.Response) -> HTTPResponse
Unify responses from HTTPx and cURL into uniform types
core.network.request
¤
request(method, url, timeout=30, headers=None) -> HTTPResponse
Try curl_cffi first, fallback to httpx on ANY transport-level failure.
core.network.try_url
¤
try_url(
url,
delay: DelayedClass,
timeout: int | float = 30,
bypass_robots_txt: bool = False,
) -> tuple[HTTPResponse | None, dict | None, str]
Probe the URL head, without getting the content.
This will
- resolve redirections
- check with robots.txt (if any) if we have permission to crawl and at what rate,
- fallback to web.archive.org if hitting 404 (not found) error,
- handle requests rate thresholding,
- find out what headers spoofing combination is accepted by the server (or by fucking Cloudflare), when robots.txt didn’t block us explicitely, but server/proxy returned 403 (unauthorized) error.
| PARAMETER | DESCRIPTION |
|---|---|
delay
|
class holding a thresholding timer/delay method,
TYPE:
|
timeout
|
abort any connection that takes longer than this (in seconds) to finish. That might cancel loading large PDFs if too small. |
bypass_robots_txt
|
don’t check if robots.txt allows us to crawl the current page. That makes us spare some requests.
When crawling pages from
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
response
|
the HTTP response object,
TYPE:
|
headers
|
the HTTP client headers that succeeded in spoofing the server, if any,
TYPE:
|
url
|
the final, redirected, URL (can be the same as the input one).
TYPE:
|
core.network.get_url
¤
get_url(
url: str, delay: DelayedClass, timeout=60, custom_header={}
) -> tuple[bytes | None, str, int, str, str]
Get the content of an URL using requests.
.pdf, .xml and .txt URLs always use requests.
Return
content: the raw DOM, url: the final URL after possible redirections status: the HTTP code (integer) encoding: apparent encoding: