core.utils¤
core.utils
¤
Logging and filter finding utilities.
© 2022-2023 - Aurélien Pierre
Attributes¤
core.utils.filter_entry
module-attribute
¤
core.utils.filter_bank
module-attribute
¤
filter_bank = dict[int, filter_entry]
Dictionnary type of core.utils.filter_entry elements associated with their priority in the bank.
| ATTRIBUTE | DESCRIPTION |
|---|---|
key |
priority
TYPE:
|
value |
filter data
TYPE:
|
Classes¤
Functions¤
core.utils.match_filter_name
¤
match_filter_name(file: str, mode: filter_mode)
Check if the current filter file matches the requested mode.
| PARAMETER | DESCRIPTION |
|---|---|
file
|
filter file to test
TYPE:
|
mode
|
filter type
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
match
|
TYPE:
|
core.utils.find_filters
¤
find_filters(path: str, filters: filter_bank, mode: filter_mode) -> filter_bank
Find all the filter files in directory (aka filenames matching filter name pattern) and append them to the dictionnary of filters based on their priority. If 2 similar priorities are found, the first-defined one gets precedence, the other is discarded.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
the folder where to find filter files
TYPE:
|
filters
|
the dictionnary where we will append filters found here. This dictionnary will have the integer priority of filters (order of running) set as keys. If filters with the same priority are found in the current path, former filters are overriden.
TYPE:
|
mode
|
the type of filter.
TYPE:
|
core.utils.lock_subfolder
¤
lock_subfolder(lockfile: str)
Write a .lock text file in the subfolder being currently processed, with the PID of the current Virtual Secretary instance.
Override the lock if it contains a PID that doesn’t exist anymore on the system (Linux-only).
| PARAMETER | DESCRIPTION |
|---|---|
lockfile
|
absolute path of the target lockfile
TYPE:
|
Todo
Make it work for Windows PID too.
core.utils.unlock_subfolder
¤
unlock_subfolder(lockfile: str)
Remove the .lock file in current subfolder.
| PARAMETER | DESCRIPTION |
|---|---|
lockfile
|
absolute path of the target lockfile
TYPE:
|
core.utils.imap_encode
¤
Encode Python string into IMAP-compliant UTF-7 bytes, as described in the RFC 3501.
There are variations, specific to IMAP4rev1, therefore the built-in python UTF-7 codec can’t be used. The main difference is the shift character, used to switch from ASCII to base64 encoding context. This is “&” in that modified UTF-7 convention, since “+” is considered as mainly used in mailbox names. Full description at RFC 3501, section 5.1.3.
Code from imap_tools/imap_utf7.py by ikvk under Apache 2.0 license.
| PARAMETER | DESCRIPTION |
|---|---|
value
|
IMAP mailbox path as string
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
path
|
IMAP-encoded path as UTF-7
TYPE:
|
core.utils.imap_decode
¤
Decode IMAP-compliant UTF-7 byte into Python string, as described in the RFC 3501.
There are variations, specific to IMAP4rev1, therefore the built-in python UTF-7 codec can’t be used. The main difference is the shift character, used to switch from ASCII to base64 encoding context. This is “&” in that modified UTF-7 convention, since “+” is considered as mainly used in mailbox names. Full description at RFC 3501, section 5.1.3.
Code from imap_tools/imap_utf7.py by ikvk under Apache 2.0 license.
| PARAMETER | DESCRIPTION |
|---|---|
value
|
IMAP-encoded path as UTF-7 modified for IMAP
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
path
|
IMAP path encoded as Python string
TYPE:
|
core.utils.typography_undo
¤
Break correct typographic Unicode entities into dummy computer characters (ASCII) to produce computer-standard vocabulary and help word tokenizers to properly detect word boundaries.
This is useful when parsing:
1. **properly composed** text, like the output of LaTeX or SmartyPants[^1]/WP Scholar[^2],
2. text typed with Dvorak-like keyboard layouts (using proper Unicode entities where needed).
For example, the proper … ellipsis entity (Unicode U+2026 symbol) will be converted into 3 regular dots ....
core.utils.clean_whitespaces
¤
Collapse repeated spaces and newlines in text.
core.utils.sanitize_unicode
¤
sanitize_unicode(text) -> str
Normalize arbitrary string-like objects into safe Python UTF-8 text.
core.utils.guess_date
¤
Best-effort datetime parsing.
Always returns
- timezone-aware UTC datetime
- or None
core.utils.get_data_folder
¤
Resolve the path of a training data saved under filename. These are stored in ../../data/.
Warning
This does not check the existence of the file and root folder.
core.utils.save_data
¤
save_data(data: list[web_page] | sqlite3.Connection, filename: str)
Save scraped data to a compressed archive.
The destination folder and file extension are handled automatically.
| PARAMETER | DESCRIPTION |
|---|---|
data
|
Data to save. Supported types:
TYPE:
|
filename
|
Base filename to use. The output extension is added
automatically depending on the type of
TYPE:
|
core.utils.open_data
¤
Open data stored in a tar.gz archive. We probe for sql and pickle datasets,
in this order, and return the first we find.
| PARAMETER | DESCRIPTION |
|---|---|
filename
|
Extension-less name of the dataset (no path).
TYPE:
|
scheme
|
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[web_page] | sqlite3.Connection
|
|
list[web_page] | sqlite3.Connection
|
|
list[web_page] | sqlite3.Connection
|
will not be saved, so the caller needs to copy/dump it, and close the connection. |
If the archive does not exist, returns an empty list.
core.utils.get_data_mtime
¤
Return the modification date of the tar.gz archive.
| RETURNS | DESCRIPTION |
|---|---|
datetime | None
|
datetime of the archive modification time, or None if it does not exist. |
core.utils.get_models_folder
¤
Resolve the path of a machine-learning model saved under filename. These are stored in ../../models/.
Warning
This does not check the existence of the file and root folder.
core.utils.get_stopwords_file
¤
Get a dictionnary file containing lines of “word: frequency” stored in ../../models/.
By default, core.nlp.Word2Vec stores a such file when the word embedding is learned.
Manually-validated files can be used for search engine purposes, since stopwords add noise to the searches.
core.utils.timeit
¤
timeit(runs: int = 1)
Provide a @timeit decorator to profile the wall performance of a function.
| PARAMETER | DESCRIPTION |
|---|---|
runs
|
how many times the function should be re-executed. Runtimes will give average and standard deviation.
TYPE:
|
core.utils.exit_after
¤
exit_after(s: int)
Define a decorator exit_after(n) that stops a function after n seconds.
Mostly intended for text parsing functions that get fed unchecked text inputs from the web. In that case, some really bad XML or super-long log files can make the parsing loop hang forever. This decorator will skip them without breaking the loop.
| PARAMETER | DESCRIPTION |
|---|---|
s
|
number of seconds
TYPE:
|
core.utils.get_past_n_months
¤
Get the date of now minus n months
core.utils.get_past_n_weeks
¤
Get the date of now minus n weeks