Utils¤
utils ¤
Logging and filter finding utilities.
© 2022-2023 - Aurélien Pierre
Attributes¤
filter_entry
module-attribute
¤
filter_bank
module-attribute
¤
Dictionnary type of [utils.filter_entry][] elements associated with their priority in the bank.
| ATTRIBUTE | DESCRIPTION |
|---|---|
key |
priority
TYPE:
|
value |
filter data
TYPE:
|
Classes¤
filter_mode ¤
Functions¤
match_filter_name ¤
Check if the current filter file matches the requested mode.
| PARAMETER | DESCRIPTION |
|---|---|
file |
filter file to test
TYPE:
|
mode |
filter type
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
match
|
TYPE:
|
find_filters ¤
Find all the filter files in directory (aka filenames matching filter name pattern) and append them to the dictionnary of filters based on their priority. If 2 similar priorities are found, the first-defined one gets precedence, the other is discarded.
| PARAMETER | DESCRIPTION |
|---|---|
path |
the folder where to find filter files
TYPE:
|
filters |
the dictionnary where we will append filters found here. This dictionnary will have the integer priority of filters (order of running) set as keys. If filters with the same priority are found in the current path, former filters are overriden.
TYPE:
|
mode |
the type of filter.
TYPE:
|
lock_subfolder ¤
Write a .lock text file in the subfolder being currently processed, with the PID of the current Virtual Secretary instance.
Override the lock if it contains a PID that doesn’t exist anymore on the system (Linux-only).
| PARAMETER | DESCRIPTION |
|---|---|
lockfile |
absolute path of the target lockfile
TYPE:
|
Todo
Make it work for Windows PID too.
unlock_subfolder ¤
Remove the .lock file in current subfolder.
| PARAMETER | DESCRIPTION |
|---|---|
lockfile |
absolute path of the target lockfile
TYPE:
|
imap_encode ¤
Encode Python string into IMAP-compliant UTF-7 bytes, as described in the RFC 3501.
There are variations, specific to IMAP4rev1, therefore the built-in python UTF-7 codec can’t be used. The main difference is the shift character, used to switch from ASCII to base64 encoding context. This is “&” in that modified UTF-7 convention, since “+” is considered as mainly used in mailbox names. Full description at RFC 3501, section 5.1.3.
Code from imap_tools/imap_utf7.py by ikvk under Apache 2.0 license.
| PARAMETER | DESCRIPTION |
|---|---|
value |
IMAP mailbox path as string
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
path
|
IMAP-encoded path as UTF-7
TYPE:
|
imap_decode ¤
Decode IMAP-compliant UTF-7 byte into Python string, as described in the RFC 3501.
There are variations, specific to IMAP4rev1, therefore the built-in python UTF-7 codec can’t be used. The main difference is the shift character, used to switch from ASCII to base64 encoding context. This is “&” in that modified UTF-7 convention, since “+” is considered as mainly used in mailbox names. Full description at RFC 3501, section 5.1.3.
Code from imap_tools/imap_utf7.py by ikvk under Apache 2.0 license.
| PARAMETER | DESCRIPTION |
|---|---|
value |
IMAP-encoded path as UTF-7 modified for IMAP
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
path
|
IMAP path encoded as Python string
TYPE:
|
typography_undo ¤
Break correct typographic Unicode entities into dummy computer characters (ASCII) to produce computer-standard vocabulary and help word tokenizers to properly detect word boundaries.
This is useful when parsing:
1. **properly composed** text, like the output of LaTeX or SmartyPants[^1]/WP Scholar[^2],
2. text typed with Dvorak-like keyboard layouts (using proper Unicode entities where needed).
For example, the proper … ellipsis entity (Unicode U+2026 symbol) will be converted into 3 regular dots ....
guess_date ¤
Best effort to guess a date from a string using typical date/time formats
get_data_folder ¤
Resolve the path of a training data saved under filename. These are stored in ../../data/.
The .pickle extension is added automatically to the filename.
Warning
This does not check the existence of the file and root folder.
save_data ¤
Save scraped data to a pickle file inside a tar.gz archive in data folder. Folder and file extension are handled automatically.
open_data ¤
Open scraped data from a pickle file inside a tar.gz archive stored in data folder. Folder and file extension are handled automatically. An empty list is returned is the file does not exist.
get_models_folder ¤
Resolve the path of a machine-learning model saved under filename. These are stored in ../../models/.
Warning
This does not check the existence of the file and root folder.
get_stopwords_file ¤
Get a dictionnary file containing lines of “word: frequency” stored in ../../models/.
By default, [core.nlp.Word2Vec.init][core.nlp.Word2Vec.__init__] stores a such file when the word embedding is learned.
Manually-validated files can be used for search engine purposes, since stopwords add noise to the searches.
timeit ¤
Provide a @timeit decorator to profile the wall performance of a function.
| PARAMETER | DESCRIPTION |
|---|---|
- |
how many times the function should be re-executed. Runtimes will give average and standard deviation.
TYPE:
|
exit_after ¤
Define a decorator exit_after(n) that stops a function after n seconds.
Mostly intended for text parsing functions that get fed unchecked text inputs from the web. In that case, some really bad XML or super-long log files can make the parsing loop hang forever. This decorator will skip them without breaking the loop.
| PARAMETER | DESCRIPTION |
|---|---|
s |
number of seconds
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
|
the output of the function or None if it timed out. |