Skip to content

Utils¤

utils ¤

Logging and filter finding utilities.

© 2022-2023 - Aurélien Pierre

Attributes¤

filter_entry module-attribute ¤

filter_entry = TypedDict(
    "filter_entry", {"path": str, "filter": str, "protocol": str}
)

Dictionnary type representating a Virtual Secretary filter

ATTRIBUTE DESCRIPTION
path

absolute path of the filter path.

TYPE: str

filter

name of the filter filter, aka name of the filter itself.

TYPE: str

protocol

server protocol, matching the name of one of the [protocols][].

TYPE: str

filter_bank module-attribute ¤

filter_bank = dict[int, filter_entry]

Dictionnary type of [utils.filter_entry][] elements associated with their priority in the bank.

ATTRIBUTE DESCRIPTION
key

priority

TYPE: int

value

filter data

TYPE: filter_entry

Classes¤

filter_mode ¤

Bases: Enum

Available filter types

Attributes¤
PROCESS class-attribute instance-attribute ¤
PROCESS = 'process'

Filter applying write, edit or move actions

LEARN class-attribute instance-attribute ¤
LEARN = 'learn'

Filter applying machine-learning or read-only actions

Functions¤

now ¤

now() -> str

Return current time for log lines

match_filter_name ¤

match_filter_name(file: str, mode: filter_mode)

Check if the current filter file matches the requested mode.

PARAMETER DESCRIPTION
file

filter file to test

TYPE: str

mode

filter type

TYPE: filter_mode

RETURNS DESCRIPTION
match

TYPE: re.Match.group

find_filters ¤

find_filters(path: str, filters: filter_bank, mode: filter_mode) -> filter_bank

Find all the filter files in directory (aka filenames matching filter name pattern) and append them to the dictionnary of filters based on their priority. If 2 similar priorities are found, the first-defined one gets precedence, the other is discarded.

PARAMETER DESCRIPTION
path

the folder where to find filter files

TYPE: str

filters

the dictionnary where we will append filters found here. This dictionnary will have the integer priority of filters (order of running) set as keys. If filters with the same priority are found in the current path, former filters are overriden.

TYPE: filter_bank

mode

the type of filter.

TYPE: filter_mode

lock_subfolder ¤

lock_subfolder(lockfile: str)

Write a .lock text file in the subfolder being currently processed, with the PID of the current Virtual Secretary instance.

Override the lock if it contains a PID that doesn’t exist anymore on the system (Linux-only).

PARAMETER DESCRIPTION
lockfile

absolute path of the target lockfile

TYPE: str

Todo

Make it work for Windows PID too.

unlock_subfolder ¤

unlock_subfolder(lockfile: str)

Remove the .lock file in current subfolder.

PARAMETER DESCRIPTION
lockfile

absolute path of the target lockfile

TYPE: str

imap_encode ¤

imap_encode(value: str) -> bytes

Encode Python string into IMAP-compliant UTF-7 bytes, as described in the RFC 3501.

There are variations, specific to IMAP4rev1, therefore the built-in python UTF-7 codec can’t be used. The main difference is the shift character, used to switch from ASCII to base64 encoding context. This is “&” in that modified UTF-7 convention, since “+” is considered as mainly used in mailbox names. Full description at RFC 3501, section 5.1.3.

Code from imap_tools/imap_utf7.py by ikvk under Apache 2.0 license.

PARAMETER DESCRIPTION
value

IMAP mailbox path as string

TYPE: str

RETURNS DESCRIPTION
path

IMAP-encoded path as UTF-7

TYPE: bytes

imap_decode ¤

imap_decode(value: bytes) -> str

Decode IMAP-compliant UTF-7 byte into Python string, as described in the RFC 3501.

There are variations, specific to IMAP4rev1, therefore the built-in python UTF-7 codec can’t be used. The main difference is the shift character, used to switch from ASCII to base64 encoding context. This is “&” in that modified UTF-7 convention, since “+” is considered as mainly used in mailbox names. Full description at RFC 3501, section 5.1.3.

Code from imap_tools/imap_utf7.py by ikvk under Apache 2.0 license.

PARAMETER DESCRIPTION
value

IMAP-encoded path as UTF-7 modified for IMAP

TYPE: bytes

RETURNS DESCRIPTION
path

IMAP path encoded as Python string

TYPE: str

typography_undo ¤

typography_undo(string: str) -> str

Break correct typographic Unicode entities into dummy computer characters (ASCII) to produce computer-standard vocabulary and help word tokenizers to properly detect word boundaries.

This is useful when parsing:

1. **properly composed** text, like the output of LaTeX or SmartyPants[^1]/WP Scholar[^2],
2. text typed with Dvorak-like keyboard layouts (using proper Unicode entities where needed).

For example, the proper ellipsis entity (Unicode U+2026 symbol) will be converted into 3 regular dots ....

guess_date ¤

guess_date(string: str | datetime) -> datetime

Best effort to guess a date from a string using typical date/time formats

get_data_folder ¤

get_data_folder(filename: str) -> str

Resolve the path of a training data saved under filename. These are stored in ../../data/. The .pickle extension is added automatically to the filename.

Warning

This does not check the existence of the file and root folder.

save_data ¤

save_data(data: list, filename: str)

Save scraped data to a pickle file inside a tar.gz archive in data folder. Folder and file extension are handled automatically.

open_data ¤

open_data(filename: str) -> list

Open scraped data from a pickle file inside a tar.gz archive stored in data folder. Folder and file extension are handled automatically. An empty list is returned is the file does not exist.

get_models_folder ¤

get_models_folder(filename: str) -> str

Resolve the path of a machine-learning model saved under filename. These are stored in ../../models/.

Warning

This does not check the existence of the file and root folder.

get_stopwords_file ¤

get_stopwords_file(filename: str) -> dict

Get a dictionnary file containing lines of “word: frequency” stored in ../../models/. By default, [core.nlp.Word2Vec.init][core.nlp.Word2Vec.__init__] stores a such file when the word embedding is learned. Manually-validated files can be used for search engine purposes, since stopwords add noise to the searches.

timeit ¤

timeit(runs: int = 1)

Provide a @timeit decorator to profile the wall performance of a function.

PARAMETER DESCRIPTION
-

how many times the function should be re-executed. Runtimes will give average and standard deviation.

TYPE: runs

exit_after ¤

exit_after(s: int)

Define a decorator exit_after(n) that stops a function after n seconds.

Mostly intended for text parsing functions that get fed unchecked text inputs from the web. In that case, some really bad XML or super-long log files can make the parsing loop hang forever. This decorator will skip them without breaking the loop.

PARAMETER DESCRIPTION
s

number of seconds

TYPE: int

RETURNS DESCRIPTION

the output of the function or None if it timed out.