Utils¤

utils ¤

Logging and filter finding utilities.

Attributes¤

filter_entry `module-attribute` ¤

filter_entry = TypedDict(
    "filter_entry", {"path": str, "filter": str, "protocol": str}
)

Dictionnary type representating a Virtual Secretary filter

ATTRIBUTE	DESCRIPTION
`path`	absolute path of the filter path. TYPE: `str`
`filter`	name of the filter filter, aka name of the filter itself. TYPE: `str`
`protocol`	server protocol, matching the name of one of the [protocols][]. TYPE: `str`

filter_bank `module-attribute` ¤

filter_bank = dict[int, filter_entry]

Dictionnary type of [utils.filter_entry][] elements associated with their priority in the bank.

ATTRIBUTE	DESCRIPTION
`key`	priority TYPE: `int`
`value`	filter data TYPE: `filter_entry`

Classes¤

filter_mode ¤

Bases: Enum

Available filter types

Attributes¤

PROCESS `class-attribute` `instance-attribute` ¤

PROCESS = 'process'

Filter applying write, edit or move actions

LEARN `class-attribute` `instance-attribute` ¤

LEARN = 'learn'

Filter applying machine-learning or read-only actions

Functions¤

now ¤

now() -> str

Return current time for log lines

match_filter_name ¤

match_filter_name(file: str, mode: filter_mode)

Check if the current filter file matches the requested mode.

PARAMETER	DESCRIPTION
`file`	filter file to test TYPE: `str`
`mode`	filter type TYPE: `filter_mode`

RETURNS	DESCRIPTION
`match`	TYPE: `re.Match.group`

find_filters ¤

find_filters(path: str, filters: filter_bank, mode: filter_mode) -> filter_bank

Find all the filter files in directory (aka filenames matching filter name pattern) and append them to the dictionnary of filters based on their priority. If 2 similar priorities are found, the first-defined one gets precedence, the other is discarded.

PARAMETER	DESCRIPTION
`path`	the folder where to find filter files TYPE: `str`
`filters`	the dictionnary where we will append filters found here. This dictionnary will have the integer priority of filters (order of running) set as keys. If filters with the same priority are found in the current path, former filters are overriden. TYPE: `filter_bank`
`mode`	the type of filter. TYPE: `filter_mode`

lock_subfolder ¤

lock_subfolder(lockfile: str)

Write a .lock text file in the subfolder being currently processed, with the PID of the current Virtual Secretary instance.

Override the lock if it contains a PID that doesn’t exist anymore on the system (Linux-only).

PARAMETER	DESCRIPTION
`lockfile`	absolute path of the target lockfile TYPE: `str`

Todo

Make it work for Windows PID too.

unlock_subfolder ¤

unlock_subfolder(lockfile: str)

Remove the .lock file in current subfolder.

PARAMETER	DESCRIPTION
`lockfile`	absolute path of the target lockfile TYPE: `str`

imap_encode ¤

imap_encode(value: str) -> bytes

Encode Python string into IMAP-compliant UTF-7 bytes, as described in the RFC 3501.

There are variations, specific to IMAP4rev1, therefore the built-in python UTF-7 codec can’t be used. The main difference is the shift character, used to switch from ASCII to base64 encoding context. This is “&” in that modified UTF-7 convention, since “+” is considered as mainly used in mailbox names. Full description at RFC 3501, section 5.1.3.

Code from imap_tools/imap_utf7.py by ikvk under Apache 2.0 license.

PARAMETER	DESCRIPTION
`value`	IMAP mailbox path as string TYPE: `str`

RETURNS	DESCRIPTION
`path`	IMAP-encoded path as UTF-7 TYPE: `bytes`

imap_decode ¤

imap_decode(value: bytes) -> str

Decode IMAP-compliant UTF-7 byte into Python string, as described in the RFC 3501.

There are variations, specific to IMAP4rev1, therefore the built-in python UTF-7 codec can’t be used. The main difference is the shift character, used to switch from ASCII to base64 encoding context. This is “&” in that modified UTF-7 convention, since “+” is considered as mainly used in mailbox names. Full description at RFC 3501, section 5.1.3.

Code from imap_tools/imap_utf7.py by ikvk under Apache 2.0 license.

PARAMETER	DESCRIPTION
`value`	IMAP-encoded path as UTF-7 modified for IMAP TYPE: `bytes`

RETURNS	DESCRIPTION
`path`	IMAP path encoded as Python string TYPE: `str`

typography_undo ¤

typography_undo(string: str) -> str

Break correct typographic Unicode entities into dummy computer characters (ASCII) to produce computer-standard vocabulary and help word tokenizers to properly detect word boundaries.

This is useful when parsing:

1. **properly composed** text, like the output of LaTeX or SmartyPants[^1]/WP Scholar[^2],
2. text typed with Dvorak-like keyboard layouts (using proper Unicode entities where needed).

For example, the proper … ellipsis entity (Unicode U+2026 symbol) will be converted into 3 regular dots ....

guess_date ¤

guess_date(string: str | datetime) -> datetime

Best effort to guess a date from a string using typical date/time formats

get_data_folder ¤

get_data_folder(filename: str) -> str

Resolve the path of a training data saved under filename. These are stored in ../../data/. The .pickle extension is added automatically to the filename.

Warning

This does not check the existence of the file and root folder.

save_data ¤

save_data(data: list, filename: str)

Save scraped data to a pickle file inside a tar.gz archive in data folder. Folder and file extension are handled automatically.

open_data ¤

open_data(filename: str) -> list

Open scraped data from a pickle file inside a tar.gz archive stored in data folder. Folder and file extension are handled automatically. An empty list is returned is the file does not exist.

get_models_folder ¤

get_models_folder(filename: str) -> str

Resolve the path of a machine-learning model saved under filename. These are stored in ../../models/.

Warning

This does not check the existence of the file and root folder.

get_stopwords_file ¤

get_stopwords_file(filename: str) -> dict

Get a dictionnary file containing lines of “word: frequency” stored in ../../models/. By default, [core.nlp.Word2Vec.init][core.nlp.Word2Vec.__init__] stores a such file when the word embedding is learned. Manually-validated files can be used for search engine purposes, since stopwords add noise to the searches.

timeit ¤

timeit(runs: int = 1)

Provide a @timeit decorator to profile the wall performance of a function.

PARAMETER	DESCRIPTION
`-`	how many times the function should be re-executed. Runtimes will give average and standard deviation. TYPE: `runs`

exit_after ¤

exit_after(s: int)

Define a decorator exit_after(n) that stops a function after n seconds.

Mostly intended for text parsing functions that get fed unchecked text inputs from the web. In that case, some really bad XML or super-long log files can make the parsing loop hang forever. This decorator will skip them without breaking the loop.

PARAMETER	DESCRIPTION
`s`	number of seconds TYPE: `int`

RETURNS	DESCRIPTION
	the output of the function or None if it timed out.

Utils¤

utils ¤

Attributes¤

filter_entry module-attribute ¤

filter_bank module-attribute ¤

Classes¤

filter_mode ¤

Attributes¤

PROCESS class-attribute instance-attribute ¤

LEARN class-attribute instance-attribute ¤

Functions¤

now ¤

match_filter_name ¤

find_filters ¤

lock_subfolder ¤

unlock_subfolder ¤

imap_encode ¤

imap_decode ¤

typography_undo ¤

guess_date ¤

get_data_folder ¤

save_data ¤

open_data ¤

get_models_folder ¤

get_stopwords_file ¤

timeit ¤

exit_after ¤

filter_entry `module-attribute` ¤

filter_bank `module-attribute` ¤

PROCESS `class-attribute` `instance-attribute` ¤

LEARN `class-attribute` `instance-attribute` ¤