core.utils¤

core.utils ¤

Logging and filter finding utilities.

Attributes¤

core.utils.filter_entry `module-attribute` ¤

filter_entry = TypedDict(
    "filter_entry", {"path": str, "filter": str, "protocol": str}
)

Dictionnary type representating a Virtual Secretary filter

ATTRIBUTE	DESCRIPTION
`path`	absolute path of the filter path. TYPE: `str`
`filter`	name of the filter filter, aka name of the filter itself. TYPE: `str`
`protocol`	server protocol, matching the name of one of the protocols. TYPE: `str`

core.utils.filter_bank `module-attribute` ¤

filter_bank = dict[int, filter_entry]

Dictionnary type of core.utils.filter_entry elements associated with their priority in the bank.

ATTRIBUTE	DESCRIPTION
`key`	priority TYPE: `int`
`value`	filter data TYPE: `filter_entry`

Classes¤

core.utils.filter_mode ¤

Bases: Enum

Available filter types

Attributes¤

core.utils.filter_mode.PROCESS `class-attribute` `instance-attribute` ¤

PROCESS = 'process'

Filter applying write, edit or move actions

core.utils.filter_mode.LEARN `class-attribute` `instance-attribute` ¤

LEARN = 'learn'

Filter applying machine-learning or read-only actions

Functions:¤

core.utils.now ¤

now() -> str

Return current time for log lines

core.utils.match_filter_name ¤

match_filter_name(file: str, mode: filter_mode)

Check if the current filter file matches the requested mode.

PARAMETER	DESCRIPTION
`file`	filter file to test TYPE: `str`
`mode`	filter type TYPE: `filter_mode`

RETURNS	DESCRIPTION
`match`	TYPE: `re.Match.group`

core.utils.find_filters ¤

find_filters(path: str, filters: filter_bank, mode: filter_mode) -> filter_bank

Find all the filter files in directory (aka filenames matching filter name pattern) and append them to the dictionnary of filters based on their priority. If 2 similar priorities are found, the first-defined one gets precedence, the other is discarded.

PARAMETER	DESCRIPTION
`path`	the folder where to find filter files TYPE: `str`
`filters`	the dictionnary where we will append filters found here. This dictionnary will have the integer priority of filters (order of running) set as keys. If filters with the same priority are found in the current path, former filters are overriden. TYPE: `filter_bank`
`mode`	the type of filter. TYPE: `filter_mode`

core.utils.lock_subfolder ¤

lock_subfolder(lockfile: str)

Write a .lock text file in the subfolder being currently processed, with the PID of the current Virtual Secretary instance.

Override the lock if it contains a PID that doesn’t exist anymore on the system (Linux-only).

PARAMETER	DESCRIPTION
`lockfile`	absolute path of the target lockfile TYPE: `str`

Todo

Make it work for Windows PID too.

core.utils.unlock_subfolder ¤

unlock_subfolder(lockfile: str)

Remove the .lock file in current subfolder.

PARAMETER	DESCRIPTION
`lockfile`	absolute path of the target lockfile TYPE: `str`

core.utils.imap_encode ¤

imap_encode(value: str) -> bytes

Encode Python string into IMAP-compliant UTF-7 bytes, as described in the RFC 3501.

There are variations, specific to IMAP4rev1, therefore the built-in python UTF-7 codec can’t be used. The main difference is the shift character, used to switch from ASCII to base64 encoding context. This is “&” in that modified UTF-7 convention, since “+” is considered as mainly used in mailbox names. Full description at RFC 3501, section 5.1.3.

Code from imap_tools/imap_utf7.py by ikvk under Apache 2.0 license.

PARAMETER	DESCRIPTION
`value`	IMAP mailbox path as string TYPE: `str`

RETURNS	DESCRIPTION
`path`	IMAP-encoded path as UTF-7 TYPE: `bytes`

core.utils.imap_decode ¤

imap_decode(value: bytes) -> str

Decode IMAP-compliant UTF-7 byte into Python string, as described in the RFC 3501.

There are variations, specific to IMAP4rev1, therefore the built-in python UTF-7 codec can’t be used. The main difference is the shift character, used to switch from ASCII to base64 encoding context. This is “&” in that modified UTF-7 convention, since “+” is considered as mainly used in mailbox names. Full description at RFC 3501, section 5.1.3.

Code from imap_tools/imap_utf7.py by ikvk under Apache 2.0 license.

PARAMETER	DESCRIPTION
`value`	IMAP-encoded path as UTF-7 modified for IMAP TYPE: `bytes`

RETURNS	DESCRIPTION
`path`	IMAP path encoded as Python string TYPE: `str`

core.utils.typography_undo ¤

typography_undo(string: str) -> str

Break correct typographic Unicode entities into dummy computer characters (ASCII) to produce computer-standard vocabulary and help word tokenizers to properly detect word boundaries.

This is useful when parsing:

1. **properly composed** text, like the output of LaTeX or SmartyPants[^1]/WP Scholar[^2],
2. text typed with Dvorak-like keyboard layouts (using proper Unicode entities where needed).

For example, the proper … ellipsis entity (Unicode U+2026 symbol) will be converted into 3 regular dots ....

core.utils.clean_whitespaces ¤

clean_whitespaces(string: str) -> str

Collapse repeated spaces and newlines in text.

core.utils.sanitize_unicode ¤

sanitize_unicode(text) -> str

Normalize arbitrary string-like objects into safe Python UTF-8 text.

core.utils.guess_date ¤

guess_date(string: str | datetime) -> datetime | None

Best-effort datetime parsing.

Always returns

timezone-aware UTC datetime
or None

core.utils.get_data_folder ¤

get_data_folder(filename: str, scheme: str, ext: str) -> str

Resolve the path of a training data saved under filename. These are stored in ../../data/.

Warning

This does not check the existence of the file and root folder.

core.utils.save_data ¤

save_data(data: list[web_page] | sqlite3.Connection, filename: str)

Save scraped data to a compressed archive.

The destination folder and file extension are handled automatically.

PARAMETER	DESCRIPTION
`data`	Data to save. Supported types: `list[web_page]`: saved as a `.pickle.tar.gz` archive using Python pickling. `sqlite3.Connection`: saved as a `.sql.tar.gz` archive using an SQLite SQL dump. TYPE: `list[web_page] \| sqlite3.Connection`
`filename`	Base filename to use. The output extension is added automatically depending on the type of `data`. TYPE: `str`

core.utils.open_data ¤

open_data(
    filename: str, scheme: str = "auto"
) -> list[web_page] | sqlite3.Connection

Open data stored in a tar.gz archive. We probe for sql and pickle datasets, in this order, and return the first we find.

PARAMETER	DESCRIPTION
`filename`	Extension-less name of the dataset (no path). TYPE: `str`
`scheme`	`sql` for data saved as SQL dumps, `pickle` for data saved as lists of `web_page`. `auto` will probe both in this order and return the first one found. TYPE: `str` DEFAULT: `'auto'`

RETURNS	DESCRIPTION
`list[web_page] \| sqlite3.Connection`	list of `web_pages` for pickle archives,
`list[web_page] \| sqlite3.Connection`	sqlite3.Connection for database archives. The database lives in memory and
`list[web_page] \| sqlite3.Connection`	will not be saved, so the caller needs to copy/dump it, and close the connection.

If the archive does not exist, returns an empty list.

core.utils.get_data_mtime ¤

get_data_mtime(filename: str, scheme: str) -> datetime | None

Return the modification date of the tar.gz archive.

RETURNS	DESCRIPTION
`datetime \| None`	datetime of the archive modification time, or None if it does not exist.

core.utils.get_models_folder ¤

get_models_folder(filename: str) -> str

Resolve the path of a machine-learning model saved under filename. These are stored in ../../models/.

Warning

This does not check the existence of the file and root folder.

core.utils.ensure_decompressed ¤

ensure_decompressed(path: str) -> str

Inflate a gzip sibling path + ".gz" into path if it is newer, then return path.

Deploying over FTP gives us no way to run a remote command, so the heavy search_engine.joblib / chantal-slim.db deploy artifacts are gzipped locally (the .db shrinks ~60%) and inflated here instead: whichever worker handles the first request after a deploy pays the one-time gunzip cost via an atomic replace, and every later worker sees an up-to-date plain file and just returns immediately (a single stat).

If path + ".gz" is missing, older than path, or fails to decompress (e.g. caught mid-upload), this is a no-op and the existing path – if any – is left untouched, so a request never breaks because of a deploy in flight.

core.utils.get_stopwords_file ¤

get_stopwords_file(filename: str) -> dict

Get a dictionnary file containing lines of “word: frequency” stored in ../../models/. By default, core.nlp.Word2Vec stores a such file when the word embedding is learned. Manually-validated files can be used for search engine purposes, since stopwords add noise to the searches.

core.utils.timeit ¤

timeit(runs: int = 1)

Provide a @timeit decorator to profile the wall performance of a function.

PARAMETER	DESCRIPTION
`runs`	how many times the function should be re-executed. Runtimes will give average and standard deviation. TYPE: `int` DEFAULT: `1`

core.utils.exit_after ¤

exit_after(s: int)

Define a decorator exit_after(n) that stops a function after n seconds.

Mostly intended for text parsing functions that get fed unchecked text inputs from the web. In that case, some really bad XML or super-long log files can make the parsing loop hang forever. This decorator will skip them without breaking the loop.

PARAMETER	DESCRIPTION
`s`	number of seconds TYPE: `int`

core.utils.get_past_n_months ¤

get_past_n_months(n: int) -> datetime

Get the date of now minus n months

core.utils.get_past_n_weeks ¤

get_past_n_weeks(n: int) -> datetime

Get the date of now minus n weeks

core.utils.get_past_n_days ¤

get_past_n_days(n: int) -> datetime

Get the date of now minus n days

core.utils¤

core.utils ¤

Attributes¤

core.utils.filter_entry module-attribute ¤

core.utils.filter_bank module-attribute ¤

Classes¤

core.utils.filter_mode ¤

Attributes¤

core.utils.filter_mode.PROCESS class-attribute instance-attribute ¤

core.utils.filter_mode.LEARN class-attribute instance-attribute ¤

Functions:¤

core.utils.now ¤

core.utils.match_filter_name ¤

core.utils.find_filters ¤

core.utils.lock_subfolder ¤

core.utils.unlock_subfolder ¤

core.utils.imap_encode ¤

core.utils.imap_decode ¤

core.utils.typography_undo ¤

core.utils.clean_whitespaces ¤

core.utils.sanitize_unicode ¤

core.utils.guess_date ¤

core.utils.get_data_folder ¤

core.utils.save_data ¤

core.utils.open_data ¤

core.utils.get_data_mtime ¤

core.utils.get_models_folder ¤

core.utils.ensure_decompressed ¤

core.utils.get_stopwords_file ¤

core.utils.timeit ¤

core.utils.exit_after ¤

core.utils.get_past_n_months ¤

core.utils.get_past_n_weeks ¤

core.utils.get_past_n_days ¤

core.utils.filter_entry `module-attribute` ¤

core.utils.filter_bank `module-attribute` ¤

core.utils.filter_mode.PROCESS `class-attribute` `instance-attribute` ¤

core.utils.filter_mode.LEARN `class-attribute` `instance-attribute` ¤