core.patterns¤

core.patterns ¤

Contains global regular expression patterns re-used in the app. You can use https://regex101.com/ to test these conveniently.

Attributes¤

core.patterns.regex_starter `module-attribute` ¤

regex_starter = '(?<=^|\\s|\\[|\\(|\\{|\\<|\\\'|\\"|`|;|\\>)'

Start of line, or start of document, or start of markup

core.patterns.regex_stopper `module-attribute` ¤

regex_stopper = '(?=$|\\s|\\]|\\)|\\}|\\>|\\\'|\\"|`|;|\\<)'

End of line, or end of document, or end of markup

core.patterns.end_of_word `module-attribute` ¤

end_of_word = '(?=$|\\s|\\]|\\)|\\}|\\>|\\\'|\\"|`|;|:|,|\\?|\\!|\\.|\\<)'

End of word, or end of line, or end of document, or end of markup

core.patterns.regex_algebra `module-attribute` ¤

regex_algebra = '[\\+\\-\\=\\≠\\±]'

Algebraic signs

core.patterns.IP_PATTERN `module-attribute` ¤

IP_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_ip, regex_stopper), re.IGNORECASE
)

IPv4 and IPv6 patterns where the whole IP is captured in the first group.

core.patterns.EMAIL_PATTERN `module-attribute` ¤

EMAIL_PATTERN = re.compile(
    "<?([0-9a-z\\-\\_\\+\\.]+?@[0-9a-z\\-\\_\\+]+(\\.[0-9a-z\\_\\-]{2,})+)>?",
    re.IGNORECASE,
)

Emails patterns like <me@mail.com> or me@mail.com where the whole address is captured in the first group.

core.patterns.URL_PATTERN `module-attribute` ¤

URL_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_url, end_of_word), re.IGNORECASE
)

URL patterns like http(s)://domain.ext/page/subpage?q=x&r=0:1#anchor or //domain.ext/page. URL must follow RFC3986, meaning query parameters should be before anchors, if any. Relying on this assumption allows a faster regex parsing.

the protocol (ftp, ftps, http, https) is captured as the first group,
domain.ext is captured as the second group,
/page/etc is the third group, including leading and trailing /,
page query parameters ?s=x&r=0, including ?, is the fourth group if the URL declares ...?params#anchor,
anchor #anchor is the fifth group, including #, if the URL declares ...?params#anchor.

URLs are captured if they are:

alone on their own line,
enclosed in {}, [], ()
enclosed in whitespaces.

Warning: URLs enclosed in (), [] and {} may retain the closing sign as part of the page name since () and [] are valid in URL pathes and parameters. This pattern will work on plain text only: Markdown, XML, HTML and JSON will need to be parsed ahead.

core.patterns.MEMBERS_PATTERN `module-attribute` ¤

MEMBERS_PATTERN = re.compile('(?<=[a-z])(\\.)(?=[a-z])', re.IGNORECASE)

Domain patterns without leading protocol like cdn.company.com or class members in object-oriented programming languages like params.cookies.client.

core.patterns.DATE_PATTERN `module-attribute` ¤

DATE_PATTERN = re.compile(date_regex, re.IGNORECASE)

Dates like 2022-12-01, 01-12-2022, 01-12-22, 01/12/2022, 01/12/22 where the whole date is captured in the first group, then each group of digits is captured in the order of appearance, in the next 3 groups

core.patterns.TIME_PATTERN `module-attribute` ¤

TIME_PATTERN = re.compile(time_regex, re.IGNORECASE)

Identify more or less standard time patterns, like :

12h15
12:15
12:15:00
12am
12 am
12 h
12:15:00Z
12:15:00+01
12:15:00 UTC+1
11:27:45+0000

RETURNS	DESCRIPTION
`0`	1- or 2-digits hour, TYPE: `str`
`1`	hour/minutes separator or half-day marker among `["h", ":", "am", "pm"]` (case-insensitive) TYPE: `str`
`2`	2-digits minutes, if any, or `None` TYPE: `str`
`3`	2-digits seconds, if any. TYPE: `str`
`4`	hour marker (`h` or `H`), half-day marker (case-insensitive `["am", "pm"]`), or time zone marker (case-sensitive `["Z", "UTC"]`) TYPE: `str`
`5`	1-or 2-digits signed integer timezone shift (referred to UTC). TYPE: `str`

Examples:

see https://regex101.com/r/QNtZAK/2

see src/tests/test-patterns.py

core.patterns.DOMAIN_PATTERN `module-attribute` ¤

DOMAIN_PATTERN = re.compile(
    "from ((?:[a-z0-9\\-_]{0,61}\\.)+[a-z]{2,})", re.IGNORECASE
)

Matches patterns like from (domain.ext) from RFC-822 Received header in emails.

core.patterns.UID_PATTERN `module-attribute` ¤

UID_PATTERN = re.compile('UID ([0-9]+)')

Matches email integer UID from IMAP headers.

core.patterns.FLAGS_PATTERN `module-attribute` ¤

FLAGS_PATTERN = re.compile('FLAGS \\((.*?)\\)')

Matches email flags from IMAP headers.

core.patterns.PATH_PATTERN `module-attribute` ¤

PATH_PATTERN = re.compile('%s%s%s' % (regex_starter, path_regex, end_of_word))

File path pattern like ~/file, /home/file, ./file or C:\windows

core.patterns.PARTIAL_PATH_REGEX `module-attribute` ¤

PARTIAL_PATH_REGEX = re.compile(
    "%s%s%s" % (regex_starter, partial_path_regex, end_of_word)
)

Partial, invalid path patterns missing the leading root, like home/user/stuff. We start capturing after at least two folder separators (slash or backslash).

Warning

this will collide with date detection, so run it after in the pipeline.

core.patterns.RESOLUTION_PATTERN `module-attribute` ¤

RESOLUTION_PATTERN = re.compile('\\d+(?:×|x|X)\\d+')

Pixel resolution like 10x20 or 10×20. Units are discarded.

core.patterns.NUMBER_PATTERN `module-attribute` ¤

NUMBER_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_number, regex_stopper)
)

Signed integers and decimals, fractions and numeric IDs with interal dashes and underscores. Numbers with starting or trailing units are not considered. Lazy decimals (.1 and 1.) are considered.

core.patterns.HASH_PATTERN `module-attribute` ¤

HASH_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_hash, end_of_word), re.IGNORECASE
)

Cryptographic hexadecimal hashes and fingerprints, of a min length of 8 characters.

core.patterns.MULTIPLE_LINES `module-attribute` ¤

MULTIPLE_LINES = re.compile('(?: ?[\\t\\r\\n]{2,} ?)+')

Detect more than 2 newlines and tab, possibly mixed with spaces

core.patterns.MULTIPLE_NEWLINES `module-attribute` ¤

MULTIPLE_NEWLINES = re.compile('(?: ?[\\t\\r\\n]+ ?){2,}')

Detect broken sequences of newlines and spaces.

core.patterns.INTERNAL_NEWLINE `module-attribute` ¤

INTERNAL_NEWLINE = re.compile('(?<=\\w)[\\n\\t\\r]{1}(?=\\w)')

Detect single newline characters nested inside text. Mostly useful for parsed PDF where line wrapping is quite literal ( used instead of space).

core.patterns.EXPOSURE `module-attribute` ¤

EXPOSURE = re.compile(
    "%s%s%s" % (regex_starter, exposure_regex, end_of_word), flags=re.IGNORECASE
)

Exposure values in EV or IL

core.patterns.PHOTOSPEED `module-attribute` ¤

PHOTOSPEED = re.compile(
    "%s%s%s" % (regex_starter, photospeed_regex, end_of_word),
    flags=re.IGNORECASE,
)

Exposure values in EV or IL

core.patterns.SENSIBILITY `module-attribute` ¤

SENSIBILITY = re.compile(
    "%s%s%s" % (regex_starter, sensibility_regex, end_of_word),
    flags=re.IGNORECASE,
)

Photographic sensibility in ISO or ASA

core.patterns.LUMINANCE `module-attribute` ¤

LUMINANCE = re.compile(
    "%s%s%s" % (regex_starter, luminance_regex, end_of_word),
    flags=re.IGNORECASE,
)

Luminance/radiance in nits or Cd/m²

core.patterns.DIAPHRAGM `module-attribute` ¤

DIAPHRAGM = re.compile(
    "%s%s" % (regex_starter, diaphragm_regex), flags=re.IGNORECASE
)

Photographic diaph aperture values like f/2.8 or f/11

core.patterns.GAIN `module-attribute` ¤

GAIN = re.compile(
    "%s%s%s" % (regex_starter, gain_regex, end_of_word), flags=re.IGNORECASE
)

Gain, attenuation and PSNR in dB

core.patterns.FILE_SIZE `module-attribute` ¤

FILE_SIZE = re.compile(
    "%s%s%s" % (regex_starter, filesize_regex, end_of_word), flags=re.IGNORECASE
)

File and memory size in bit, byte, or octet and their multiples

core.patterns.DISTANCE `module-attribute` ¤

DISTANCE = re.compile(
    "%s%s%s" % (regex_starter, distance_regex, end_of_word), flags=re.IGNORECASE
)

Distance in meter, inch, foot and their multiples

core.patterns.PERCENT `module-attribute` ¤

PERCENT = re.compile('%s%s%s' % (regex_starter, percent_regex, end_of_word))

Number followed by %

core.patterns.WEIGHT `module-attribute` ¤

WEIGHT = re.compile(
    "%s%s%s" % (regex_starter, weight_regex, end_of_word), flags=re.IGNORECASE
)

Weight (mass) in British and SI units and their multiples

core.patterns.ANGLE `module-attribute` ¤

ANGLE = re.compile(
    "%s%s%s" % (regex_starter, angle_regex, end_of_word), flags=re.IGNORECASE
)

Angles in radians, degrees and steradians

core.patterns.TEMPERATURE `module-attribute` ¤

TEMPERATURE = re.compile(
    "%s%s%s" % (regex_starter, temperature_regex, end_of_word),
    flags=re.IGNORECASE,
)

Temperatures in °C, °F and K

core.patterns.FREQUENCY `module-attribute` ¤

FREQUENCY = re.compile(
    "%s%s%s" % (regex_starter, frequency_regex, end_of_word),
    flags=re.IGNORECASE,
)

Frequencies in hertz and multiples

core.patterns.TEXT_DATES `module-attribute` ¤

TEXT_DATES = re.compile(
    "([0-9]{1,2})? (jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec|jan|fév|mar|avr|mai|jui|jui|aou|sep|oct|nov|déc|janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre|january|february|march|april|may|june|july|august|september|october|november|december)\\.?( [0-9]{1,2})?( [0-9]{2,4})(?!\\:)",
    flags=re.IGNORECASE | re.MULTILINE,
)

Find textual dates formats:

English dates like 01 Jan 20 or 01 Jan. 2020 but avoid capturing adjacent time like 12:08.
French dates like 01 Jan 20 or 01 Jan. 2020 but avoid capturing adjacent time like 12:08.

RETURNS	DESCRIPTION
`0`	2 digits (day number or year number, depending on language) TYPE: `str`
`1`	month (full-form or abbreviated) TYPE: `str`
`2`	2 digits (day number or year number, depending on language) TYPE: `str`
`3`	4 digits (full year) TYPE: `str`

core.patterns.BASE_64 `module-attribute` ¤

BASE_64 = re.compile(
    "((?:[A-Za-z0-9+\\/]{4}){64,}(?:[A-Za-z0-9+\\/]{2}==|[A-Za-z0-9+\\/]{3}=)?)"
)

Identifies base64 encoding

core.patterns.BB_CODE `module-attribute` ¤

BB_CODE = re.compile('\\[(img|quote)[a-zA-Z0-9 =\\"]*?\\].*?\\[\\/\\1\\]')

Identifies left-over BB code markup [img] and [quote]

core.patterns.MARKUP `module-attribute` ¤

MARKUP = re.compile('(?:\\[|\\{|\\<)([^\\n\\r]+?)(?:\\]|\\}|\\>)')

Identifies left-over HTML and Markdown markup, like <...>, {...}, [...]

core.patterns.USER `module-attribute` ¤

USER = re.compile('([\\w\\-\\+\\.]+)?@([\\w\\-\\+\\.]+)|(user\\-?\\d+)')

Identifies user handles or emails

core.patterns.REPEATED_CHARACTERS `module-attribute` ¤

REPEATED_CHARACTERS = re.compile('(.)\\1{9,}')

Identifies any character repeated more than 9 times

core.patterns.UNFINISHED_SENTENCES `module-attribute` ¤

UNFINISHED_SENTENCES = re.compile('(?<![?!.;:])\\n\\n|\\r\\n')

Identifies sentences finishing with 2 newlines characters without having ending punctuations

core.patterns.MULTIPLE_DOTS `module-attribute` ¤

MULTIPLE_DOTS = re.compile('\\.{2,}')

Identifies dots repeated more than twice

core.patterns.MULTIPLE_DASHES `module-attribute` ¤

MULTIPLE_DASHES = re.compile('[-~]{1,}')

Identifies dashes repeated more than once

core.patterns.MULTIPLE_QUESTIONS `module-attribute` ¤

MULTIPLE_QUESTIONS = re.compile('\\?{1,}')

Identifies question marks repeated more than once

core.patterns.ORDINAL_FR `module-attribute` ¤

ORDINAL_FR = re.compile('n° ?([0-9]+)')

French ordinal numbers (numéros n°)

core.patterns.FRANCAIS `module-attribute` ¤

FRANCAIS = re.compile(
    "%s(j|t|s|d|qu|lorsqu|quelqu|jusqu|m|c|n)\\'(?=[aeiouyéèàêâîôûïüäëöh][\\w\\s])"
    % regex_starter,
    flags=re.IGNORECASE,
)

French contractions of pronouns and determinants

core.patterns.DASHES `module-attribute` ¤

DASHES = re.compile('(?<=\\w)(-|_|=)+(?=\\w)', re.IGNORECASE)

Dashes in the middle of ASCII/Latin compounded words. Will not work if accented or Unicode characters are immediately surrounding the dash.

core.patterns.ALTERNATIVES `module-attribute` ¤

ALTERNATIVES = re.compile('(?<=[a-z])(\\/)(?=[a-z])', re.IGNORECASE)

Slash-separated word alternatives like and/or mr/mrs

core.patterns.PLURAL_S `module-attribute` ¤

PLURAL_S = re.compile('(?<=[a-zA-Z]{4,})s?e{0,2}s%s' % end_of_word)

Identify plural form of nouns (French and English), adjectives (French) and third-person present verbs (English) and second-person verbs (French) in -s.

core.patterns.FEMININE_E `module-attribute` ¤

FEMININE_E = re.compile('(?<=\\w{4,})e{1,2}%s' % end_of_word)

Identify feminine form of adjectives (French) in -e.

core.patterns.DOUBLE_CONSONANTS `module-attribute` ¤

DOUBLE_CONSONANTS = re.compile(
    "(?<=\\w{2,})([bcfghjklmnpqrstvwxz])\\1", re.IGNORECASE
)

Identify double consonants in the middle of words.

core.patterns.FEMININE_TRICE `module-attribute` ¤

FEMININE_TRICE = re.compile('(?<=\\w{4,})t(rice|eur|or)%s' % end_of_word)

Identify French feminine nouns in -trice.

core.patterns.ADVERB_MENT `module-attribute` ¤

ADVERB_MENT = re.compile('(?<=\\w{4,})e?ment%s' % end_of_word)

Identify French adverbs and English nouns ending en -ment

core.patterns.SUBSTANTIVE_TION `module-attribute` ¤

SUBSTANTIVE_TION = re.compile('(?<=\\w{4,})(t|s)ion%s' % end_of_word)

Identify French and English substantives formed from verbs by adding -tion and -sion

core.patterns.SUBSTANTIVE_AT `module-attribute` ¤

SUBSTANTIVE_AT = re.compile('(?<=\\w{4,})at%s' % end_of_word)

Identify French and English substantives formed from other nouns by adding -at

core.patterns.PARTICIPLE_ING `module-attribute` ¤

PARTICIPLE_ING = re.compile('(?<=\\w{4,})ing%s' % end_of_word)

Identify English substantives and present participles formed from verbs by adding -ing

core.patterns.ADJECTIVE_ED `module-attribute` ¤

ADJECTIVE_ED = re.compile('(?<=\\w{4,})ed%s' % end_of_word)

Identify English adjectives formed from verbs by adding -ed

core.patterns.ADJECTIVE_TIF `module-attribute` ¤

ADJECTIVE_TIF = re.compile('(?<=\\w{2,})ti(f|v)%s' % end_of_word)

Identify English and French adjectives formed from verbs by adding -tif or -tive

core.patterns.SUBSTANTIVE_Y `module-attribute` ¤

SUBSTANTIVE_Y = re.compile('(?<=\\w{3,})y%s' % end_of_word)

Identify English substantives ending in -y

core.patterns.VERB_IZ `module-attribute` ¤

VERB_IZ = re.compile('(?<=\\w{4,})(i|y)z%s' % end_of_word)

Identify American verbs ending in -iz that French and Brits write in -is

core.patterns.STUFF_ER `module-attribute` ¤

STUFF_ER = re.compile('(?<=\\w{5,})er%s' % end_of_word)

Identify French 1st group verb (infinitive) and English substantives ending in -er

core.patterns.BRITISH_OUR `module-attribute` ¤

BRITISH_OUR = re.compile('(?<=\\w{3,})our%s' % end_of_word)

Identify British spelling ending in -our (colour, behaviour).

core.patterns.SUBSTANTIVE_ITY `module-attribute` ¤

SUBSTANTIVE_ITY = re.compile('(?<=\\w{4,})it(y|e)%s' % end_of_word)

Identify substantives in -ity (English) and -ite (French).

core.patterns.SUBSTANTIVE_IST `module-attribute` ¤

SUBSTANTIVE_IST = re.compile('(?<=\\w{3,})is(t|m)%s' % end_of_word)

Identify substantives in -ist and -ism.

core.patterns.SUBSTANTIVE_IQU `module-attribute` ¤

SUBSTANTIVE_IQU = re.compile('(?<=\\w{3,})i(qu|c)%s' % end_of_word)

Identify French substantives in -iqu

core.patterns.SUBSTANTIVE_EUR `module-attribute` ¤

SUBSTANTIVE_EUR = re.compile('(?<=\\w{3,})eur%s' % end_of_word)

Identify French substantives -eur

core.patterns.HYPHENIZED `module-attribute` ¤

HYPHENIZED = re.compile('(?<=\\w{3,})[-–—]+ *[\\n\\r]{1,2}(?=\\w)')

Detect hyphenized words at the end of a PDF text line.

core.patterns.WAYBACK_RE `module-attribute` ¤

WAYBACK_RE = re.compile('https?://web\\.archive\\.org/web/[^/]+/(https?://.+)')

Find the canonical URL from web.archive.org (Wayback Machine) URLs

Functions:¤

core.patterns.split_url ¤

split_url(url: str) -> tuple[str, str, str, str, str] | None

Split a well-formed URL following RFC3986 into base elements.

RETURNS	DESCRIPTION
`tuple[str, str, str, str, str] \| None`	a tuple of `(protocol, domain, page, parameters, anchor)`.
`tuple[str, str, str, str, str] \| None`	Empty/missing fields are inited with empty strings so there is no need for individual `None` checks.
`tuple[str, str, str, str, str] \| None`	If the `url` input doesn’t match an URL format, return `None`.

core.patterns¤

core.patterns ¤

Attributes¤

core.patterns.regex_starter module-attribute ¤

core.patterns.regex_stopper module-attribute ¤

core.patterns.end_of_word module-attribute ¤

core.patterns.regex_algebra module-attribute ¤

core.patterns.IP_PATTERN module-attribute ¤

core.patterns.EMAIL_PATTERN module-attribute ¤

core.patterns.URL_PATTERN module-attribute ¤

core.patterns.MEMBERS_PATTERN module-attribute ¤

core.patterns.DATE_PATTERN module-attribute ¤

core.patterns.TIME_PATTERN module-attribute ¤

core.patterns.DOMAIN_PATTERN module-attribute ¤

core.patterns.UID_PATTERN module-attribute ¤

core.patterns.FLAGS_PATTERN module-attribute ¤

core.patterns.PATH_PATTERN module-attribute ¤

core.patterns.PARTIAL_PATH_REGEX module-attribute ¤

core.patterns.RESOLUTION_PATTERN module-attribute ¤

core.patterns.NUMBER_PATTERN module-attribute ¤

core.patterns.HASH_PATTERN module-attribute ¤

core.patterns.MULTIPLE_LINES module-attribute ¤

core.patterns.MULTIPLE_NEWLINES module-attribute ¤

core.patterns.INTERNAL_NEWLINE module-attribute ¤

core.patterns.EXPOSURE module-attribute ¤

core.patterns.PHOTOSPEED module-attribute ¤

core.patterns.SENSIBILITY module-attribute ¤

core.patterns.LUMINANCE module-attribute ¤

core.patterns.DIAPHRAGM module-attribute ¤

core.patterns.GAIN module-attribute ¤

core.patterns.FILE_SIZE module-attribute ¤

core.patterns.DISTANCE module-attribute ¤

core.patterns.PERCENT module-attribute ¤

core.patterns.WEIGHT module-attribute ¤

core.patterns.ANGLE module-attribute ¤

core.patterns.TEMPERATURE module-attribute ¤

core.patterns.FREQUENCY module-attribute ¤

core.patterns.TEXT_DATES module-attribute ¤

core.patterns.BASE_64 module-attribute ¤

core.patterns.BB_CODE module-attribute ¤

core.patterns.MARKUP module-attribute ¤

core.patterns.USER module-attribute ¤

core.patterns.REPEATED_CHARACTERS module-attribute ¤

core.patterns.UNFINISHED_SENTENCES module-attribute ¤

core.patterns.MULTIPLE_DOTS module-attribute ¤

core.patterns.MULTIPLE_DASHES module-attribute ¤

core.patterns.MULTIPLE_QUESTIONS module-attribute ¤

core.patterns.ORDINAL_FR module-attribute ¤

core.patterns.FRANCAIS module-attribute ¤

core.patterns.DASHES module-attribute ¤

core.patterns.ALTERNATIVES module-attribute ¤

core.patterns.PLURAL_S module-attribute ¤

core.patterns.FEMININE_E module-attribute ¤

core.patterns.DOUBLE_CONSONANTS module-attribute ¤

core.patterns.FEMININE_TRICE module-attribute ¤

core.patterns.ADVERB_MENT module-attribute ¤

core.patterns.SUBSTANTIVE_TION module-attribute ¤

core.patterns.SUBSTANTIVE_AT module-attribute ¤

core.patterns.PARTICIPLE_ING module-attribute ¤

core.patterns.ADJECTIVE_ED module-attribute ¤

core.patterns.ADJECTIVE_TIF module-attribute ¤

core.patterns.SUBSTANTIVE_Y module-attribute ¤

core.patterns.VERB_IZ module-attribute ¤

core.patterns.STUFF_ER module-attribute ¤

core.patterns.BRITISH_OUR module-attribute ¤

core.patterns.SUBSTANTIVE_ITY module-attribute ¤

core.patterns.SUBSTANTIVE_IST module-attribute ¤

core.patterns.SUBSTANTIVE_IQU module-attribute ¤

core.patterns.SUBSTANTIVE_EUR module-attribute ¤

core.patterns.HYPHENIZED module-attribute ¤

core.patterns.WAYBACK_RE module-attribute ¤