Patterns¤

patterns ¤

Contains global regular expression patterns re-used in the app. You can use https://regex101.com/ to test these conveniently.

Attributes¤

regex_starter `module-attribute` ¤

regex_starter = '(?<=^|\\s|\\[|\\(|\\{|\\<|\\\'|\\"|`|;)'

Start of line, or start of document, or start of markup

regex_stopper `module-attribute` ¤

regex_stopper = '(?=$|\\s|\\]|\\)|\\}|\\>|\\\'|\\"|`|;)'

End of line, or end of document, or end of markup

end_of_word `module-attribute` ¤

end_of_word = '(?=$|\\s|\\]|\\)|\\}|\\>|\\\'|\\"|`|;|:|,|\\?|\\!|\\.)'

End of word, or end of line, or end of document, or end of markup

regex_algebra `module-attribute` ¤

regex_algebra = '[\\+\\-\\=\\≠\\±]'

Algebraic signs

IP_PATTERN `module-attribute` ¤

IP_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_ip, regex_stopper), re.IGNORECASE
)

IPv4 and IPv6 patterns where the whole IP is captured in the first group.

EMAIL_PATTERN `module-attribute` ¤

EMAIL_PATTERN = re.compile(
    "<?([0-9a-z\\-\\_\\+\\.]+?@[0-9a-z\\-\\_\\+]+(\\.[0-9a-z\\_\\-]{2,})+)>?",
    re.IGNORECASE,
)

Emails patterns like <me@mail.com> or me@mail.com where the whole address is captured in the first group.

URL_PATTERN `module-attribute` ¤

URL_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_url, end_of_word), re.IGNORECASE
)

URL patterns like http(s)://domain.ext/page/subpage?q=x&r=0:1#anchor or //domain.ext/page. URL must follow RFC3986, meaning query parameters should be before anchors, if any. Relying on this assumption allows a faster regex parsing.

the protocol (ftp, ftps, http, https) is captured as the first group,
domain.ext is captured as the second group,
/page/etc is the third group, including leading and trailing /,
page query parameters ?s=x&r=0, including ?, is the fourth group if the URL declares ...?params#anchor,
anchor #anchor is the fifth group, including #, if the URL declares ...?params#anchor.

URLs are captured if they are:

alone on their own line,
enclosed in {}, [], ()
enclosed in whitespaces.

Warning: URLs enclosed in (), [] and {} may retain the closing sign as part of the page name since () and [] are valid in URL pathes and parameters. This pattern will work on plain text only: Markdown, XML, HTML and JSON will need to be parsed ahead.

MEMBERS_PATTERN `module-attribute` ¤

MEMBERS_PATTERN = re.compile('(?<=[a-z])(\\.)(?=[a-z])', re.IGNORECASE)

Domain patterns without leading protocol like cdn.company.com or class members in object-oriented programming languages like params.cookies.client.

DATE_PATTERN `module-attribute` ¤

DATE_PATTERN = re.compile(date_regex, re.IGNORECASE)

Dates like 2022-12-01, 01-12-2022, 01-12-22, 01/12/2022, 01/12/22 where the whole date is captured in the first group, then each group of digits is captured in the order of appearance, in the next 3 groups

TIME_PATTERN `module-attribute` ¤

TIME_PATTERN = re.compile(time_regex, re.IGNORECASE)

Identify more or less standard time patterns, like :

12h15
12:15
12:15:00
12am
12 am
12 h
12:15:00Z
12:15:00+01
12:15:00 UTC+1
11:27:45+0000

RETURNS	DESCRIPTION
`0`	1- or 2-digits hour, TYPE: `str`
`1`	hour/minutes separator or half-day marker among `["h", ":", "am", "pm"]` (case-insensitive) TYPE: `str`
`2`	2-digits minutes, if any, or `None` TYPE: `str`
`3`	2-digits seconds, if any. TYPE: `str`
`4`	hour marker (`h` or `H`), half-day marker (case-insensitive `["am", "pm"]`), or time zone marker (case-sensitive `["Z", "UTC"]`) TYPE: `str`
`5`	1-or 2-digits signed integer timezone shift (referred to UTC). TYPE: `str`

Examples:

see https://regex101.com/r/QNtZAK/2

see src/tests/test-patterns.py

DOMAIN_PATTERN `module-attribute` ¤

DOMAIN_PATTERN = re.compile(
    "from ((?:[a-z0-9\\-_]{0,61}\\.)+[a-z]{2,})", re.IGNORECASE
)

Matches patterns like from (domain.ext) from RFC-822 Received header in emails.

UID_PATTERN `module-attribute` ¤

UID_PATTERN = re.compile('UID ([0-9]+)')

Matches email integer UID from IMAP headers.

FLAGS_PATTERN `module-attribute` ¤

FLAGS_PATTERN = re.compile('FLAGS \\((.*?)\\)')

Matches email flags from IMAP headers.

PATH_PATTERN `module-attribute` ¤

PATH_PATTERN = re.compile('%s%s%s' % (regex_starter, path_regex, end_of_word))

File path pattern like ~/file, /home/file, ./file or C:\windows

PARTIAL_PATH_REGEX `module-attribute` ¤

PARTIAL_PATH_REGEX = re.compile(
    "%s%s%s" % (regex_starter, partial_path_regex, end_of_word)
)

Partial, invalid path patterns missing the leading root, like home/user/stuff. We start capturing after at least two folder separators (slash or backslash).

Warning

this will collide with date detection, so run it after in the pipeline.

RESOLUTION_PATTERN `module-attribute` ¤

RESOLUTION_PATTERN = re.compile('\\d+(?:×|x|X)\\d+')

Pixel resolution like 10x20 or 10×20. Units are discarded.

NUMBER_PATTERN `module-attribute` ¤

NUMBER_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_number, regex_stopper)
)

Signed integers and decimals, fractions and numeric IDs with interal dashes and underscores. Numbers with starting or trailing units are not considered. Lazy decimals (.1 and 1.) are considered.

HASH_PATTERN `module-attribute` ¤

HASH_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_hash, end_of_word), re.IGNORECASE
)

Cryptographic hexadecimal hashes and fingerprints, of a min length of 8 characters.

MULTIPLE_LINES `module-attribute` ¤

MULTIPLE_LINES = re.compile('(?: ?[\\t\\r\\n]{2,} ?)+')

Detect more than 2 newlines and tab, possibly mixed with spaces

MULTIPLE_NEWLINES `module-attribute` ¤

MULTIPLE_NEWLINES = re.compile('(?: ?[\\t\\r\\n]+ ?){2,}')

Detect broken sequences of newlines and spaces.

EXPOSURE `module-attribute` ¤

EXPOSURE = re.compile(
    "%s%s%s" % (regex_starter, exposure_regex, end_of_word), flags=re.IGNORECASE
)

Exposure values in EV or IL

SENSIBILITY `module-attribute` ¤

SENSIBILITY = re.compile(
    "%s%s%s" % (regex_starter, sensibility_regex, end_of_word),
    flags=re.IGNORECASE,
)

Photographic sensibility in ISO or ASA

LUMINANCE `module-attribute` ¤

LUMINANCE = re.compile(
    "%s%s%s" % (regex_starter, luminance_regex, end_of_word),
    flags=re.IGNORECASE,
)

Luminance/radiance in nits or Cd/m²

DIAPHRAGM `module-attribute` ¤

DIAPHRAGM = re.compile(
    "%s%s" % (regex_starter, diaphragm_regex), flags=re.IGNORECASE
)

Photographic diaph aperture values like f/2.8 or f/11

GAIN `module-attribute` ¤

GAIN = re.compile(
    "%s%s%s" % (regex_starter, gain_regex, end_of_word), flags=re.IGNORECASE
)

Gain, attenuation and PSNR in dB

FILE_SIZE `module-attribute` ¤

FILE_SIZE = re.compile(
    "%s%s%s" % (regex_starter, filesize_regex, end_of_word), flags=re.IGNORECASE
)

File and memory size in bit, byte, or octet and their multiples

DISTANCE `module-attribute` ¤

DISTANCE = re.compile(
    "%s%s%s" % (regex_starter, distance_regex, end_of_word), flags=re.IGNORECASE
)

Distance in meter, inch, foot and their multiples

PERCENT `module-attribute` ¤

PERCENT = re.compile('%s%s%s' % (regex_starter, percent_regex, end_of_word))

Number followed by %

WEIGHT `module-attribute` ¤

WEIGHT = re.compile(
    "%s%s%s" % (regex_starter, weight_regex, end_of_word), flags=re.IGNORECASE
)

Weight (mass) in British and SI units and their multiples

ANGLE `module-attribute` ¤

ANGLE = re.compile(
    "%s%s%s" % (regex_starter, angle_regex, end_of_word), flags=re.IGNORECASE
)

Angles in radians, degrees and steradians

TEMPERATURE `module-attribute` ¤

TEMPERATURE = re.compile(
    "%s%s%s" % (regex_starter, temperature_regex, end_of_word),
    flags=re.IGNORECASE,
)

Temperatures in °C, °F and K

FREQUENCY `module-attribute` ¤

FREQUENCY = re.compile(
    "%s%s%s" % (regex_starter, frequency_regex, end_of_word),
    flags=re.IGNORECASE,
)

Frequencies in hertz and multiples

TEXT_DATES `module-attribute` ¤

TEXT_DATES = re.compile(
    "([0-9]{1,2})? (jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec|jan|fév|mar|avr|mai|jui|jui|aou|sep|oct|nov|déc|janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre|january|february|march|april|may|june|july|august|september|october|november|december)\\.?( [0-9]{1,2})?( [0-9]{2,4})(?!\\:)",
    flags=re.IGNORECASE | re.MULTILINE,
)

Find textual dates formats:

English dates like 01 Jan 20 or 01 Jan. 2020 but avoid capturing adjacent time like 12:08.
French dates like 01 Jan 20 or 01 Jan. 2020 but avoid capturing adjacent time like 12:08.

RETURNS	DESCRIPTION
`0`	2 digits (day number or year number, depending on language) TYPE: `str`
`1`	month (full-form or abbreviated) TYPE: `str`
`2`	2 digits (day number or year number, depending on language) TYPE: `str`
`3`	4 digits (full year) TYPE: `str`

BASE_64 `module-attribute` ¤

BASE_64 = re.compile(
    "((?:[A-Za-z0-9+\\/]{4}){64,}(?:[A-Za-z0-9+\\/]{2}==|[A-Za-z0-9+\\/]{3}=)?)"
)

Identifies base64 encoding

BB_CODE `module-attribute` ¤

BB_CODE = re.compile('\\[(img|quote)[a-zA-Z0-9 =\\"]*?\\].*?\\[\\/\\1\\]')

Identifies left-over BB code markup [img] and [quote]

MARKUP `module-attribute` ¤

MARKUP = re.compile('(?:\\[|\\{|\\<)([^\\n\\r]+?)(?:\\]|\\}|\\>)')

Identifies left-over HTML and Markdown markup, like <...>, {...}, [...]

USER `module-attribute` ¤

USER = re.compile('([\\w\\-\\+\\.]+)?@([\\w\\-\\+\\.]+)|(user\\-?\\d+)')

Identifies user handles or emails

REPEATED_CHARACTERS `module-attribute` ¤

REPEATED_CHARACTERS = re.compile('(.)\\1{9,}')

Identifies any character repeated more than 9 times

UNFINISHED_SENTENCES `module-attribute` ¤

UNFINISHED_SENTENCES = re.compile('(?<![?!.;:])\\n\\n|\\r\\n')

Identifies sentences finishing with 2 newlines characters without having ending punctuations

MULTIPLE_DOTS `module-attribute` ¤

MULTIPLE_DOTS = re.compile('\\.{2,}')

Identifies dots repeated more than twice

MULTIPLE_DASHES `module-attribute` ¤

MULTIPLE_DASHES = re.compile('-{1,}')

Identifies dashes repeated more than once

MULTIPLE_QUESTIONS `module-attribute` ¤

MULTIPLE_QUESTIONS = re.compile('\\?{1,}')

Identifies question marks repeated more than once

ORDINAL_FR `module-attribute` ¤

ORDINAL_FR = re.compile('n° ?([0-9]+)')

French ordinal numbers (numéros n°)

FRANCAIS `module-attribute` ¤

FRANCAIS = re.compile(
    "%s(j|t|s|l|d|qu|lorsqu|quelqu|jusqu|m|c|n)\\'(?=[aeiouyéèàêâîôûïüäëöh][\\w\\s])"
    % regex_starter,
    flags=re.IGNORECASE,
)

French contractions of pronouns and determinants

DASHES `module-attribute` ¤

DASHES = re.compile('(?<=\\w)(-)(?=\\w)', re.IGNORECASE)

Dashes in the middle of ASCII/Latin compounded words. Will not work if accented or Unicode characters are immediately surrounding the dash.

ALTERNATIVES `module-attribute` ¤

ALTERNATIVES = re.compile('(?<=[a-z])(\\/)(?=[a-z])', re.IGNORECASE)

Slash-separated word alternatives like and/or mr/mrs

PLURAL_S `module-attribute` ¤

PLURAL_S = re.compile('(?<=\\w{4,})s?e{0,2}s%s' % end_of_word)

Identify plural form of nouns (French and English), adjectives (French) and third-person present verbs (English) and second-person verbs (French) in -s.

FEMININE_E `module-attribute` ¤

FEMININE_E = re.compile('(?<=\\w{4,})e{1,2}%s' % end_of_word)

Identify feminine form of adjectives (French) in -e.

DOUBLE_CONSONANTS `module-attribute` ¤

DOUBLE_CONSONANTS = re.compile('(?<=\\w{2,})([^aeiouy])\\1')

Identify double consonants in the middle of words.

FEMININE_TRICE `module-attribute` ¤

FEMININE_TRICE = re.compile('(?<=\\w{4,})t(rice|eur|or)%s' % end_of_word)

Identify French feminine nouns in -trice.

ADVERB_MENT `module-attribute` ¤

ADVERB_MENT = re.compile('(?<=\\w{4,})e?ment%s' % end_of_word)

Identify French adverbs and English nouns ending en -ment

SUBSTANTIVE_TION `module-attribute` ¤

SUBSTANTIVE_TION = re.compile('(?<=\\w{4,})(t|s)ion%s' % end_of_word)

Identify French and English substantives formed from verbs by adding -tion and -sion

SUBSTANTIVE_AT `module-attribute` ¤

SUBSTANTIVE_AT = re.compile('(?<=\\w{4,})at%s' % end_of_word)

Identify French and English substantives formed from other nouns by adding -at

PARTICIPLE_ING `module-attribute` ¤

PARTICIPLE_ING = re.compile('(?<=\\w{4,})ing%s' % end_of_word)

Identify English substantives and present participles formed from verbs by adding -ing

ADJECTIVE_ED `module-attribute` ¤

ADJECTIVE_ED = re.compile('(?<=\\w{4,})ed%s' % end_of_word)

Identify English adjectives formed from verbs by adding -ed

ADJECTIVE_TIF `module-attribute` ¤

ADJECTIVE_TIF = re.compile('(?<=\\w{2,})ti(f|v)%s' % end_of_word)

Identify English and French adjectives formed from verbs by adding -tif or -tive

SUBSTANTIVE_Y `module-attribute` ¤

SUBSTANTIVE_Y = re.compile('(?<=\\w{3,})y%s' % end_of_word)

Identify English substantives ending in -y

VERB_IZ `module-attribute` ¤

VERB_IZ = re.compile('(?<=\\w{4,})(i|y)z%s' % end_of_word)

Identify American verbs ending in -iz that French and Brits write in -is

STUFF_ER `module-attribute` ¤

STUFF_ER = re.compile('(?<=\\w{4,})er%s' % end_of_word)

Identify French 1st group verb (infinitive) and English substantives ending in -er

BRITISH_OUR `module-attribute` ¤

BRITISH_OUR = re.compile('(?<=\\w{3,})our%s' % end_of_word)

Identify British spelling ending in -our (colour, behaviour).

SUBSTANTIVE_ITY `module-attribute` ¤

SUBSTANTIVE_ITY = re.compile('(?<=\\w{4,})it(y|e)%s' % end_of_word)

Identify substantives in -ity (English) and -ite (French).

SUBSTANTIVE_IST `module-attribute` ¤

SUBSTANTIVE_IST = re.compile('(?<=\\w{3,})is(t|m)%s' % end_of_word)

Identify substantives in -ist and -ism.

SUBSTANTIVE_IQU `module-attribute` ¤

SUBSTANTIVE_IQU = re.compile('(?<=\\w{3,})i(qu|c)%s' % end_of_word)

Identify French substantives in -iqu

SUBSTANTIVE_EUR `module-attribute` ¤

SUBSTANTIVE_EUR = re.compile('(?<=\\w{3,})eur%s' % end_of_word)

Identify French substantives -eur

Patterns¤

patterns ¤

Attributes¤

regex_starter module-attribute ¤

regex_stopper module-attribute ¤

end_of_word module-attribute ¤

regex_algebra module-attribute ¤

IP_PATTERN module-attribute ¤

EMAIL_PATTERN module-attribute ¤

URL_PATTERN module-attribute ¤

MEMBERS_PATTERN module-attribute ¤

DATE_PATTERN module-attribute ¤

TIME_PATTERN module-attribute ¤

DOMAIN_PATTERN module-attribute ¤

UID_PATTERN module-attribute ¤

FLAGS_PATTERN module-attribute ¤

PATH_PATTERN module-attribute ¤

PARTIAL_PATH_REGEX module-attribute ¤

RESOLUTION_PATTERN module-attribute ¤

NUMBER_PATTERN module-attribute ¤

HASH_PATTERN module-attribute ¤

MULTIPLE_LINES module-attribute ¤

MULTIPLE_NEWLINES module-attribute ¤

EXPOSURE module-attribute ¤

SENSIBILITY module-attribute ¤

LUMINANCE module-attribute ¤

DIAPHRAGM module-attribute ¤

GAIN module-attribute ¤

FILE_SIZE module-attribute ¤

DISTANCE module-attribute ¤

PERCENT module-attribute ¤

WEIGHT module-attribute ¤

ANGLE module-attribute ¤

TEMPERATURE module-attribute ¤

FREQUENCY module-attribute ¤

TEXT_DATES module-attribute ¤

BASE_64 module-attribute ¤

BB_CODE module-attribute ¤

MARKUP module-attribute ¤

USER module-attribute ¤

REPEATED_CHARACTERS module-attribute ¤

UNFINISHED_SENTENCES module-attribute ¤

MULTIPLE_DOTS module-attribute ¤

MULTIPLE_DASHES module-attribute ¤

MULTIPLE_QUESTIONS module-attribute ¤

ORDINAL_FR module-attribute ¤

FRANCAIS module-attribute ¤

DASHES module-attribute ¤

ALTERNATIVES module-attribute ¤

PLURAL_S module-attribute ¤

FEMININE_E module-attribute ¤

DOUBLE_CONSONANTS module-attribute ¤

FEMININE_TRICE module-attribute ¤

ADVERB_MENT module-attribute ¤

SUBSTANTIVE_TION module-attribute ¤

SUBSTANTIVE_AT module-attribute ¤

PARTICIPLE_ING module-attribute ¤

ADJECTIVE_ED module-attribute ¤

ADJECTIVE_TIF module-attribute ¤

SUBSTANTIVE_Y module-attribute ¤

VERB_IZ module-attribute ¤

STUFF_ER module-attribute ¤

BRITISH_OUR module-attribute ¤

SUBSTANTIVE_ITY module-attribute ¤

SUBSTANTIVE_IST module-attribute ¤

SUBSTANTIVE_IQU module-attribute ¤

SUBSTANTIVE_EUR module-attribute ¤