core.patterns¤
core.patterns
¤
Contains global regular expression patterns re-used in the app. You can use https://regex101.com/ to test these conveniently.
© 2023 - Aurélien Pierre
Attributes¤
core.patterns.regex_starter
module-attribute
¤
Start of line, or start of document, or start of markup
core.patterns.regex_stopper
module-attribute
¤
End of line, or end of document, or end of markup
core.patterns.end_of_word
module-attribute
¤
End of word, or end of line, or end of document, or end of markup
core.patterns.IP_PATTERN
module-attribute
¤
IP_PATTERN = re.compile(
"%s%s%s" % (regex_starter, regex_ip, regex_stopper), re.IGNORECASE
)
IPv4 and IPv6 patterns where the whole IP is captured in the first group.
core.patterns.EMAIL_PATTERN
module-attribute
¤
EMAIL_PATTERN = re.compile(
"<?([0-9a-z\\-\\_\\+\\.]+?@[0-9a-z\\-\\_\\+]+(\\.[0-9a-z\\_\\-]{2,})+)>?",
re.IGNORECASE,
)
Emails patterns like <me@mail.com> or me@mail.com where the whole address is captured in the first group.
core.patterns.URL_PATTERN
module-attribute
¤
URL_PATTERN = re.compile(
"%s%s%s" % (regex_starter, regex_url, end_of_word), re.IGNORECASE
)
URL patterns like http(s)://domain.ext/page/subpage?q=x&r=0:1#anchor or //domain.ext/page.
URL must follow RFC3986, meaning query parameters
should be before anchors, if any. Relying on this assumption allows a faster regex parsing.
- the protocol (ftp, ftps, http, https) is captured as the first group,
domain.extis captured as the second group,/page/etcis the third group, including leading and trailing/,- page query parameters
?s=x&r=0, including?, is the fourth group if the URL declares...?params#anchor, - anchor
#anchoris the fifth group, including#, if the URL declares...?params#anchor.
URLs are captured if they are:
- alone on their own line,
- enclosed in
{},[],() - enclosed in whitespaces.
Warning: URLs enclosed in (), [] and {} may retain the closing sign
as part of the page name since () and [] are valid in URL pathes
and parameters. This pattern will work on plain text only: Markdown, XML, HTML and JSON
will need to be parsed ahead.
core.patterns.MEMBERS_PATTERN
module-attribute
¤
Domain patterns without leading protocol like cdn.company.com
or class members in object-oriented programming languages like params.cookies.client.
core.patterns.DATE_PATTERN
module-attribute
¤
Dates like 2022-12-01, 01-12-2022, 01-12-22, 01/12/2022, 01/12/22 where the whole date is captured in the first group, then each group of digits is captured in the order of appearance, in the next 3 groups
core.patterns.TIME_PATTERN
module-attribute
¤
Identify more or less standard time patterns, like :
- 12h15
- 12:15
- 12:15:00
- 12am
- 12 am
- 12 h
- 12:15:00Z
- 12:15:00+01
- 12:15:00 UTC+1
- 11:27:45+0000
| RETURNS | DESCRIPTION |
|---|---|
0
|
1- or 2-digits hour,
TYPE:
|
1
|
hour/minutes separator or half-day marker among
TYPE:
|
2
|
2-digits minutes, if any, or
TYPE:
|
3
|
2-digits seconds, if any.
TYPE:
|
4
|
hour marker (
TYPE:
|
5
|
1-or 2-digits signed integer timezone shift (referred to UTC).
TYPE:
|
Examples:
see https://regex101.com/r/QNtZAK/2
see src/tests/test-patterns.py
core.patterns.DOMAIN_PATTERN
module-attribute
¤
Matches patterns like from (domain.ext) from RFC-822 Received header in emails.
core.patterns.UID_PATTERN
module-attribute
¤
Matches email integer UID from IMAP headers.
core.patterns.FLAGS_PATTERN
module-attribute
¤
Matches email flags from IMAP headers.
core.patterns.PATH_PATTERN
module-attribute
¤
PATH_PATTERN = re.compile('%s%s%s' % (regex_starter, path_regex, end_of_word))
File path pattern like ~/file, /home/file, ./file or C:\windows
core.patterns.PARTIAL_PATH_REGEX
module-attribute
¤
PARTIAL_PATH_REGEX = re.compile(
"%s%s%s" % (regex_starter, partial_path_regex, end_of_word)
)
Partial, invalid path patterns missing the leading root, like home/user/stuff.
We start capturing after at least two folder separators (slash or backslash).
Warning
this will collide with date detection, so run it after in the pipeline.
core.patterns.RESOLUTION_PATTERN
module-attribute
¤
Pixel resolution like 10x20 or 10×20. Units are discarded.
core.patterns.NUMBER_PATTERN
module-attribute
¤
NUMBER_PATTERN = re.compile(
"%s%s%s" % (regex_starter, regex_number, regex_stopper)
)
Signed integers and decimals, fractions and numeric IDs with interal dashes and underscores. Numbers with starting or trailing units are not considered. Lazy decimals (.1 and 1.) are considered.
core.patterns.HASH_PATTERN
module-attribute
¤
HASH_PATTERN = re.compile(
"%s%s%s" % (regex_starter, regex_hash, end_of_word), re.IGNORECASE
)
Cryptographic hexadecimal hashes and fingerprints, of a min length of 8 characters.
core.patterns.MULTIPLE_LINES
module-attribute
¤
Detect more than 2 newlines and tab, possibly mixed with spaces
core.patterns.MULTIPLE_NEWLINES
module-attribute
¤
Detect broken sequences of newlines and spaces.
core.patterns.INTERNAL_NEWLINE
module-attribute
¤
Detect single newline characters nested inside text. Mostly useful for parsed PDF where line wrapping is quite literal ( used instead of space).
core.patterns.EXPOSURE
module-attribute
¤
EXPOSURE = re.compile(
"%s%s%s" % (regex_starter, exposure_regex, end_of_word), flags=re.IGNORECASE
)
Exposure values in EV or IL
core.patterns.PHOTOSPEED
module-attribute
¤
PHOTOSPEED = re.compile(
"%s%s%s" % (regex_starter, photospeed_regex, end_of_word),
flags=re.IGNORECASE,
)
Exposure values in EV or IL
core.patterns.SENSIBILITY
module-attribute
¤
SENSIBILITY = re.compile(
"%s%s%s" % (regex_starter, sensibility_regex, end_of_word),
flags=re.IGNORECASE,
)
Photographic sensibility in ISO or ASA
core.patterns.LUMINANCE
module-attribute
¤
LUMINANCE = re.compile(
"%s%s%s" % (regex_starter, luminance_regex, end_of_word),
flags=re.IGNORECASE,
)
Luminance/radiance in nits or Cd/m²
core.patterns.DIAPHRAGM
module-attribute
¤
DIAPHRAGM = re.compile(
"%s%s" % (regex_starter, diaphragm_regex), flags=re.IGNORECASE
)
Photographic diaph aperture values like f/2.8 or f/11
core.patterns.GAIN
module-attribute
¤
GAIN = re.compile(
"%s%s%s" % (regex_starter, gain_regex, end_of_word), flags=re.IGNORECASE
)
Gain, attenuation and PSNR in dB
core.patterns.FILE_SIZE
module-attribute
¤
FILE_SIZE = re.compile(
"%s%s%s" % (regex_starter, filesize_regex, end_of_word), flags=re.IGNORECASE
)
File and memory size in bit, byte, or octet and their multiples
core.patterns.DISTANCE
module-attribute
¤
DISTANCE = re.compile(
"%s%s%s" % (regex_starter, distance_regex, end_of_word), flags=re.IGNORECASE
)
Distance in meter, inch, foot and their multiples
core.patterns.PERCENT
module-attribute
¤
PERCENT = re.compile('%s%s%s' % (regex_starter, percent_regex, end_of_word))
Number followed by %
core.patterns.WEIGHT
module-attribute
¤
WEIGHT = re.compile(
"%s%s%s" % (regex_starter, weight_regex, end_of_word), flags=re.IGNORECASE
)
Weight (mass) in British and SI units and their multiples
core.patterns.ANGLE
module-attribute
¤
ANGLE = re.compile(
"%s%s%s" % (regex_starter, angle_regex, end_of_word), flags=re.IGNORECASE
)
Angles in radians, degrees and steradians
core.patterns.TEMPERATURE
module-attribute
¤
TEMPERATURE = re.compile(
"%s%s%s" % (regex_starter, temperature_regex, end_of_word),
flags=re.IGNORECASE,
)
Temperatures in °C, °F and K
core.patterns.FREQUENCY
module-attribute
¤
FREQUENCY = re.compile(
"%s%s%s" % (regex_starter, frequency_regex, end_of_word),
flags=re.IGNORECASE,
)
Frequencies in hertz and multiples
core.patterns.TEXT_DATES
module-attribute
¤
TEXT_DATES = re.compile(
"([0-9]{1,2})? (jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec|jan|fév|mar|avr|mai|jui|jui|aou|sep|oct|nov|déc|janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre|january|february|march|april|may|june|july|august|september|october|november|december)\\.?( [0-9]{1,2})?( [0-9]{2,4})(?!\\:)",
flags=re.IGNORECASE | re.MULTILINE,
)
Find textual dates formats:
- English dates like
01 Jan 20or01 Jan. 2020but avoid capturing adjacent time like12:08. - French dates like
01 Jan 20or01 Jan. 2020but avoid capturing adjacent time like12:08.
| RETURNS | DESCRIPTION |
|---|---|
0
|
2 digits (day number or year number, depending on language)
TYPE:
|
1
|
month (full-form or abbreviated)
TYPE:
|
2
|
2 digits (day number or year number, depending on language)
TYPE:
|
3
|
4 digits (full year)
TYPE:
|
core.patterns.BASE_64
module-attribute
¤
BASE_64 = re.compile(
"((?:[A-Za-z0-9+\\/]{4}){64,}(?:[A-Za-z0-9+\\/]{2}==|[A-Za-z0-9+\\/]{3}=)?)"
)
Identifies base64 encoding
core.patterns.BB_CODE
module-attribute
¤
Identifies left-over BB code markup [img] and [quote]
core.patterns.MARKUP
module-attribute
¤
Identifies left-over HTML and Markdown markup, like <...>, {...}, [...]
core.patterns.USER
module-attribute
¤
Identifies user handles or emails
core.patterns.REPEATED_CHARACTERS
module-attribute
¤
Identifies any character repeated more than 9 times
core.patterns.UNFINISHED_SENTENCES
module-attribute
¤
Identifies sentences finishing with 2 newlines characters without having ending punctuations
core.patterns.MULTIPLE_DOTS
module-attribute
¤
Identifies dots repeated more than twice
core.patterns.MULTIPLE_DASHES
module-attribute
¤
Identifies dashes repeated more than once
core.patterns.MULTIPLE_QUESTIONS
module-attribute
¤
Identifies question marks repeated more than once
core.patterns.ORDINAL_FR
module-attribute
¤
French ordinal numbers (numéros n°)
core.patterns.FRANCAIS
module-attribute
¤
FRANCAIS = re.compile(
"%s(j|t|s|d|qu|lorsqu|quelqu|jusqu|m|c|n)\\'(?=[aeiouyéèàêâîôûïüäëöh][\\w\\s])"
% regex_starter,
flags=re.IGNORECASE,
)
French contractions of pronouns and determinants
core.patterns.DASHES
module-attribute
¤
Dashes in the middle of ASCII/Latin compounded words. Will not work if accented or Unicode characters are immediately surrounding the dash.
core.patterns.ALTERNATIVES
module-attribute
¤
Slash-separated word alternatives like and/or mr/mrs
core.patterns.PLURAL_S
module-attribute
¤
PLURAL_S = re.compile('(?<=[a-zA-Z]{4,})s?e{0,2}s%s' % end_of_word)
Identify plural form of nouns (French and English), adjectives (French) and third-person present verbs (English) and second-person verbs (French) in -s.
core.patterns.FEMININE_E
module-attribute
¤
FEMININE_E = re.compile('(?<=\\w{4,})e{1,2}%s' % end_of_word)
Identify feminine form of adjectives (French) in -e.
core.patterns.DOUBLE_CONSONANTS
module-attribute
¤
Identify double consonants in the middle of words.
core.patterns.FEMININE_TRICE
module-attribute
¤
FEMININE_TRICE = re.compile('(?<=\\w{4,})t(rice|eur|or)%s' % end_of_word)
Identify French feminine nouns in -trice.
core.patterns.ADVERB_MENT
module-attribute
¤
ADVERB_MENT = re.compile('(?<=\\w{4,})e?ment%s' % end_of_word)
Identify French adverbs and English nouns ending en -ment
core.patterns.SUBSTANTIVE_TION
module-attribute
¤
SUBSTANTIVE_TION = re.compile('(?<=\\w{4,})(t|s)ion%s' % end_of_word)
Identify French and English substantives formed from verbs by adding -tion and -sion
core.patterns.SUBSTANTIVE_AT
module-attribute
¤
SUBSTANTIVE_AT = re.compile('(?<=\\w{4,})at%s' % end_of_word)
Identify French and English substantives formed from other nouns by adding -at
core.patterns.PARTICIPLE_ING
module-attribute
¤
PARTICIPLE_ING = re.compile('(?<=\\w{4,})ing%s' % end_of_word)
Identify English substantives and present participles formed from verbs by adding -ing
core.patterns.ADJECTIVE_ED
module-attribute
¤
ADJECTIVE_ED = re.compile('(?<=\\w{4,})ed%s' % end_of_word)
Identify English adjectives formed from verbs by adding -ed
core.patterns.ADJECTIVE_TIF
module-attribute
¤
ADJECTIVE_TIF = re.compile('(?<=\\w{2,})ti(f|v)%s' % end_of_word)
Identify English and French adjectives formed from verbs by adding -tif or -tive
core.patterns.SUBSTANTIVE_Y
module-attribute
¤
SUBSTANTIVE_Y = re.compile('(?<=\\w{3,})y%s' % end_of_word)
Identify English substantives ending in -y
core.patterns.VERB_IZ
module-attribute
¤
VERB_IZ = re.compile('(?<=\\w{4,})(i|y)z%s' % end_of_word)
Identify American verbs ending in -iz that French and Brits write in -is
core.patterns.STUFF_ER
module-attribute
¤
STUFF_ER = re.compile('(?<=\\w{5,})er%s' % end_of_word)
Identify French 1st group verb (infinitive) and English substantives ending in -er
core.patterns.BRITISH_OUR
module-attribute
¤
BRITISH_OUR = re.compile('(?<=\\w{3,})our%s' % end_of_word)
Identify British spelling ending in -our (colour, behaviour).
core.patterns.SUBSTANTIVE_ITY
module-attribute
¤
SUBSTANTIVE_ITY = re.compile('(?<=\\w{4,})it(y|e)%s' % end_of_word)
Identify substantives in -ity (English) and -ite (French).
core.patterns.SUBSTANTIVE_IST
module-attribute
¤
SUBSTANTIVE_IST = re.compile('(?<=\\w{3,})is(t|m)%s' % end_of_word)
Identify substantives in -ist and -ism.
core.patterns.SUBSTANTIVE_IQU
module-attribute
¤
SUBSTANTIVE_IQU = re.compile('(?<=\\w{3,})i(qu|c)%s' % end_of_word)
Identify French substantives in -iqu
core.patterns.SUBSTANTIVE_EUR
module-attribute
¤
SUBSTANTIVE_EUR = re.compile('(?<=\\w{3,})eur%s' % end_of_word)
Identify French substantives -eur
core.patterns.HYPHENIZED
module-attribute
¤
Detect hyphenized words at the end of a PDF text line.
core.patterns.WAYBACK_RE
module-attribute
¤
Find the canonical URL from web.archive.org (Wayback Machine) URLs
Functions¤
core.patterns.split_url
¤
Split a well-formed URL following RFC3986 into base elements.
| RETURNS | DESCRIPTION |
|---|---|
tuple[str, str, str, str, str] | None
|
a tuple of |
tuple[str, str, str, str, str] | None
|
Empty/missing fields are inited with empty strings so there is no need for individual |
tuple[str, str, str, str, str] | None
|
If the |