Patterns¤
patterns ¤
Contains global regular expression patterns re-used in the app. You can use https://regex101.com/ to test these conveniently.
© 2023 - Aurélien Pierre
Attributes¤
regex_starter
module-attribute
¤
Start of line, or start of document, or start of markup
regex_stopper
module-attribute
¤
End of line, or end of document, or end of markup
end_of_word
module-attribute
¤
End of word, or end of line, or end of document, or end of markup
IP_PATTERN
module-attribute
¤
IPv4 and IPv6 patterns where the whole IP is captured in the first group.
EMAIL_PATTERN
module-attribute
¤
EMAIL_PATTERN = re.compile(
"<?([0-9a-z\\-\\_\\+\\.]+?@[0-9a-z\\-\\_\\+]+(\\.[0-9a-z\\_\\-]{2,})+)>?",
re.IGNORECASE,
)
Emails patterns like <me@mail.com>
or me@mail.com
where the whole address is captured in the first group.
URL_PATTERN
module-attribute
¤
URL patterns like http(s)://domain.ext/page/subpage?q=x&r=0:1#anchor
or //domain.ext/page
.
URL must follow RFC3986, meaning query parameters
should be before anchors, if any. Relying on this assumption allows a faster regex parsing.
- the protocol (ftp, ftps, http, https) is captured as the first group,
domain.ext
is captured as the second group,/page/etc
is the third group, including leading and trailing/
,- page query parameters
?s=x&r=0
, including?
, is the fourth group if the URL declares...?params#anchor
, - anchor
#anchor
is the fifth group, including#
, if the URL declares...?params#anchor
.
URLs are captured if they are:
- alone on their own line,
- enclosed in
{}
,[]
,()
- enclosed in whitespaces.
Warning: URLs enclosed in ()
, []
and {}
may retain the closing sign
as part of the page name since ()
and []
are valid in URL pathes
and parameters. This pattern will work on plain text only: Markdown, XML, HTML and JSON
will need to be parsed ahead.
MEMBERS_PATTERN
module-attribute
¤
Domain patterns without leading protocol like cdn.company.com
or class members in object-oriented programming languages like params.cookies.client
.
DATE_PATTERN
module-attribute
¤
Dates like 2022-12-01
, 01-12-2022
, 01-12-22
, 01/12/2022
, 01/12/22
where the whole date is captured in the first group, then each group of digits is captured in the order of appearance, in the next 3 groups
TIME_PATTERN
module-attribute
¤
Identify more or less standard time patterns, like :
- 12h15
- 12:15
- 12:15:00
- 12am
- 12 am
- 12 h
- 12:15:00Z
- 12:15:00+01
- 12:15:00 UTC+1
- 11:27:45+0000
RETURNS | DESCRIPTION |
---|---|
0
|
1- or 2-digits hour,
TYPE:
|
1
|
hour/minutes separator or half-day marker among
TYPE:
|
2
|
2-digits minutes, if any, or
TYPE:
|
3
|
2-digits seconds, if any.
TYPE:
|
4
|
hour marker (
TYPE:
|
5
|
1-or 2-digits signed integer timezone shift (referred to UTC).
TYPE:
|
Examples:
see https://regex101.com/r/QNtZAK/2
see src/tests/test-patterns.py
DOMAIN_PATTERN
module-attribute
¤
Matches patterns like from (domain.ext)
from RFC-822 Received
header in emails.
UID_PATTERN
module-attribute
¤
Matches email integer UID from IMAP headers.
FLAGS_PATTERN
module-attribute
¤
Matches email flags from IMAP headers.
PATH_PATTERN
module-attribute
¤
File path pattern like ~/file
, /home/file
, ./file
or C:\windows
PARTIAL_PATH_REGEX
module-attribute
¤
Partial, invalid path patterns missing the leading root, like home/user/stuff
.
We start capturing after at least two folder separators (slash or backslash).
Warning
this will collide with date detection, so run it after in the pipeline.
RESOLUTION_PATTERN
module-attribute
¤
Pixel resolution like 10x20 or 10×20. Units are discarded.
NUMBER_PATTERN
module-attribute
¤
Signed integers and decimals, fractions and numeric IDs with interal dashes and underscores. Numbers with starting or trailing units are not considered. Lazy decimals (.1 and 1.) are considered.
HASH_PATTERN
module-attribute
¤
Cryptographic hexadecimal hashes and fingerprints, of a min length of 8 characters.
MULTIPLE_LINES
module-attribute
¤
Detect more than 2 newlines and tab, possibly mixed with spaces
MULTIPLE_NEWLINES
module-attribute
¤
Detect broken sequences of newlines and spaces.
EXPOSURE
module-attribute
¤
EXPOSURE = re.compile(
"%s%s%s" % (regex_starter, exposure_regex, end_of_word), flags=re.IGNORECASE
)
Exposure values in EV or IL
SENSIBILITY
module-attribute
¤
SENSIBILITY = re.compile(
"%s%s%s" % (regex_starter, sensibility_regex, end_of_word),
flags=re.IGNORECASE,
)
Photographic sensibility in ISO or ASA
LUMINANCE
module-attribute
¤
LUMINANCE = re.compile(
"%s%s%s" % (regex_starter, luminance_regex, end_of_word),
flags=re.IGNORECASE,
)
Luminance/radiance in nits or Cd/m²
DIAPHRAGM
module-attribute
¤
Photographic diaph aperture values like f/2.8 or f/11
GAIN
module-attribute
¤
Gain, attenuation and PSNR in dB
FILE_SIZE
module-attribute
¤
FILE_SIZE = re.compile(
"%s%s%s" % (regex_starter, filesize_regex, end_of_word), flags=re.IGNORECASE
)
File and memory size in bit, byte, or octet and their multiples
DISTANCE
module-attribute
¤
DISTANCE = re.compile(
"%s%s%s" % (regex_starter, distance_regex, end_of_word), flags=re.IGNORECASE
)
Distance in meter, inch, foot and their multiples
PERCENT
module-attribute
¤
Number followed by %
WEIGHT
module-attribute
¤
Weight (mass) in British and SI units and their multiples
ANGLE
module-attribute
¤
Angles in radians, degrees and steradians
TEMPERATURE
module-attribute
¤
TEMPERATURE = re.compile(
"%s%s%s" % (regex_starter, temperature_regex, end_of_word),
flags=re.IGNORECASE,
)
Temperatures in °C, °F and K
FREQUENCY
module-attribute
¤
FREQUENCY = re.compile(
"%s%s%s" % (regex_starter, frequency_regex, end_of_word),
flags=re.IGNORECASE,
)
Frequencies in hertz and multiples
TEXT_DATES
module-attribute
¤
TEXT_DATES = re.compile(
"([0-9]{1,2})? (jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec|jan|fév|mar|avr|mai|jui|jui|aou|sep|oct|nov|déc|janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre|january|february|march|april|may|june|july|august|september|october|november|december)\\.?( [0-9]{1,2})?( [0-9]{2,4})(?!\\:)",
flags=re.IGNORECASE | re.MULTILINE,
)
Find textual dates formats:
- English dates like
01 Jan 20
or01 Jan. 2020
but avoid capturing adjacent time like12:08
. - French dates like
01 Jan 20
or01 Jan. 2020
but avoid capturing adjacent time like12:08
.
RETURNS | DESCRIPTION |
---|---|
0
|
2 digits (day number or year number, depending on language)
TYPE:
|
1
|
month (full-form or abbreviated)
TYPE:
|
2
|
2 digits (day number or year number, depending on language)
TYPE:
|
3
|
4 digits (full year)
TYPE:
|
BASE_64
module-attribute
¤
BASE_64 = re.compile(
"((?:[A-Za-z0-9+\\/]{4}){64,}(?:[A-Za-z0-9+\\/]{2}==|[A-Za-z0-9+\\/]{3}=)?)"
)
Identifies base64 encoding
BB_CODE
module-attribute
¤
Identifies left-over BB code markup [img]
and [quote]
MARKUP
module-attribute
¤
Identifies left-over HTML and Markdown markup, like <...>
, {...}
, [...]
USER
module-attribute
¤
Identifies user handles or emails
REPEATED_CHARACTERS
module-attribute
¤
Identifies any character repeated more than 9 times
UNFINISHED_SENTENCES
module-attribute
¤
Identifies sentences finishing with 2 newlines characters without having ending punctuations
MULTIPLE_DOTS
module-attribute
¤
Identifies dots repeated more than twice
MULTIPLE_DASHES
module-attribute
¤
Identifies dashes repeated more than once
MULTIPLE_QUESTIONS
module-attribute
¤
Identifies question marks repeated more than once
ORDINAL_FR
module-attribute
¤
French ordinal numbers (numéros n°)
FRANCAIS
module-attribute
¤
FRANCAIS = re.compile(
"%s(j|t|s|l|d|qu|lorsqu|quelqu|jusqu|m|c|n)\\'(?=[aeiouyéèàêâîôûïüäëöh][\\w\\s])"
% regex_starter,
flags=re.IGNORECASE,
)
French contractions of pronouns and determinants
DASHES
module-attribute
¤
Dashes in the middle of ASCII/Latin compounded words. Will not work if accented or Unicode characters are immediately surrounding the dash.
ALTERNATIVES
module-attribute
¤
Slash-separated word alternatives like and/or
mr/mrs
PLURAL_S
module-attribute
¤
Identify plural form of nouns (French and English), adjectives (French) and third-person present verbs (English) and second-person verbs (French) in -s.
FEMININE_E
module-attribute
¤
Identify feminine form of adjectives (French) in -e.
DOUBLE_CONSONANTS
module-attribute
¤
Identify double consonants in the middle of words.
FEMININE_TRICE
module-attribute
¤
Identify French feminine nouns in -trice.
ADVERB_MENT
module-attribute
¤
Identify French adverbs and English nouns ending en -ment
SUBSTANTIVE_TION
module-attribute
¤
Identify French and English substantives formed from verbs by adding -tion and -sion
SUBSTANTIVE_AT
module-attribute
¤
Identify French and English substantives formed from other nouns by adding -at
PARTICIPLE_ING
module-attribute
¤
Identify English substantives and present participles formed from verbs by adding -ing
ADJECTIVE_ED
module-attribute
¤
Identify English adjectives formed from verbs by adding -ed
ADJECTIVE_TIF
module-attribute
¤
Identify English and French adjectives formed from verbs by adding -tif or -tive
SUBSTANTIVE_Y
module-attribute
¤
Identify English substantives ending in -y
VERB_IZ
module-attribute
¤
Identify American verbs ending in -iz that French and Brits write in -is
STUFF_ER
module-attribute
¤
Identify French 1st group verb (infinitive) and English substantives ending in -er
BRITISH_OUR
module-attribute
¤
Identify British spelling ending in -our (colour, behaviour).
SUBSTANTIVE_ITY
module-attribute
¤
Identify substantives in -ity (English) and -ite (French).
SUBSTANTIVE_IST
module-attribute
¤
Identify substantives in -ist and -ism.
SUBSTANTIVE_IQU
module-attribute
¤
Identify French substantives in -iqu
SUBSTANTIVE_EUR
module-attribute
¤
Identify French substantives -eur