core.database¤
core.database
¤
Create an SQLite database of web_pages to be used by a search engine.
© 2024 - Aurélien Pierre
Attributes¤
core.database.regex_starter
module-attribute
¤
Start of line, or start of document, or start of markup
core.database.regex_stopper
module-attribute
¤
End of line, or end of document, or end of markup
core.database.end_of_word
module-attribute
¤
End of word, or end of line, or end of document, or end of markup
core.database.IP_PATTERN
module-attribute
¤
IP_PATTERN = re.compile(
"%s%s%s" % (regex_starter, regex_ip, regex_stopper), re.IGNORECASE
)
IPv4 and IPv6 patterns where the whole IP is captured in the first group.
core.database.EMAIL_PATTERN
module-attribute
¤
EMAIL_PATTERN = re.compile(
"<?([0-9a-z\\-\\_\\+\\.]+?@[0-9a-z\\-\\_\\+]+(\\.[0-9a-z\\_\\-]{2,})+)>?",
re.IGNORECASE,
)
Emails patterns like <me@mail.com> or me@mail.com where the whole address is captured in the first group.
core.database.URL_PATTERN
module-attribute
¤
URL_PATTERN = re.compile(
"%s%s%s" % (regex_starter, regex_url, end_of_word), re.IGNORECASE
)
URL patterns like http(s)://domain.ext/page/subpage?q=x&r=0:1#anchor or //domain.ext/page.
URL must follow RFC3986, meaning query parameters
should be before anchors, if any. Relying on this assumption allows a faster regex parsing.
- the protocol (ftp, ftps, http, https) is captured as the first group,
domain.extis captured as the second group,/page/etcis the third group, including leading and trailing/,- page query parameters
?s=x&r=0, including?, is the fourth group if the URL declares...?params#anchor, - anchor
#anchoris the fifth group, including#, if the URL declares...?params#anchor.
URLs are captured if they are:
- alone on their own line,
- enclosed in
{},[],() - enclosed in whitespaces.
Warning: URLs enclosed in (), [] and {} may retain the closing sign
as part of the page name since () and [] are valid in URL pathes
and parameters. This pattern will work on plain text only: Markdown, XML, HTML and JSON
will need to be parsed ahead.
core.database.MEMBERS_PATTERN
module-attribute
¤
Domain patterns without leading protocol like cdn.company.com
or class members in object-oriented programming languages like params.cookies.client.
core.database.DATE_PATTERN
module-attribute
¤
Dates like 2022-12-01, 01-12-2022, 01-12-22, 01/12/2022, 01/12/22 where the whole date is captured in the first group, then each group of digits is captured in the order of appearance, in the next 3 groups
core.database.TIME_PATTERN
module-attribute
¤
Identify more or less standard time patterns, like :
- 12h15
- 12:15
- 12:15:00
- 12am
- 12 am
- 12 h
- 12:15:00Z
- 12:15:00+01
- 12:15:00 UTC+1
- 11:27:45+0000
| RETURNS | DESCRIPTION |
|---|---|
0
|
1- or 2-digits hour,
TYPE:
|
1
|
hour/minutes separator or half-day marker among
TYPE:
|
2
|
2-digits minutes, if any, or
TYPE:
|
3
|
2-digits seconds, if any.
TYPE:
|
4
|
hour marker (
TYPE:
|
5
|
1-or 2-digits signed integer timezone shift (referred to UTC).
TYPE:
|
Examples:
see https://regex101.com/r/QNtZAK/2
see src/tests/test-patterns.py
core.database.DOMAIN_PATTERN
module-attribute
¤
Matches patterns like from (domain.ext) from RFC-822 Received header in emails.
core.database.UID_PATTERN
module-attribute
¤
Matches email integer UID from IMAP headers.
core.database.FLAGS_PATTERN
module-attribute
¤
Matches email flags from IMAP headers.
core.database.PATH_PATTERN
module-attribute
¤
PATH_PATTERN = re.compile('%s%s%s' % (regex_starter, path_regex, end_of_word))
File path pattern like ~/file, /home/file, ./file or C:\windows
core.database.PARTIAL_PATH_REGEX
module-attribute
¤
PARTIAL_PATH_REGEX = re.compile(
"%s%s%s" % (regex_starter, partial_path_regex, end_of_word)
)
Partial, invalid path patterns missing the leading root, like home/user/stuff.
We start capturing after at least two folder separators (slash or backslash).
Warning
this will collide with date detection, so run it after in the pipeline.
core.database.RESOLUTION_PATTERN
module-attribute
¤
Pixel resolution like 10x20 or 10×20. Units are discarded.
core.database.NUMBER_PATTERN
module-attribute
¤
NUMBER_PATTERN = re.compile(
"%s%s%s" % (regex_starter, regex_number, regex_stopper)
)
Signed integers and decimals, fractions and numeric IDs with interal dashes and underscores. Numbers with starting or trailing units are not considered. Lazy decimals (.1 and 1.) are considered.
core.database.HASH_PATTERN
module-attribute
¤
HASH_PATTERN = re.compile(
"%s%s%s" % (regex_starter, regex_hash, end_of_word), re.IGNORECASE
)
Cryptographic hexadecimal hashes and fingerprints, of a min length of 8 characters.
core.database.MULTIPLE_LINES
module-attribute
¤
Detect more than 2 newlines and tab, possibly mixed with spaces
core.database.MULTIPLE_NEWLINES
module-attribute
¤
Detect broken sequences of newlines and spaces.
core.database.INTERNAL_NEWLINE
module-attribute
¤
Detect single newline characters nested inside text. Mostly useful for parsed PDF where line wrapping is quite literal ( used instead of space).
core.database.EXPOSURE
module-attribute
¤
EXPOSURE = re.compile(
"%s%s%s" % (regex_starter, exposure_regex, end_of_word), flags=re.IGNORECASE
)
Exposure values in EV or IL
core.database.PHOTOSPEED
module-attribute
¤
PHOTOSPEED = re.compile(
"%s%s%s" % (regex_starter, photospeed_regex, end_of_word),
flags=re.IGNORECASE,
)
Exposure values in EV or IL
core.database.SENSIBILITY
module-attribute
¤
SENSIBILITY = re.compile(
"%s%s%s" % (regex_starter, sensibility_regex, end_of_word),
flags=re.IGNORECASE,
)
Photographic sensibility in ISO or ASA
core.database.LUMINANCE
module-attribute
¤
LUMINANCE = re.compile(
"%s%s%s" % (regex_starter, luminance_regex, end_of_word),
flags=re.IGNORECASE,
)
Luminance/radiance in nits or Cd/m²
core.database.DIAPHRAGM
module-attribute
¤
DIAPHRAGM = re.compile(
"%s%s" % (regex_starter, diaphragm_regex), flags=re.IGNORECASE
)
Photographic diaph aperture values like f/2.8 or f/11
core.database.GAIN
module-attribute
¤
GAIN = re.compile(
"%s%s%s" % (regex_starter, gain_regex, end_of_word), flags=re.IGNORECASE
)
Gain, attenuation and PSNR in dB
core.database.FILE_SIZE
module-attribute
¤
FILE_SIZE = re.compile(
"%s%s%s" % (regex_starter, filesize_regex, end_of_word), flags=re.IGNORECASE
)
File and memory size in bit, byte, or octet and their multiples
core.database.DISTANCE
module-attribute
¤
DISTANCE = re.compile(
"%s%s%s" % (regex_starter, distance_regex, end_of_word), flags=re.IGNORECASE
)
Distance in meter, inch, foot and their multiples
core.database.PERCENT
module-attribute
¤
PERCENT = re.compile('%s%s%s' % (regex_starter, percent_regex, end_of_word))
Number followed by %
core.database.WEIGHT
module-attribute
¤
WEIGHT = re.compile(
"%s%s%s" % (regex_starter, weight_regex, end_of_word), flags=re.IGNORECASE
)
Weight (mass) in British and SI units and their multiples
core.database.ANGLE
module-attribute
¤
ANGLE = re.compile(
"%s%s%s" % (regex_starter, angle_regex, end_of_word), flags=re.IGNORECASE
)
Angles in radians, degrees and steradians
core.database.TEMPERATURE
module-attribute
¤
TEMPERATURE = re.compile(
"%s%s%s" % (regex_starter, temperature_regex, end_of_word),
flags=re.IGNORECASE,
)
Temperatures in °C, °F and K
core.database.FREQUENCY
module-attribute
¤
FREQUENCY = re.compile(
"%s%s%s" % (regex_starter, frequency_regex, end_of_word),
flags=re.IGNORECASE,
)
Frequencies in hertz and multiples
core.database.TEXT_DATES
module-attribute
¤
TEXT_DATES = re.compile(
"([0-9]{1,2})? (jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec|jan|fév|mar|avr|mai|jui|jui|aou|sep|oct|nov|déc|janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre|january|february|march|april|may|june|july|august|september|october|november|december)\\.?( [0-9]{1,2})?( [0-9]{2,4})(?!\\:)",
flags=re.IGNORECASE | re.MULTILINE,
)
Find textual dates formats:
- English dates like
01 Jan 20or01 Jan. 2020but avoid capturing adjacent time like12:08. - French dates like
01 Jan 20or01 Jan. 2020but avoid capturing adjacent time like12:08.
| RETURNS | DESCRIPTION |
|---|---|
0
|
2 digits (day number or year number, depending on language)
TYPE:
|
1
|
month (full-form or abbreviated)
TYPE:
|
2
|
2 digits (day number or year number, depending on language)
TYPE:
|
3
|
4 digits (full year)
TYPE:
|
core.database.BASE_64
module-attribute
¤
BASE_64 = re.compile(
"((?:[A-Za-z0-9+\\/]{4}){64,}(?:[A-Za-z0-9+\\/]{2}==|[A-Za-z0-9+\\/]{3}=)?)"
)
Identifies base64 encoding
core.database.BB_CODE
module-attribute
¤
Identifies left-over BB code markup [img] and [quote]
core.database.MARKUP
module-attribute
¤
Identifies left-over HTML and Markdown markup, like <...>, {...}, [...]
core.database.USER
module-attribute
¤
Identifies user handles or emails
core.database.REPEATED_CHARACTERS
module-attribute
¤
Identifies any character repeated more than 9 times
core.database.UNFINISHED_SENTENCES
module-attribute
¤
Identifies sentences finishing with 2 newlines characters without having ending punctuations
core.database.MULTIPLE_DOTS
module-attribute
¤
Identifies dots repeated more than twice
core.database.MULTIPLE_DASHES
module-attribute
¤
Identifies dashes repeated more than once
core.database.MULTIPLE_QUESTIONS
module-attribute
¤
Identifies question marks repeated more than once
core.database.ORDINAL_FR
module-attribute
¤
French ordinal numbers (numéros n°)
core.database.FRANCAIS
module-attribute
¤
FRANCAIS = re.compile(
"%s(j|t|s|d|qu|lorsqu|quelqu|jusqu|m|c|n)\\'(?=[aeiouyéèàêâîôûïüäëöh][\\w\\s])"
% regex_starter,
flags=re.IGNORECASE,
)
French contractions of pronouns and determinants
core.database.DASHES
module-attribute
¤
Dashes in the middle of ASCII/Latin compounded words. Will not work if accented or Unicode characters are immediately surrounding the dash.
core.database.ALTERNATIVES
module-attribute
¤
Slash-separated word alternatives like and/or mr/mrs
core.database.PLURAL_S
module-attribute
¤
PLURAL_S = re.compile('(?<=[a-zA-Z]{4,})s?e{0,2}s%s' % end_of_word)
Identify plural form of nouns (French and English), adjectives (French) and third-person present verbs (English) and second-person verbs (French) in -s.
core.database.FEMININE_E
module-attribute
¤
FEMININE_E = re.compile('(?<=\\w{4,})e{1,2}%s' % end_of_word)
Identify feminine form of adjectives (French) in -e.
core.database.DOUBLE_CONSONANTS
module-attribute
¤
Identify double consonants in the middle of words.
core.database.FEMININE_TRICE
module-attribute
¤
FEMININE_TRICE = re.compile('(?<=\\w{4,})t(rice|eur|or)%s' % end_of_word)
Identify French feminine nouns in -trice.
core.database.ADVERB_MENT
module-attribute
¤
ADVERB_MENT = re.compile('(?<=\\w{4,})e?ment%s' % end_of_word)
Identify French adverbs and English nouns ending en -ment
core.database.SUBSTANTIVE_TION
module-attribute
¤
SUBSTANTIVE_TION = re.compile('(?<=\\w{4,})(t|s)ion%s' % end_of_word)
Identify French and English substantives formed from verbs by adding -tion and -sion
core.database.SUBSTANTIVE_AT
module-attribute
¤
SUBSTANTIVE_AT = re.compile('(?<=\\w{4,})at%s' % end_of_word)
Identify French and English substantives formed from other nouns by adding -at
core.database.PARTICIPLE_ING
module-attribute
¤
PARTICIPLE_ING = re.compile('(?<=\\w{4,})ing%s' % end_of_word)
Identify English substantives and present participles formed from verbs by adding -ing
core.database.ADJECTIVE_ED
module-attribute
¤
ADJECTIVE_ED = re.compile('(?<=\\w{4,})ed%s' % end_of_word)
Identify English adjectives formed from verbs by adding -ed
core.database.ADJECTIVE_TIF
module-attribute
¤
ADJECTIVE_TIF = re.compile('(?<=\\w{2,})ti(f|v)%s' % end_of_word)
Identify English and French adjectives formed from verbs by adding -tif or -tive
core.database.SUBSTANTIVE_Y
module-attribute
¤
SUBSTANTIVE_Y = re.compile('(?<=\\w{3,})y%s' % end_of_word)
Identify English substantives ending in -y
core.database.VERB_IZ
module-attribute
¤
VERB_IZ = re.compile('(?<=\\w{4,})(i|y)z%s' % end_of_word)
Identify American verbs ending in -iz that French and Brits write in -is
core.database.STUFF_ER
module-attribute
¤
STUFF_ER = re.compile('(?<=\\w{5,})er%s' % end_of_word)
Identify French 1st group verb (infinitive) and English substantives ending in -er
core.database.BRITISH_OUR
module-attribute
¤
BRITISH_OUR = re.compile('(?<=\\w{3,})our%s' % end_of_word)
Identify British spelling ending in -our (colour, behaviour).
core.database.SUBSTANTIVE_ITY
module-attribute
¤
SUBSTANTIVE_ITY = re.compile('(?<=\\w{4,})it(y|e)%s' % end_of_word)
Identify substantives in -ity (English) and -ite (French).
core.database.SUBSTANTIVE_IST
module-attribute
¤
SUBSTANTIVE_IST = re.compile('(?<=\\w{3,})is(t|m)%s' % end_of_word)
Identify substantives in -ist and -ism.
core.database.SUBSTANTIVE_IQU
module-attribute
¤
SUBSTANTIVE_IQU = re.compile('(?<=\\w{3,})i(qu|c)%s' % end_of_word)
Identify French substantives in -iqu
core.database.SUBSTANTIVE_EUR
module-attribute
¤
SUBSTANTIVE_EUR = re.compile('(?<=\\w{3,})eur%s' % end_of_word)
Identify French substantives -eur
core.database.HYPHENIZED
module-attribute
¤
Detect hyphenized words at the end of a PDF text line.
core.database.WAYBACK_RE
module-attribute
¤
Find the canonical URL from web.archive.org (Wayback Machine) URLs
Classes¤
core.database.SQLitePageCorpus
¤
SQLitePageCorpus(
db,
query,
params=(),
atomic_types=(str, bytes),
max_depth=None,
yield_rows=False,
)
Lazily stream rows from an SQLite request, avoiding full copy.
Example
corpus = SQLitePageCorpus(
db,
"""
SELECT tokenized
FROM pages
WHERE lang IN ('fr', 'en')
""",
max_depth=0
)
max_depth=0 will not flatten the content, so it will return
the original list[list[str]] (list of sentences, aka list of list of words),
- max_depth=1 flattens documents, to it will return
list[str] (list of words)
Functions¤
core.database.split_url
¤
Split a well-formed URL following RFC3986 into base elements.
| RETURNS | DESCRIPTION |
|---|---|
tuple[str, str, str, str, str] | None
|
a tuple of |
tuple[str, str, str, str, str] | None
|
Empty/missing fields are inited with empty strings so there is no need for individual |
tuple[str, str, str, str, str] | None
|
If the |
core.database.adapt_array
¤
http://stackoverflow.com/a/31312102/190597 (SoulNibbler)
core.database.create_db
¤
create_db(name: str) -> sqlite3.Connection
Create the pages table if needed and add any missing columns.
This doesn’t destroy existing tables, rows or columns, so it’s safe
to run on any database.
Warning
Columns are inferred directly from web_page.__annotations__.
Existing columns are preserved unchanged.
The url column is used as the PRIMARY KEY.
core.database.create_temp_db
¤
create_temp_db(
min_free: float = 2.0, filename: str | None = None
) -> sqlite3.Connection
Create a temporary SQLite database file (in /dev/shm when available) and
initialize the pages table according to web_page annotations.
| PARAMETER | DESCRIPTION |
|---|---|
min_free
|
minimum available disk space in GiB required to create the temporary database. This is checked at runtime and the function will raise an error if the condition is not met.
TYPE:
|
filename
|
the full path and filename to save the temporary database, if it needs to be reused at some point.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
sqlite3.Connection
|
the sqlite3.Connection opened in bulk mode. |
WARNING
the temporary SQLite database doesn’t use web_page URL as primary key, to allow
later deduplication.
core.database.delete_temp_db
¤
delete_temp_db(db: sqlite3.Connection)
Close and delete a temporary database in one shot.
core.database.open_db
¤
open_db(name: str, mode: str = 'rw') -> sqlite3.Connection
Open an SQLite database with workload-specific optimizations.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
Database identifier/path passed to
TYPE:
|
mode
|
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
sqlite3.Connection
|
sqlite3.Connection |
core.database.compress_db
¤
compress_db(
db: sqlite3.Connection,
delete_query: str | None = None,
delete_params: tuple | None = None,
delete_columns: list[str] | None = None,
)
Optionally delete rows, then reclaim SQLite disk space.
| PARAMETER | DESCRIPTION |
|---|---|
db
|
SQLite connection
TYPE:
|
delete_query
|
full DELETE SQL query
TYPE:
|
delete_params
|
optional SQL parameters
TYPE:
|
core.database.is_primary_key
¤
is_primary_key(db: sqlite3.Connection, table: str, column: str) -> bool
Check whether column is part of the PRIMARY KEY of table.
core.database.populate_db
¤
populate_db(
db: sqlite3.Connection, pages: list[web_page], batch_size: int = 4096
)
Insert or update web_page records into the SQLite database.
Existing rows are matched using the PRIMARY KEY url.
Warning
Array-like Python values are converted to bytearray
then to bytes in order to be handled as BLOB
by SQLite.
core.database.db_to_list
¤
db_to_list(db: sqlite3.Connection) -> list[web_page]
Extract all web_page rows from the pages table in db as a list of web_page
core.database.migrate_url_to_primary_key
¤
migrate_url_to_primary_key(db: sqlite3.Connection)
Rebuild the pages table using url as PRIMARY KEY
for older databases that didn’t use a primary key.
core.database.merge_databases
¤
merge_databases(old_db: sqlite3.Connection, new_db: sqlite3.Connection)
Merge two pages databases.
Rows from old_db are inserted into new_db
only if their URL does not already exist.
Existing rows in new_db are preserved unchanged.
Only columns existing in BOTH databases are copied.
core.database.update_pages_from_database
¤
update_pages_from_database(
target_db: sqlite3.Connection, source_db: sqlite3.Connection
) -> list[str]
Update rows in target_db.pages from source_db.pages
using url as PRIMARY KEY.
Only shared columns are updated.
Returns missing_urls: URLs present in target_db but absent from source_db.
core.database.import_pages
¤
import_pages(
source_db: str | sqlite3.Connection,
destination_db: str | sqlite3.Connection,
where_clause: str = "1=1",
params: tuple = (),
) -> int
Import rows from one SQLite database into another.
Both source_db and destination_db may be either a filesystem
path (str) or an active sqlite3.Connection handle. Passing a
Connection is the only way to target a :memory: database, since
those cannot be addressed by path.
Connection lifecycle - Path supplied – the function opens, commits, and closes the connection itself (original behaviour). - Connection supplied – the caller retains full control; the connection is neither committed nor closed here, so the import can participate in a larger transaction.
Rows are copied from source.pages into destination.pages.
Existing rows are updated on conflict of the url primary key.
Columns present in the destination but absent from the source receive
NULL. Both schemas are discovered at runtime, so the function adapts
automatically if either evolves.
| PARAMETER | DESCRIPTION |
|---|---|
source_db
|
Path to, or an open connection for, the source SQLite database.
TYPE:
|
destination_db
|
Path to, or an open connection for, the destination SQLite database.
TYPE:
|
where_clause
|
SQL WHERE clause applied to
TYPE:
|
params
|
Positional parameters bound to where_clause.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int
|
Number of affected rows. |
Examples::
# File → file (unchanged from before)
import_pages("old.db", "new.db", "domain = ?", ("example.com",))
# In-memory source → file destination
import_pages(mem_conn, "new.db")
# File source → in-memory destination (e.g. for tests)
import_pages("prod.db", mem_conn, "date >= ?", ("2024-01-01",))
# Both in-memory
import_pages(src_conn, dst_conn)
core.database.inspect_db
¤
inspect_db(db: sqlite3.Connection, message: str = '') -> None
Print useful metadata and statistics about a SQLite database.
| PARAMETER | DESCRIPTION |
|---|---|
db
|
active database connection
TYPE:
|
message
|
optional additional message to indentify several inspections if any.
TYPE:
|