core.database¤

core.database ¤

Create an SQLite database of web_pages to be used by a search engine.

Attributes¤

core.database.regex_starter `module-attribute` ¤

regex_starter = '(?<=^|\\s|\\[|\\(|\\{|\\<|\\\'|\\"|`|;|\\>)'

Start of line, or start of document, or start of markup

core.database.regex_stopper `module-attribute` ¤

regex_stopper = '(?=$|\\s|\\]|\\)|\\}|\\>|\\\'|\\"|`|;|\\<)'

End of line, or end of document, or end of markup

core.database.end_of_word `module-attribute` ¤

end_of_word = '(?=$|\\s|\\]|\\)|\\}|\\>|\\\'|\\"|`|;|:|,|\\?|\\!|\\.|\\<)'

End of word, or end of line, or end of document, or end of markup

core.database.regex_algebra `module-attribute` ¤

regex_algebra = '[\\+\\-\\=\\≠\\±]'

Algebraic signs

core.database.IP_PATTERN `module-attribute` ¤

IP_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_ip, regex_stopper), re.IGNORECASE
)

IPv4 and IPv6 patterns where the whole IP is captured in the first group.

core.database.EMAIL_PATTERN `module-attribute` ¤

EMAIL_PATTERN = re.compile(
    "<?([0-9a-z\\-\\_\\+\\.]+?@[0-9a-z\\-\\_\\+]+(\\.[0-9a-z\\_\\-]{2,})+)>?",
    re.IGNORECASE,
)

Emails patterns like <me@mail.com> or me@mail.com where the whole address is captured in the first group.

core.database.URL_PATTERN `module-attribute` ¤

URL_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_url, end_of_word), re.IGNORECASE
)

URL patterns like http(s)://domain.ext/page/subpage?q=x&r=0:1#anchor or //domain.ext/page. URL must follow RFC3986, meaning query parameters should be before anchors, if any. Relying on this assumption allows a faster regex parsing.

the protocol (ftp, ftps, http, https) is captured as the first group,
domain.ext is captured as the second group,
/page/etc is the third group, including leading and trailing /,
page query parameters ?s=x&r=0, including ?, is the fourth group if the URL declares ...?params#anchor,
anchor #anchor is the fifth group, including #, if the URL declares ...?params#anchor.

URLs are captured if they are:

alone on their own line,
enclosed in {}, [], ()
enclosed in whitespaces.

Warning: URLs enclosed in (), [] and {} may retain the closing sign as part of the page name since () and [] are valid in URL pathes and parameters. This pattern will work on plain text only: Markdown, XML, HTML and JSON will need to be parsed ahead.

core.database.MEMBERS_PATTERN `module-attribute` ¤

MEMBERS_PATTERN = re.compile('(?<=[a-z])(\\.)(?=[a-z])', re.IGNORECASE)

Domain patterns without leading protocol like cdn.company.com or class members in object-oriented programming languages like params.cookies.client.

core.database.DATE_PATTERN `module-attribute` ¤

DATE_PATTERN = re.compile(date_regex, re.IGNORECASE)

Dates like 2022-12-01, 01-12-2022, 01-12-22, 01/12/2022, 01/12/22 where the whole date is captured in the first group, then each group of digits is captured in the order of appearance, in the next 3 groups

core.database.TIME_PATTERN `module-attribute` ¤

TIME_PATTERN = re.compile(time_regex, re.IGNORECASE)

Identify more or less standard time patterns, like :

12h15
12:15
12:15:00
12am
12 am
12 h
12:15:00Z
12:15:00+01
12:15:00 UTC+1
11:27:45+0000

RETURNS	DESCRIPTION
`0`	1- or 2-digits hour, TYPE: `str`
`1`	hour/minutes separator or half-day marker among `["h", ":", "am", "pm"]` (case-insensitive) TYPE: `str`
`2`	2-digits minutes, if any, or `None` TYPE: `str`
`3`	2-digits seconds, if any. TYPE: `str`
`4`	hour marker (`h` or `H`), half-day marker (case-insensitive `["am", "pm"]`), or time zone marker (case-sensitive `["Z", "UTC"]`) TYPE: `str`
`5`	1-or 2-digits signed integer timezone shift (referred to UTC). TYPE: `str`

Examples:

see https://regex101.com/r/QNtZAK/2

see src/tests/test-patterns.py

core.database.DOMAIN_PATTERN `module-attribute` ¤

DOMAIN_PATTERN = re.compile(
    "from ((?:[a-z0-9\\-_]{0,61}\\.)+[a-z]{2,})", re.IGNORECASE
)

Matches patterns like from (domain.ext) from RFC-822 Received header in emails.

core.database.UID_PATTERN `module-attribute` ¤

UID_PATTERN = re.compile('UID ([0-9]+)')

Matches email integer UID from IMAP headers.

core.database.FLAGS_PATTERN `module-attribute` ¤

FLAGS_PATTERN = re.compile('FLAGS \\((.*?)\\)')

Matches email flags from IMAP headers.

core.database.PATH_PATTERN `module-attribute` ¤

PATH_PATTERN = re.compile('%s%s%s' % (regex_starter, path_regex, end_of_word))

File path pattern like ~/file, /home/file, ./file or C:\windows

core.database.PARTIAL_PATH_REGEX `module-attribute` ¤

PARTIAL_PATH_REGEX = re.compile(
    "%s%s%s" % (regex_starter, partial_path_regex, end_of_word)
)

Partial, invalid path patterns missing the leading root, like home/user/stuff. We start capturing after at least two folder separators (slash or backslash).

Warning

this will collide with date detection, so run it after in the pipeline.

core.database.RESOLUTION_PATTERN `module-attribute` ¤

RESOLUTION_PATTERN = re.compile('\\d+(?:×|x|X)\\d+')

Pixel resolution like 10x20 or 10×20. Units are discarded.

core.database.NUMBER_PATTERN `module-attribute` ¤

NUMBER_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_number, regex_stopper)
)

Signed integers and decimals, fractions and numeric IDs with interal dashes and underscores. Numbers with starting or trailing units are not considered. Lazy decimals (.1 and 1.) are considered.

core.database.HASH_PATTERN `module-attribute` ¤

HASH_PATTERN = re.compile(
    "%s%s%s" % (regex_starter, regex_hash, end_of_word), re.IGNORECASE
)

Cryptographic hexadecimal hashes and fingerprints, of a min length of 8 characters.

core.database.MULTIPLE_LINES `module-attribute` ¤

MULTIPLE_LINES = re.compile('(?: ?[\\t\\r\\n]{2,} ?)+')

Detect more than 2 newlines and tab, possibly mixed with spaces

core.database.MULTIPLE_NEWLINES `module-attribute` ¤

MULTIPLE_NEWLINES = re.compile('(?: ?[\\t\\r\\n]+ ?){2,}')

Detect broken sequences of newlines and spaces.

core.database.INTERNAL_NEWLINE `module-attribute` ¤

INTERNAL_NEWLINE = re.compile('(?<=\\w)[\\n\\t\\r]{1}(?=\\w)')

Detect single newline characters nested inside text. Mostly useful for parsed PDF where line wrapping is quite literal ( used instead of space).

core.database.EXPOSURE `module-attribute` ¤

EXPOSURE = re.compile(
    "%s%s%s" % (regex_starter, exposure_regex, end_of_word), flags=re.IGNORECASE
)

Exposure values in EV or IL

core.database.PHOTOSPEED `module-attribute` ¤

PHOTOSPEED = re.compile(
    "%s%s%s" % (regex_starter, photospeed_regex, end_of_word),
    flags=re.IGNORECASE,
)

Exposure values in EV or IL

core.database.SENSIBILITY `module-attribute` ¤

SENSIBILITY = re.compile(
    "%s%s%s" % (regex_starter, sensibility_regex, end_of_word),
    flags=re.IGNORECASE,
)

Photographic sensibility in ISO or ASA

core.database.LUMINANCE `module-attribute` ¤

LUMINANCE = re.compile(
    "%s%s%s" % (regex_starter, luminance_regex, end_of_word),
    flags=re.IGNORECASE,
)

Luminance/radiance in nits or Cd/m²

core.database.DIAPHRAGM `module-attribute` ¤

DIAPHRAGM = re.compile(
    "%s%s" % (regex_starter, diaphragm_regex), flags=re.IGNORECASE
)

Photographic diaph aperture values like f/2.8 or f/11

core.database.GAIN `module-attribute` ¤

GAIN = re.compile(
    "%s%s%s" % (regex_starter, gain_regex, end_of_word), flags=re.IGNORECASE
)

Gain, attenuation and PSNR in dB

core.database.FILE_SIZE `module-attribute` ¤

FILE_SIZE = re.compile(
    "%s%s%s" % (regex_starter, filesize_regex, end_of_word), flags=re.IGNORECASE
)

File and memory size in bit, byte, or octet and their multiples

core.database.DISTANCE `module-attribute` ¤

DISTANCE = re.compile(
    "%s%s%s" % (regex_starter, distance_regex, end_of_word), flags=re.IGNORECASE
)

Distance in meter, inch, foot and their multiples

core.database.PERCENT `module-attribute` ¤

PERCENT = re.compile('%s%s%s' % (regex_starter, percent_regex, end_of_word))

Number followed by %

core.database.WEIGHT `module-attribute` ¤

WEIGHT = re.compile(
    "%s%s%s" % (regex_starter, weight_regex, end_of_word), flags=re.IGNORECASE
)

Weight (mass) in British and SI units and their multiples

core.database.ANGLE `module-attribute` ¤

ANGLE = re.compile(
    "%s%s%s" % (regex_starter, angle_regex, end_of_word), flags=re.IGNORECASE
)

Angles in radians, degrees and steradians

core.database.TEMPERATURE `module-attribute` ¤

TEMPERATURE = re.compile(
    "%s%s%s" % (regex_starter, temperature_regex, end_of_word),
    flags=re.IGNORECASE,
)

Temperatures in °C, °F and K

core.database.FREQUENCY `module-attribute` ¤

FREQUENCY = re.compile(
    "%s%s%s" % (regex_starter, frequency_regex, end_of_word),
    flags=re.IGNORECASE,
)

Frequencies in hertz and multiples

core.database.TEXT_DATES `module-attribute` ¤

TEXT_DATES = re.compile(
    "([0-9]{1,2})? (jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec|jan|fév|mar|avr|mai|jui|jui|aou|sep|oct|nov|déc|janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre|january|february|march|april|may|june|july|august|september|october|november|december)\\.?( [0-9]{1,2})?( [0-9]{2,4})(?!\\:)",
    flags=re.IGNORECASE | re.MULTILINE,
)

Find textual dates formats:

English dates like 01 Jan 20 or 01 Jan. 2020 but avoid capturing adjacent time like 12:08.
French dates like 01 Jan 20 or 01 Jan. 2020 but avoid capturing adjacent time like 12:08.

RETURNS	DESCRIPTION
`0`	2 digits (day number or year number, depending on language) TYPE: `str`
`1`	month (full-form or abbreviated) TYPE: `str`
`2`	2 digits (day number or year number, depending on language) TYPE: `str`
`3`	4 digits (full year) TYPE: `str`

core.database.BASE_64 `module-attribute` ¤

BASE_64 = re.compile(
    "((?:[A-Za-z0-9+\\/]{4}){64,}(?:[A-Za-z0-9+\\/]{2}==|[A-Za-z0-9+\\/]{3}=)?)"
)

Identifies base64 encoding

core.database.BB_CODE `module-attribute` ¤

BB_CODE = re.compile('\\[(img|quote)[a-zA-Z0-9 =\\"]*?\\].*?\\[\\/\\1\\]')

Identifies left-over BB code markup [img] and [quote]

core.database.MARKUP `module-attribute` ¤

MARKUP = re.compile('(?:\\[|\\{|\\<)([^\\n\\r]+?)(?:\\]|\\}|\\>)')

Identifies left-over HTML and Markdown markup, like <...>, {...}, [...]

core.database.USER `module-attribute` ¤

USER = re.compile('([\\w\\-\\+\\.]+)?@([\\w\\-\\+\\.]+)|(user\\-?\\d+)')

Identifies user handles or emails

core.database.REPEATED_CHARACTERS `module-attribute` ¤

REPEATED_CHARACTERS = re.compile('(.)\\1{9,}')

Identifies any character repeated more than 9 times

core.database.UNFINISHED_SENTENCES `module-attribute` ¤

UNFINISHED_SENTENCES = re.compile('(?<![?!.;:])\\n\\n|\\r\\n')

Identifies sentences finishing with 2 newlines characters without having ending punctuations

core.database.MULTIPLE_DOTS `module-attribute` ¤

MULTIPLE_DOTS = re.compile('\\.{2,}')

Identifies dots repeated more than twice

core.database.MULTIPLE_DASHES `module-attribute` ¤

MULTIPLE_DASHES = re.compile('[-~]{1,}')

Identifies dashes repeated more than once

core.database.MULTIPLE_QUESTIONS `module-attribute` ¤

MULTIPLE_QUESTIONS = re.compile('\\?{1,}')

Identifies question marks repeated more than once

core.database.ORDINAL_FR `module-attribute` ¤

ORDINAL_FR = re.compile('n° ?([0-9]+)')

French ordinal numbers (numéros n°)

core.database.FRANCAIS `module-attribute` ¤

FRANCAIS = re.compile(
    "%s(j|t|s|d|qu|lorsqu|quelqu|jusqu|m|c|n)\\'(?=[aeiouyéèàêâîôûïüäëöh][\\w\\s])"
    % regex_starter,
    flags=re.IGNORECASE,
)

French contractions of pronouns and determinants

core.database.DASHES `module-attribute` ¤

DASHES = re.compile('(?<=\\w)(-|_|=)+(?=\\w)', re.IGNORECASE)

Dashes in the middle of ASCII/Latin compounded words. Will not work if accented or Unicode characters are immediately surrounding the dash.

core.database.ALTERNATIVES `module-attribute` ¤

ALTERNATIVES = re.compile('(?<=[a-z])(\\/)(?=[a-z])', re.IGNORECASE)

Slash-separated word alternatives like and/or mr/mrs

core.database.PLURAL_S `module-attribute` ¤

PLURAL_S = re.compile('(?<=[a-zA-Z]{4,})s?e{0,2}s%s' % end_of_word)

Identify plural form of nouns (French and English), adjectives (French) and third-person present verbs (English) and second-person verbs (French) in -s.

core.database.FEMININE_E `module-attribute` ¤

FEMININE_E = re.compile('(?<=\\w{4,})e{1,2}%s' % end_of_word)

Identify feminine form of adjectives (French) in -e.

core.database.DOUBLE_CONSONANTS `module-attribute` ¤

DOUBLE_CONSONANTS = re.compile(
    "(?<=\\w{2,})([bcfghjklmnpqrstvwxz])\\1", re.IGNORECASE
)

Identify double consonants in the middle of words.

core.database.FEMININE_TRICE `module-attribute` ¤

FEMININE_TRICE = re.compile('(?<=\\w{4,})t(rice|eur|or)%s' % end_of_word)

Identify French feminine nouns in -trice.

core.database.ADVERB_MENT `module-attribute` ¤

ADVERB_MENT = re.compile('(?<=\\w{4,})e?ment%s' % end_of_word)

Identify French adverbs and English nouns ending en -ment

core.database.SUBSTANTIVE_TION `module-attribute` ¤

SUBSTANTIVE_TION = re.compile('(?<=\\w{4,})(t|s)ion%s' % end_of_word)

Identify French and English substantives formed from verbs by adding -tion and -sion

core.database.SUBSTANTIVE_AT `module-attribute` ¤

SUBSTANTIVE_AT = re.compile('(?<=\\w{4,})at%s' % end_of_word)

Identify French and English substantives formed from other nouns by adding -at

core.database.PARTICIPLE_ING `module-attribute` ¤

PARTICIPLE_ING = re.compile('(?<=\\w{4,})ing%s' % end_of_word)

Identify English substantives and present participles formed from verbs by adding -ing

core.database.ADJECTIVE_ED `module-attribute` ¤

ADJECTIVE_ED = re.compile('(?<=\\w{4,})ed%s' % end_of_word)

Identify English adjectives formed from verbs by adding -ed

core.database.ADJECTIVE_TIF `module-attribute` ¤

ADJECTIVE_TIF = re.compile('(?<=\\w{2,})ti(f|v)%s' % end_of_word)

Identify English and French adjectives formed from verbs by adding -tif or -tive

core.database.SUBSTANTIVE_Y `module-attribute` ¤

SUBSTANTIVE_Y = re.compile('(?<=\\w{3,})y%s' % end_of_word)

Identify English substantives ending in -y

core.database.VERB_IZ `module-attribute` ¤

VERB_IZ = re.compile('(?<=\\w{4,})(i|y)z%s' % end_of_word)

Identify American verbs ending in -iz that French and Brits write in -is

core.database.STUFF_ER `module-attribute` ¤

STUFF_ER = re.compile('(?<=\\w{5,})er%s' % end_of_word)

Identify French 1st group verb (infinitive) and English substantives ending in -er

core.database.BRITISH_OUR `module-attribute` ¤

BRITISH_OUR = re.compile('(?<=\\w{3,})our%s' % end_of_word)

Identify British spelling ending in -our (colour, behaviour).

core.database.SUBSTANTIVE_ITY `module-attribute` ¤

SUBSTANTIVE_ITY = re.compile('(?<=\\w{4,})it(y|e)%s' % end_of_word)

Identify substantives in -ity (English) and -ite (French).

core.database.SUBSTANTIVE_IST `module-attribute` ¤

SUBSTANTIVE_IST = re.compile('(?<=\\w{3,})is(t|m)%s' % end_of_word)

Identify substantives in -ist and -ism.

core.database.SUBSTANTIVE_IQU `module-attribute` ¤

SUBSTANTIVE_IQU = re.compile('(?<=\\w{3,})i(qu|c)%s' % end_of_word)

Identify French substantives in -iqu

core.database.SUBSTANTIVE_EUR `module-attribute` ¤

SUBSTANTIVE_EUR = re.compile('(?<=\\w{3,})eur%s' % end_of_word)

Identify French substantives -eur

core.database.HYPHENIZED `module-attribute` ¤

HYPHENIZED = re.compile('(?<=\\w{3,})[-–—]+ *[\\n\\r]{1,2}(?=\\w)')

Detect hyphenized words at the end of a PDF text line.

core.database.WAYBACK_RE `module-attribute` ¤

WAYBACK_RE = re.compile('https?://web\\.archive\\.org/web/[^/]+/(https?://.+)')

Find the canonical URL from web.archive.org (Wayback Machine) URLs

Classes¤

core.database.SQLitePageCorpus ¤

SQLitePageCorpus(
    db,
    query,
    params=(),
    atomic_types=(str, bytes),
    max_depth=None,
    yield_rows=False,
)

Lazily stream rows from an SQLite request, avoiding full copy.

Example

    corpus = SQLitePageCorpus(
        db,
        """
        SELECT tokenized
        FROM pages
        WHERE lang IN ('fr', 'en')
        """,
        max_depth=0
    )

- max_depth=0 will not flatten the content, so it will return the original list[list[str]] (list of sentences, aka list of list of words), - max_depth=1 flattens documents, to it will return list[str] (list of words)

Functions:¤

core.database.split_url ¤

split_url(url: str) -> tuple[str, str, str, str, str] | None

Split a well-formed URL following RFC3986 into base elements.

RETURNS	DESCRIPTION
`tuple[str, str, str, str, str] \| None`	a tuple of `(protocol, domain, page, parameters, anchor)`.
`tuple[str, str, str, str, str] \| None`	Empty/missing fields are inited with empty strings so there is no need for individual `None` checks.
`tuple[str, str, str, str, str] \| None`	If the `url` input doesn’t match an URL format, return `None`.

core.database.adapt_array ¤

adapt_array(arr: np.ndarray)

http://stackoverflow.com/a/31312102/190597 (SoulNibbler)

core.database.create_db ¤

create_db(name: str) -> sqlite3.Connection

Create the pages table if needed and add any missing columns. This doesn’t destroy existing tables, rows or columns, so it’s safe to run on any database.

Warning

Columns are inferred directly from web_page.__annotations__. Existing columns are preserved unchanged.

The url column is used as the PRIMARY KEY.

core.database.create_temp_db ¤

create_temp_db(
    min_free: float = 2.0, filename: str | None = None
) -> sqlite3.Connection

Create a temporary SQLite database file (in /dev/shm when available) and initialize the pages table according to web_page annotations.

PARAMETER	DESCRIPTION
`min_free`	minimum available disk space in GiB required to create the temporary database. This is checked at runtime and the function will raise an error if the condition is not met. TYPE: `float` DEFAULT: `2.0`
`filename`	the full path and filename to save the temporary database, if it needs to be reused at some point. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`sqlite3.Connection`	the sqlite3.Connection opened in bulk mode.

WARNING

the temporary SQLite database doesn’t use web_page URL as primary key, to allow later deduplication.

core.database.delete_temp_db ¤

delete_temp_db(db: sqlite3.Connection)

Close and delete a temporary database in one shot.

core.database.open_db ¤

open_db(name: str, mode: str = 'rw') -> sqlite3.Connection

Open an SQLite database with workload-specific optimizations.

PARAMETER	DESCRIPTION
`name`	Database identifier/path passed to `get_models_folder()`. TYPE: `str`
`mode`	“rw”: Generic read/write mode. “ro”: Read-only immutable mode optimized for serving/search workloads. “bulk”: Bulk-ingestion mode optimized for large batch writes. TYPE: `str` DEFAULT: `'rw'`

RETURNS	DESCRIPTION
`sqlite3.Connection`	sqlite3.Connection

core.database.compress_db ¤

compress_db(
    db: sqlite3.Connection,
    delete_query: str | None = None,
    delete_params: tuple | None = None,
    delete_columns: list[str] | None = None,
    repack: bool = False,
)

Optionally delete rows, then reclaim SQLite disk space.

Two reclaim strategies, picked automatically:

Incremental (cheap, default): when the database was created with auto_vacuum = INCREMENTAL (see :func:open_db), free pages are returned to the OS in place via PRAGMA incremental_vacuum. No full copy is made, so this needs no scratch space and cannot hit the “database or disk is full” trap. It does not defragment.
Full repack (repack=True, or as a fallback when the DB predates the auto_vacuum setting): rewrites the whole DB tightly via VACUUM INTO + online backup. Defragments and, as a side effect, applies any pending auto_vacuum mode change so legacy DBs convert to incremental on their first full repack.

PARAMETER	DESCRIPTION
`db`	SQLite connection TYPE: `sqlite3.Connection`
`delete_query`	full DELETE SQL query TYPE: `str \| None` DEFAULT: `None`
`delete_params`	optional SQL parameters TYPE: `tuple \| None` DEFAULT: `None`
`delete_columns`	columns to NULL out before reclaiming space TYPE: `list[str] \| None` DEFAULT: `None`
`repack`	force a full defragmenting rewrite (use for slim deliverables) TYPE: `bool` DEFAULT: `False`

core.database.is_primary_key ¤

is_primary_key(db: sqlite3.Connection, table: str, column: str) -> bool

Check whether column is part of the PRIMARY KEY of table.

core.database.populate_db ¤

populate_db(
    db: sqlite3.Connection, pages: list[web_page], batch_size: int = 4096
)

Insert or update web_page records into the SQLite database.

Existing rows are matched using the PRIMARY KEY url.

Warning

Array-like Python values are converted to bytearray then to bytes in order to be handled as BLOB by SQLite.

core.database.db_to_list ¤

db_to_list(db: sqlite3.Connection) -> list[web_page]

Extract all web_page rows from the pages table in db as a list of web_page

core.database.migrate_url_to_primary_key ¤

migrate_url_to_primary_key(db: sqlite3.Connection)

Rebuild the pages table using url as PRIMARY KEY for older databases that didn’t use a primary key.

core.database.merge_databases ¤

merge_databases(old_db: sqlite3.Connection, new_db: sqlite3.Connection)

Merge two pages databases.

Rows from old_db are inserted into new_db only if their URL does not already exist.

Existing rows in new_db are preserved unchanged.

Only columns existing in BOTH databases are copied.

core.database.update_pages_from_database ¤

update_pages_from_database(
    target_db: sqlite3.Connection, source_db: sqlite3.Connection
) -> list[str]

Update rows in target_db.pages from source_db.pages using url as PRIMARY KEY.

Only shared columns are updated.

Returns missing_urls: URLs present in target_db but absent from source_db.

core.database.import_pages ¤

import_pages(
    source_db: str | sqlite3.Connection,
    destination_db: str | sqlite3.Connection,
    where_clause: str = "1=1",
    params: tuple = (),
    preserve_derived: list[str] | None = None,
) -> int

Import rows from one SQLite database into another.

Both source_db and destination_db may be either a filesystem path (str) or an active sqlite3.Connection handle. Passing a Connection is the only way to target a :memory: database, since those cannot be addressed by path.

Connection lifecycle - Path supplied – the function opens, commits, and closes the connection itself (original behaviour). - Connection supplied – the caller retains full control; the connection is neither committed nor closed here, so the import can participate in a larger transaction.

Rows are copied from source.pages into destination.pages. Existing rows are updated on conflict of the url primary key. Columns present in the destination but absent from the source receive NULL. Both schemas are discovered at runtime, so the function adapts automatically if either evolves.

PARAMETER	DESCRIPTION
`source_db`	Path to, or an open connection for, the source SQLite database. TYPE: `str \| sqlite3.Connection`
`destination_db`	Path to, or an open connection for, the destination SQLite database. TYPE: `str \| sqlite3.Connection`
`where_clause`	SQL WHERE clause applied to `source.pages`. Example: `"domain = ? AND date >= ?"` TYPE: `str` DEFAULT: `'1=1'`
`params`	Positional parameters bound to where_clause. TYPE: `tuple` DEFAULT: `()`
`preserve_derived`	columns whose existing value in the destination must be preserved when a conflicting (same-`url`) row’s content is unchanged, and only overwritten when the content changed (detected via `content_hash`). Use this when merging a freshly-crawled source that has not computed these derived columns yet, so re-crawling an unchanged page does not wipe its expensive artifacts (e.g. `["tokenized", "stemmed", "vectorized"]`). `None` keeps the plain “overwrite everything” upsert behaviour. TYPE: `list[str] \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`int`	Number of affected rows.

Examples::

# File → file (unchanged from before)
import_pages("old.db", "new.db", "domain = ?", ("example.com",))

# In-memory source → file destination
import_pages(mem_conn, "new.db")

# File source → in-memory destination (e.g. for tests)
import_pages("prod.db", mem_conn, "date >= ?", ("2024-01-01",))

# Both in-memory
import_pages(src_conn, dst_conn)

core.database.inspect_db ¤

inspect_db(db: sqlite3.Connection, message: str = '') -> None

Print useful metadata and statistics about a SQLite database.

PARAMETER	DESCRIPTION
`db`	active database connection TYPE: `sqlite3.Connection`
`message`	optional additional message to indentify several inspections if any. TYPE: `str` DEFAULT: `''`

core.database¤

core.database ¤

Attributes¤

core.database.regex_starter module-attribute ¤

core.database.regex_stopper module-attribute ¤

core.database.end_of_word module-attribute ¤

core.database.regex_algebra module-attribute ¤

core.database.IP_PATTERN module-attribute ¤

core.database.EMAIL_PATTERN module-attribute ¤

core.database.URL_PATTERN module-attribute ¤

core.database.MEMBERS_PATTERN module-attribute ¤

core.database.DATE_PATTERN module-attribute ¤

core.database.TIME_PATTERN module-attribute ¤

core.database.DOMAIN_PATTERN module-attribute ¤

core.database.UID_PATTERN module-attribute ¤

core.database.FLAGS_PATTERN module-attribute ¤

core.database.PATH_PATTERN module-attribute ¤

core.database.PARTIAL_PATH_REGEX module-attribute ¤

core.database.RESOLUTION_PATTERN module-attribute ¤

core.database.NUMBER_PATTERN module-attribute ¤

core.database.HASH_PATTERN module-attribute ¤

core.database.MULTIPLE_LINES module-attribute ¤

core.database.MULTIPLE_NEWLINES module-attribute ¤

core.database.INTERNAL_NEWLINE module-attribute ¤

core.database.EXPOSURE module-attribute ¤

core.database.PHOTOSPEED module-attribute ¤

core.database.SENSIBILITY module-attribute ¤

core.database.LUMINANCE module-attribute ¤

core.database.DIAPHRAGM module-attribute ¤

core.database.GAIN module-attribute ¤

core.database.FILE_SIZE module-attribute ¤

core.database.DISTANCE module-attribute ¤

core.database.PERCENT module-attribute ¤

core.database.WEIGHT module-attribute ¤

core.database.ANGLE module-attribute ¤

core.database.TEMPERATURE module-attribute ¤

core.database.FREQUENCY module-attribute ¤

core.database.TEXT_DATES module-attribute ¤

core.database.BASE_64 module-attribute ¤

core.database.BB_CODE module-attribute ¤

core.database.MARKUP module-attribute ¤

core.database.USER module-attribute ¤

core.database.REPEATED_CHARACTERS module-attribute ¤

core.database.UNFINISHED_SENTENCES module-attribute ¤

core.database.MULTIPLE_DOTS module-attribute ¤

core.database.MULTIPLE_DASHES module-attribute ¤

core.database.MULTIPLE_QUESTIONS module-attribute ¤

core.database.ORDINAL_FR module-attribute ¤

core.database.FRANCAIS module-attribute ¤

core.database.DASHES module-attribute ¤

core.database.ALTERNATIVES module-attribute ¤

core.database.PLURAL_S module-attribute ¤

core.database.FEMININE_E module-attribute ¤

core.database.DOUBLE_CONSONANTS module-attribute ¤

core.database.FEMININE_TRICE module-attribute ¤

core.database.ADVERB_MENT module-attribute ¤

core.database.SUBSTANTIVE_TION module-attribute ¤

core.database.SUBSTANTIVE_AT module-attribute ¤

core.database.PARTICIPLE_ING module-attribute ¤

core.database.ADJECTIVE_ED module-attribute ¤

core.database.ADJECTIVE_TIF module-attribute ¤

core.database.SUBSTANTIVE_Y module-attribute ¤

core.database.VERB_IZ module-attribute ¤

core.database.STUFF_ER module-attribute ¤

core.database.BRITISH_OUR module-attribute ¤

core.database.SUBSTANTIVE_ITY module-attribute ¤

core.database.SUBSTANTIVE_IST module-attribute ¤

core.database.SUBSTANTIVE_IQU module-attribute ¤

core.database.SUBSTANTIVE_EUR module-attribute ¤

core.database.HYPHENIZED module-attribute ¤

core.database.WAYBACK_RE module-attribute ¤

Classes¤

core.database.SQLitePageCorpus ¤

Functions:¤

core.database.split_url ¤

core.database.adapt_array ¤

core.database.create_db ¤

core.database.create_temp_db ¤

core.database.delete_temp_db ¤

core.database.open_db ¤

core.database.regex_starter `module-attribute` ¤

core.database.regex_stopper `module-attribute` ¤

core.database.end_of_word `module-attribute` ¤

core.database.regex_algebra `module-attribute` ¤

core.database.IP_PATTERN `module-attribute` ¤

core.database.EMAIL_PATTERN `module-attribute` ¤

core.database.URL_PATTERN `module-attribute` ¤

core.database.MEMBERS_PATTERN `module-attribute` ¤

core.database.DATE_PATTERN `module-attribute` ¤

core.database.TIME_PATTERN `module-attribute` ¤

core.database.DOMAIN_PATTERN `module-attribute` ¤

core.database.UID_PATTERN `module-attribute` ¤

core.database.FLAGS_PATTERN `module-attribute` ¤

core.database.PATH_PATTERN `module-attribute` ¤

core.database.PARTIAL_PATH_REGEX `module-attribute` ¤

core.database.RESOLUTION_PATTERN `module-attribute` ¤

core.database.NUMBER_PATTERN `module-attribute` ¤

core.database.HASH_PATTERN `module-attribute` ¤

core.database.MULTIPLE_LINES `module-attribute` ¤

core.database.MULTIPLE_NEWLINES `module-attribute` ¤

core.database.INTERNAL_NEWLINE `module-attribute` ¤

core.database.EXPOSURE `module-attribute` ¤

core.database.PHOTOSPEED `module-attribute` ¤

core.database.SENSIBILITY `module-attribute` ¤

core.database.LUMINANCE `module-attribute` ¤

core.database.DIAPHRAGM `module-attribute` ¤

core.database.GAIN `module-attribute` ¤

core.database.FILE_SIZE `module-attribute` ¤

core.database.DISTANCE `module-attribute` ¤

core.database.PERCENT `module-attribute` ¤

core.database.WEIGHT `module-attribute` ¤

core.database.ANGLE `module-attribute` ¤

core.database.TEMPERATURE `module-attribute` ¤

core.database.FREQUENCY `module-attribute` ¤

core.database.TEXT_DATES `module-attribute` ¤

core.database.BASE_64 `module-attribute` ¤

core.database.BB_CODE `module-attribute` ¤

core.database.MARKUP `module-attribute` ¤

core.database.USER `module-attribute` ¤

core.database.REPEATED_CHARACTERS `module-attribute` ¤

core.database.UNFINISHED_SENTENCES `module-attribute` ¤

core.database.MULTIPLE_DOTS `module-attribute` ¤

core.database.MULTIPLE_DASHES `module-attribute` ¤

core.database.MULTIPLE_QUESTIONS `module-attribute` ¤

core.database.ORDINAL_FR `module-attribute` ¤

core.database.FRANCAIS `module-attribute` ¤

core.database.DASHES `module-attribute` ¤

core.database.ALTERNATIVES `module-attribute` ¤

core.database.PLURAL_S `module-attribute` ¤

core.database.FEMININE_E `module-attribute` ¤

core.database.DOUBLE_CONSONANTS `module-attribute` ¤

core.database.FEMININE_TRICE `module-attribute` ¤

core.database.ADVERB_MENT `module-attribute` ¤

core.database.SUBSTANTIVE_TION `module-attribute` ¤

core.database.SUBSTANTIVE_AT `module-attribute` ¤

core.database.PARTICIPLE_ING `module-attribute` ¤

core.database.ADJECTIVE_ED `module-attribute` ¤

core.database.ADJECTIVE_TIF `module-attribute` ¤

core.database.SUBSTANTIVE_Y `module-attribute` ¤

core.database.VERB_IZ `module-attribute` ¤

core.database.STUFF_ER `module-attribute` ¤

core.database.BRITISH_OUR `module-attribute` ¤

core.database.SUBSTANTIVE_ITY `module-attribute` ¤

core.database.SUBSTANTIVE_IST `module-attribute` ¤

core.database.SUBSTANTIVE_IQU `module-attribute` ¤

core.database.SUBSTANTIVE_EUR `module-attribute` ¤

core.database.HYPHENIZED `module-attribute` ¤

core.database.WAYBACK_RE `module-attribute` ¤