sqlitefts-python

_

sqlitefts-python

sqlitefts-python provides binding for tokenizer of SQLite Full-Text search(FTS3/4) and FTS5. it allows you to write tokenizers in Python.

SQLite has Full-Text search feature FTS3/FTS4 and FTS5 along with some predefined tokenizers for FTS3/4, and also predefined tokenizers for FTS5. It is easy to use and has enough functionality. Python has a built-in SQLite module, so that it is easy to use and deploy. You don't need anything else to full-text search.

But... the predefined tokenizers are not enough for some languages including Japanese. Also it is not easy to write own tokenizers. This module provides ability to write tokenizers using Python with CFFI, so that you don't need C compiler to write your tokenizer.

It also has ranking functions based on peewee, utility function to add FTS5 auxiliary functions, and an FTS5 aux function implementation.

NOTE: all connections using this modules should be explicitly closed. due to GC behavior, it can be crashed if a connection is left open when a program terminated.

Sample tokenizer

There are differences between FTS3/4 and FTS5, so 2 different base classes are defined.

a tokenizer for FTS3/4 can be used with FTS5 by using FTS3TokenizerAdaptor.
a tokenizer for FTS5 can be used with FTS3/4 if 'flags' is not used.

FTS3/4:

import sqlitefts as fts

class SimpleTokenizer(fts.Tokenizer):
    _p = re.compile(r'\w+', re.UNICODE)

    def tokenize(self, text):
        for m in self._p.finditer(text):
            s, e = m.span()
            t = text[s:e]
            l = len(t.encode('utf-8'))
            p = len(text[:s].encode('utf-8'))
            yield t, p, p + l

tk = sqlitefts.make_tokenizer_module(SimpleTokenizer())
fts.register_tokenizer(conn, 'simple_tokenizer', tk)

FTS5:

from sqlitefts import fts5

class SimpleTokenizer(fts5.FTS5Tokenizer):
    _p = re.compile(r'\w+', re.UNICODE)

    def tokenize(self, text, flags=None):
        for m in self._p.finditer(text):
            s, e = m.span()
            t = text[s:e]
            l = len(t.encode('utf-8'))
            p = len(text[:s].encode('utf-8'))
            yield t, p, p + l

tk = fts5.make_fts5_tokenizer(SimpleTokenizer())
fts5.register_tokenizer(conn, 'simple_tokenizer', tk)

Requirements

Python 2.7, Python 3.7+, and PyPy2.7, PyPy3.7+ (older versions may work, but not tested)

CFFI

FTS3/4 and/or FTS5 enabled SQLite3 or APSW (OS/Python bundled SQLite3 shared library may not work, building sqlite3 from source or pre-compiled binary may be required)

SQLite 3.11.x have to be compiled with -DSQLITE_ENABLE_FTS3_TOKENIZER to enable 2-arg fts3_tokenizer

SQLite older/newer than 3.11.x do not have extra requirements

Note for APSW users:

FTS3 should work as same as builtin sqlite3 - sqlite3(_sqlite3) is used to access SQLite internals
sqlitefts.fts5 does not support APSW Amalgamation build. see GH-14

Licence

This software is released under the MIT License, see LICENSE.

Thanks

https://github.com/saaj

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
.github/workflows		.github/workflows
sqlitefts		sqlitefts
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.python-version		.python-version
CHANGES.rst		CHANGES.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
pyproject.toml		pyproject.toml
renovate.json		renovate.json
requirements-dev.lock		requirements-dev.lock
requirements.lock		requirements.lock
tox.ini		tox.ini

License

hideaki-t/sqlite-fts-python

Folders and files

Latest commit

History

Repository files navigation

sqlitefts-python

Sample tokenizer

Requirements

Licence

Thanks

About

Resources

License

Stars

Watchers

Forks

Languages