CS207 Time Series Project

A package for storing and manipulating time series.

timeseries: Code for creating and manipulating time series

pype: A domain-specific language for time series analysis.

tsdb: A database for storing, searching, and manipulating time series.

procs: Procedures that can be executed on time series stored in the database.

fft: Cython-wrapped fftw for fast cross-correlation.

tests: Tests for other modules.

Installation/Run

The package includes an installable package file in dist/ . Simply unpack the file, enter the directory, and run:

python setup.py install

The files needed to build Cython FFT module for computing cross-correlation from scratch (fft.pyx, cfft.pxd, fft_defs.h, setup.py) are included, along with the generated fft.c and fft.so files. If it works on your system, the easiest option is to just use the fft.so file. Otherwise, you must install version 3.3.4 of FFTW and compile the Cython files manually. This can be done by running:

python setup.py build_ext --inplace

from /procs/fft/.

To run the server, run python go_server.py, and make sure the documents folder is deleted by doing rm -r documents/. To run REST API, run python rest_api.py. Finally, to run the client, run python go_client.py.

Persistence, REST API, and Cython-Wrapped FFTW

Persistence

In our database, time series are each stored in separate files called {pk}.json, where {pk} is replaced by the primary key. We also have one pickled index file indices.pkl that stores data structures for all secondary indices. For example, one part of indices.pkl storing the secondary index for order might look like

'order': {B+-tree}

where {B+-tree} is a B+-tree whose nodes are lists of primary keys with the given value of order.

To make our database persistent, we keep a serialized DocDB on disk, which is a document-based store. This is complemented with a DictDB in memory, which tracks uncommitted changed. When a commit occurs, the DictDB copy is flushed to disk, becoming part of the DocDB. If a rollback occurs, the DictDB is emptied and the persisted indices are loaded back into memory, replacing any indices that still were in memory and may have changed due to uncomitted changes. Time series files that are scheduled for deletion are given the extension .trash, and when a commit happens, all files with that extension are deleted. In the event of a rollback, the extension is changed back to its original.

REST API

We implemented a rest API using Tornado. In many of the routes the following parameterization is used:

metadata_dict:

Query Parameter	Description
field(i)	Name of field
value(i)	Value of field
from(i)	Value field must be greater than or equal to, assumes field numeric type
to(i)	Value field must be less than or equal to, assumes field numeric type
dtype(i)	Optional argument given with value(i), can be one of int, float or bool. Bool implies true/false or 1/0 value. If no value given assumes field is a string.

where (i) represents an ordered index starting from 1, e.g. field1=pk&value1=ts1&field2=order&from2=1&to2=5

additional_params:

Query Parameter	Description
sort_by	Name of field to sort by
sort_by_increasing	true/false or 1/0 to indicate increasing or decreasing order respectively
limit	Integer argument for number of results returned

e.g. sort_by=order&sort_by_increasing=true&limit=10

Route	Method	Description	Parameters	Example (Using Python's requests package)
/api/timeseries	POST	Insert a timeseries	json body with fields pk, t, v corresponding to primary key (string), times (array), and values (array)	requests.post("http://localhost:5000/api/timeseries", json = {'t':list(range(1,10)), 'v':list(range(101,110)), 'pk':"1"})
/api/timeseries	GET	Select timeseries	metadata_dict, additional_params, optional fields parameter to specify what to return (default returns all fields)	requests.get("http://localhost:5000/api/timeseries?field1=pk&value1=1&fields=blarg&fields=order")
/api/timeseries	DELETE	Delete timeseries	pk argument for timeseries	requests.delete("http://localhost:5000/api/timeseries?pk=1")
/api/timeseries/upsert	POST	Upsert data into a timeseries	json body with field pk and data to upsert	requests.post("http://localhost:5000/api/timeseries/upsert", json = {'pk':"1", 'blarg':123, 'order':1})
/api/timeseries/augmented	GET	Augmented select for a timeseries	proc: name of procedure, target: fields to put results in, arg: any argument for the procedure, metadata_dict, additional_params	requests.get("http://localhost:5000/api/timeseries/augmented?proc=stats&target=mean&target=std&field1=pk&value1=1")
/api/timeseries/similarity	GET	Calculate similarity of timeseries to chosen pks	pk(i) for i=1..5 indicating vantages points for d_vp-i, metadata_dict, additional_params	requests.get("http://localhost:5000/api/timeseries/similarity?pk1=1&pk2=2&sort_by=d_vp-1&limit=2")
/api/commit	POST	Persist database to file		requests.post("http://localhost:5000/api/commit")
/api/rollback	POST	Revert to last commit		requests.post("http://localhost:5000/api/rollback")

Cython-Wrapped FFTW

For our extra feature, we wrapped FFTW in Cython for fast calculation of cross-correlation. To compute a discrete Fourier transform using FFTW, you first allocate memory for the input and output and create an fftw_plan variable specifying how long the input is whether you want to do a forward or inverse transform. You then copy the desired input into the allocated input memory and execute the plan, leaving the result in the output memory. Finally, you free all the memory you allocated and destroy the plan. We wrote a Cython package that allows the calling function to leverage FFTW's speed without dealing with its complexity.

We decided to implement the cross-correlation as a callable function rather than as a class. This requires some sacrifices; for example, we can no longer allocate one plan and call it multiple times. However, we pass the flag FFTW_ESTIMATE when creating plans, which significantly speeds up plan creation at the expense of execution time. This tradeoff makes all cross-correlation queries reasonably fast and avoids the problem of having an extremely slow query during which the plan is set up. Furthermore, there is no guarantee that all time series in our database would be of the same length, so the time spent saving a plan could be wasted. Perhaps, in a more advanced version of our database, an advanced user who knew the sort of queries they would be making could choose which type of plan to create.

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
dist		dist
docs		docs
documents		documents
procs		procs
pype		pype
tests		tests
timeseries		timeseries
tsdb		tsdb
.coveragerc		.coveragerc
.gitignore		.gitignore
.travis.yml		.travis.yml
AUTHORS.rst		AUTHORS.rst
CHANGES.rst		CHANGES.rst
LICENSE.txt		LICENSE.txt
README.rst		README.rst
Rest_API Demo.ipynb		Rest_API Demo.ipynb
__init__.py		__init__.py
both.sh		both.sh
driver.py		driver.py
go_client.py		go_client.py
go_server.py		go_server.py
install_fftw.sh		install_fftw.sh
output.md		output.md
output2.md		output2.md
output3.md		output3.md
rest_api.py		rest_api.py
schema.json		schema.json
setup.cfg		setup.cfg
setup.py		setup.py
similarity_search.png		similarity_search.png
skeleton.py		skeleton.py
test-requirements.txt		test-requirements.txt

License

mc-hammertimeseries/cs207project

Folders and files

Latest commit

History

Repository files navigation

CS207 Time Series Project

Installation/Run

Persistence, REST API, and Cython-Wrapped FFTW

Persistence

REST API

Cython-Wrapped FFTW

About

Resources

License

Stars

Watchers

Forks

Languages