smap-materializer

Currently works:

"Stream" focused operators (e.g. running subsample for every stream, lots of operators for one stream)
New streams detected as they appear, subsample(300) and subsample(3600) is run for all of them
Subsample(300) and subsample(3600) are updated every 5 mins (or one hour) for each stream
Setting proper metadata so that powerdb recognizes these as subsample streams (i.e. this is now capable of being actually deployed at least for auto-subsampling, barring any stability issues)
Graceful restart after failure (don't have to recompute everything, just pickup from where we left off)
"Query" focused operators (e.g. queries used by berkeley.openbms, lots of streams for one set of operators)
Output metadata allows for powerdb to automatically recognize substreams

Needs to be implemented:

Load from DB instead of shelves
What happens if a stream that we're computing for goes down?
Republisher (I'm not really sure if this needs to be used... seems simpler just to keep pulling directly from readingdb and keep track of latest time + polling)
Propagating metadata

How to use:

Adding a stream
- Start your driver
- Materializer will auto-detect the stream and start subsample(300) and subsample(3600) for it (including all historical data) with no need for user action
- If you need to add custom ops, stop materializer, edit the shelf with python and restart materializer (this is annoying, will be fixed once DB is integrated)

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.gitignore		.gitignore
README.md		README.md
mat_utils.py		mat_utils.py
materializer.py		materializer.py
wrappers.py		wrappers.py

Provide feedback