benchmarks

Benchmarks for comparing sqlite, MySQL, MongoDB, and InfluxDB for a common task when doing weather data analysis: create an array holding the max daily temperature and the time it was achieved, for each day of a year.

Uses the module gen_fake_data from the weewx test suites to create a year's worth of synthetic data, with a 5 minute archive interval. This is approximately 105,000 records.

Both the weewx test suites, and the weewx API must be in your PYTHONPATH. The following works for me:

PYTHONPATH="/home/tkeffer/git/weewx/bin:/home/tkeffer/git/weewx/bin/weewx/tests" python3 time_db.py

Results

Using an Intel NUC with Intel® Core™ i5-4250U CPU @ 1.30GHz × 4, and 16 GB of memory.

SQLITE

It took approximately 4 seconds to create the archive database as a single transaction, and another 18.25s to create the daily summaries.

The table was indexed on the timestamp dateTime.

Query strategies:

Using an explicit SQL query for each day of the year

SELECT dateTime, outTemp FROM archive WHERE dateTime > %d AND dateTime <= %d AND 
    outTemp = (SELECT MAX(outTemp) FROM archive WHERE dateTime > %d AND dateTime <= %d)

where the %d is filled in by the start and stop of each day in local time.

Using the weewx manager.getAggregate() function. This actually involves two queries: one for the max temperature, the other for the time of the max temperature.
Using the weewx daily manager.getAggregate() function. This takes advantage of the daily summaries built by the weewx class WXDaySummaryManager.

On /var/tmp, a conventional hard drive:

0.106
0.140
0.055

On /tmp, a solid-state disk:

0.075
0.127
0.054

MySQL

The MySQL database was on /var/lib, a conventional hard drive.

It took 15.4s to create the archive database as a single transaction, and another 25.6s to create the daily summaries.

The table was indexed on the timestamp dateTime.

Using the same three query strategies as used above with sqlite.

First run

0.16
0.27
0.18

Second run (presumably after caches have been filled):

0.026
0.048
0.068

MongoDB

It's worth noting that even though the 100,000+ records were inserted individually into the MongoDB database, it only took 25 seconds. While a bulk insert would be presumably faster, this was fast enough, so I didn't bother.

The collection was indexed on the timestamp dateTime.

The query strategy is similar to strategy #1 above, used with the SQL databases: do an aggregation for each day of the year (local time).

for span in weeutil.weeutil.genDaySpans(start_ts, stop_ts):
    rs = collection.aggregate([{"$match"   : {"dateTime" : {"$gt"  : datetime.datetime.utcfromtimestamp(span[0]), 
                                                            "$lte" : datetime.datetime.utcfromtimestamp(span[1])},
                                              "outTemp"  : {"$ne" : None}}}, 
                               {"$project" : {"dateTime" : 1, "outTemp" : 1}},
                               {"$sort" : {"outTemp":-1 } },
                               {"$limit" : 1}
                               ])

The match for non-null values of outTemp is not really necessary when you are looking for the largest value of something, but it is when you are looking for the smallest, because null sorts before numbers in MongoDB.

Total run time for the query was approximately 0.35s.

InfluxDB

The query

SELECT MAX(outTemp) FROM "wxpacket" WHERE time > %d AND time <=%d

gets the max temperature of each day and took 1.37 seconds. I couldn't find a way to get the time of each max in the same query (InfluxDB does not allow a SELECT of a SELECT, like SQL).

Using the GROUP BY query:

SELECT MAX(outTemp) FROM "wxpacket" WHERE time > %d AND time <= %d GROUP BY time(1d)

took 0.47 seconds. Unfortunately, it's giving the wrong answer because the days are being grouped by UTC time, not local time. It doesn't return the time of the max either.

Additional tests using a Sixth Normal Form schema

Later, I did some additional test using a Sixth Normal Form (6NF) schema. In 6NF, each row consists of just a key and one value. In this context, this means each row can hold only a single measurement value.

The table looked like this

dateTime	obstype	measurement
1262332800	`outTemp`	20.5
1262332800	`windSpeed`	12.5
1262332800	`windDir`	230.0
1262484000	`outTemp`	20.6
1262484000	`windSpeed`	9.5
1262484000	`windDir`	235.0
...	...	...

The primary key is the combination (obstype, dateTime).

The query looked like this:

    vec = []
    for span in weeutil.weeutil.genDaySpans(start_ts, stop_ts):
        query = "SELECT dateTime, measurement FROM bench WHERE obstype = 'outTemp' AND dateTime > ? AND dateTime <= ? AND " \
                    "measurement = (SELECT MAX(measurement) FROM bench WHERE obstype = 'outTemp' AND dateTime > ? AND dateTime <= ?)"
        cursor.execute(query, span + span)
        vec.append(tuple(cursor.fetchone()))

SQLITE using 6NF

Using /var/tmp, mounted on a conventional hard drive.

Test	time
Generate data in a single transaction	5.2s
Generate data in transaction batches of 32,000 records	8.3s
Run the query, first run	0.48s
Run the query, subsequent runs	0.48s

MySQL using 6NF

Run on a conventional hard drive.

Test	time
Generate data in a single transaction	80s
Generate data in transaction batches of 32,000 records	78s
Run the query, first run	0.47s
Run the query, subsequent runs	0.021s

Clearly, filling the caches makes a big difference for MySQL

Conclusion

Because the observation type is specified as a key, and not a schema attribute, 6NF has the advantage of flexibility. New types can be added easily.

Unfortunately, queries seem to take about 3 times as long as the standard schema.

It's worth noting that MongoDB also allows any mix of types, but takes about twice as long as MySQL using the standard schema.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
README.md		README.md
time_db.py		time_db.py
time_db6nf.py		time_db6nf.py
time_influxdb.py		time_influxdb.py
time_mongo.py		time_mongo.py
time_xtypes.py		time_xtypes.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

time_db.py

time_db.py

time_db6nf.py

time_db6nf.py

time_influxdb.py

time_influxdb.py

time_mongo.py

time_mongo.py

time_xtypes.py

time_xtypes.py

Repository files navigation

benchmarks

Results

SQLITE

MySQL

MongoDB

InfluxDB

Additional tests using a Sixth Normal Form schema

SQLITE using 6NF

MySQL using 6NF

Conclusion

About

Releases

Packages

Languages

tkeffer/benchmarks

Folders and files

Latest commit

History

Repository files navigation

benchmarks

Results

SQLITE

MySQL

Additional tests using a Sixth Normal Form schema

SQLITE using 6NF

MySQL using 6NF

Conclusion

About

Resources

Stars

Watchers

Forks

Languages