A program that matches property listing to requirements and Vice Versa. The project has two interfaces HTTP and CLI. Currently HTTP interface is a work in progress and CLI is functional.
How to use -
DB_USER=radius DB_PASSWORD=abcdefgh DB_HOST=127.0.0.1 python cli.py
https://www.dropbox.com/s/762f5h2h2yjwv1x/Screenshot%202019-02-10%2023.32.55.png?dl=0
Have used Location data set of starbucks in US to generate dummy data for requirements and properties
This program uses Bounding box method for filtering out most of the non matching locations and once the working set is limited, Hoversine Formula is used to calculate distance. All the lat/long are stored in radians to avoid doing the conversions at runtime on SQL server
Properties and requirements both have (long,lat,id) as primary key. The Btree indexes make the range queries exetremely performant. Other than the primary index the tables also have secondary B-Tree index on price,bedroom,bathroom. And Hash Index on id for fast lookup by ID
Tested for 8 Million Properties and 1 million Requirements on 1GB/Single Core/SSD server
- (A) - Response time Without Cache ~13.02 ms
- (B) - Response time with cache but cache miss ~22.34 ms
- (C) - Response time when cache hit ~2ms
- (A) - Mysql query performance can be improved to some extent by using int instead of float/double to store lat/long
- (B) - Response time in cache miss can be improved by choosing a more compact and fast data serialization to and from redis
- (C) - Response time in case of Cache hit can be improved by caching requirements by id, however that will need more memory
Currently the results from SQL are cached with md5 of sql query. Cache hits can be increased by careful and more granular caching of results
If blazing fast queries are a necessitiy and there is scope for prewarming the cache then queries to SQL server can be completely Avoided by Storing everything in sorted sets and building the cache incrementally as and when the requirements/properties come in. Works best if the system is not write heavy. Difficult to maintain and in case of server failure warming the cache takes time. Can be Used only when you have another server up and running on hot standby
I have been using a variant of this approach in production for serving 2.5 billion responses per day in a geo distributed system. This approach seems like an overkill for the problem statement in consideration hence not chosen.