Skip to content

Non-Metric Space Library (NMSLIB): A similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

rajeev3983/nmslib

 
 

Repository files navigation

================= Non-Metric Space Library (NMSLIB), version 1.5

Non-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The goal of the project is to create an effective and comprehensive toolkit for searching in generic non-metric spaces. Being comprehensive is important, because no single method is likely to be sufficient in all cases. Also note that exact solutions are hardly efficient in high dimensions and/or non-metric spaces. Hence, the main focus is on approximate methods.

NMSLIB is an extendible library, which means that is possible to add new search methods and distance functions. NMSLIB can be used directly in C++ and Python (via Python bindings). In addition, it is also possible to build a query server, which can be used from Java (or other languages supported by Apache Thrift). Java has a native client, i.e., it works on many platforms without requiring a C++ library to be installed.

Main developers : Bilegsaikhan Naidan, Leonid Boytsov. With contributions from Yury Malkov, David Novak, Lawrence Cayton, Wei Dong, Avrelin Nikita, Dmitry Yashunin, Daniel Lemire, Alexander Ponomarenko.

Leo(nid) Boytsov is a maintainer.

Should you decide to modify the library (and, perhaps, create a pull request), please, use the develoment branch.

NMSLIB is generic yet fast!

Even though our methods are generic (see e.g., evaluation results in Naidan and Boytsov 2015), they often outperform specialized methods for the Euclidean and/or angular distance (i.e., for the cosine similarity). Below are the results (as of May 2016) of NMSLIB compared to the best implementations participated in a public evaluation code-named ann-benchmarks. Our main competitors are:

  1. A popular library Annoy, which uses a forest of random-projection KD-trees.
  2. A new library FALCONN, which is a highly-optimized implementation of the multiprobe LSH. It uses a novel type of random projections based on the fast Hadamard transform.

The benchmarks were run on a c4.2xlarge instance on EC2 using a previously unseen subset of 5K queries. The benchmarks employ the following data sets:

  1. GloVe : 1.2M 100-dimensional word embeddings trained on Tweets
  2. 1M of 128-dimensional SIFT features
1.19M 100d GloVe, cosine similarity. 1M 128d SIFT features, Euclidean distance:

What's new in version 1.5

  1. A new efficient method: a hierarchical (navigable) small-world graph (HNSW), contributed by Yury Malkov (@yurymalkov). Works with g++, Visual Studio, Intel Compiler, but doesn't work with Clang yet.
  2. A query server, which can have clients in C++, Java, Python, and other languages supported by Apache Thrift
  3. Python bindings for vector and non-vector spaces
  4. Improved performance of two core methods SW-graph and NAPP
  5. Better handling of the gold standard data in the benchmarking utility experiment
  6. Updated API that permits search methods to serialize indices
  7. Improved documentation (e.g., we added tuning guidelines for best methods)

General information

A detailed description is given in the manual. The manual also contains instructions for building under Linux and Windows, extending the library, as well as for debugging the code using Eclipse.

Most of this code is released under the Apache License Version 2.0 http://www.apache.org/licenses/.

To acknowledge the use of the library, you could provide a link to this repository and/or cite our SISAP paper [BibTex]. Some other related papers are listed in the end.

The LSHKIT, which is embedded in our library, is distributed under the GNU General Public License, see http://www.gnu.org/licenses/. The k-NN graph construction algorithm NN-Descent due to Dong et al. 2011 (see the links below), which is also embedded in our library, seems to be covered by a free-to-use license, similar to Apache 2.

Prerequisites

  1. A modern compiler that supports C++11: G++ 4.7, Intel compiler 14, Clang 3.4, or Visual Studio 14 (version 12 can also be used, but the project fileds need to be downgraded).
  2. 64-bit Linux is recommended, but most of our code builds on 64-bit Windows as well.
  3. Boost (dev version). For Windows, the core library and the standalone sample application do not require Boost.
  4. Only for Linux: CMake (GNU make is also required)
  5. Only for Linux: GNU scientific library (dev version)
  6. Only for Linux: Eigen (dev version)
  7. An Intel or AMD processor that supports SSE 4.2 is recommended

Quick start on Linux

To compile, go to the directory similarity_search and type:

cmake .
make  

Note that the directory similarity_search contains an Eclipse project that can be imported into The Eclipse IDE for C/C++ Developers. A more detailed description is given in in the manual, which also contains examples of using the software.

You can also download almost every data set used in our previous evaluations (see the section Data sets below). The downloaded data needs to be decompressed (you may need 7z, gzip, and bzip2). Old experimental scripts can be found in the directory previous_releases_scripts. However, they will work only with previous releases.

Note that the benchmarking utility supports caching of ground truth data, so that ground truth data is not recomputed every time this utility is re-run on the same data set.

Query server (Linux-only)

The query server requires Apache Thrift. We used Apache Thrift 0.9.2, but, perhaps, newer versions will work as well.
To install Apache Thrift, you need to build it from source. This may require additional libraries. On Ubuntu they can be installed as follows:

sudo apt-get install libboost-dev libboost-test-dev libboost-program-options-dev libboost-system-dev libboost-filesystem-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev libboost-thread-dev make

After Apache Thrift is installed, you need to build the library itself. Then, change the directory to query_server/cpp_client_server and type make (the makefile may need to be modified, if Apache Thrift is installed to a non-standard location). The query server has a similar set of parameters to the benchmarking utility experiment. For example, you can start the server as follows:

 ./query_server -i ../../sample_data/final8_10K.txt -s l2 -m sw-graph -c NN=10,efConstruction=200,initIndexAttempts=1 -p 10000

There are also three sample clients implemented in C++, Python, and Java. A client reads a string representation of a query object from the standard stream. The format is the same as the format of objects in a data file. Here is an example of searching for ten vectors closest to the first data set vector (stored in row one) of a provided sample data file:

export DATA_FILE=../../sample_data/final8_10K.txt
head -1 $DATA_FILE | ./query_client -p 10000 -a localhost  -k 10

It is also possible to generate client classes for other languages supported by Thrift from the interface definition file, e.g., for C#. To this end, one should invoke the thrift compiler as follows:

thrift --gen csharp  protocol.thrift

For instructions on using generated code, please consult the Apache Thrift tutorial.

Python bindings (Linux-only)

We provide basic Python bindings (for Linux and Python 2.7). To build bindings for dense vector spaces, build the library first. Then, change the directory to python_vect_bindings and type:

sudo make install

For an example of using our library in Python, see the script test_nmslib_vect.py. Generic bindings can be found in the directory python_gen_bindings.

Quick start on Windows

Building on Windows is straightforward. Download Visual Studio 2015 Express for Desktop. Download and install respective Boost binaries (64-bit version 59). Please, use the default installation directory on disk c: (otherwise, it will be necessary to update project files).

Afterwards, you can simply use the provided Visual Studio solution file. The solution file references several project (*.vcxproj) files: NonMetricSpaceLib.vcxproj is the main project file that is used to build the library itself. The output is stored in the folder similarity_search\x64.

Also note that the core library, the test utilities, as well as examples of the standalone applications (projects sample_standalone_app1 and sample_standalone_app2) can be built without installing Boost.

Data sets

We use several data sets, which were created either by other folks, or using 3d party software. If you use these data sets, please, consider giving proper credit. The download scripts prints respective BibTex entries. More information can be found in the manual.

Here is the list of scripts to download major data sets:

The downloaded data needs to be decompressed (you may need 7z, gzip, and bzip2)

Related publications

Most important related papers are listed below in the chronological order:

About

Non-Metric Space Library (NMSLIB): A similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 82.8%
  • Perl 5.7%
  • Python 5.1%
  • Shell 2.6%
  • C 1.1%
  • CMake 1.1%
  • Other 1.6%