Skip to content
/ umash Public
forked from backtrace-labs/umash

UMASH: a fast enough hash and fingerprint with collision bounds

License

Notifications You must be signed in to change notification settings

rurban/umash

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UMASH: a fast almost universal 64-bit string hash

UMASH is a string hash function with throughput (22 GB/s on a 2.5 GHz Xeon 8175M) and latency (9-22 ns for input sizes up to 64 bytes on the same machine) comparable to that of performance-optimised hashes like MurmurHash3, XXH3, or farmhash. Its 64-bit output is almost universal, and it, as well as both its 32-bit halves, passes both Reini Urban's fork of SMHasher and Yves Orton's extended version (after expanding each seed to a 320-byte key for the latter).

Unlike most other non-cryptographic hash functions (CLHash is a rare exception) which do not prevent seed-independent collisions and thus usually suffer from such weaknesses, UMASH provably avoids parameter-independent collisions. For any two inputs of l bytes or fewer, the probability that a randomly parameterised UMASH assigns them the same 64 bit hash value is less than ceil(l / 2048) 2**-56. UMASH also offers a fingerprinting mode that simply computes two independent hashes at the same time. The resulting 128-bit fingerprint collides pairs of l-or-fewer-byte inputs with probability less than ceil(l / 2048)**2 * 2**-112; that's less than 2**-70 (1e-21) for up to 7.5 GB of data.

See umash_reference.py (pre-rendered in umash.pdf) for details and rationale about the design, and a proof sketch for the collision bound. The blog post announcing UMASH includes a higher level overview and may also provide useful context.

If you're not into details, you can also just copy umash.c and umash.h in your project: they're distributed under the MIT license.

The current implementation only build with gcc-compatible compilers that support the integer overflow builtins introduced by GCC 5 (April 2015) and targets x86-64 machines with the CLMUL extension (available since 2011 on Intel and AMD). That's simply because we only use UMASH on such platforms at Backtrace. There should be no reason we can't also target other compilers, or other architectures with carry-less multiplication instructions (e.g., VMULL on ARMv8).

Hacking on UMASH

The test suite calls into a shared object with test-only external symbols with Python 3, CFFI, and Hypothesis. As long as Python3 and venv are installed, you may execute t/run-tests.sh to download test dependencies, build the current version of UMASH and run all the pytests in the t/ directory. t/run-tests-public.sh only exercises the public interface, which may be helpful to test a production build or when making extensive internal changes.

The Python test code is automatically formatted with black. We try to make sure the C code sticks to something similar to the FreeBSD KNF; when in doubt, whatever t/format.sh does is good enough.

Help wanted

While the UMASH algorithm isn't frozen yet, we do not expect to change its overall structure (PH block compressor that feeds a polynomial string hash). There are plenty of lower hanging fruits.

  1. The short (8 or fewer bytes) input code can hopefully be simpler.
  2. The medium-length (9-15 bytes) input code path is a micro-optimised version of the general case, but does not actually share any machine code; can we improve the latency and maintain the collision bounds by replacing it with something completely different?
  3. It’s already nice that we can get away with a single round of xor-shift / multiply in the finaliser, but can we shave even more latency there?
  4. We only looked at x86-64 implementations; we will consider simple changes that improve performance on x86-64, or on other platforms as long they don't penalise x86-64.
  5. We currently only use incremental and one-shot hashing interfaces. If someone needs parallel hashing, we can collaborate to find out what that interface could look like.

And of course, portability to other C compilers or platforms is interesting.

About

UMASH: a fast enough hash and fingerprint with collision bounds

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 59.1%
  • C 39.3%
  • Shell 1.6%