//// Copyright 2005-2008 Daniel James Copyright 2022 Christian Mazakas Copyright 2022 Peter Dimov Distributed under the Boost Software License, Version 1.0. https://www.boost.org/LICENSE_1_0.txt //// [#notes] = Design and Implementation Notes :idprefix: notes_ == Quality of the Hash Function Many hash functions strive to have little correlation between the input and output values. They attempt to uniformally distribute the output values for very similar inputs. This hash function makes no such attempt. In fact, for integers, the result of the hash function is often just the input value. So similar but different input values will often result in similar but different output values. This means that it is not appropriate as a general hash function. For example, a hash table may discard bits from the hash function resulting in likely collisions, or might have poor collision resolution when hash values are clustered together. In such cases this hash function will perform poorly. But the standard has no such requirement for the hash function, it just requires that the hashes of two different values are unlikely to collide. Containers or algorithms designed to work with the standard hash function will have to be implemented to work well when the hash function's output is correlated to its input. Since they are paying that cost a higher quality hash function would be wasteful. == The hash_value Customization Point The way one customizes the standard `std::hash` function object for user types is via a specialization. `boost::hash` chooses a different mechanism -- an overload of a free function `hash_value` in the user namespace that is found via argument-dependent lookup. Both approaches have their pros and cons. Specializing the function object is stricter in that it only applies to the exact type, and not to derived or convertible types. Defining a function, on the other hand, is easier and more convenient, as it can be done directly in the type definition as an `inline` `friend`. The fact that overloads can be invoked via conversions did cause issues in an earlier iteration of the library that defined `hash_value` for all integral types separately, including `bool`. Especially under {cpp}03, which doesn't have `explicit` conversion operators, some types were convertible to `bool` to allow their being tested in e.g. `if` statements, which caused them to hash to 0 or 1, rarely what one expects or wants. This, however, was fixed by declaring the built-in `hash_value` overloads to be templates constrained on e.g. `std::is_integral` or its moral equivalent. This causes types convertible to an integral to no longer match, avoiding the problem. == Hash Value Stability In general, the library does not promise that the hash values will stay the same from release to release (otherwise improvements would be impossible). However, historically values have been quite stable. Before release 1.81, the previous changes have been in 1.56 (a better `hash_combine`) and 1.78 (macOS-specific change to `hash_combine`.) Code should generally not depend on specific hash values, but for those willing to take the risk of occasional breaks due to hash value changes, the library now has a test that checks hash values for a number of types against reference values (`test/hash_reference_values.cpp`), whose https://github.com/boostorg/container_hash/commits/develop/test/hash_reference_values.cpp[version history] can be used as a rough guide to when hash values have changed, and for what types. == hash_combine The initial implementation of the library was based on Issue 6.18 of the http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1837.pdf[Library Extension Technical Report Issues List] (pages 63-67) which proposed the following implementation of `hash_combine`: [source] ---- template void hash_combine(size_t & seed, T const & v) { seed ^= hash_value(v) + (seed << 6) + (seed >> 2); } ---- taken from the paper "https://people.eng.unimelb.edu.au/jzobel/fulltext/jasist03thz.pdf[Methods for Identifying Versioned and Plagiarised Documents]" by Timothy C. Hoad and Justin Zobel. During the Boost formal review, Dave Harris pointed out that this suffers from the so-called "zero trap"; if `seed` is initially 0, and all the inputs are 0 (or hash to 0), `seed` remains 0 no matter how many input values are combined. This is an undesirable property, because it causes containers of zeroes to have a zero hash value regardless of their sizes. To fix this, the arbitrary constant `0x9e3779b9` (the golden ratio in a 32 bit fixed point representation) was added to the computation, yielding [source] ---- template void hash_combine(size_t & seed, T const & v) { seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2); } ---- This is what shipped in Boost 1.33, the first release containing the library. This function was a reasonable compromise between quality and speed for its time, when the input consisted of ``char``s, but it's less suitable for combining arbitrary `size_t` inputs. In Boost 1.56, it was replaced by functions derived from Austin Appleby's https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02b518de/src/MurmurHash2.cpp#L57-L62[MurmurHash2 hash function round]. In Boost 1.81, it was changed again -- to the equivalent of `mix(seed + 0x9e3779b9 + hash_value(v))`, where `mix(x)` is a high quality mixing function that is a bijection over the `size_t` values, of the form [source] ---- x ^= x >> k1; x *= m1; x ^= x >> k2; x *= m2; x ^= x >> k3; ---- This type of mixing function was originally devised by Austin Appleby as the "final mix" part of his MurmurHash3 hash function. He used [source] ---- x ^= x >> 33; x *= 0xff51afd7ed558ccd; x ^= x >> 33; x *= 0xc4ceb9fe1a85ec53; x ^= x >> 33; ---- as the https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02b518de/src/MurmurHash2.cpp#L57-L62[64 bit function `fmix64`] and [source] ---- x ^= x >> 16; x *= 0x85ebca6b; x ^= x >> 13; x *= 0xc2b2ae35; x ^= x >> 16; ---- as the https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02b518de/src/MurmurHash3.cpp#L68-L77[32 bit function `fmix32`]. Several improvements of the 64 bit function have been subsequently proposed, by https://zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html[David Stafford], https://mostlymangling.blogspot.com/2019/12/stronger-better-morer-moremur-better.html[Pelle Evensen], and http://jonkagstrom.com/mx3/mx3_rev2.html[Jon Maiga]. We currently use Jon Maiga's function [source] ---- x ^= x >> 32; x *= 0xe9846af9b1a615d; x ^= x >> 32; x *= 0xe9846af9b1a615d; x ^= x >> 28; ---- Under 32 bit, we use a mixing function proposed by "TheIronBorn" in a https://github.com/skeeto/hash-prospector/issues/19[Github issue] in the https://github.com/skeeto/hash-prospector[repository] of https://nullprogram.com/blog/2018/07/31/[Hash Prospector] by Chris Wellons: [source] ---- x ^= x >> 16; x *= 0x21f0aaad; x ^= x >> 15; x *= 0x735a2d97; x ^= x >> 15; ---- With this improved `hash_combine`, `boost::hash` for strings now passes the https://github.com/aappleby/smhasher[SMHasher test suite] by Austin Appleby (for a 64 bit `size_t`). == hash_range The traditional implementation of `hash_range(seed, first, last)` has been [source] ---- for( ; first != last; ++first ) { boost::hash_combine::value_type>( seed, *first ); } ---- (the explicit template parameter is needed to support iterators with proxy return types such as `std::vector::iterator`.) This is logical, consistent and straightforward. In the common case where `typename std::iterator_traits::value_type` is `char` -- which it is in the common case of `boost::hash` -- this however leaves a lot of performance on the table, because processing each `char` individually is much less efficient than processing several in bulk. In Boost 1.81, `hash_range` was changed to process elements of type `char`, `signed char`, `unsigned char`, `std::byte`, or `char8_t`, four of a time. A `uint32_t` is composed from `first[0]` to `first[3]`, and that `uint32_t` is fed to `hash_combine`. In Boost 1.82, `hash_range` for these types was changed to use https://github.com/pdimov/mulxp_hash[`mulxp1_hash`]. This improves both quality and speed of string hashing. Note that `hash_range` has also traditionally guaranteed that the same element sequence yields the same hash value regardless of the iterator type. This property remains valid after the changes to `char` range hashing. `hash_range`, applied to the `char` sequence `{ 'a', 'b', 'c' }`, results in the same value whether the sequence comes from `char[3]`, `std::string`, `std::deque`, or `std::list`.