229 lines
8.6 KiB
Plaintext
229 lines
8.6 KiB
Plaintext
|
////
|
||
|
Copyright 2005-2008 Daniel James
|
||
|
Copyright 2022 Christian Mazakas
|
||
|
Copyright 2022 Peter Dimov
|
||
|
Distributed under the Boost Software License, Version 1.0.
|
||
|
https://www.boost.org/LICENSE_1_0.txt
|
||
|
////
|
||
|
|
||
|
[#notes]
|
||
|
= Design and Implementation Notes
|
||
|
:idprefix: notes_
|
||
|
|
||
|
== Quality of the Hash Function
|
||
|
|
||
|
Many hash functions strive to have little correlation between the input and
|
||
|
output values. They attempt to uniformally distribute the output values for
|
||
|
very similar inputs. This hash function makes no such attempt. In fact, for
|
||
|
integers, the result of the hash function is often just the input value. So
|
||
|
similar but different input values will often result in similar but different
|
||
|
output values. This means that it is not appropriate as a general hash
|
||
|
function. For example, a hash table may discard bits from the hash function
|
||
|
resulting in likely collisions, or might have poor collision resolution when
|
||
|
hash values are clustered together. In such cases this hash function will
|
||
|
perform poorly.
|
||
|
|
||
|
But the standard has no such requirement for the hash function, it just
|
||
|
requires that the hashes of two different values are unlikely to collide.
|
||
|
Containers or algorithms designed to work with the standard hash function will
|
||
|
have to be implemented to work well when the hash function's output is
|
||
|
correlated to its input. Since they are paying that cost a higher quality hash
|
||
|
function would be wasteful.
|
||
|
|
||
|
== The hash_value Customization Point
|
||
|
|
||
|
The way one customizes the standard `std::hash` function object for user
|
||
|
types is via a specialization. `boost::hash` chooses a different mechanism --
|
||
|
an overload of a free function `hash_value` in the user namespace that is
|
||
|
found via argument-dependent lookup.
|
||
|
|
||
|
Both approaches have their pros and cons. Specializing the function object
|
||
|
is stricter in that it only applies to the exact type, and not to derived
|
||
|
or convertible types. Defining a function, on the other hand, is easier
|
||
|
and more convenient, as it can be done directly in the type definition as
|
||
|
an `inline` `friend`.
|
||
|
|
||
|
The fact that overloads can be invoked via conversions did cause issues in
|
||
|
an earlier iteration of the library that defined `hash_value` for all
|
||
|
integral types separately, including `bool`. Especially under {cpp}03,
|
||
|
which doesn't have `explicit` conversion operators, some types were
|
||
|
convertible to `bool` to allow their being tested in e.g. `if` statements,
|
||
|
which caused them to hash to 0 or 1, rarely what one expects or wants.
|
||
|
|
||
|
This, however, was fixed by declaring the built-in `hash_value` overloads
|
||
|
to be templates constrained on e.g. `std::is_integral` or its moral
|
||
|
equivalent. This causes types convertible to an integral to no longer
|
||
|
match, avoiding the problem.
|
||
|
|
||
|
== Hash Value Stability
|
||
|
|
||
|
In general, the library does not promise that the hash values will stay
|
||
|
the same from release to release (otherwise improvements would be
|
||
|
impossible). However, historically values have been quite stable. Before
|
||
|
release 1.81, the previous changes have been in 1.56 (a better
|
||
|
`hash_combine`) and 1.78 (macOS-specific change to `hash_combine`.)
|
||
|
|
||
|
Code should generally not depend on specific hash values, but for those
|
||
|
willing to take the risk of occasional breaks due to hash value changes,
|
||
|
the library now has a test that checks hash values for a number of types
|
||
|
against reference values (`test/hash_reference_values.cpp`),
|
||
|
whose https://github.com/boostorg/container_hash/commits/develop/test/hash_reference_values.cpp[version history]
|
||
|
can be used as a rough guide to when hash values have changed, and for what
|
||
|
types.
|
||
|
|
||
|
== hash_combine
|
||
|
|
||
|
The initial implementation of the library was based on Issue 6.18 of the
|
||
|
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1837.pdf[Library Extension Technical Report Issues List]
|
||
|
(pages 63-67) which proposed the following implementation of `hash_combine`:
|
||
|
|
||
|
[source]
|
||
|
----
|
||
|
template<class T>
|
||
|
void hash_combine(size_t & seed, T const & v)
|
||
|
{
|
||
|
seed ^= hash_value(v) + (seed << 6) + (seed >> 2);
|
||
|
}
|
||
|
----
|
||
|
|
||
|
taken from the paper
|
||
|
"https://people.eng.unimelb.edu.au/jzobel/fulltext/jasist03thz.pdf[Methods for Identifying Versioned and Plagiarised Documents]"
|
||
|
by Timothy C. Hoad and Justin Zobel.
|
||
|
|
||
|
During the Boost formal review, Dave Harris pointed out that this suffers
|
||
|
from the so-called "zero trap"; if `seed` is initially 0, and all the
|
||
|
inputs are 0 (or hash to 0), `seed` remains 0 no matter how many input
|
||
|
values are combined.
|
||
|
|
||
|
This is an undesirable property, because it causes containers of zeroes
|
||
|
to have a zero hash value regardless of their sizes.
|
||
|
|
||
|
To fix this, the arbitrary constant `0x9e3779b9` (the golden ratio in a
|
||
|
32 bit fixed point representation) was added to the computation, yielding
|
||
|
|
||
|
[source]
|
||
|
----
|
||
|
template<class T>
|
||
|
void hash_combine(size_t & seed, T const & v)
|
||
|
{
|
||
|
seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
|
||
|
}
|
||
|
----
|
||
|
|
||
|
This is what shipped in Boost 1.33, the first release containing the library.
|
||
|
|
||
|
This function was a reasonable compromise between quality and speed for its
|
||
|
time, when the input consisted of ``char``s, but it's less suitable for
|
||
|
combining arbitrary `size_t` inputs.
|
||
|
|
||
|
In Boost 1.56, it was replaced by functions derived from Austin Appleby's
|
||
|
https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02b518de/src/MurmurHash2.cpp#L57-L62[MurmurHash2 hash function round].
|
||
|
|
||
|
In Boost 1.81, it was changed again -- to the equivalent of
|
||
|
`mix(seed + 0x9e3779b9 + hash_value(v))`, where `mix(x)` is a high quality
|
||
|
mixing function that is a bijection over the `size_t` values, of the form
|
||
|
|
||
|
[source]
|
||
|
----
|
||
|
x ^= x >> k1;
|
||
|
x *= m1;
|
||
|
x ^= x >> k2;
|
||
|
x *= m2;
|
||
|
x ^= x >> k3;
|
||
|
----
|
||
|
|
||
|
This type of mixing function was originally devised by Austin Appleby as
|
||
|
the "final mix" part of his MurmurHash3 hash function. He used
|
||
|
|
||
|
[source]
|
||
|
----
|
||
|
x ^= x >> 33;
|
||
|
x *= 0xff51afd7ed558ccd;
|
||
|
x ^= x >> 33;
|
||
|
x *= 0xc4ceb9fe1a85ec53;
|
||
|
x ^= x >> 33;
|
||
|
----
|
||
|
|
||
|
as the https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02b518de/src/MurmurHash2.cpp#L57-L62[64 bit function `fmix64`] and
|
||
|
|
||
|
[source]
|
||
|
----
|
||
|
x ^= x >> 16;
|
||
|
x *= 0x85ebca6b;
|
||
|
x ^= x >> 13;
|
||
|
x *= 0xc2b2ae35;
|
||
|
x ^= x >> 16;
|
||
|
----
|
||
|
|
||
|
as the https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02b518de/src/MurmurHash3.cpp#L68-L77[32 bit function `fmix32`].
|
||
|
|
||
|
Several improvements of the 64 bit function have been subsequently proposed,
|
||
|
by https://zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html[David Stafford],
|
||
|
https://mostlymangling.blogspot.com/2019/12/stronger-better-morer-moremur-better.html[Pelle Evensen],
|
||
|
and http://jonkagstrom.com/mx3/mx3_rev2.html[Jon Maiga]. We currently use Jon
|
||
|
Maiga's function
|
||
|
|
||
|
[source]
|
||
|
----
|
||
|
x ^= x >> 32;
|
||
|
x *= 0xe9846af9b1a615d;
|
||
|
x ^= x >> 32;
|
||
|
x *= 0xe9846af9b1a615d;
|
||
|
x ^= x >> 28;
|
||
|
----
|
||
|
|
||
|
Under 32 bit, we use a mixing function proposed by "TheIronBorn" in a
|
||
|
https://github.com/skeeto/hash-prospector/issues/19[Github issue] in
|
||
|
the https://github.com/skeeto/hash-prospector[repository] of
|
||
|
https://nullprogram.com/blog/2018/07/31/[Hash Prospector] by Chris Wellons:
|
||
|
|
||
|
[source]
|
||
|
----
|
||
|
x ^= x >> 16;
|
||
|
x *= 0x21f0aaad;
|
||
|
x ^= x >> 15;
|
||
|
x *= 0x735a2d97;
|
||
|
x ^= x >> 15;
|
||
|
----
|
||
|
|
||
|
With this improved `hash_combine`, `boost::hash` for strings now passes the
|
||
|
https://github.com/aappleby/smhasher[SMHasher test suite] by Austin Appleby
|
||
|
(for a 64 bit `size_t`).
|
||
|
|
||
|
== hash_range
|
||
|
|
||
|
The traditional implementation of `hash_range(seed, first, last)` has been
|
||
|
|
||
|
[source]
|
||
|
----
|
||
|
for( ; first != last; ++first )
|
||
|
{
|
||
|
boost::hash_combine<typename std::iterator_traits<It>::value_type>( seed, *first );
|
||
|
}
|
||
|
----
|
||
|
|
||
|
(the explicit template parameter is needed to support iterators with proxy
|
||
|
return types such as `std::vector<bool>::iterator`.)
|
||
|
|
||
|
This is logical, consistent and straightforward. In the common case where
|
||
|
`typename std::iterator_traits<It>::value_type` is `char` -- which it is
|
||
|
in the common case of `boost::hash<std::string>` -- this however leaves a
|
||
|
lot of performance on the table, because processing each `char` individually
|
||
|
is much less efficient than processing several in bulk.
|
||
|
|
||
|
In Boost 1.81, `hash_range` was changed to process elements of type `char`,
|
||
|
`signed char`, `unsigned char`, `std::byte`, or `char8_t`, four of a time.
|
||
|
A `uint32_t` is composed from `first[0]` to `first[3]`, and that `uint32_t`
|
||
|
is fed to `hash_combine`.
|
||
|
|
||
|
In Boost 1.82, `hash_range` for these types was changed to use
|
||
|
https://github.com/pdimov/mulxp_hash[`mulxp1_hash`]. This improves both
|
||
|
quality and speed of string hashing.
|
||
|
|
||
|
Note that `hash_range` has also traditionally guaranteed that the same element
|
||
|
sequence yields the same hash value regardless of the iterator type. This
|
||
|
property remains valid after the changes to `char` range hashing. `hash_range`,
|
||
|
applied to the `char` sequence `{ 'a', 'b', 'c' }`, results in the same value
|
||
|
whether the sequence comes from `char[3]`, `std::string`, `std::deque<char>`,
|
||
|
or `std::list<char>`.
|