Hash functions: An empirical comparison

Hash tables are popular data structures for storing key-value pairs. A hash function is used to map the key value (usually a string) to array index. The functions are different from cryptographic hash functions, because they should be much faster and don't need to be resistant to preimage attack. Hashing in large databases is also left out from this article; the benchmark includes medium-size hash tables such as:

symbol table in a parser,
IP address table for filtering network traffic,
the dictionary in a word counting program or a spellchecker.

There are two classes of the functions used in hash tables:

multiplicative hash functions, which are simple and fast, but have a high number of collisions;
more complex functions, which have better quality, but take more time to calculate.

Hash table benchmarks usually include theoretical metrics such as the number of collisions or distribution uniformity (see, for example, hash function comparison in the Red Dragon book). Obviously, you will have a better distribution with more complex functions, so they are winners in these benchmarks.

The question is whether using complex functions gives you a faster program. The complex functions require more operations per one key, so they can be slower. Is the price of collisions high enough to justify the additional operations?

Multiplicative hash functions

Any multiplicative hash function is a special case of the following algorithm:

UINT HashMultiplicative(const CHAR *key, SIZE_T len) {
   UINT hash = INITIAL_VALUE;
   for(UINT i = 0; i < len; ++i)
      hash = M * hash + key[i];
   return hash % TABLE_SIZE;
}

(Sometimes XOR operation is used instead of addition, but it does not make much difference.) The hash functions differ only by values of INITIAL_VALUE and multiplier (M). For example, the popular Bernstein's function uses INITIAL_VALUE of 5381 and M of 33; Kernighan and Ritchie's function uses INITIAL_VALUE of 0 and M of 31.

A multiplicative function works by adding together the letters weighted by powers of multiplier. For example, the hash for the word TONE will be:

  INITIAL_VALUE * M^4  +  'T' * M^3  +  'O' * M^2  +  'N' * M  +  'E'

Let's enter several similar strings and watch the output of the functions:

     Bernstein Kernighan
        (M=33)     (M=31)
 too   b88af17     1c154
 top   b88af18     1c155
 tor   b88af1a     1c157
 tpp   b88af39     1c174
a000  7c9312d6    2cd22f
a001  7c9312d7    2cd230
a002  7c9312d8    2cd231
a003  7c9312d9    2cd232
a004  7c9312da    2cd233
a005  7c9312db    2cd234
a006  7c9312dc    2cd235
a007  7c9312dd    2cd236
a008  7c9312de    2cd237
a009  7c9312df    2cd238
a010  7c9312f7    2cd24e
   a     2b606        61
  aa    597727       c20
 aaa   b885c68     17841

Too and top are different in the last letter only. The letter P is the next one after O, so the values of hash function are different by 1 (1c154 and 1c155, b88af17 and b88af18). Ditto for a000..a009.

Now let's compare top with tpp. Their hashes will be:

  INITIAL_VALUE * M^3 + 'T' * M^2 + 'O' * M + 'P'
  INITIAL_VALUE * M^3 + 'T' * M^2 + 'P' * M + 'P'

The hashes will be different by M * ('P' - 'O') = M. Similarly, when the first letters are different by x, their hashes will be different by x * M^2.

When there are less than 33 possible letters, Bernstein's function will pack them into a number (similar to Radix40 packing scheme). For example, hash table of size 33³ will provide perfect hashing (without any collisions) for all three-letter English words written in small letters. In practice, the words are longer and hash tables are smaller, so there will be some collisions (situations when different strings have the same hash value).

If the string is too long to fit into the 32-bit number, the first letters will still affect the value of the hash function, because the multiplication is done modulo 2^32 (in a 32-bit register), and the multiplier is chosen to have no common divisors with 2^32 (in other words, it must be odd), so the bits will not be just shifted away.

There are no exact rules for choosing the multiplier, only some heuristics:

the multiplier should be large enough to accommodate most of the possible letters (e.g., 3 or 5 is too small);
the multiplier should be fast to calculate with shifts and additions [e.g., 33 * hash can be calculated as (hash << 5) + hash];
the multiplier should be odd for the reason explained above;
prime numbers are good multipliers.

Complex hash functions

These functions do a good job of mixing together the bits of the source word. The change in one input bit changes a half of the bits in the output (see Avalanche_effect), so the result looks completely random:

     Paul Hsieh One At Time
 too   3ad11d33  3a9fad1e  
 top   78b5a877  4c5dd09a  
 tor   c09e2021  f2aa9d35  
 tpp   3058996d  d5e9e480  
a000   7552599f  ed3859d8  
a001   3cc1d896  fef7fd57  
a002   c6ff5c9b  08a610b3  
a003   dcab7b0c  1a88b478  
a004   780c7202  3621ebaa  
a005   7eb63e3a  47db8f1d  
a006   6b0a7a17  b901717b  
a007   cb5cb1ab  caec1550  
a008   5c2a15c0  e58d4a92  
a009   33339829  f75aee2d  
a010   eb1f336e  bd097a6b  
   a   115ea782  ca2e9442  
  aa   008ad357  7081738e  
 aaa   7dfdc310  ae4f22ec

To achieve this behavior, the hash functions perform a lot of shifts, XORs, and additions. But do we need a complex function? What is faster: tolerating the collisions and resolving them with chaining, or avoiding them with a more complex function?

Test conditions

The benchmark uses separate chaining algorithm for collision resolution. Memory allocation and other "heavy" functions were excluded from the benchmarked code. The RDTSC instruction was used for benchmarking. The test was performed on Pentium-M and Core i5 processors.

The benchmark inserts some keys in the table, then looks them up in the same order as they were inserted. The test data include:

the list of common words from Wiktionary (500 items);
the list of Win32 functions from Colorer syntax highlight scheme (1992 items);
500 names from a000 to a499 (imitates the names in auto-generated source code);
the list of common words with a long prefix and postfix;
all variable names from WordPress 2.3.2 source code in wp-includes folder (1842 names);
list of all words in Sonnets by W. Shakespeare (imitates a word counting program; 3228 words);
list of all words in La Peau de chagrin by Balzac (in French, UTF-8 encoding);
search engine IP addresses (binary).

Results

Core i5 processor

	Words	Win32	Numbers	Prefix	Postfix	Variables	Sonnets	UTF-8	IPv4	Avg
iSCSI CRC	65	[105]	329	[415]	36	[112]	84	[106]	83	[92]	280	[368]	408	[584]	1964	[2388]	322	[838]	1.01	[1.78]
Meiyan	64	[102]	328	[409]	45	[125]	87	[106]	85	[112]	274	[350]	411	[588]	1972	[2377]	353	[768]	1.05	[1.87]
Murmur2	72	[103]	378	[415]	48	[104]	109	[106]	106	[111]	315	[383]	450	[566]	2183	[2399]	399	[834]	1.21	[1.74]
XXHfast32	78	[110]	372	[420]	57	[102]	88	[103]	88	[106]	315	[347]	473	[491]	2323	[2494]	463	[838]	1.23	[1.71]
SBox	70	[91]	389	[431]	46	[116]	124	[108]	123	[91]	304	[347]	430	[526]	2182	[2442]	377	[836]	1.23	[1.78]
Larson	72	[99]	401	[416]	34	[16]	143	[99]	141	[105]	312	[366]	451	[583]	2230	[2447]	349	[755]	1.25	[1.10]
XXHstrong32	79	[109]	385	[429]	58	[102]	93	[102]	92	[112]	321	[355]	474	[491]	2332	[2496]	464	[838]	1.25	[1.72]
Sedgewick	73	[107]	417	[414]	36	[48]	143	[103]	143	[103]	319	[348]	446	[570]	2246	[2437]	349	[782]	1.26	[1.33]
Novak unrolled	76	[113]	404	[399]	43	[90]	127	[118]	125	[113]	322	[342]	459	[581]	2284	[2430]	379	[969]	1.26	[1.68]
CRC-32	70	[101]	429	[426]	40	[64]	146	[107]	143	[94]	320	[338]	443	[563]	2231	[2400]	357	[725]	1.28	[1.41]
Murmur3	78	[101]	391	[380]	54	[104]	108	[103]	107	[105]	331	[334]	492	[555]	2360	[2376]	433	[783]	1.28	[1.69]
x65599	74	[111]	407	[382]	45	[203]	144	[107]	144	[122]	316	[379]	449	[560]	2221	[2373]	349	[846]	1.29	[2.45]
FNV-1a	74	[124]	408	[428]	47	[108]	144	[94]	144	[105]	309	[374]	440	[555]	2193	[2446]	376	[807]	1.30	[1.77]
Murmur2A	79	[114]	410	[433]	53	[102]	117	[112]	114	[109]	337	[365]	494	[544]	2377	[2369]	429	[772]	1.31	[1.73]
Fletcher	71	[131]	352	[406]	80	[460]	104	[127]	100	[108]	312	[507]	481	[1052]	2477	[4893]	388	[1359]	1.31	[4.62]
K&R	73	[106]	429	[437]	47	[288]	149	[94]	149	[106]	324	[360]	450	[561]	2266	[2365]	343	[831]	1.32	[3.00]
Paul Hsieh	80	[114]	410	[420]	54	[118]	123	[101]	121	[100]	336	[341]	496	[600]	2351	[2380]	433	[847]	1.33	[1.83]
Bernstein	75	[114]	428	[412]	49	[288]	150	[100]	150	[102]	324	[353]	460	[572]	2312	[2380]	351	[703]	1.34	[2.99]
x17 unrolled	78	[109]	446	[415]	43	[24]	156	[113]	153	[102]	344	[368]	472	[589]	2361	[2392]	373	[829]	1.37	[1.19]
lookup3	83	[101]	459	[412]	55	[97]	140	[101]	137	[95]	359	[361]	526	[550]	2480	[2392]	427	[834]	1.42	[1.65]
MaPrime2c	79	[103]	459	[426]	50	[106]	155	[91]	155	[106]	349	[349]	486	[550]	2493	[2406]	406	[865]	1.42	[1.73]
Ramakrishna	80	[108]	513	[409]	44	[91]	189	[125]	186	[103]	370	[360]	483	[528]	2565	[2383]	380	[840]	1.51	[1.66]
One At Time	85	[105]	562	[421]	58	[110]	221	[97]	220	[103]	392	[364]	511	[545]	2659	[2346]	459	[795]	1.72	[1.75]
Arash Partow	83	[101]	560	[435]	71	[420]	215	[98]	212	[85]	392	[355]	507	[570]	2638	[2372]	407	[779]	1.72	[3.88]
Weinberger	87	[104]	590	[422]	37	[100]	254	[111]	273	[117]	398	[364]	541	[712]	2734	[2547]	419	[744]	1.78	[1.75]
Hanson	73	[118]	417	[649]	45	[112]	123	[118]	1207	[499]	318	[435]	448	[592]	2324	[2890]	370	[833]	2.70	[2.46]

Pentium-M processor

	Words	Win32	Numbers	Prefix	Postfix	Variables	Sonnets	UTF-8	IPv4	Avg
Meiyan	80	[102]	426	[409]	56	[125]	123	[106]	121	[112]	354	[350]	525	[588]	2443	[2377]	445	[768]	1.02	[1.87]
Novak unrolled	90	[113]	517	[399]	56	[90]	169	[118]	164	[113]	398	[342]	575	[581]	2716	[2430]	482	[969]	1.18	[1.68]
Fletcher	84	[131]	444	[406]	102	[460]	140	[127]	133	[108]	374	[507]	592	[1052]	2891	[4893]	513	[1359]	1.21	[4.62]
SBox	88	[91]	552	[431]	57	[116]	181	[108]	178	[91]	414	[347]	560	[526]	2814	[2442]	472	[836]	1.22	[1.78]
Murmur2	97	[103]	532	[415]	65	[104]	165	[106]	162	[111]	434	[383]	622	[566]	2948	[2399]	537	[834]	1.25	[1.74]
CRC-32	90	[101]	565	[426]	55	[64]	198	[107]	192	[94]	427	[338]	590	[563]	2842	[2400]	469	[725]	1.26	[1.41]
x17 unrolled	93	[109]	593	[415]	52	[24]	214	[113]	208	[102]	434	[368]	593	[589]	2867	[2392]	486	[829]	1.30	[1.19]
lookup3	94	[101]	565	[412]	70	[97]	189	[101]	182	[95]	432	[361]	631	[550]	2943	[2392]	572	[834]	1.32	[1.65]
K&R	93	[106]	619	[437]	58	[288]	221	[94]	218	[106]	442	[360]	587	[561]	2961	[2365]	447	[831]	1.33	[3.00]
Larson	95	[99]	631	[416]	49	[16]	231	[99]	228	[105]	455	[366]	599	[583]	3027	[2447]	469	[755]	1.35	[1.10]
XXHfast32	108	[110]	546	[420]	86	[102]	139	[103]	136	[106]	459	[347]	681	[491]	3259	[2494]	717	[838]	1.35	[1.71]
Murmur3	108	[101]	561	[380]	74	[104]	167	[103]	165	[105]	468	[334]	700	[555]	3259	[2376]	604	[783]	1.36	[1.69]
Bernstein	97	[114]	622	[412]	61	[288]	225	[100]	222	[102]	448	[353]	609	[572]	3053	[2380]	469	[703]	1.37	[2.99]
XXHstrong32	108	[109]	558	[429]	86	[102]	150	[102]	147	[112]	460	[355]	682	[491]	3262	[2496]	714	[838]	1.38	[1.72]
x65599	99	[111]	628	[382]	61	[203]	234	[107]	232	[122]	459	[379]	630	[560]	3097	[2373]	471	[846]	1.40	[2.45]
Paul Hsieh	106	[114]	576	[420]	82	[118]	183	[101]	178	[100]	456	[341]	678	[600]	3154	[2380]	670	[847]	1.41	[1.83]
Sedgewick	101	[107]	667	[414]	52	[48]	245	[103]	242	[103]	478	[348]	630	[570]	3204	[2437]	475	[782]	1.42	[1.33]
Murmur2A	113	[114]	598	[433]	78	[102]	183	[112]	178	[109]	488	[365]	719	[544]	3380	[2369]	651	[772]	1.44	[1.73]
FNV-1a	102	[124]	660	[428]	62	[108]	239	[94]	237	[105]	473	[374]	627	[555]	3140	[2446]	516	[807]	1.44	[1.77]
MaPrime2c	108	[103]	705	[426]	65	[106]	255	[91]	254	[106]	508	[349]	674	[550]	3413	[2406]	542	[865]	1.54	[1.73]
Ramakrishna	108	[108]	728	[409]	61	[91]	278	[125]	272	[103]	511	[360]	660	[528]	3378	[2383]	517	[840]	1.56	[1.66]
Arash Partow	106	[101]	739	[435]	93	[420]	280	[98]	275	[85]	514	[355]	671	[570]	3332	[2372]	543	[779]	1.65	[3.88]
One At Time	118	[105]	830	[421]	81	[110]	321	[97]	319	[103]	578	[364]	741	[545]	3809	[2346]	657	[795]	1.82	[1.75]
Weinberger	119	[104]	956	[422]	54	[100]	375	[111]	379	[117]	614	[364]	745	[712]	3973	[2547]	560	[744]	1.89	[1.75]
Hanson	86	[118]	531	[649]	55	[112]	168	[118]	1722	[499]	393	[435]	549	[592]	2742	[2890]	463	[833]	2.60	[2.46]

Each cell includes the execution time, then the number of collisions in square brackets. Execution time is expressed in thousands of clock cycles (a lower number is better). Avg column contains the average normalized execution time (and the number of collisions).

The function by Kernighan and Ritchie is from their famous book "The C programming Language", 3rd edition; Weinberger's hash and the hash with multiplier 65599 are from the Red Dragon book. The latter function is used in gawk, sdbm, and other Linux programs. x17 is the function by Peter Kankowski (multiplier = 17; 32 is subtracted from each letter code).

As you can see from the table, the function with the lowest number of collisions is not always the fastest one.

Results on a large data set (list of all words in English Wikipedia, 12.5 million words, from the benchmark by Georgi 'Sanmayce'):

Core i5 processor

	Wikipedia	Avg
iSCSI CRC	5725944	[2077725]	1.00	[1.00]
Meiyan	5829105	[2111271]	1.02	[1.02]
Murmur2	6313466	[2081476]	1.10	[1.00]
Larson	6403975	[2080111]	1.12	[1.00]
Murmur3	6492620	[2082084]	1.13	[1.00]
x65599	6479417	[2102893]	1.13	[1.01]
FNV-1a	6599423	[2081195]	1.15	[1.00]
SBox	6964673	[2084018]	1.22	[1.00]
Hanson	7007689	[2129832]	1.22	[1.03]
CRC-32	7016147	[2075088]	1.23	[1.00]
Sedgewick	7060691	[2080640]	1.23	[1.00]
XXHfast32	7078804	[2084164]	1.24	[1.00]
K&R	7109841	[2083145]	1.24	[1.00]
XXHstrong32	7168788	[2084514]	1.25	[1.00]
Bernstein	7247096	[2074237]	1.27	[1.00]
lookup3	7342986	[2084889]	1.28	[1.01]
Murmur2A	7376650	[2081370]	1.29	[1.00]
Paul Hsieh	7387317	[2180206]	1.29	[1.05]
x17 unrolled	7410443	[2410605]	1.29	[1.16]
Ramakrishna	8172670	[2093253]	1.43	[1.01]
One At Time	8338799	[2087861]	1.46	[1.01]
MaPrime2c	8428492	[2084467]	1.47	[1.00]
Arash Partow	8503299	[2084572]	1.49	[1.00]
Weinberger	9416340	[3541181]	1.64	[1.71]
Novak unrolled	21289919	[6318611]	3.72	[3.05]
Fletcher	22235133	[9063797]	3.88	[4.37]

Pentium-M processor

	Wikipedia	Avg
x17 unrolled	11321744	[2410605]	1.00	[1.16]
K&R	11666050	[2083145]	1.03	[1.00]
Bernstein	11833902	[2074237]	1.05	[1.00]
Larson	11888751	[2080111]	1.05	[1.00]
Sedgewick	12111839	[2080640]	1.07	[1.00]
x65599	12144777	[2102893]	1.07	[1.01]
Arash Partow	12235396	[2084572]	1.08	[1.00]
Ramakrishna	12185834	[2093253]	1.08	[1.01]
Meiyan	12269691	[2111271]	1.08	[1.02]
CRC-32	12604152	[2075088]	1.11	[1.00]
Murmur2	12713455	[2081476]	1.12	[1.00]
SBox	12716574	[2084018]	1.12	[1.00]
Hanson	12627597	[2129832]	1.12	[1.03]
lookup3	12791917	[2084889]	1.13	[1.01]
FNV-1a	12868991	[2081195]	1.14	[1.00]
Murmur3	12916960	[2082084]	1.14	[1.00]
XXHfast32	12936106	[2084164]	1.14	[1.00]
XXHstrong32	12950650	[2084514]	1.14	[1.00]
Murmur2A	13068746	[2081370]	1.15	[1.00]
Paul Hsieh	12992315	[2180206]	1.15	[1.05]
MaPrime2c	13348580	[2084467]	1.18	[1.00]
One At Time	13662010	[2087861]	1.21	[1.01]
Weinberger	14592843	[3541181]	1.29	[1.71]
Fletcher	37410790	[9063797]	3.30	[4.37]
Novak unrolled	37769882	[6318611]	3.34	[3.05]

Some functions were excluded from the benchmark because of very bad performance:

Adler-32 (slow filling, not suitable as a hash function);
TwoChars (bad for machine-generated names and variable names that are similar to each other, disastrous for large data sets such as Wikipedia).

The number of collisions depending on the hash table size (for the same data set, thanks to Ace for the idea):

For 28 bits: Novak unrolled - 5.9 million collisions, Fletcher - 4.9 million collisions, Weinberger - 1.1 million collisions, x17 unrolled - 0.8 million collisions, Paul Hsieh - about 0.4 million collisions, other functions - about 0.3 million collisions

Red Dragon Book proposes the following formula for evaluating hash function quality:

where b_j is the number of items in j-th slot, m is the number of slots, and n is the total number of items. The sum of b_j(b_j + 1) / 2 estimates the number of slots your program should visit to find the required value. The denominator (n / 2m)(n + 2m − 1) is the number of visited slots for an ideal function that puts each item into a random slot. So, if the function is ideal, the formula should give 1. In reality, a good function is somewhere between 0.95 and 1.05. If it's more, there is a high number of collisions (slow!). If it's less, the function gives less collisions than the randomly distributing function, which is not bad.

Here are the results for some of our functions:

Hash function quality (using the formula from Red Dragon book). In Numbers test: K&R and Bernstein - 1.6, x65599 - 1.2, x17 and Paul Larson - 0.8, CRC-32 - 0.9. Meiyan, FNV-1a, SBox, Murmur2, Paul Hsieh, XXHfast32, and lookup3 - between 0.95 and 1.05. In other tests all functions have the quality between 0.95 and 1.05.

Conclusion

Complex functions by Paul Hsieh and Bob Jenkins are tuned for long keys, such as the ones in postfix and prefix tests. Note that they do not provide the best number of collisions for these tests, but do have the best time, which means that the functions are faster than the others because of loop unrolling. At the same time, they are suboptimal for short keys (words and sonnets tests).

For a word counting program, a compiler, or another application that typically handles short keys, it's often advantageous to use a simple multiplicative function such as x17 or Larson's hash. However, these functions perform badly on long keys.

Novak showed bad results on the large data set. Jesteress has a high number of collisions in numbers test.

Murmur2, Meiyan, SBox, and CRC32 provide good performance for all kinds of keys. They can be recommended as general-purpose hashing functions on x86.

Hardware-accelerated CRC (labeled iSCSI CRC in the table) is the fastest hash function on the recent Core i5/i7 processors. However, the CRC32 instruction is not supported by AMD and earlier Intel processors.

Download the source code (152 KB, MSVC++)

Variations

XORing high and low part

For table size less than 2^16, we can improve the quality of hash function by XORing high and low words, so that more letters will be taken into account:

   return hash ^ (hash >> 16);

Subtracting a constant

x17 hash function subtracts a space from each letter to cut off the control characters in the range 0x00..0x1F. If the hash keys are long and contain only Latin letters and numbers, the letters will be less frequently shifted out, and the overall number of collisions will be lower. You can even subtract 'A' when you know that the keys will be only English words.

Using larger multipliers for a compiler

Paul Hsieh noted that large multipliers may provide better results for the hash table in a compiler, because a typical source code contains a lot of one-letter variable names (i, j, s, etc.), and they will collide if the multiplier is less than the number of letters in the alphabet.

The test confirms this assumption: the function by Kernighan & Ritchie (M = 33) has lower number of collisions than x17 (M = 17), but the latter is still faster (see Variables column in the table above).

Setting hash table size to a prime number

A test showed that the number of collisions will usually be lower if you use a prime, but the calculations modulo prime take much more time than the calculations for a power of 2, so this method is impractical. Even replacing division with multiplication by reciprocal values do not help here:

	Words	Win32	Numbers	Prefix	Postfix	Variables	Shakespeare
Bernstein % 2^K	145	[261]	880	[889]	426	[8030]	326	[214]	316	[226]	649	[697]	874	[1131]
Bernstein % prime	186	[221]	1049	[995]	445	[5621]	364	[194]	357	[217]	805	[800]	1123	[1051]
Bernstein optimized mod	160	[221]	960	[995]	416	[5621]	341	[194]	334	[217]	722	[800]	969	[1051]
x17 % 2^K	137	[193]	847	[1002]	81	[340]	314	[244]	300	[228]	641	[863]	832	[1012]
x17 % prime	173	[256]	1010	[1026]	104	[324]	356	[246]	339	[216]	760	[760]	1046	[1064]
x17 optimized mod	155	[256]	915	[1026]	96	[324]	330	[246]	315	[216]	691	[760]	930	[1064]

Implementing open addressing vs. separate chaining

With open addressing, most hash functions show awkward clustering behavior in "Numbers" test:

	Bernst.	K&R	x17 unroll	x65599	FNV	Univ	Weinb.	Hsieh	One-at	Lookup3	Partow	CRC
OA	426	81	84	207	88	91	273	110	103	92	1042	79
[8030]	[20810]	[340]	[3158]	[207]	[480]	[4360]	[342]	[267]	[205]	[20860]	[96]
32-bit	179	69	74	114	86	80	125	105	99	92	347	82
[8030]	[20810]	[340]	[3158]	[207]	[480]	[4360]	[342]	[267]	[205]	[20860]	[96]
chain	92	68	73	82	88	84	73	107	99	95	149	84
[500]	[500]	[24]	[258]	[124]	[48]	[100]	[138]	[131]	[108]	[1530]	[64]

You can avoid the worst case by using chaining for collision resolution. However, chaining requires more memory for the next item pointers, so the performance improvement does not come for free. A custom memory allocator should be usually written, because calling malloc() for a large number of small structures is suboptimal.

Some implementations (e.g., hash table in Python interpreter) store a full 32-bit hash with the item to speed up the string comparison, but this is less effective than chaining.

Translation

Hungarian translation by Zsolt Boros

205 comments

Ten recent comments are shown below. Show all comments

m^2, 4 years ago

I run a quick slash_hash test with SMHasher:

https://gist.github.com/anonymous/5926294

Not great.

Ace, 3 years ago

Safe against DoS attacks, by Jean-Philippe Aumasson and Daniel J. Bernstein:

https://131002.net/siphash/

Another mentioned:

http://google-opensource.blogspot.co.at/2011/04/introducing-cityhash.html

http://code.google.com/p/cityhash/

And the set of tests maintained by Murmur authors: http://code.google.com/p/smhasher/

Peter Kankowski, 3 years ago

Thank you, SipHash looks interesting. I will include it in the benchmarks.

ace, 3 years ago

http://blog.booking.com/hardening-perls-hash-function.html

"Analysis done by the Perl5 Security Team suggests that One-At-A-Time-Hash is intrinsically more secure than MurmurHash. However to my knowledge, there is no peer-reviewed cryptanalysis to prove it.

There seems to be very little research into fast, robust, hash algorithms which are suitable for dynamic languages. Siphash is a notable exception and a step forward, but within the Perl community there seems to be consensus that it is currently too slow, at least in the recommended Siphash-2-4 incarnation. It is also problematic that its current implementation only supports 64 bit architectures. (No doubt this will improve over time, or perhaps even already has.)"

JC Meyrignac, 3 years ago

Why didn't you try Pearson hashing ?

http://en.wikipedia.org/wiki/Pearson_hashing

Thank you, it looks interesting. I will definitely try it.

Robert, 2 years ago

Great article Peter, with some useful results about performance and collision probably on real world data.

Just one thing I want to query - I would question your assumption that increased complexity leads to lower collisions? In fact it could be argued that the reverse may be true. For example, FNV specifically advocates using "FNV primes" as the multiplier, which have relatively small avalanche effect, but are designed to systematically *reduce* collisions over short input sequences. For longer input sequences the distribution of a hash function with low avalanche effect can still approach the ideal random distribution.

That is why the Wikipedia article on the avalange effect correctly observes that it "refers to a desirable property of *cryptographic*algorithms" (my emphasis)

In general the avalanche effect is overrated and not really important for non crpyptographic / non hostile environments. It doesn't inherently reduce the probability of collision, but what the it does do is make it much harder to systematically construct a collision. (Making it harder, for example, for an attacker to ruin a hash table by sending inputs that systematically overload some buckets while leaving others empty.)

I think if you repeated your tests for complex/cryptogrpahic functions you would find that, while they obviously have lower performance, they dont have any significant difference in collision rate from any of the good non-cryptographic functions. But it would be great to see the actual results if you ever have time for this?

Peter Kankowski, 2 years ago

Robert, thank you very much for the kind words and suggestions. I agree that the avalanche effect is overrated; a simple multiplicative function is often enough.

There are functions designed to prevent an attacker from overloading a hash table (e.g., https://131002.net/siphash/ ).

I will include complex/cryptographic functions into the comparison, thank you for the idea.

Bulat, 6 months ago

You may find this fork useful: https://github.com/rurban/smhasher

In doc subdir it contains reports for all hashes

Dec	JAN	Feb
	30
2016	2017	2018

Hash functions: An empirical comparison

Multiplicative hash functions

Complex hash functions

Test conditions

Results

Core i5 processor

Pentium-M processor

Core i5 processor

Pentium-M processor

Conclusion

Variations

XORing high and low part

Subtracting a constant

Using larger multipliers for a compiler

Setting hash table size to a prime number

Implementing open addressing vs. separate chaining

Translation

About the author

205 comments

Featured pages

Recent comments