Let's say I have n machines and I need to allocate data across those machines as uniformly as possible. Let's use 5 for this example. And the data we have will look like:
id state name date
1 'DE' 'Aaron' 2014-01-01
To shard on the id, I could do a function like:
machine_num = id % n
To shard on a string, I suppose the most basic way would be something like string-to-binary-to-number:
name_as_num = int(''.join(format(ord(i), 'b') for i in name), 2)
machine_num = name_as_num % n
Or even simpler:
machine_num = ord(name[0]) % n
What would be an example of how a date or timestamp could be sharded? And what might be a better function to shard a string (or even numeric) field than the ones I'm using above?
Since hash functions are meant to produce numbers that are evenly distributed, you can use the hash function for your purpose:
machine_num = hash(name) % n
Works for datetime objects too:
machine_num = hash(datetime(2019, 10, 2, 12, 0, 0)) % n
But as #jasonharper pointed out in the comment, the hash value of a specific object is only guaranteed to be consistent within the same run of a program, so if you require the distribution to be consistent across multiple runs, you would have to write your own hashing function like what you have done in your question.
Without further knowledge about the structure and distribution about the keys used for shard operations, a hash function is a good approach. Python standard library provides in zlib module the simple functions adler32 and crc32 which take bytes (actually anything with buffer interface) and return an unsigned 32 bit integer on which modulo can then be applied to get the machine number.
CRC and Adler are fast algorithms but documentation says that "Since the algorithm is designed for use as a checksum algorithm, it is not suitable for use as a general hash algorithm." So the distribution may not be optimal (uniform).
Cryptographic hashes (slower but with better distribution) are available through hashlib module. They return their digest as byte-sequence which can be converted to integer with int.from_bytes.
Related
I have recently started playing around with Bloom Filters, and I have a use case in which this calculator suggests to use 15 different and unrelated hash functions.
Hashing 15 times a string would be quite computing-intensive, so I started looking for a better solution.
Gladly, I encountered the Kirsch-Mitzenmacher optimization, which suggests the following: hash_i = hash1 + i * hash2, 0 ≤ i ≤ k - 1, where K is the number of suggested hash functions.
In my specific case, I am trying to crate a 0.5MiB Bloom Filter that should eventually be able to efficiently store 200,000 (200k) pieces of information. With 15 hash functions, this would result in p = 0.000042214, or 1 in 23,689 (23k).
I tried implementing the optimization in python as follows:
import uuid
import hashlib
import mmh3
m = 4194304 # 0.5 MiB in bit
k = 15
bit_vector = [0] * m
data = str(uuid.uuid4()).encode()
hash1 = int(hashlib.sha256(data).hexdigest(), base=16)
hash2 = mmh3.hash(data)
for _ in range(0, k-1):
hash1 += hash2
bit = hash1 % m
bit_vector[bit] = 1
But it seems to get far worse results when compared to creating a classic Bloom Filter with only two hashes (it performs 10x worse in terms of false positives).
What am I doing wrong here? Am I completely misinterpreting the optimization - hence using the formula in the wrong way - or am I just using the wrong "parts" of the hashes (here I am using the full hashes in the calculation, maybe I should be using the first 32 bits on one and the latter 32 bits of the other)? Or maybe the problem is me encoding the data before hashing it?
I also found this git repository that talks about Bloom Filters and the optimization, and their results when using it are quite good both in terms of computing time and in false positives quantity.
*In the example I am using python, as that's the programming language I'm most comfortable with to test things out, but in my project I am actually going to use JavaScript to perform the whole process. Help in any programming language of your choice is much appreciated regardless.
The Kirsch-Mitzenmacher optimization is a proof of concept paper, and as such assumes a bunch of requirements on the table size and the hash functions themselves. Using it naively has stumbled people before you too. There is a bunch of practical stuff to consider. You can check out "6.5.1 Double hashing" of the P.C.Dillinger's thesis, linked in his long comment as well, in which he explains the issues and offers solutions. His solutions to rocksdb implementation issues can also be interesting.
Maybe start by trying the "enhanced double hashing":
hashes = []
for i in range(0, k-1):
hash2 += i
hashes.append(hash1)
hash1 += hash2
(I don't know why your example code only generates 1 hash as result, and assume you pasted simplified code. 15 hash functions should set 15 bits.)
If this doesn't work, and you are exhausted of reading/trying more solutions - say "it's not worth it" at this point, and choose to find a library with a good implementation instead, compare them for speed/accuracy. Unfortunately, I don't konw any to recommend. Maybe the rocksdb implementation, since we know an expert has worked on it?
And even that may not be worth the effort, and calling mmh3.hash(data, i) with 15 seeds be reasonably fast enough.
You are indeed misinterpreting the Kirsch-Mitzenmacher optimization. You are combining the hash1 and hash2 functions, plus all your k functions, into what amounts to a single, more complicated hash function.
The point of the optimization is to use each of these as separate hash functions. You don't merge the k bit results together in any way, you set each of them separately in your bit vector. In other words, you should have something roughly along the lines of:
bit_vector[hash1] = 1
bit_vector[hash2] = 1
for i in range(0, k-1):
bit_vector[hash1 + i * hash2] = 1
See how I'm setting a bit for every hash function? You're just combining them all. That ruins the whole point of using multiple different "independent" hash functions, which is really necessary for good result from a bloom filter.
When using Python's built in hash() function on strings, I was just playing with it when I noticed something odd. Typically, a normal hash function is supposed to be uncorrelated, in the sense that from hash(A), hash(B) should be completely unrecognizable (for sufficient definitions of uncorrelated/unrecognizable).
However, this quick little script shows otherwise
In [1]: for i in range(15):
...: print hash('test{0}'.format(i))
...:
-5092793511388848639
-5092793511388848640
-5092793511388848637
-5092793511388848638
-5092793511388848635
-5092793511388848636
-5092793511388848633
-5092793511388848634
-5092793511388848631
-5092793511388848632
5207588497627702649
5207588497627702648
5207588497627702651
5207588497627702650
5207588497627702653
I understand Python's hash() function isn't supposed to be cryptographically secure by any stretch, and for that you would use the hashlib library, but why are the values of testX so regularly distributed? This seems to me like it could have poor collision behavior.
The hash is calculated one character after the other. That's why the hashes are so similar.
During the computation, "test0" and "test1" have the exact same hash up to "test". There's only one bit difference, in the last character. In secure hashes, changing one bit anywhere should completely change the whole hash (e.g. thanks to multiple passes).
You can check this behaviour by calculating the hash of "0test" and "1test":
>>> for i in range(15):
... print hash('{0}test'.format(i))
...
-2218321119694330423
-198347807511608008
-8430555520134600289
1589425791872121742
-6642709920510870371
-4622800608552147860
8038463826323963107
2058173137418684322
-8620450647505857711
-6600477335291135136
8795071937164440413
4111679291630235372
-765820399655801141
2550858955145994266
6363120682850473265
This is the kind of widespread distribution you were expecting, right? By the way, Python 3 seems to have a different hash computation for strings.
For more information about Python2 string hash, take a look at "Python Hash Algorithms":
class string:
def __hash__(self):
if not self:
return 0 # empty
value = ord(self[0]) << 7
for char in self:
value = c_mul(1000003, value) ^ ord(char)
value = value ^ len(self)
if value == -1:
value = -2
return value
By the way, this problem isn't related to Python. In Java, "Aa" and "BB" share the same hash.
the python hash function is not a cryptographic hash (i.e. must not protect against collisions or show an avalanche effect etc.); its just a identifier (e.g. to be used as dictionary keys) for objects.
read more about __hash__ and hash in the documentation.
as stated there:
dict. __hash__() should return an integer. The only required property is that objects which compare equal have the same hash value
and - as Jean-François Fabre pointed out in a comment - python hashes must be fast (i.e. to build dictionaries). cryptographic hashes are slow and therefore unusable for that.
by the way: in python 3 the distribution looks way more random.
The explanation can be found in the comments for the source code of Python2.7's Objects/dictobject.c:
Major subtleties ahead: Most hash schemes depend on having a "good"
hash function, in the sense of simulating randomness. Python doesn't:
its most important hash functions (for strings and ints) are very
regular in common cases:
>>> map(hash, (0, 1, 2, 3))
[0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]
This isn't necessarily bad! To the contrary, in a table of size 2**i,
taking the low-order i bits as the initial table index is extremely
fast, and there are no collisions at all for dicts indexed by a
contiguous range of ints. The same is approximately true when keys are
"consecutive" strings. So this gives better-than-random behavior in
common cases, and that's very desirable.
I am trying to delta compress a list of pixels and store them in a binary file. I have managed to do this however the method I found takes ~4 minutes a frame.
def getByte_List(self):
values = BitArray("")
for I in range(len(self.delta_values)):
temp = Bits(int= self.delta_values[I], length=self.num_bits_pixel)
values.append(temp)
##start_time = time.time()
bit_stream = pack("uint:16, uint:5, bits", self.intial_value, self.num_bits_pixel, values)
##end_time = time.time()
##print(end_time - start_time)
# Make sure that the list of bits contains a multiple of 8 values
if (len(bit_stream) % 8):
bit_stream.append(Bits(uint=0, length = (8-(len(bit_stream) % 8)))) #####Append? On a pack? (Only work on bitarray? bit_stream = BitArray("")
# Create a list of unsigned integer values to represent each byte in the stream
fmt = (len(bit_stream)/8) * ["uint:8"]
return bit_stream.unpack(fmt)
This is my code. I take the initial value, the number of bits per pixel and the delta values and convert them into bits. I then byte align and take the integer representation of the bytes and use it elsewhere. The problem areas are where I convert each delta value to bits(3min) and where I pack(1min). Is it possible to do what I am doing faster or another way to pack them straight into integers representing bytes.
From a quick Google of the classes you're instantiating, it looks like you're using the bitstring module. This is written in pure Python, so it's not a big surprise that it's pretty slow. You might look at one or more of the following:
struct - a module that comes with Python that will let you pack and unpack C structures into constituent values
bytearray - a built-in type that lets you accumulate, well, an array of bytes, and has both list-like and string-like operations
bin(x), int(x, 2) - conversion of numbers to a binary representation as a string, and back - string manipulations can sometimes be a reasonably efficient way to do this
bitarray - native (C) module for bit manipulation, looks like it has similar functionality to bitstring but should be much faster. Available here in form suitable for compiling on Linux or here pre-compiled for Windows.
numpy - fast manipulation of arrays of various types including single bytes. Kind of the go-to module for this sort of thing, frankly. http://www.numpy.org/
Just for debugging purposes I would like to map a big string (a session_id, which is difficult to visualize) to a, let's say, 6 character "hash". This hash does not need to be secure in any way, just cheap to compute, and of fixed and reduced length (md5 is too long). The input string can have any length.
How would you implement this "cheap_hash" in python so that it is not expensive to compute? It should generate something like this:
def compute_cheap_hash(txt, length=6):
# do some computation
return cheap_hash
print compute_cheap_hash("SDFSGSADSADFSasdfgsadfSDASAFSAGAsaDSFSA2345435adfdasgsaed")
aBxr5u
I can't recall if MD5 is uniformly distributed, but it is designed to change a lot even for the smallest difference in the input.
Don't trust my math, but I guess the collision chance is 1 in 16^6 for the first 6 digits from the MD5 hexdigest, which is about 1 in 17 millions.
So you can just cheap_hash = lambda input: hashlib.md5(input).hexdigest()[:6].
After that you can use hash = cheap_hash(any_input) anywhere.
PS: Any algorithm can be used; MD5 is slightly cheaper to compute but hashlib.sha256 is also a popular choice.
def cheaphash(string,length=6):
if length<len(hashlib.sha256(string).hexdigest()):
return hashlib.sha256(string).hexdigest()[:length]
else:
raise Exception("Length too long. Length of {y} when hash length is {x}.".format(x=str(len(hashlib.sha256(string).hexdigest())),y=length))
This should do what you need it to do, it simply uses the hashlib module, so make sure to import it before using this function.
I found this similar question: https://stackoverflow.com/a/6048639/647991
So here is the function:
import hashlib
def compute_cheap_hash(txt, length=6):
# This is just a hash for debugging purposes.
# It does not need to be unique, just fast and short.
hash = hashlib.sha1()
hash.update(txt)
return hash.hexdigest()[:length]
Is there a C data structure equatable to the following python structure?
data = {'X': 1, 'Y': 2}
Basically I want a structure where I can give it an pre-defined string and have it come out with an integer.
The data-structure you are looking for is called a "hash table" (or "hash map"). You can find the source code for one here.
A hash table is a mutable mapping of an integer (usually derived from a string) to another value, just like the dict from Python, which your sample code instantiates.
It's called a "hash table" because it performs a hash function on the string to return an integer result, and then directly uses that integer to point to the address of your desired data.
This system makes it extremely extremely quick to access and change your information, even if you have tons of it. It also means that the data is unordered because a hash function returns a uniformly random result and puts your data unpredictable all over the map (in a perfect world).
Also note that if you're doing a quick one-off hash, like a two or three static hash for some lookup: look at gperf, which generates a perfect hash function and generates simple code for that hash.
The above data structure is a dict type.
In C/C++ paralance, a hashmap should be equivalent, Google for hashmap implementation.
There's nothing built into the language or standard library itself but, depending on your requirements, there are a number of ways to do it.
If the data set will remain relatively small, the easiest solution is to probably just have an array of structures along the lines of:
typedef struct {
char *key;
int val;
} tElement;
then use a sequential search to look them up. Have functions which insert keys, delete keys and look up keys so that, if you need to change it in future, the API itself won't change. Pseudo-code:
def init:
create g.key[100] as string
create g.val[100] as integer
set g.size to 0
def add (key,val):
if lookup(key) != not_found:
return already_exists
if g.size == 100:
return no_space
g.key[g.size] = key
g.val[g.size] = val
g.size = g.size + 1
return okay
def del (key):
pos = lookup (key)
if pos == not_found:
return no_such_key
if pos < g.size - 1:
g.key[pos] = g.key[g.size-1]
g.val[pos] = g.val[g.size-1]
g.size = g.size - 1
def find (key):
for pos goes from 0 to g.size-1:
if g.key[pos] == key:
return pos
return not_found
Insertion means ensuring it doesn't already exist then just tacking an element on to the end (you'll maintain a separate size variable for the structure). Deletion means finding the element then simply overwriting it with the last used element and decrementing the size variable.
Now this isn't the most efficient method in the world but you need to keep in mind that it usually only makes a difference as your dataset gets much larger. The difference between a binary tree or hash and a sequential search is irrelevant for, say, 20 entries. I've even used bubble sort for small data sets where a more efficient one wasn't available. That's because it massively quick to code up and the performance is irrelevant.
Stepping up from there, you can remove the fixed upper size by using a linked list. The search is still relatively inefficient since you're doing it sequentially but the same caveats apply as for the array solution above. The cost of removing the upper bound is a slight penalty for insertion and deletion.
If you want a little more performance and a non-fixed upper limit, you can use a binary tree to store the elements. This gets rid of the sequential search when looking for keys and is suited to somewhat larger data sets.
If you don't know how big your data set will be getting, I would consider this the absolute minimum.
A hash is probably the next step up from there. This performs a function on the string to get a bucket number (usually treated as an array index of some sort). This is O(1) lookup but the aim is to have a hash function that only allocates one item per bucket, so that no further processing is required to get the value.
A degenerate case of "all items in the same bucket" is no different to an array or linked list.
For maximum performance, and assuming the keys are fixed and known in advance, you can actually create your own hashing function based on the keys themselves.
Knowing the keys up front, you have extra information that allows you to fully optimise a hashing function to generate the actual value so you don't even involve buckets - the value generated by the hashing function can be the desired value itself rather than a bucket to get the value from.
I had to put one of these together recently for converting textual months ("January", etc) in to month numbers. You can see the process here.
I mention this possibility because of your "pre-defined string" comment. If your keys are limited to "X" and "Y" (as in your example) and you're using a character set with contiguous {W,X,Y} characters (which even covers EBCDIC as well as ASCII though not necessarily every esoteric character set allowed by ISO), the simplest hashing function would be:
char *s = "X";
int val = *s - 'W';
Note that this doesn't work well if you feed it bad data. These are ideal for when the data is known to be restricted to certain values. The cost of checking data can often swamp the saving given by a pre-optimised hash function like this.
C doesn't have any collection classes. C++ has std::map.
You might try searching for C implementations of maps, e.g. http://elliottback.com/wp/hashmap-implementation-in-c/
A 'trie' or a 'hasmap' should do. The simplest implementation is an array of struct { char *s; int i }; pairs.
Check out 'trie' in 'include/nscript.h' and 'src/trie.c' here: http://github.com/nikki93/nscript . Change the 'trie_info' type to 'int'.
Try a Trie for strings, or a Tree of some sort for integer/pointer types (or anything that can be compared as "less than" or "greater than" another key). Wikipedia has reasonably good articles on both, and they can be implemented in C.