I would like to know why the following code snippet is a bad hashing function.
def computeHash(self, s):
h = 0
for ch in s:
h = h + ord(ch) # ord gives the ASCII value of character ch
return h % HASH_TABLE_SIZE
If I dramatically increase the size of the hash table will this make up for the inadequacies of the hash function?
It's a bad hashing function because strings are order-sensitive, but the hash is not; "ab" and "ba" would hash identically, and for longer strings the collisions just get worse; all of "abc", "acb", "bac", "bca", "cab", "cba" would share the same hash.
For an order-insensitive data structure (e.g. frozenset) a strategy like this isn't as bad, but it's still too easy to produce collisions by simply reducing one ordinal by one and increasing another by one, or by simplying putting a NUL character in there; frozenset({'\0', 'a'}) would hash identically to just frozenset({'a'}); typically this is countered by incorporating the length of the collection into the hash in some manner.
Secure hashes (e.g. Python uses SipHash) are the best solution; a randomized seed combined with an algorithm that conceals the seed while incorporating it into the resulting hash makes it not only harder to accidentally create collisions (which simple improvements like hashing the index as well as the ordinal would help with, to make the hash order and length sensitive), but also makes it nigh impossible for malicious data sources to intentionally cause collisions (which the simple solutions don't handle at all).
The other problem with this hash is that it's doesn't distribute the bits evenly; short strings mean only low bits are set in the hash. This means that increasing the table size is completely pointless when the strings are all relatively short; if all the strings are ASCII and 100 characters or less, the largest possible raw hash value is 12700; if you need to store a million such strings, you'll average nearly 79 collisions per bucket in the first 12,700 buckets (in practice, much more than that for common buckets; there will be humps with many more collisions in the middling values, and fewer collisions near the beginning, and almost none at the end, since stuff like '\x7f' * 100 is the only way to reach said maximum value), and no matter how many more buckets you have, they'll never be used. Technically, an open-addressing based hash table might use them, but it would be largely equivalent to separate chaining per bucket, since all indices past 12700 would only be found by the open-addressing "bounce around" algorithm; if thats badly designed, e.g. linear scanning, you might end up linearly scanning the whole table even if no entries actually collide for your particular hash (your bucket was filled by chaining, and it has to linearly scan until it finds an empty slot or the matching element).
Bad hashing function:
1.AC and BB would give same result at for big string there can be many permutations in which sum of ascii value is same.
2.Even different length string would give same hash result . 'A ' (A +space) = 'a'.
3.Even rearrangement characters in string would give same hash.
This is a bad hashing function. One big problem: re-arranging any or all characters returns the exact same hash.
Increasing the TABLE_SIZE does nothing to adjust for this.
Related
I have a set of ~100 long (between 120 and 150 characters) strings encoded using a 20 letter alphabet (the natural amino acid alphabet). I'm using them in database entries, but they're cumbersome. I'd like to shorten (not compressing, because I don't care about the memory size) them to make them easier to:
Visually compare
Copy/Paste
Manually enter
I was hoping a feasible way to shorten them would be convert the string to a larger alphabet. Specifically, the set of single digits, as well as lower and upper case alphabet.
For example:
# given some long string as input
shorten("ACTRYP...TW")
# returns something shorter like "a3A4n"
Possible approaches
From my elementary understanding of compression, this could be accomplished naively by making a lookup dictionary which maps certain repeating sequences elements of the larger alphabet.
Related Question
This question seemed to pointing in a similar direction, but was working with the DNA alphabet and seemed to be actually seeking compression.
As suggested by #thethiny a combination of hashing can accomplish the shortening desired:
import base64
import hashlib
kinda_long = "ELYWPSRVESGTLVGYQYGRAITGQGKTSGGGSGWLGGGLRLSALELSGKTFSCDQAYYQVLSLNRGVICFLKVSTSVWSYESAAGFTMSGSAQYDYNVSGKANRSDMPTAFDVSGA"
shorter = base64.b32encode(hashlib.sha256(af.encode()).digest()).decode().strip("=")
My original question mentioned using ASCII alphabet and digits. This would be a base 62 encoding. Various libraries exist for this.
In Python 3, I have a list of strings and would find it useful to be able to append a sentinel that would compare greater than all elements in the list.
Is there a straightforward way to construct such an object?
I can define a class, possibly subclassing str, but it feels like there ought to be a simpler way.
For this to be useful in simplifying my algorithm, I need to do this ahead of time, before I know what the strings contained in the list are going to be (and so it can't be a function of those strings).
This is kind of a naïve answer, but when you're dealing with numbers and need a sentinel value for comparison purposes, it's not uncommon to use the largest (or smallest) number that a specific number type can hold.
Python strings are compared lexicographically, so to create a "max string", you'd simply need to create a long string of the "max char":
# 1114111 is the highest value that chr seems to accept
MAX_CHAR = chr(1114111)
# One million is entirely arbitary here.
# It should ideally be 1 + the length of the longest possible string that you'll compare against
MAX_STRING = MAX_CHAR * int(1e6)
Unless there's weird corner cases that I'm not aware of, MAX_STRING should now be considered greater than any other string (other than itself); providing that it's long enough.
So I had a question about a serialization algorithm I just came up with it, wanted to know if it already exists and if there's a better version out there.
So we know normal algorithms use a delimiter and join words in a list, but then you have to look through the whole word for existence of the delimiter, escape, etc, or make the serialization algorithm not robust. I thought a more intuitive approach would be to use higher level languages like Python where len() is O(1) and prepend that to each word. So for example this code I attached.
Wouldn't this be faster because instead of going through every letter of every word we instead just go through every word? And then deserialization we don't have to look through every character to find the delimiter, we can just skip directly to the end of each word.
The only problem I see is that double digit sizes would cause problems, but I'm sure there's a way around that I haven't found yet.
It was suggested to me that protocol buffers are similar to this idea, but I haven't understood why yet.
def serialize(list_o_words):
return ''.join(str(len(word)) + word for word in list_o_words)
def deserialize(serialized_list_o_words):
index = 0
deserialized_list = []
while index < len(serialized_list_o_words):
word_length = int(serialized_list_o_words[index])
next_index = index + word_length + 1
deserialized_list.append(serialized_list_o_words[index+1:next_index])
index = next_index
return deserialized_list
serialized_list = "some,comma,separated,text".split(",")
print(serialize(serialized_list))
print(deserialize(serialize(serialized_list)) == serialized_list)
Essentially, I want to know how I can handle double digit lengths.
There are many variations on length-prefixed strings, but the key bits come down to how you store the length.
You're deserializing the lengths as a single-character ASCII number, which means you can only handle lengths from 0 to 9. (You don't actually test that on the serialize size, so you can generate garbage, but let's forget that.)
So, the obvious option is to use 2 characters instead of 1. Let's add in a bit of error handling while we're at it; the code is still pretty easy:
def _len(word):
s = format(len(word), '02')
if len(s) != 2:
raise ValueError(f"Can't serialize {s}; it's too long")
return s
def serialize(list_o_words):
return ''.join(_len(word) + word for word in list_o_words)
def deserialize(serialized_list_o_words):
index = 0
deserialized_list = []
while index+1 < len(serialized_list_o_words):
word_length = int(serialized_list_o_words[index:index+2])
next_index = index + word_length + 2
deserialized_list.append(serialized_list_o_words[index+2:next_index])
index = next_index
return deserialized_list
But now you can't handle strings >99 characters.
Of course you can keep adding more digits for longer strings, but if you think "I'm never going to need a 100,000-character string"… you are going to need it, and then you'll have a zillion old files in the 5-digit format that aren't compatible with the new 6-digit format.
Also, this wastes a lot of bytes. If you're using 5-digit lengths, s encodes as 00000s, which is 6x as big as the original value.
You can stretch things a lot farther by using binary lengths instead of ASCII. Now, with two bytes, we can handle lengths up to 65535 instead of just 99. And if you go to four or eight bytes, that might actually be big enough for all your strings ever. Of course this only works if you're storing bytes rather than Unicode strings, but that's fine; you probably needed to encode your strings for persistence anyway. So:
def _len(word):
# already raises an exception for lengths > 65535
s = struct.pack('>H', len(word))
def serialize(list_o_words):
utfs8 = (word.encode() for word in list_o_words)
return b''.join(_len(utf8) + utf8 for utf8 in utfs8)
Of course this isn't very human-readable or -editable; you need to be comfortable in a hex editor to replace a string in a file this way.
Another option is to delimit the lengths. This may sound like a step backward—but it still gives us all the benefits of knowing the length in advance. Sure, you have to "read until comma", but you don't have to worry about escaped or quoted commas the way you do with CSV files, and if you're worried about performance, it's going to be much faster to read a buffer of 8K at a time and chunk through it with some kind of C loop (whether that's slicing, or str.find, barely matters by comparison) than to actually read either until comma or just two bytes.
This also has the benefit of solving the sync problem. With delimited values, if you come in mid-stream, or get out of sync because of an error, it's no big deal; just read until the next unescaped delimiter and worst-case you missed a few values. With length-prefixed values, if you're out of sync, you're reading arbitrary characters and treating them as a length, which just throws you even more out of sync. The netstring format is a minor variation on this idea, with a tiny bit more redundancy to make sync problems easier to detect/recover from.
Going back to binary lengths, there are all kinds of clever tricks for encoding variable-length numbers. Here's one idea, in pseudocode:
if the current byte is < hex 0x80 (128):
that's the length
else:
add the low 7 bits of the current byte
plus 128 times (recursively process the next byte)
Now you can handle short strings with just 1 byte of length, but if a 5-billion-character string comes along, you can handle that too.
Of course this is even less human-readable than fixed binary lengths.
And finally, if you ever want to be able to store other kinds of values, not just strings, you probably want a format that uses a "type code". For example, use I for 32-bit int, f for 64-bit float, D for datetime.datetime, etc. Then you can use s for strings <256 characters with a 1-byte length, S for strings <65536 characters with a 2-byte length, z for string <4B characters with a 4-byte length, and Z for unlimited strings with a complicated variable-int length (or maybe null-terminated strings, or maybe an 8-byte length is close enough to unlimited—after all, nobody's ever going to want more than 640KB in a computer…).
I have two text files that are both about 1M lines.
Let's call them f1 and f2.
For each line in f1, I need to find the index of line in f2, where the line in f1 is a substring of the line in f2. Since I need to do it for all lines of f1, using nested for loop is too time-costly, and I was wondering if there is a workaround that could significantly reduce the time.
Thank you in advance for the help.
There certainly are better ways than using two for loops :D That would give you an O(n^2) runtime. Something very useful for finding substrings is called a rolling hash. It is a way to use previous information to speed up finding substrings. It goes something like this:
Say I have string, f1 = "cat" and a long string f2 = "There once was a cat named felix". What you do is define a "hash" based on the letters of your f1 string. The specifics on this can be found online in various sources but for this example lets simplify things and say that letters are assigned to numbers starting at 0 going up to 25 and we'll multiply each letter's value to form a decimal number with the number of digits equaling the lenght of the string:
hash("cat") = 10^2 * 2 + 10^1 * 0 + 10^0 * 19
= some value (in python the "hash" values of letters
are not 0 through 25 but given by using the ord cast:
ord("a") will give you 97)
Now this next part is super cool. We designate windows of the size of our f1 string, so size 3, and hash the f2 string in the same way we did with f1. You start with the first three letters. The hash doesn't match so we move on. If the hash matched, we make sure it's the same string (sometimes hashes equal each other but are not the same string because of the way we assign hashes but that's ok).
COOL PART** Instead of simply shifting the window and rehashing the 2 through 4th letters of f2, we "roll" the window and don't recalculate the entire hash (which if f1 is really long would be a waste of time) since the only letters changing are the first and last! The trick is to subtract the first hash value (in our example would be ord("t")*10^2), then multiply the entire number remaining by ten (because we moved everything to the left), and add the new hash letter, ord("r") * 10^0. Check for a match again and move on. If we match, return the index.
Why are we doing this: if you have a long enough f1 string, you reduce the runtime down to O(n*len(n)) so asymptotically linear!
Now, the actual implementation takes time and could get messy but there are plenty of sources online for this kind of answer. My algorithms class has great course notes online which help understand the theory a little better and there are tons of links with python implementations. Hope this helps!
I'm trying to evaluate if comparing two string get slower as their length increases. My calculations suggest comparing strings should take an amortized constant time, but my Python experiments yield strange results:
Here is a plot of string length (1 to 400) versus time in milliseconds. Automatic garbage collection is disabled, and gc.collect is run between every iteration.
I'm comparing 1 million random strings each time, counting matches as follows.The process is repeated 50 times before taking the min of all measured times.
for index in range(COUNT):
if v1[index] == v2[index]:
matches += 1
else:
non_matches += 1
What might account for the sudden increase around length 64?
Note: The following snippet can be used to try to reproduce the problem assuming v1 and v2 are two lists of random strings of length n and COUNT is their length.
timeit.timeit("for i in range(COUNT): v1[i] == v2[i]",
"from __main__ import COUNT, v1, v2", number=50)
Further note: I've made two extra tests: comparing string with is instead of == suppresses the problem completely, and the performance is about 210ms/1M comparisons.
Since interning has been mentioned, I made sure to add a white space after each string, which should prevent interning; that doesn't change anything. Is it something else than interning then?
Python can 'intern' short strings; stores them in a special cache, and re-uses string objects from that cache.
When then comparing strings, it'll first test if it is the same pointer (e.g. an interned string):
if (a == b) {
switch (op) {
case Py_EQ:case Py_LE:case Py_GE:
result = Py_True;
goto out;
// ...
Only if that pointer comparison fails does it use a size check and memcmp to compare the strings.
Interning normally only takes place for identifiers (function names, arguments, attributes, etc.) however, not for string values created at runtime.
Another possible culprit is string constants; string literals used in code are stored as constants at compile time and reused throughout; again only one object is created and identity tests are faster on those.
For string objects that are not the same, Python tests for equal length, equal first characters then uses the memcmp() function on the internal C strings. If your strings are not interned or otherwise are reusing the same objects, all other speed characteristics come down to the memcmp() function.
I am just making wild guesses but you asked "what might" rather than what does so here are some possibilities:
The CPU cache line size is 64 bytes and longer strings cause a cache miss.
Python might store strings of 64 bytes in one kind of structure and longer strings in a more complicated structure.
Related to the last one: it might zero-pad strings into a 64-byte array and is able to use very fast SSE2 vector instructions to match two strings.