Is there a C data structure equatable to the following python structure?
data = {'X': 1, 'Y': 2}
Basically I want a structure where I can give it an pre-defined string and have it come out with an integer.
The data-structure you are looking for is called a "hash table" (or "hash map"). You can find the source code for one here.
A hash table is a mutable mapping of an integer (usually derived from a string) to another value, just like the dict from Python, which your sample code instantiates.
It's called a "hash table" because it performs a hash function on the string to return an integer result, and then directly uses that integer to point to the address of your desired data.
This system makes it extremely extremely quick to access and change your information, even if you have tons of it. It also means that the data is unordered because a hash function returns a uniformly random result and puts your data unpredictable all over the map (in a perfect world).
Also note that if you're doing a quick one-off hash, like a two or three static hash for some lookup: look at gperf, which generates a perfect hash function and generates simple code for that hash.
The above data structure is a dict type.
In C/C++ paralance, a hashmap should be equivalent, Google for hashmap implementation.
There's nothing built into the language or standard library itself but, depending on your requirements, there are a number of ways to do it.
If the data set will remain relatively small, the easiest solution is to probably just have an array of structures along the lines of:
typedef struct {
char *key;
int val;
} tElement;
then use a sequential search to look them up. Have functions which insert keys, delete keys and look up keys so that, if you need to change it in future, the API itself won't change. Pseudo-code:
def init:
create g.key[100] as string
create g.val[100] as integer
set g.size to 0
def add (key,val):
if lookup(key) != not_found:
return already_exists
if g.size == 100:
return no_space
g.key[g.size] = key
g.val[g.size] = val
g.size = g.size + 1
return okay
def del (key):
pos = lookup (key)
if pos == not_found:
return no_such_key
if pos < g.size - 1:
g.key[pos] = g.key[g.size-1]
g.val[pos] = g.val[g.size-1]
g.size = g.size - 1
def find (key):
for pos goes from 0 to g.size-1:
if g.key[pos] == key:
return pos
return not_found
Insertion means ensuring it doesn't already exist then just tacking an element on to the end (you'll maintain a separate size variable for the structure). Deletion means finding the element then simply overwriting it with the last used element and decrementing the size variable.
Now this isn't the most efficient method in the world but you need to keep in mind that it usually only makes a difference as your dataset gets much larger. The difference between a binary tree or hash and a sequential search is irrelevant for, say, 20 entries. I've even used bubble sort for small data sets where a more efficient one wasn't available. That's because it massively quick to code up and the performance is irrelevant.
Stepping up from there, you can remove the fixed upper size by using a linked list. The search is still relatively inefficient since you're doing it sequentially but the same caveats apply as for the array solution above. The cost of removing the upper bound is a slight penalty for insertion and deletion.
If you want a little more performance and a non-fixed upper limit, you can use a binary tree to store the elements. This gets rid of the sequential search when looking for keys and is suited to somewhat larger data sets.
If you don't know how big your data set will be getting, I would consider this the absolute minimum.
A hash is probably the next step up from there. This performs a function on the string to get a bucket number (usually treated as an array index of some sort). This is O(1) lookup but the aim is to have a hash function that only allocates one item per bucket, so that no further processing is required to get the value.
A degenerate case of "all items in the same bucket" is no different to an array or linked list.
For maximum performance, and assuming the keys are fixed and known in advance, you can actually create your own hashing function based on the keys themselves.
Knowing the keys up front, you have extra information that allows you to fully optimise a hashing function to generate the actual value so you don't even involve buckets - the value generated by the hashing function can be the desired value itself rather than a bucket to get the value from.
I had to put one of these together recently for converting textual months ("January", etc) in to month numbers. You can see the process here.
I mention this possibility because of your "pre-defined string" comment. If your keys are limited to "X" and "Y" (as in your example) and you're using a character set with contiguous {W,X,Y} characters (which even covers EBCDIC as well as ASCII though not necessarily every esoteric character set allowed by ISO), the simplest hashing function would be:
char *s = "X";
int val = *s - 'W';
Note that this doesn't work well if you feed it bad data. These are ideal for when the data is known to be restricted to certain values. The cost of checking data can often swamp the saving given by a pre-optimised hash function like this.
C doesn't have any collection classes. C++ has std::map.
You might try searching for C implementations of maps, e.g. http://elliottback.com/wp/hashmap-implementation-in-c/
A 'trie' or a 'hasmap' should do. The simplest implementation is an array of struct { char *s; int i }; pairs.
Check out 'trie' in 'include/nscript.h' and 'src/trie.c' here: http://github.com/nikki93/nscript . Change the 'trie_info' type to 'int'.
Try a Trie for strings, or a Tree of some sort for integer/pointer types (or anything that can be compared as "less than" or "greater than" another key). Wikipedia has reasonably good articles on both, and they can be implemented in C.
Related
I am relatively new to Python. However, my needs generally only involve simple string manipulation of rigidly formatted data files. I have a specific situation that I have scoured the web trying to solve and have come up blank.
This is the situation. I have a simple list of two-part entires, formatted like this:
name = ['PAUL;25', 'MARY;60', 'PAUL;40', 'NEIL;50', 'MARY;55', 'HELEN;25', ...]
And, I need to keep only one instance of any repeated name (ignoring the number to the right of the ' ; '), keeping only the entry with the highest number, along with that highest value still attached. So the answer would look like this:
ans = ['MARY;60', 'PAUL;40', 'HELEN;25', 'NEIL;50, ...]
The order of the elements in the list is irrelevant, but the format of the ans list entries must remain the same.
I can probably figure out a way to brute force it. I have looked at 2D lists, sets, tuples, etc. But, I can't seem to find the answer. The name list has about a million entries, so I need something that is efficient. I am sure it will be painfully easy for some of you.
Thanks for any input you can provide.
Cheers.
alkemyst
Probably the best data structure for this would be a dictionary, with the entries split up (and converted to integer) and later re-joined.
Something like this:
max_score = {}
for n in name:
person, score_str = n.split(';')
score = int(score_str)
if person not in max_score or max_score[person] < score:
max_score[person] = score
ans = [
'%s;%s' % (person, score)
for person, score in max_score.items()
]
This is a fairly common structure for many functions and programs: first convert the input to an internal representation (in this case, split and convert to integer), then do the logic or calculation (in this case, uniqueness and maximum), then convert to the required output representation (in this case, string separated with ;).
In terms of efficiency, this code looks at each input item once, then at each output item once; there's unlikely to be any approach that can do better than that (certainly not formally, and likely not in practice). All of the per-item operations are constant-time and fast. It accumulates the intermediate answer in memory (in max_score), but again that is unavoidable; if memory is an issue, the input and output could be changed to iterators/generators, but the whole intermediate answer has to be accumulated in max_score before any items can be output.
I was using a dictionary as a lookup table but I started to wonder if a list would be better for my application -- the amount of entries in my lookup table wasn't that big. I know lists use C arrays under the hood which made me conclude that lookup in a list with just a few items would be better than in a dictionary (accessing a few elements in an array is faster than computing a hash).
I decided to profile the alternatives but the results surprised me. List lookup was only better with a single element! See the following figure (log-log plot):
So here comes the question: Why do list lookups perform so poorly? What am I missing?
On a side question, something else that called my attention was a little "discontinuity" in the dict lookup time after approximately 1000 entries. I plotted the dict lookup time alone to show it.
p.s.1 I know about O(n) vs O(1) amortized time for arrays and hash tables, but it is usually the case that for a small number of elements iterating over an array is better than to use a hash table.
p.s.2 Here is the code I used to compare the dict and list lookup times:
import timeit
lengths = [2 ** i for i in xrange(15)]
list_time = []
dict_time = []
for l in lengths:
list_time.append(timeit.timeit('%i in d' % (l/2), 'd=range(%i)' % l))
dict_time.append(timeit.timeit('%i in d' % (l/2),
'd=dict.fromkeys(range(%i))' % l))
print l, list_time[-1], dict_time[-1]
p.s.3 Using Python 2.7.13
I know lists use C arrays under the hood which made me conclude that lookup in a list with just a few items would be better than in a dictionary (accessing a few elements in an array is faster than computing a hash).
Accessing a few array elements is cheap, sure, but computing == is surprisingly heavyweight in Python. See that spike in your second graph? That's the cost of computing == for two ints right there.
Your list lookups need to compute == a lot more than your dict lookups do.
Meanwhile, computing hashes might be a pretty heavyweight operation for a lot of objects, but for all ints involved here, they just hash to themselves. (-1 would hash to -2, and large integers (technically longs) would hash to smaller integers, but that doesn't apply here.)
Dict lookup isn't really that bad in Python, especially when your keys are just a consecutive range of ints. All ints here hash to themselves, and Python uses a custom open addressing scheme instead of chaining, so all your keys end up nearly as contiguous in memory as if you'd used a list (which is to say, the pointers to the keys end up in a contiguous range of PyDictEntrys). The lookup procedure is fast, and in your test cases, it always hits the right key on the first probe.
Okay, back to the spike in graph 2. The spike in the lookup times at 1024 entries in the second graph is because for all smaller sizes, the integers you were looking for were all <= 256, so they all fell within the range of CPython's small integer cache. The reference implementation of Python keeps canonical integer objects for all integers from -5 to 256, inclusive. For these integers, Python was able to use a quick pointer comparison to avoid going through the (surprisingly heavyweight) process of computing ==. For larger integers, the argument to in was no longer the same object as the matching integer in the dict, and Python had to go through the whole == process.
The short answer is that lists use linear search and dicts use amortized O(1) search.
In addition, dict searches can skip an equality test either when 1) hash values don't match or 2) when there is an identity match. Lists only benefit from the identity-implies equality optimization.
Back in 2008, I gave a talk on this subject where you'll find all the details: https://www.youtube.com/watch?v=hYUsssClE94
Roughly the logic for searching lists is:
for element in s:
if element is target:
# fast check for identity implies equality
return True
if element == target:
# slower check for actual equality
return True
return False
For dicts the logic is roughly:
h = hash(target)
for i in probe_sequence(h, len(table)):
element = key_table[i]
if element is UNUSED:
raise KeyError(target)
if element is target:
# fast path for identity implies equality
return value_table[i]
if h != h_table[i]:
# unequal hashes implies unequal keys
continue
if element == target:
# slower check for actual equality
return value_table[i]
Dictionary hash tables are typically between one-third and two-thirds full, so they tend to have few collisions (few trips around the loop shown above) regardless of size. Also, the hash value check prevents needless slow equality checks (the chance of a wasted equality check is about 1 in 2**64).
If your timing focuses on integers, there are some other effects at play as well. That hash of a int is the int itself, so hashing is very fast. Also, it means that if you're storing consecutive integers, there tend to be no collisions at all.
You say "accessing a few elements in an array is faster than computing a hash".
A simple hashing rule for strings might be just a sum (with a modulo in the end). This is a branchless operation that can compare favorably with character comparisons, especially when there is a long match on the prefix.
My goal is to iterate through a set S of elements given a single element and an action G: S -> S that acts transitively on S (i.e., for any elt,elt' in S, there is a map f in G such that f(elt) = elt'). The action is finitely generated, so I can use that I can apply each generator to a given element.
The algorithm I use is:
def orbit(act,elt):
new_elements = [elt]
seen_elements = set([elt])
yield elt
while new_elements:
elt = new_elements.pop()
seen_elements.add(elt)
for f in act.gens():
elt_new = f(elt)
if elt_new not in seen_elements:
new_elements.append(elt_new)
seen_elements.add(elt_new)
yield elt_new
This algorithm seems to be well-suited and very generic. BUT it has one major and one minor slowdown in big computations that I would like to get rid of:
The major: seen_elements collects all the elements, and is thus too memory consuming, given that I do not need the actual elements anymore.
How can I achieve to not have all the elements stored in memory?
Very likely, this depends on what the elements are. So for me, these are short lists (<10 entries) of ints (each < 10^3). So first, is there a fast way to associate a (with high probability) unique integer to such a list? Does that save much memory? If so, should I put those into a dict to check the containment (in this case, first the hash equality test, and then an int equality test are done, right?), or how should I do that?
the minor: poping the element takes a lot of time given that I don't quite need that list. Is there a better way of doing that?
Thanks a lot for your suggestions!
So first, is there a fast way to associate a (with high probability) unique integer to such a list?
If the list entries all are in range(1, 1024), then sum(x << (i * 10) for i, x in enumerate(elt)) yields a unique integer.
Does that save much memory?
The short answer is yes. The long answer is that it's complicated to determine how much. Python's long integer representation uses (probably) 30-bit digits, so the digits will pack 3 to the 32-bit word instead of 1 (or 0.5 for 64-bit). There's some object overhead (8/16 bytes?), and then there's the question of how many of the list entries require separate objects, which is where the big win may lie.
If you can tolerate errors, then a Bloom filter would be a possibility.
the minor: popping the element takes a lot of time given that I don't quite need that list. Is there a better way of doing that?
I find that claim surprising. Have you measured?
I need to find identical sequences of characters in a collection of texts. Think of it as finding identical/plagiarized sentences.
The naive way is something like this:
ht = defaultdict(int)
for s in sentences:
ht[s]+=1
I usually use python but I'm beginning to think that python is not the best choice for this task. Am I wrong about it? is there a reasonable way to do it with python?
If I understand correctly, python dictionaries use open addressing which means that the key itself is also saved in the array. If this is indeed the case, it means that a python dictionary allows efficient lookup but is VERY bad in memory usage, thus if I have millions of sentences, they are all saved in the dictionary which is horrible since it exceeds the available memory - making the python dictionary an impractical solution.
Can someone approve the former paragraph?
One solution that comes into mind is explicitly using a hash function (either use the builtin hash function, implement one or use the hashlib module) and instead of inserting ht[s]+=1, insert:
ht[hash(s)]+=1
This way the key stored in the array is an int (that will be hashed again) instead of the full sentence.
Will that work? Should I expect collisions? any other Pythonic solutions?
Thanks!
Yes, dict store the key in memory. If you data fit in memory this is the easiest approach.
Hash should work. Try MD5. It is 16 byte int so collision is unlikely.
Try BerkeleyDB for a disk based approach.
Python dicts are indeed monsters in memory. You hardly can operate in millions of keys when storing anything larger than integers. Consider following code:
for x in xrange(5000000): # it's 5 millions
d[x] = random.getrandbits(BITS)
For BITS(64) it takes 510MB of my RAM, for BITS(128) 550MB, for BITS(256) 650MB, for BITS(512) 830MB. Increasing number of iterations to 10 millions will increase memory usage by 2. However, consider this snippet:
for x in xrange(5000000): # it's 5 millions
d[x] = (random.getrandbits(64), random.getrandbits(64))
It takes 1.1GB of my memory. Conclusion? If you want to keep two 64-bits integers, use one 128-bits integer, like this:
for x in xrange(5000000): # it's still 5 millions
d[x] = random.getrandbits(64) | (random.getrandbits(64) << 64)
It'll reduce memory usage by two.
It depends on your actual memory limit and number of sentences, but you should be safe with using dictionaries with 10-20 millions of keys when using just integers. You have a good idea with hashes, but probably want to keep pointer to the sentence, so in case of collision you can investigate (compare the sentence char by char and probably print it out). You could create a pointer as a integer, for example by including number of file and offset in it. If you don't expect massive number of collision, you can simply set up another dictionary for storing only collisions, for example:
hashes = {}
for s in sentence:
ptr_value = pointer(s) # make it integer
hash_value = hash(s) # make it integer
if hash_value in hashes:
collisions.setdefault(hashes[hash_value], []).append(ptr_value)
else:
hashes[hash_value] = ptr_value
So at the end you will have collisions dictionary where key is a pointer to sentence and value is an array of pointers the key is colliding with. It sounds pretty hacky, but working with integers is just fine (and fun!).
perhaps passing keys to md5 http://docs.python.org/library/md5.html
Im not sure exactly how large your data set you are comparing all between is, but I would recommend looking into bloom filters (be careful of false positives). http://en.wikipedia.org/wiki/Bloom_filter ... Another avenue to consider would be something simple like cosine similarity or edit distance between documents, but if you are trying to compare one document with many... I would suggest looking into bloom filters, you can encode it however you find most efficient for your problem.
I have a set of lots of big long strings that I want to do existence lookups for. I don't need the whole string ever to be saved. As far as I can tell, the set() actually stored the string which is eating up a lot of my memory.
Does such a data structure exist?
done = hash_only_set()
while len(queue) > 0 :
item = queue.pop()
if item not in done :
process(item)
done.add(item)
(My queue is constantly being filled by other threads so I have no way of dedupping it at the start).
It's certainly possible to keep a set of only hashes:
done = set()
while len(queue) > 0 :
item = queue.pop()
h = hash(item)
if h not in done :
process(item)
done.add(h)
Notice that because of hash collisions, there is a chance that you consider an item done even though it isn't.
If you cannot accept this risk, you really need to save the full strings to be able to tell whether you have seen it before. Alternatively: perhaps the processing itself would be able to tell?
Yet alternatively: if you cannot accept to keep the strings in memory, keep them in a database, or create files in a directory with the same name as the string.
You can use a data structure called Bloom Filter specifically for this purpose. A Python implementation can be found here.
EDIT: Important notes:
False positives are possible in this data structure, i.e. a check for the existence of a string could return a positive result even though it was not stored.
False negatives (getting a negative result for a string that was stored) are not possible.
That said, the chances of this happening can be brought to a minimum if used properly and so I consider this data structure to be very useful.
If you use a secure (like SHA-256, found in the hashlib module) hash function to hash the strings, it's very unlikely that you would found duplicate (and if you find some you can probably win a prize as with most cryptographic hash functions).
The builtin __hash__() method does not guarantee you won't have duplicates (and since it only uses 32 bits, it's very likely you'll find some).
You need to know the whole string to have 100% certainty. If you have lots of strings with similar prefixes you could save space by using a trie to store the strings. If your strings are long you could also save space by using a large hash function like SHA-1 to make the possibility of hash collisions so remote as to be irrelevant.
If you can make the process() function idempotent - i.e. having it called twice on an item is only a performance issue, then the problem becomes a lot simpler and you can use lossy datastructures, such as bloom filters.
You would have to think about how to do the lookup, since there are two methods that the set needs, __hash__ and __eq__.
The hash is a "loose part" that you can take away, but the __eq__ is not a loose part that you can save; you have to have two strings for the comparison.
If you only need negative confirmation (this item is not part of the set), you could fill a Set collection you implemented yourself with your strings, then you "finalize" the set by removing all strings, except those with collisions (those are kept around for eq tests), and you promise not to add more objects to your Set. Now you have an exclusive test available.. you can tell if an object is not in your Set. You can't be certain if "obj in Set == True" is a false positive or not.
Edit: This is basically a bloom filter that was cleverly linked, but a bloom filter might use more than one hash per element which is really clever.
Edit2: This is my 3-minute bloom filter:
class BloomFilter (object):
"""
Let's make a bloom filter
http://en.wikipedia.org/wiki/Bloom_filter
__contains__ has false positives, but never false negatives
"""
def __init__(self, hashes=(hash, )):
self.hashes = hashes
self.data = set()
def __contains__(self, obj):
return all((h(obj) in self.data) for h in self.hashes)
def add(self, obj):
self.data.update(h(obj) for h in self.hashes)
As has been hinted already, if the answers offered here (most of which break down in the face of hash collisions) are not acceptable you would need to use a lossless representation of the strings.
Python's zlib module provides built-in string compression capabilities and could be used to pre-process the strings before you put them in your set. Note however that the strings would need to be quite long (which you hint that they are) and have minimal entropy in order to save much memory space. Other compression options might provide better space savings and some Python based implementations can be found here