Dictionaries/hashmaps setting in Python

Dictionaries/hashmaps setting in Python - python

I've been studying Python for a few days now on the famous tutorial Learn Python The Hard Way. At a certain point, talking about dictionaries on Exercise 39, there is a couple of little functions that read like this:
def hash_key(aMap, key):
"""Given a key this will create a number and then convert it to
an index for the aMap's buckets."""
return hash(key) % len(aMap)
def get_bucket(aMap, key):
"""Given a key, find the bucket where it would go."""
bucket_id = hash_key(aMap, key)
return aMap[bucket_id]
Now, what sounds obscure to me is the way the bucket id is decided on the first function.
Assuming I wanted to find the bucket for the key "myCoolKey", Python would go: hash('myCoolKey') % len(aMap), which in case of len(aMap) being "256" would result "139".
So reading on afterwards, if I'm not being wrong, 'myCoolKey' is going to be put assigned to aMap slot 139.
Now:
Is there a particular reason I can't see for this being done?
What about collisions? Isn't it possible that being the map limited, two keys could result being assigned the same slot while other slots still being unused at the same time?

The purpose of a hash table is to provide you with immediate lookup times. The % modulo function is used to ensure that you will always have a key that is within the bounds of your hash table (so there are no IndexError issues). There is often additional hashing before this (such as in your case) to try and ensure that the keys are as evenly distributed as possible, to reduce collisions.
Yes, it's possible for a general hash table. Hash tables can resolve this by 1) re-hashing the value to put it into another slot, 2) just putting it in the next available slot, or 3) putting a list of values into that slot, instead of just a single value. It appears that your code goes with option 3.

I think this link gives a good walkthrough of how the dictionary or hashmap works for exercise 39 - https://nicolasgkruk.wordpress.com/2014/07/11/understanding-making-your-own-dictionary-module-from-learning-python-the-hard-way-exercise-39/

Related

GAE NDB model sequential ID [duplicate]

I have to label something in a "strong monotone increasing" fashion. Be it Invoice Numbers, shipping label numbers or the like.
A number MUST NOT BE used twice
Every number SHOULD BE used when exactly all smaller numbers have been used (no holes).
Fancy way of saying: I need to count 1,2,3,4 ...
The number Space I have available are typically 100.000 numbers and I need perhaps 1000 a day.
I know this is a hard Problem in distributed systems and often we are much better of with GUIDs. But in this case for legal reasons I need "traditional numbering".
Can this be implemented on Google AppEngine (preferably in Python)?

If you absolutely have to have sequentially increasing numbers with no gaps, you'll need to use a single entity, which you update in a transaction to 'consume' each new number. You'll be limited, in practice, to about 1-5 numbers generated per second - which sounds like it'll be fine for your requirements.

If you drop the requirement that IDs must be strictly sequential, you can use a hierarchical allocation scheme. The basic idea/limitation is that transactions must not affect multiple storage groups.
For example, assuming you have the notion of "users", you can allocate a storage group for each user (creating some global object per user). Each user has a list of reserved IDs. When allocating an ID for a user, pick a reserved one (in a transaction). If no IDs are left, make a new transaction allocating 100 IDs (say) from the global pool, then make a new transaction to add them to the user and simultaneously withdraw one. Assuming each user interacts with the application only sequentially, there will be no concurrency on the user objects.

The gaetk - Google AppEngine Toolkit now comes with a simple library function to get a number in a sequence. It is based on Nick Johnson's transactional approach and can be used quite easily as a foundation for Martin von Löwis' sharding approach:
>>> from gaeth.sequences import *
>>> init_sequence('invoce_number', start=1, end=0xffffffff)
>>> get_numbers('invoce_number', 2)
[1, 2]
The functionality is basically implemented like this:
def _get_numbers_helper(keys, needed):
results = []
for key in keys:
seq = db.get(key)
start = seq.current or seq.start
end = seq.end
avail = end - start
consumed = needed
if avail <= needed:
seq.active = False
consumed = avail
seq.current = start + consumed
seq.put()
results += range(start, start + consumed)
needed -= consumed
if needed == 0:
return results
raise RuntimeError('Not enough sequence space to allocate %d numbers.' % needed)
def get_numbers(needed):
query = gaetkSequence.all(keys_only=True).filter('active = ', True)
return db.run_in_transaction(_get_numbers_helper, query.fetch(5), needed)

If you aren't too strict on the sequential, you can "shard" your incrementer. This could be thought of as an "eventually sequential" counter.
Basically, you have one entity that is the "master" count. Then you have a number of entities (based on the load you need to handle) that have their own counters. These shards reserve chunks of ids from the master and serve out from their range until they run out of values.
Quick algorithm:
You need to get an ID.
Pick a shard at random.
If the shard's start is less than its end, take it's start and increment it.
If the shard's start is equal to (or more oh-oh) its end, go to the master, take the value and add an amount n to it. Set the shards start to the retrieved value plus one and end to the retrieved plus n.
This can scale quite well, however, the amount you can be out by is the number of shards multiplied by your n value. If you want your records to appear to go up this will probably work, but if you want to have them represent order it won't be accurate. It is also important to note that the latest values may have holes, so if you are using that to scan for some reason you will have to mind the gaps.
Edit
I needed this for my app (that was why I was searching the question :P ) so I have implemented my solution. It can grab single IDs as well as efficiently grab batches. I have tested it in a controlled environment (on appengine) and it performed very well. You can find the code on github.

Take a look at how the sharded counters are made. It may help you. Also do you really need them to be numeric. If unique is satisfying just use the entity keys.

Alternatively, you could use allocate_ids(), as people have suggested, then creating these entities up front (i.e. with placeholder property values).
first, last = MyModel.allocate_ids(1000000)
keys = [Key(MyModel, id) for id in range(first, last+1)]
Then, when creating a new invoice, your code could run through these entries to find the one with the lowest ID such that the placeholder properties have not yet been overwritten with real data.
I haven't put that into practice, but seems like it should work in theory, most likely with the same limitations people have already mentioned.

Remember: Sharding increases the probability that you will get a unique, auto-increment value, but does not guarantee it. Please take Nick's advice if you MUST have a unique auto-incrment.

I implemented something very simplistic for my blog, which increments an IntegerProperty, iden rather than the Key ID.
I define max_iden() to find the maximum iden integer currently being used. This function scans through all existing blog posts.
def max_iden():
max_entity = Post.gql("order by iden desc").get()
if max_entity:
return max_entity.iden
return 1000 # If this is the very first entry, start at number 1000
Then, when creating a new blog post, I assign it an iden property of max_iden() + 1
new_iden = max_iden() + 1
p = Post(parent=blog_key(), header=header, body=body, iden=new_iden)
p.put()
I wonder if you might also want to add some sort of verification function after this, i.e. to ensure the max_iden() has now incremented, before moving onto the next invoice.
Altogether: fragile, inefficient code.

I'm thinking in using the following solution: use CloudSQL (MySQL) to insert the records and assign the sequential ID (maybe with a Task Queue), later (using a Cron Task) move the records from CloudSQL back to the Datastore.
The entities also can have a UUID, so we can map the entities from the Datastore in CloudSQL, and also have the sequential ID (for legal reasons).

Ideal data structure with fast lookup, fast update and easy comparison/sorting

I am looking for a good data structure to contain a list of tuples with (hash, timestamp) values. Basically, I want to use it in the following way:
Data comes in, check to see if it's already present in the data structure (hash equality, not timestamp).
If it is, update the timestamp to "now"
If not, add it to the set with timestamp "now"
Periodically, I wish to remove and return a list of tuples that older than a specific timestamp (I need to update various other elements when they 'expire'). Timestamp does not have to be anything specific (it can be a unix timestamp, a python datetime object, or some other easy-to-compare hash/string).
I am using this to receive incoming data, update it if it's already present and purge data older than X seconds/minutes.
Multiple data structures can be a valid suggestion as well (I originally went with a priority queue + set, but a priority queue is less-than-optimal for constantly updating values).
Other approaches to achieve the same thing are welcome as well. The end goal is to track when elements are a) new to the system, b) exist in the system already and c) when they expire.

This is a pretty well trod space. The thing you need is two structures, You need something to tell you wether your key (hash in your case) is known to the collection. For this, dict is a very good fit; we'll just map the hash to the timestamp so you can look up each item easily. Iterating over the items in order of timestamp is a task particularly suited to Heaps, which are provided by the heapq module. Each time we see a key, we'll just add it to our heap, as a tuple of (timestamp, hash).
Unfortunately there's no way to look into a heapified list and remove certain items (because, say, they have been updated to expire later). We'll get around that by just ignoring entries in the heap that have timestamps that are dissimilar from the value in the dict.
So here's a place to start, you can probably add methods to the wrapper class to support additional operations, or change the way data is stored:
import heapq
class ExpiringCache(object):
def __init__(self):
self._dict = {}
self._heap = []
def add(self, key, expiry):
self._dict[key] = expiry
heapq.heappush(self._heap, (expiry, key))
def contains(self, key):
return key in self._dict
def collect(self, maxage):
while self._heap and self._heap[0][0] <= maxage:
expiry, key = heapq.heappop(self._heap)
if self._dict.get(key) == expiry:
del self._dict[key]
def items(self):
return self._dict.items()
create a cache and add some items
>>> xc = ExpiringCache()
>>> xc.add('apples', 1)
>>> xc.add('bananas', 2)
>>> xc.add('mangoes', 3)
re-add an item with an even later expiry
>>> xc.add('apples', 4)
collect everything "older" than two time units
>>> xc.collect(2)
>>> xc.contains('apples')
True
>>> xc.contains('bananas')
False

The closest I can think of to a single structure with the properties you want is a splay tree (with your hash as the key).
By rotating recently-accessed (and hence updated) nodes to the root, you should end up with the least recently-accessed (and hence updated) data at the leaves or grouped in a right subtree.
Figuring out the details (and implementing them) is left as an exercise for the reader ...
Caveats:
worst case height - and therefore complexity - is linear. This shouldn't occur with a decent hash
any read-only operations (ie, lookups that don't update the timestamp) will disrupt the relationship between splay tree layout and timestamp
A simpler approach is to store an object containing (hash, timestamp, prev, next) in a regular dict, using prev and next to keep an up-to-date doubly-linked list. Then all you need alongside the dict are head and tail references.
Insert & update are still constant time (hash lookup + linked-list splice), and walking backwards from the tail of the list collecting the oldest hashes is linear.

Unless I'm misreading your question, a plain old dict should be ideal for everything except the purging. Assuming you are trying to avoid having to inspect the entire dictionary during purging, I would suggest keeping around a second data structure to hold (timestamp, hash) pairs.
This supplemental data structure could either be a plain old list or a deque (from the collections module). Possibly the bisect module could be handy to keep the number of timestamp comparisons down to a minimum (as opposed to comparing all the timestamps until you reach the cut-off value), but since you'd still have to iterate sequentially over the items that need to be purged, ironing out the exact details of what would be quickest requires some testing.
Edit:
For Python 2.7 or 3.1+, you could also consider using OrderedDict (from the collections module). This is basically a dict with a supplementary order-preserving data structure built into the class, so you don't have to implement it yourself. The only hitch is that the only order it preserves is insertion order, so that for your purpose, instead of just reassigning an existing entry to a new timestamp, you'll need to remove it (with del) and then assign a fresh entry with the new timestamp. Still, it retains the O(1) lookup and saves you from having to maintain the list of (timestamp, hash) pairs yourself; when it comes time to purge, you can just iterate straight through the OrderedDict, deleting entries until you reach one with a timestamp that is later than your cut-off.

If you're okay with working around occasional false positives, I think that a bloom filter may suit your needs well (it's very very fast)
http://en.wikipedia.org/wiki/Bloom_filter
and a python implementation: https://github.com/axiak/pybloomfiltermmap
EDIT: reading your post again, I think this will work, but instead of storing the hashes, just let the bloomfilter create the hashes for you. ie, I think you just want to use the bloomfilter as a set of timestamps. I'm assuming that your timestamps could basically just be a set since you are hashing them.

A simple hashtable or dictionary will be O(1) for the check/update/set operations. You could simultaneously store the data in a simple time-ordered list where for the purge operations. Keep a head and tail pointer, so that insert is also O(1) and removal is as simple as advancing the head until it reaches the target time and removing all the entries you find from the hash.
The overhead is one extra pointer per stored data item, and the code is dead simple:
insert(key,time,data):
existing = MyDictionary.find(key)
if existing:
existing.mark()
node = MyNodeType(data,time) #simple container holding args + 'next' pointer
node.next = NULL
MyDictionary.insert(key,node)
Tail.next = node
if Head is NULL: Head = node
clean(olderThan):
while Head.time < olderThan:
n = Head.next
if not Head.isMarked():
MyDictionary.remove(Head.key)
#else it was already overwritten
if Head == Tail: Tail = n
Head = n

How to implement "autoincrement" on Google AppEngine

I have to label something in a "strong monotone increasing" fashion. Be it Invoice Numbers, shipping label numbers or the like.
A number MUST NOT BE used twice
Every number SHOULD BE used when exactly all smaller numbers have been used (no holes).
Fancy way of saying: I need to count 1,2,3,4 ...
The number Space I have available are typically 100.000 numbers and I need perhaps 1000 a day.
I know this is a hard Problem in distributed systems and often we are much better of with GUIDs. But in this case for legal reasons I need "traditional numbering".
Can this be implemented on Google AppEngine (preferably in Python)?

If you absolutely have to have sequentially increasing numbers with no gaps, you'll need to use a single entity, which you update in a transaction to 'consume' each new number. You'll be limited, in practice, to about 1-5 numbers generated per second - which sounds like it'll be fine for your requirements.

If you drop the requirement that IDs must be strictly sequential, you can use a hierarchical allocation scheme. The basic idea/limitation is that transactions must not affect multiple storage groups.
For example, assuming you have the notion of "users", you can allocate a storage group for each user (creating some global object per user). Each user has a list of reserved IDs. When allocating an ID for a user, pick a reserved one (in a transaction). If no IDs are left, make a new transaction allocating 100 IDs (say) from the global pool, then make a new transaction to add them to the user and simultaneously withdraw one. Assuming each user interacts with the application only sequentially, there will be no concurrency on the user objects.

The gaetk - Google AppEngine Toolkit now comes with a simple library function to get a number in a sequence. It is based on Nick Johnson's transactional approach and can be used quite easily as a foundation for Martin von Löwis' sharding approach:
>>> from gaeth.sequences import *
>>> init_sequence('invoce_number', start=1, end=0xffffffff)
>>> get_numbers('invoce_number', 2)
[1, 2]
The functionality is basically implemented like this:
def _get_numbers_helper(keys, needed):
results = []
for key in keys:
seq = db.get(key)
start = seq.current or seq.start
end = seq.end
avail = end - start
consumed = needed
if avail <= needed:
seq.active = False
consumed = avail
seq.current = start + consumed
seq.put()
results += range(start, start + consumed)
needed -= consumed
if needed == 0:
return results
raise RuntimeError('Not enough sequence space to allocate %d numbers.' % needed)
def get_numbers(needed):
query = gaetkSequence.all(keys_only=True).filter('active = ', True)
return db.run_in_transaction(_get_numbers_helper, query.fetch(5), needed)

If you aren't too strict on the sequential, you can "shard" your incrementer. This could be thought of as an "eventually sequential" counter.
Basically, you have one entity that is the "master" count. Then you have a number of entities (based on the load you need to handle) that have their own counters. These shards reserve chunks of ids from the master and serve out from their range until they run out of values.
Quick algorithm:
You need to get an ID.
Pick a shard at random.
If the shard's start is less than its end, take it's start and increment it.
If the shard's start is equal to (or more oh-oh) its end, go to the master, take the value and add an amount n to it. Set the shards start to the retrieved value plus one and end to the retrieved plus n.
This can scale quite well, however, the amount you can be out by is the number of shards multiplied by your n value. If you want your records to appear to go up this will probably work, but if you want to have them represent order it won't be accurate. It is also important to note that the latest values may have holes, so if you are using that to scan for some reason you will have to mind the gaps.
Edit
I needed this for my app (that was why I was searching the question :P ) so I have implemented my solution. It can grab single IDs as well as efficiently grab batches. I have tested it in a controlled environment (on appengine) and it performed very well. You can find the code on github.

Take a look at how the sharded counters are made. It may help you. Also do you really need them to be numeric. If unique is satisfying just use the entity keys.

Alternatively, you could use allocate_ids(), as people have suggested, then creating these entities up front (i.e. with placeholder property values).
first, last = MyModel.allocate_ids(1000000)
keys = [Key(MyModel, id) for id in range(first, last+1)]
Then, when creating a new invoice, your code could run through these entries to find the one with the lowest ID such that the placeholder properties have not yet been overwritten with real data.
I haven't put that into practice, but seems like it should work in theory, most likely with the same limitations people have already mentioned.

Remember: Sharding increases the probability that you will get a unique, auto-increment value, but does not guarantee it. Please take Nick's advice if you MUST have a unique auto-incrment.

I implemented something very simplistic for my blog, which increments an IntegerProperty, iden rather than the Key ID.
I define max_iden() to find the maximum iden integer currently being used. This function scans through all existing blog posts.
def max_iden():
max_entity = Post.gql("order by iden desc").get()
if max_entity:
return max_entity.iden
return 1000 # If this is the very first entry, start at number 1000
Then, when creating a new blog post, I assign it an iden property of max_iden() + 1
new_iden = max_iden() + 1
p = Post(parent=blog_key(), header=header, body=body, iden=new_iden)
p.put()
I wonder if you might also want to add some sort of verification function after this, i.e. to ensure the max_iden() has now incremented, before moving onto the next invoice.
Altogether: fragile, inefficient code.

I'm thinking in using the following solution: use CloudSQL (MySQL) to insert the records and assign the sequential ID (maybe with a Task Queue), later (using a Cron Task) move the records from CloudSQL back to the Datastore.
The entities also can have a UUID, so we can map the entities from the Datastore in CloudSQL, and also have the sequential ID (for legal reasons).

C data structures

Is there a C data structure equatable to the following python structure?
data = {'X': 1, 'Y': 2}
Basically I want a structure where I can give it an pre-defined string and have it come out with an integer.

The data-structure you are looking for is called a "hash table" (or "hash map"). You can find the source code for one here.
A hash table is a mutable mapping of an integer (usually derived from a string) to another value, just like the dict from Python, which your sample code instantiates.
It's called a "hash table" because it performs a hash function on the string to return an integer result, and then directly uses that integer to point to the address of your desired data.
This system makes it extremely extremely quick to access and change your information, even if you have tons of it. It also means that the data is unordered because a hash function returns a uniformly random result and puts your data unpredictable all over the map (in a perfect world).

Also note that if you're doing a quick one-off hash, like a two or three static hash for some lookup: look at gperf, which generates a perfect hash function and generates simple code for that hash.

The above data structure is a dict type.
In C/C++ paralance, a hashmap should be equivalent, Google for hashmap implementation.

There's nothing built into the language or standard library itself but, depending on your requirements, there are a number of ways to do it.
If the data set will remain relatively small, the easiest solution is to probably just have an array of structures along the lines of:
typedef struct {
char *key;
int val;
} tElement;
then use a sequential search to look them up. Have functions which insert keys, delete keys and look up keys so that, if you need to change it in future, the API itself won't change. Pseudo-code:
def init:
create g.key[100] as string
create g.val[100] as integer
set g.size to 0
def add (key,val):
if lookup(key) != not_found:
return already_exists
if g.size == 100:
return no_space
g.key[g.size] = key
g.val[g.size] = val
g.size = g.size + 1
return okay
def del (key):
pos = lookup (key)
if pos == not_found:
return no_such_key
if pos < g.size - 1:
g.key[pos] = g.key[g.size-1]
g.val[pos] = g.val[g.size-1]
g.size = g.size - 1
def find (key):
for pos goes from 0 to g.size-1:
if g.key[pos] == key:
return pos
return not_found
Insertion means ensuring it doesn't already exist then just tacking an element on to the end (you'll maintain a separate size variable for the structure). Deletion means finding the element then simply overwriting it with the last used element and decrementing the size variable.
Now this isn't the most efficient method in the world but you need to keep in mind that it usually only makes a difference as your dataset gets much larger. The difference between a binary tree or hash and a sequential search is irrelevant for, say, 20 entries. I've even used bubble sort for small data sets where a more efficient one wasn't available. That's because it massively quick to code up and the performance is irrelevant.
Stepping up from there, you can remove the fixed upper size by using a linked list. The search is still relatively inefficient since you're doing it sequentially but the same caveats apply as for the array solution above. The cost of removing the upper bound is a slight penalty for insertion and deletion.
If you want a little more performance and a non-fixed upper limit, you can use a binary tree to store the elements. This gets rid of the sequential search when looking for keys and is suited to somewhat larger data sets.
If you don't know how big your data set will be getting, I would consider this the absolute minimum.
A hash is probably the next step up from there. This performs a function on the string to get a bucket number (usually treated as an array index of some sort). This is O(1) lookup but the aim is to have a hash function that only allocates one item per bucket, so that no further processing is required to get the value.
A degenerate case of "all items in the same bucket" is no different to an array or linked list.
For maximum performance, and assuming the keys are fixed and known in advance, you can actually create your own hashing function based on the keys themselves.
Knowing the keys up front, you have extra information that allows you to fully optimise a hashing function to generate the actual value so you don't even involve buckets - the value generated by the hashing function can be the desired value itself rather than a bucket to get the value from.
I had to put one of these together recently for converting textual months ("January", etc) in to month numbers. You can see the process here.
I mention this possibility because of your "pre-defined string" comment. If your keys are limited to "X" and "Y" (as in your example) and you're using a character set with contiguous {W,X,Y} characters (which even covers EBCDIC as well as ASCII though not necessarily every esoteric character set allowed by ISO), the simplest hashing function would be:
char *s = "X";
int val = *s - 'W';
Note that this doesn't work well if you feed it bad data. These are ideal for when the data is known to be restricted to certain values. The cost of checking data can often swamp the saving given by a pre-optimised hash function like this.

C doesn't have any collection classes. C++ has std::map.
You might try searching for C implementations of maps, e.g. http://elliottback.com/wp/hashmap-implementation-in-c/

A 'trie' or a 'hasmap' should do. The simplest implementation is an array of struct { char *s; int i }; pairs.
Check out 'trie' in 'include/nscript.h' and 'src/trie.c' here: http://github.com/nikki93/nscript . Change the 'trie_info' type to 'int'.

Try a Trie for strings, or a Tree of some sort for integer/pointer types (or anything that can be compared as "less than" or "greater than" another key). Wikipedia has reasonably good articles on both, and they can be implemented in C.

How to rewrite this Dictionary For Loop in Python?

I have a Dictionary of Classes where the classes hold attributes that are lists of strings.
I made this function to find out the max number of items are in one of those lists for a particular person.
def find_max_var_amt(some_person) #pass in a patient id number, get back their max number of variables for a type of variable
max_vars=0
for key, value in patients[some_person].__dict__.items():
challenger=len(value)
if max_vars < challenger:
max_vars= challenger
return max_vars
What I want to do is rewrite it so that I do not have to use the .iteritems() function. This find_max_var_amt function works fine as is, but I am converting my code from using a dictionary to be a database using the dbm module, so typical dictionary functions will no longer work for me even though the syntax for assigning and accessing the key:value pairs will be the same. Thanks for your help!

Since dbm doesn't let you iterate over the values directly, you can iterate over the keys. To do so, you could modify your for loop to look like
for key in patients[some_person].__dict__:
value = patients[some_person].__dict__[key]
# then continue as before
I think a bigger issue, though, will be the fact that dbm only stores strings. So you won't be able to store the list directly in the database; you'll have to store a string representation of it. And that means that when you try to compute the length of the list, it won't be as simple as len(value); you'll have to develop some code to figure out the length of the list based on whatever string representation you use. It could just be as simple as len(the_string.split(',')), just be aware that you have to do it.
By the way, your existing function could be rewritten using a generator, like so:
def find_max_var_amt(some_person):
return max(len(value) for value in patients[some_person].__dict__.itervalues())
and if you did it that way, the change to iterating over keys would look like
def find_max_var_amt(some_person):
dct = patients[some_person].__dict__
return max(len(dct[key]) for key in dct)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.