delta-dictionary/dictionary with revision awareness in python? - python

I am looking to create a dictionary with 'roll-back' capabilities in python. The dictionary would start with a revision number of 0, and the revision would be bumped up only by explicit method call. I do not need to delete keys, only add and update key,value pairs, and then roll back. I will never need to 'roll forward', that is, when rolling the dictionary back, all the newer revisions can be discarded, and I can start re-reving up again. thus I want behaviour like:
>>> rr = rev_dictionary()
>>> rr.rev
0
>>> rr["a"] = 17
>>> rr[('b',23)] = 'foo'
>>> rr["a"]
17
>>> rr.rev
0
>>> rr.roll_rev()
>>> rr.rev
1
>>> rr["a"]
17
>>> rr["a"] = 0
>>> rr["a"]
0
>>> rr[('b',23)]
'foo'
>>> rr.roll_to(0)
>>> rr.rev
0
>>> rr["a"]
17
>>> rr.roll_to(1)
Exception ...
Just to be clear, the state associated with a revision is the state of the dictionary just prior to the roll_rev() method call. thus if I can alter the value associated with a key several times 'within' a revision, and only have the last one remembered.
I would like a fairly memory-efficient implementation of this: the memory usage should be proportional to the deltas. Thus simply having a list of copies of the dictionary will not scale for my problem. One should assume the keys are in the tens of thousands, and the revisions are in the hundreds of thousands.
We can assume the values are immutable, but need not be numeric. For the case where the values are e.g. integers, there is a fairly straightforward implementation (have a list of dictionaries of the numerical delta from revision to revision). I am not sure how to turn this into the general form. Maybe bootstrap the integer version and add on an array of values?
all help appreciated.

Have just one dictionary, mapping from the key to a list of (revision_number, actual_value) tuples. Current value is the_dict[akey][-1][1]. Rollback merely involves popping the appropriate entries off the end of each list.
Update: examples of rollback
key1 -> [(10, 'v1-10'), (20, 'v1-20')]
Scenario 1: current revision is 30, rollback to 25: nothing happens
Scenario 2: current 30, back to 15: pop last entry
Scenario 3: current 30, back to 5: pop both entries
Update 2: faster rollback (with trade-offs)
I think your concern about popping every list is better expressed as "needs to examine every list to see if it needs popping". With a fancier data structure (more memory, more time to maintain the fancy bits in add and update operations) you can reduce the time to roll back.
Add an array (indexed by revision number) whose values are lists of the dictionary values that were changed in that revision.
# Original rollback code:
for rlist in the_dict.itervalues():
if not rlist: continue
while rlist[-1][0] > target_revno:
rlist.pop()
# New rollback code
for revno in xrange(current_revno, target_revno, -1):
for rlist in delta_index[revno]:
assert rlist[-1][0] == revno
del rlist[-1] # faster than rlist.pop()
del delta_index[target_revno+1:]
Update 3: full code for fancier method
import collections
class RevDict(collections.MutableMapping):
def __init__(self):
self.current_revno = 0
self.dict = {}
self.delta_index = [[]]
def __setitem__(self, key, value):
if key in self.dict:
rlist = self.dict[key]
last_revno = rlist[-1][0]
rtup = (self.current_revno, value)
if last_revno == self.current_revno:
rlist[-1] = rtup
# delta_index already has an entry for this rlist
else:
rlist.append(rtup)
self.delta_index[self.current_revno].append(rlist)
else:
rlist = [(self.current_revno, value)]
self.dict[key] = rlist
self.delta_index[self.current_revno].append(rlist)
def __getitem__(self, key):
if not key in self.dict:
raise KeyError(key)
return self.dict[key][-1][1]
def new_revision(self):
self.current_revno += 1
self.delta_index.append([])
def roll_back(self, target_revno):
assert 0 <= target_revno < self.current_revno
for revno in xrange(self.current_revno, target_revno, -1):
for rlist in self.delta_index[revno]:
assert rlist[-1][0] == revno
del rlist[-1]
del self.delta_index[target_revno+1:]
self.current_revno = target_revno
def __delitem__(self, key):
raise TypeError("RevDict doesn't do del")
def keys(self):
return self.dict.keys()
def __contains__(self, key):
return key in self.dict
def iteritems(self):
for key, rlist in self.dict.iteritems():
yield key, rlist[-1][1]
def __len__(self):
return len(self.dict)
def __iter__(self):
return self.dict.iterkeys()

The deluxe solution would be to use B+Trees with copy-on-write. I used a variation on B+Trees to implement my blist data type (which can be used to very efficiently create revisions of lists, exactly analogous to your problem).
The general idea is to store the data in a balanced tree. When you create a new revision, you copy only the root node. If you need to modify a node shared with an older revision, you copy the node and modify the copy instead. That way, the old tree is still completely intact, but you only need memory for the changes (technically, O(k * log n) where k is the number of changes and n is the total number of items).
It's non-trivial to implement, though.

Related

Modifying a nested dictionary element by a reference, generated from a list

The code:
def main():
nested_dict = {'A': {'A_1': 'value_1', 'B_1': 'value_2'},
'B': 'value_3'}
access_pattern = ['A', 'B_1']
new_value = 'value_4'
nested_dict[access_pattern] = new_value
return nested_dict
Background information:
As can be seen, I have a variable called nested_dict - in reality, it contains hundreds of elements with a different number of sub-elements each (I'm simplifying it for the purpose of the example).
I need to modify the value of some elements inside this dictionary, but it is not predetermined which elements exactly. The specific "path" to the elements that need be modified, will be provided by the access_pattern variable, which will be different every time.
The problem:
I know how to reference the value of the dictionary with this function functools.reduce(dict.get, access_pattern, nested_dict). However, I do not know how to universally modify (regardless of the contained variable type) the value of the access_pattern in the dictionary.
The provided code produces a TypeError that I do not know how to overcome elegantly. I did think of some solution, specified in 4.
Possible solutions:
if len(access_pattern) == 1:
nested_dict[access_pattern[0]] = new_value
elif len(access_pattern) == 2:
nested_dict[access_pattern[0]][access_pattern[1]] = new_value
...
So on for all len()
This just seems VERY inelegant and painful. Is there a more practical way to achieve this?
Make use of recursion
def edit_from_access_pattern(access_pattern, nested_dict, new_value):
if len(access_pattern) == 1:
nested_dict[access_pattern[0]] = new_value
else:
return edit_from_access_pattern(access_pattern[1:], nested_dict[access_pattern[0], new_value]
You can use recursion
def set_value(container, key, value):
if len(key) == 1:
container[key[0]] = value
else:
set_value(container[key[0]], key[1:], value)
but an explicit loop is probably going to be more efficient
def set_value(container, key, value):
for i in range(len(key)-1):
container = container[key[i]]
container[key[-1]] = value

Data structure with O(1) random removal and adds for shuffling generator order

I need a data structure that lets you add elements and remove them randomly in O(1) time.
The reason for this that I need to shuffle data from a generator, but I can't load everything into memory at the same time due to size.
This is an example of usage, which automatically shuffles the order of the results generated by a generator expression without loading everything into memory:
def generator_shuffler(generator)
a = magical_data_structure_described_above
for i in generator:
a.add(i)
if len(a) > 10: yield a.poprandom()
Initially I tried a python set(), however from here: Set.pop() isn't random?, it seems that set() doesn't actually remove the items in an arbitrary order. How would I implement the data structure with the above usage?
If you want to pop randomly, why don't you just use a list and implement pop by swapping the last element with some randomly-selected element and then dropping the new last element? That won't preserve the order of the remaining elements in the data structure, but "pop randomly" and "shuffle" suggest that you don't really care.
Finding and removing a random element in a collection is generally O(k) when using pop, however, you can modify the action so that the list is shuffled when checking for length, that way, both the add and pop operations remain O(1):
import random
class RandomStack:
def __init__(self, _d = None):
self.stack = _d if _d else []
def __len__(self):
random.shuffle(self.stack)
return len(self.stack)
def add(self, _val):
self.stack.append(_val)
def poprandom(self):
return self.stack.pop()
a = RandomStack()
for i in range(16):
a.add(i)
if len(a) > 10:
val = a.poprandom()
print(val)
Output:
2
4
9
0
6
12

Test if all values of a dictionary are equal - when value is unknown

I have 2 dictionaries:
the values in each dictionary should all be equal.
BUT I don't know what that number will be...
dict1 = {'xx':A, 'yy':A, 'zz':A}
dict2 = {'xx':B, 'yy':B, 'zz':B}
N.B. A does not equal B
N.B. Both A and B are actually strings of decimal numbers (e.g. '-2.304998') as they have been extracted from a text file
I want to create another dictionary - that effectively summarises this data - but only if all the values in each dictionary are the same.
i.e.
summary = {}
if dict1['xx'] == dict1['yy'] == dict1['zz']:
summary['s'] = dict1['xx']
if dict2['xx'] == dict2['yy'] == dict2['zz']:
summary['hf'] = dict2['xx']
Is there a neat way of doing this in one line?
I know it is possible to create a dictionary using comprehensions
summary = {k:v for (k,v) in zip(iterable1, iterable2)}
but am struggling with both the underlying for loop and the if statement...
Some advice would be appreciated.
I have seen this question, but the answers all seem to rely on already knowing the value being tested (i.e. are all the entries in the dictionary equal to a known number) - unless I am missing something.
sets are a solid way to go here, but just for code golf purposes here's a version that can handle non-hashable dict values:
expected_value = next(iter(dict1.values())) # check for an empty dictionary first if that's possible
all_equal = all(value == expected_value for value in dict1.values())
all terminates early on a mismatch, but the set constructor is well enough optimized that I wouldn't say that matters without profiling on real test data. Handling non-hashable values is the main advantage to this version.
One way to do this would be to leverage set. You know a set of an iterable has a length of 1 if there is only one value in it:
if len(set(dct.values())) == 1:
summary[k] = next(iter(dct.values()))
This of course, only works if the values of your dictionary are hashable.
While we can use set for this, doing so has a number of inefficiencies when the input is large. It can take memory proportional to the size of the input, and it always scans the whole input, even when two distinct values are found early. Also, the input has to be hashable.
For 3-key dicts, this doesn't matter much, but for bigger ones, instead of using set, we can use itertools.groupby and see if it produces multiple groups:
import itertools
groups = itertools.groupby(dict1.values())
# Consume one group if there is one, then see if there's another.
next(groups, None)
if next(groups, None) is None:
# All values are equal.
do_something()
else:
# Unequal values detected.
do_something_else()
Except for readability, I don't care for all the answers involving set or .values. All of these are always O(N) in time and memory. In practice it can be faster, although it depends on the distribution of values.
Also because set employs hashing operations, you may also have a hefty large constant multiplier to your time cost. And your values have to hashable, when a test for equality is all that's needed.
It is theoretically better to take the first value from the dictionary and search for the first example in the remaining values that is not equal to.
set might be quicker than the solution below because its workings are may reduce to C implementations.
def all_values_equal(d):
if len(d)<=1: return True # Treat len0 len1 as all equal
i = d.itervalues()
firstval = i.next()
try:
# Incrementally generate all values not equal to firstval
# .next raises StopIteration if empty.
(j for j in i if j!=firstval).next()
return False
except StopIteration:
return True
print all_values_equal({1:0, 2:1, 3:0, 4:0, 5:0}) # False
print all_values_equal({1:0, 2:0, 3:0, 4:0, 5:0}) # True
print all_values_equal({1:"A", 2:"B", 3:"A", 4:"A", 5:"A"}) # False
print all_values_equal({1:"A", 2:"A", 3:"A", 4:"A", 5:"A"}) # True
In the above:
(j for j in i if j!=firstval)
is equivalent to:
def gen_neq(i, val):
"""
Give me the values of iterator i that are not equal to val
"""
for j in i:
if j!=val:
yield j
I found this solution, which I find quite a bit I combined another solution found here: enter link description here
user_min = {'test':1,'test2':2}
all(value == list(user_min.values())[0] for value in user_min.values())
>>> user_min = {'test':1,'test2':2}
>>> all(value == list(user_min.values())[0] for value in user_min.values())
False
>>> user_min = {'test':2,'test2':2}
>>> all(value == list(user_min.values())[0] for value in user_min.values())
True
>>> user_min = {'test':'A','test2':'B'}
>>> all(value == list(user_min.values())[0] for value in user_min.values())
False
>>> user_min = {'test':'A','test2':'A'}
>>> all(value == list(user_min.values())[0] for value in user_min.values())
True
Good for a small dictionary, but I'm not sure about a large dictionary, since we get all the values to choose the first one

Remove all elements that satisfy a predicate from a set

Given a mutable set of objects,
A = set(1,2,3,4,5,6)
I can construct a new set containing only those objects that don't satisfy a predicate ...
B = set(x for x in A if not (x % 2 == 0))
... but how do I modify A in place to contain only those objects? If possible, do this in linear time, without constructing O(n)-sized scratch objects, and without removing anything from A, even temporarily, that doesn't satisfy the predicate.
(Integers are used here only to simplify the example. In the actual code they are Future objects and I'm trying to pull out those that have already completed, which is expected to be a small fraction of the total.)
Note that it is not, in general, safe in Python to mutate an object that you are iterating over. I'm not sure of the precise rules for sets (the documentation doesn't make any guarantee either way).
I only need an answer for 3.4+, but will take more general answers.
(Not actually O(1) due to implementation details, but I'm loathe to delete it as it's quite clean.)
Use symmetric_difference_update.
>>> A = {1,2,3,4,5,6}
>>> A.symmetric_difference_update(x for x in A if not (x % 2))
>>> A
{1, 3, 5}
With an horrible time complexity (quadratic), but in O(1) space:
>>> A = {1,2,3,4,5,6}
>>> while modified:
... modified = False
... for x in A:
... if not x%2:
... A.remove(x)
... modified = True
... break
...
>>> A
{1, 3, 5}
On the very specific use case you showed, there is a way to do this in O(1) space, but it doesn't generalize very well to sets containing anything other than int objects:
A = {1, 2, 3, 4, 5, 6}
for i in range(min(A), max(A) + 1):
if i % 2 != 0:
A.discard(i)
It also wastes time since it will check numbers that aren't even in the set. For anything other than int objects, I can't yet think of a way to do this without creating an intermediate set or container of some sort.
For a more general solution, it would be better to simply initially construct your set using the predicate (if you don't need to use the set for anything else first). Something like this:
def items():
# maybe this is a file or a stream or something,
# where ever your initial values are coming from.
for thing in source:
yield thing
def predicate(item):
return bool(item)
A = set(item for item in items() if predicate(item))
to maintain the use use of memory constant this is the only thing that come to my mind
def filter_Set(predicate,origen:set) -> set:
resul = set()
while origen:
elem = origen.pop()
if predicate( elem ):
resul.add( elem )
return resul
def filter_Set_inplace(predicate,origen:set):
resul = set()
while origen:
elem = origen.pop()
if predicate( elem ):
resul.add( elem )
while resul:
origen.add(resul.pop())
with this functions I move elems from one set to the other keeping only those that satisfied the predicate

QUICKEST way to find a key in a dictionary

I have a dictionary, with over 11 million keys (and each value is a list).Each key is a unique integer.
e.g.
Dict1 = {11:"a",12:"b",22:"c",56:"d"}
Then, separately, I have a list of ranges, e.g.
[10-20,30-40,50-60]
And I want to say, for each range in my list of ranges, go through the dictionary and return the value, if the key is within the range.
So it would return:
10-20: "a","b"
50-60: "d"
The actual code that I used is:
for each_key in sorted(dictionary):
if each_key in range(start,end):
print str(dictionary[each_key])
The problem is that this technique is prohibitively long because it's going through all 11 million keys and checking if it's within the range or not.
Is there a way that I can say "skip through all of the dictionary keys until one in found that is higher than the start number" and then "stop once the end number is higher than the key"? Just basically some way that just zooms in on the portion of the dictionary within a certain range very quickly?
Thanks
Just use Python's EAFP principle. It's Easier to Ask Forgiveness than Permission.
Assume that all keys are valid, and catch the error if they're not:
for key in xrange(start, end):
try:
print str(dictionary[key])
except KeyError:
pass
This will just try to get each number as a key, and if there's a KeyError from a non existent key then it will move on to the next iteration.
Note that if you expect a lot of the keys will be missing, it might be faster to test first:
for key in xrange(start, end):
if key in dictionary:
print str(dictionary[key])
Note that xrange is just a slightly different function to range. It will produce the values one by one instead of creating the whole list in advance. It's useful to use in for loops and has no drawbacks in this case.
my thought for this problem is to find the correct keys first. The reason why your solution take too much time is that it use O(n) algorithm to find a correct key. If we can implement binary search method, the complexity will be reduced to O(log(n)), which helps a lot.
Following is my sample code. It works for the example, but I cannot promise it won't get some small bugs. Just find the idea there and implement yours.
def binarySearch(alist, target):
left = 0
right = len(alist) -1
if target>alist[-1]:
return len(alist)
while left < right:
m = (left + right) / 2
if alist[m] == target:
return m
if alist[m] < target:
left = m+1
else:
right = m
return left
def work(dictionary, start, end):
keys = sorted(dictionary.keys())
start_pos = binarySearch(keys, start)
end_pos = binarySearch(keys, end)
print [dictionary[keys[pos]] for pos in range(start_pos,end_pos)]
dictionary = {11:"a",12:"b",22:"c",56:"d"}
work(dictionary, 10, 20)
work(dictionary, 20, 40)
work(dictionary, 10, 60)
This solution ( using OrderedDict and filter ) can help you a bit.
from collections import OrderedDict
d = {2:3, 10:89, 4:5, 23:0}
od = OrderedDict(sorted(d.items()))
lst=["1-10","11-20","21-30"]
lower_lst=map(int,[i.split("-")[0] for i in lst])
upper_lst=map(int,[i.split("-")[1] for i in lst])
for low,up in zip(lower_lst,upper_lst):
print "In range {0}-{1}".format(low,up),filter(lambda a:low <= a[0] <= up,od.iteritems())

Categories

Resources