Python: modifying single dictionary item containing an array view modifies all items - python

I have two dictionaries with same keys. Each item is an ndarray.
from numpy import zeros, random
from collections import namedtuple
PhaseAmplitude = namedtuple('PhaseAmplitude','phase amplitude')
dict_keys = {'K1','K2', 'K3'}
J1 = dict.fromkeys(dict_keys, zeros((2,2,2,2)))
U1 = dict.fromkeys(dict_keys, PhaseAmplitude(phase = zeros((2,2)),
amplitude = zeros((2,2))))
for iFld in dict_keys:
U1[iFld] = U1[iFld]._replace(phase = random.random_sample((2,2)),
amplitude = random.random_sample((2,2)))
I want to modify each item in the the first dictionary using the corresponding item in the second one:
for iFld in dict_keys:
J1[iFld][0,0,:,:] += U1[iFld].phase
J1[iFld][0,1,:,:] += U1[iFld].amplitude
I expect to get that J1[iFld][0,0,:,:] = U1[iFld].phase and J1[iFld][0,1,:,:] = U1[iFld].amplitude but I get J1[iFld] being the same for all iFld and equal to the sum over all iFld keys of U1 (keeping track of the phase and amplitude fields of U1 of course).
To me this looks like a bug but I've been using Python only for a month or so (switching from matlab) so I am not sure.
Question: Is this expected behavior or a bug? What should I change in my code in order to get the behavior I want?
Note: I chose the number of dimensions of dict_keys, J1 and U1 to reflect my particular situation.

This isn't a bug, though it is a pretty common gotcha that shows up in a few different situations. dict.fromkeys creates a new dictionary where all of the values are the same object. This works great for immutable types (e.g. int, str), but for mutable types, you can run into problems.
e.g.:
>>> import numpy as np
>>> d = dict.fromkeys('ab', np.zeros(2))
>>> d
{'a': array([ 0., 0.]), 'b': array([ 0., 0.])}
>>> d['a'][1] = 1
>>> d
{'a': array([ 0., 1.]), 'b': array([ 0., 1.])}
and this is because:
>>> d['a'] is d['b']
True
Use a dict comprehension to build the dictionary in this case:
J1 = {k: zeros((2,2,2,2)) for k in dict_keys}
(or, pre-python2.7):
J1 = dict((k, zeros((2,2,2,2))) for k in dict_keys)

Related

What is the fastest way to return a specific list within a dictionary within a dictionary?

I have a list within a dictionary within a dictionary. The data set is very large. How can I most quickly return the list nested in the two dictionaries if I am given a List that is specific to the key, dict pairs?
{"Dict1":{"Dict2": ['UNIOUE LIST'] }}
Is there an alternate data structure to use for this for efficiency?
I do not believe a more efficient data structure exists in Python. Simply retrieving the list using the regular indexing operator should be a very fast operation, even if both levels of dictionaries are very large.
nestedDict = {"Dict1":{"Dict2": ['UNIOUE LIST'] }}
uniqueList = nestedDict["Dict1"]["Dict2"]
My only thought for improving performance was to try flattening the data structure into a single dictionary with tuples for keys. This would take more memory than the nested approach since the keys in the top-level dictionary will be replicated for every entry in the second-level dictionaries, but it will only compute the hash function once for every lookup. But this approach is actually slower than the nested approach in practice:
nestedDict = {i: {j: ['UNIQUE LIST'] for j in range(1000)} for i in range(1000)}
flatDict = {(i, j): ['UNIQUE LIST'] for i in range(1000) for j in range(1000)}
import random
def accessNested():
i = random.randrange(1000)
j = random.randrange(1000)
return nestedDict[i][j]
def accessFlat():
i = random.randrange(1000)
j = random.randrange(1000)
return nestedDict[(i,j)]
import timeit
print(timeit.timeit(accessNested))
print(timeit.timeit(accessFlat))
Output:
2.0440238649971434
2.302736301004188
The fastest way to access the list within the nested dictionary is,
d = {"Dict1":{"Dict2": ['UNIOUE LIST'] }}
print(d["Dict1"]["Dict2"])
Output :
['UNIOUE LIST']
But if you perform iteration on the list that is in nested dictionary. so you can use the following code as example,
d = {"a":{"b": ['1','2','3','4'] }}
for i in d["a"]["b"]:
print(i)
Output :
1
2
3
4
If I understand correctly, you want to access a nested dictionary structure if...
if I am given a List that is specific to the key
So, here you have a sample dictionary and key that you want to access
d = {'a': {'a': 0, 'b': 1},
'b': {'a': {'a': 2}, 'b': 3}}
key = ('b', 'a', 'a')
The lazy approach
This is fast if you know Python dictionaries already, no need to learn other stuff!
>>> value = d
>>> for level in key:
... value = temp[level]
>>> value
2
NestedDict from the ndicts package
If you pip install ndicts then you get the same "lazy approach" implementation in a nicer interface.
>>> from ndicts import NestedDict
>>> nd = NestedDict(d)
>>> nd[key]
2
>>> nd["b", "a", "a"]
2
This option is fast because you can't really write less code than nd[key] to get what you want.
Pandas dataframes
This is the solution that will give you performance. Lookups in dataframes should be quick, especially if you have a sorted index.
In this case we have hierarchical data with multiple levels, so I will create a MultiIndex first. I will use the NestedDict for ease, but anything else to flatten the dictionary will do.
>>> keys = list(nd.keys())
>>> values = list(nd.values())
>>> from pandas import DataFrame, MultiIndex
>>> index = MultiIndex.from_tuples(keys)
>>> df = DataFrame(values, index=index, columns="Data").sort_index()
>>> df
Data
a a NaN 0
b NaN 1
b a a 2
b NaN 3
Use the loc method to get a row.
>>> nd.loc[key]
Data 2
Name: (b, a, a), dtype: int64

Binning variable length lists in python

I have a dictionary d with 100 keys where the values are variable length lists, e.g.
In[165]: d.values()[0]
Out[165]:
[0.0432,
0.0336,
0.0345,
0.044,
0.0394,
0.0555]
In[166]: d.values()[1]
Out[166]:
[0.0236,
0.0333,
0.0571]
Here's what I'd like to do: for every list in d.values(), I'd like to organize the values into 10 bins (where a value gets tossed into a bin if it satisfies the criteria, e.g. is between 0.03 and 0.04, 0.04 and 0.05, etc.).
What'd I'd like to end up with is something that looks exactly like d, but instead of d.values()[0] being a list of numbers, I'd like it to be a list of lists, like so:
In[167]: d.values()[0]
Out[167]:
[[0.0336,0.0345,0.0394],
[0.0432,0.044],
[0.0555]]
Each key would still be associated with the same values, but they'd be structured into the 10 bins.
I've been going crazy with nested for loops and if/elses, etc. What is the best way to go about this?
EDIT: Hi, all. Just wanted to let you know I resolved my issues. I used a variation of #Brent Washburne's answer. Thanks for the help!
def bin(values):
bins = [[] for _ in range(10)] # create ten bins
for n in values:
b = int(n * 100) # normalize the value to the bin number
bins[b].append(n) # add the number to the bin
return bins
d = [0.0432,
0.0336,
0.0345,
0.044,
0.0394,
0.0555]
print bin(d)
The result is:
[[], [], [], [0.0336, 0.0345, 0.0394], [0.0432, 0.044], [0.0555], [], [], [], []]
You can use itertools.groupby() function by passing a proper key-function in order to categorize your items. And in this case you can use floor(x*100) as your key-function:
>>> from math import floor
>>> from itertools import groupby
>>> lst = [0.0432, 0.0336, 0.0345, 0.044, 0.0394, 0.0555]
>>> [list(g) for _,g in groupby(sorted(lst), key=lambda x: floor(x*100))]
[[0.0336, 0.0345, 0.0394], [0.0432, 0.044], [0.0555]]
And for applying this on your values you can use a dictionary comprehension:
def categorizer(val):
return [list(g) for _,g in groupby(sorted(lst), key=lambda x: floor(x*100))]
new_dict = {k:categorizer(v) for k,v in old_dict.items()}
As another approach which is more optimized in term of execution speed you can use a dictionary for categorizing:
>>> def categorizer(val, d={}):
... for i in val:
... d.setdefault(floor(i*100),[]).append(i)
... return d.values()
Why not make the values a set of dictionaries where the ke is the bin indicator and the values a list of those items that are in that bin?
yoe would define
newd = [{bin1:[], bin2:[], ...binn:[]}, ... ]
newd[0][bin1] = (list of items in d[0] that belong in bin1)
You now have a list of dictionaries each of which has the appropriate bin listings.
newd[0] is now the equivalent of a dictionary built from d[0] each key (which I call bin1, bin2, ... binn) contains a list of the values that are appropriate for that bin. Thus we have `newd[0][bin1], newd[0][bin2, ... new[k][lastbin]
Dictionary creation allows you to create the appropriate key and value list as you go along. If there is not yet a particular bin key, create the empty list and then the append of the value to the list will succeed.
Now when you want to identify elements of a bin, you can loop through the list of newd and extract whichever bin that you want. This allows you to have bins with no entry without having to create empty lists. If a bin key is not in newd, the retrieve is set to return an empty list as a default (to avoid the dictionary invalid key exception).

boolean retrieval, indexing phase

I am doing a boolean retrieval project, the first phase is indexing. I am trying to build an inverted index now. Say I got a sorted list like following: how can I merge the items
list = [('a',1),('a',2),('a',3),('b',1),('b',2),('b',3)...]
such that I can get a dictionary like the following and it remains sorted:
dict = {'a':[1,2,3], 'b':[1,2,3]...}, thx a lot
You can do it like this:
>>> import collections
>>> mylist = [('a',1),('a',2),('a',3),('b',1),('b',2),('b',3)]
>>> result = collections.defaultdict(list)
>>> for item in mylist:
result[item[0]].append(item[1])
>>> dict(result)
{'a': [1, 2, 3], 'b': [1, 2, 3]}
defaultdict(list) creates a dictionary in which keys are initialised upon first access to an object created using the callable passed as the argument (in this case list). It avoids having to check whether the key already exists or not.
The last line converts the defaultdict to a normal dict - it is not strictly necessary as defaultdict behaves like a normal dictionary too.
Values are appended to each key in the same order as the original list. However, the keys themselves will not be ordered (this is a property of dictionaries).
Update: if you need the dictionary keys to remain sorted as well, you can do this:
>>> import collections
>>> mylist = [('a',1),('a',2),('c',1),('c',2),('b',1),('b',2)]
>>> result = collections.OrderedDict()
>>> for item in mylist:
if item[0] not in result:
result[item[0]] = list()
result[item[0]].append(item[1])
>>> result
OrderedDict([('a', [1, 2]), ('c', [1, 2]), ('b', [1, 2])])
>>> result.keys()
['a', 'c', 'b']
Obviously, you cannot use dict(result) in this case as dict does not maintain any specific key order.

Duplicates in a dictionary (Python)

I need to write a function that returns true if the dictionary has duplicates in it. So pretty much if anything appears in the dictionary more than once, it will return true.
Here is what I have but I am very far off and not sure what to do.
d = {"a", "b", "c"}
def has_duplicates(d):
seen = set()
d={}
for x in d:
if x in seen:
return True
seen.add(x)
return False
print has_duplicates(d)
If you are looking to find duplication in values of the dictionary:
def has_duplicates(d):
return len(d) != len(set(d.values()))
print has_duplicates({'a': 1, 'b': 1, 'c': 2})
Outputs:
True
def has_duplicates(d):
return False
Dictionaries do not contain duplicate keys, ever. Your function, btw., is equivalent to this definition, so it's correct (just a tad long).
If you want to find duplicate values, that's
len(set(d.values())) != len(d)
assuming the values are hashable.
In your code, d = {"a", "b", "c"}, d is a set, not a dictionary.
Neither dictionary keys nor sets can contain duplicates. If you're looking for duplicate values, check if the set of the values has the same size as the dictionary itself:
def has_duplicate_values(d):
return len(set(d.values())) != len(d)
Python dictionaries already have unique keys.
Are you possibly interested in unique values?
set(d.values())
If so, you can check the length of that set to see if it is smaller than the number of values. This works because sets eliminate duplicates from the input, so if the result is smaller than the input, it means some duplicates were found and eliminated.
Not only is your general proposition that dictionaries can have duplicate keys false, but also your implementation is gravely flawed: d={} means that you have lost sight of your input d arg and are processing an empty dictionary!
The only thing that a dictionary can have duplicates of, is values. A dictionary is a key, value store where the keys are unique. In Python, you can create a dictionary like so:
d1 = {k1: v1, k2: v2, k3: v1}
d2 = [k1, v1, k2, v2, k3, v1]
d1 was created using the normal dictionary notation. d2 was created from a list with an even number of elements. Note that both versions have a duplicate value.
If you had a function that returned the number of unique values in a dictionary then you could say something like:
len(d1) != func(d1)
Fortunately, Python makes it easy to do this using sets. Simply converting d1 into a set is not sufficient. Lets make our keys and values real so you can run some code.
v1 = 1; v2 = 2
k1 = "a"; k2 = "b"; k3 = "c"
d1 = {k1: v1, k2: v2, k3: v1}
print len(d1)
s = set(d1)
print s
You will notice that s has three members too and looks like set(['c', 'b', 'a']). That's because a simple conversion only uses the keys in the dict. You want to use the values like so:
s = set(d1.values())
print s
As you can see there are only two elements because the value 1 occurs two times. One way of looking at a set is that it is a list with no duplicate elements. That's what print sees when it prints out a set as a bracketed list. Another way to look at it is as a dict with no values. Like many data processing activities you need to start by selecting the data that you are interested in, and then manipulating it. Start by selecting the values from the dict, then create a set, then count and compare.
This is not a dictionary, is a set:
d = {"a", "b", "c"}
I don't know what are you trying to accomplish but you can't have dictionaries with same key. If you have:
>>> d = {'a': 0, 'b':1}
>>> d['a'] = 2
>>> print d
{'a': 2, 'b': 1}

python dict: get vs setdefault

The following two expressions seem equivalent to me. Which one is preferable?
data = [('a', 1), ('b', 1), ('b', 2)]
d1 = {}
d2 = {}
for key, val in data:
# variant 1)
d1[key] = d1.get(key, []) + [val]
# variant 2)
d2.setdefault(key, []).append(val)
The results are the same but which version is better or rather more pythonic?
Personally I find version 2 harder to understand, as to me setdefault is very tricky to grasp. If I understand correctly, it looks for the value of "key" in the dictionary, if not available, enters "[]" into the dict, returns a reference to either the value or "[]" and appends "val" to that reference. While certainly smooth it is not intuitive in the least (at least to me).
To my mind, version 1 is easier to understand (if available, get the value for "key", if not, get "[]", then join with a list made up from [val] and place the result in "key"). But while more intuitive to understand, I fear this version is less performant, with all this list creating. Another disadvantage is that "d1" occurs twice in the expression which is rather error-prone. Probably there is a better implementation using get, but presently it eludes me.
My guess is that version 2, although more difficult to grasp for the inexperienced, is faster and therefore preferable. Opinions?
Your two examples do the same thing, but that doesn't mean get and setdefault do.
The difference between the two is basically manually setting d[key] to point to the list every time, versus setdefault automatically setting d[key] to the list only when it's unset.
Making the two methods as similar as possible, I ran
from timeit import timeit
print timeit("c = d.get(0, []); c.extend([1]); d[0] = c", "d = {1: []}", number = 1000000)
print timeit("c = d.get(1, []); c.extend([1]); d[0] = c", "d = {1: []}", number = 1000000)
print timeit("d.setdefault(0, []).extend([1])", "d = {1: []}", number = 1000000)
print timeit("d.setdefault(1, []).extend([1])", "d = {1: []}", number = 1000000)
and got
0.794723378711
0.811882272256
0.724429205999
0.722129751973
So setdefault is around 10% faster than get for this purpose.
The get method allows you to do less than you can with setdefault. You can use it to avoid getting a KeyError when the key doesn't exist (if that's something that's going to happen frequently) even if you don't want to set the key.
See Use cases for the 'setdefault' dict method and dict.get() method returns a pointer for some more info about the two methods.
The thread about setdefault concludes that most of the time, you want to use a defaultdict. The thread about get concludes that it is slow, and often you're better off (speed wise) doing a double lookup, using a defaultdict, or handling the error (depending on the size of the dictionary and your use case).
The accepted answer from agf isn't comparing like with like. After:
print timeit("d[0] = d.get(0, []) + [1]", "d = {1: []}", number = 10000)
d[0] contains a list with 10,000 items whereas after:
print timeit("d.setdefault(0, []) + [1]", "d = {1: []}", number = 10000)
d[0] is simply []. i.e. the d.setdefault version never modifies the list stored in d. The code should actually be:
print timeit("d.setdefault(0, []).append(1)", "d = {1: []}", number = 10000)
and in fact is faster than the faulty setdefault example.
The difference here really is because of when you append using concatenation the whole list is copied every time (and once you have 10,000 elements that is beginning to become measurable. Using append the list updates are amortised O(1), i.e. effectively constant time.
Finally, there are two other options not considered in the original question: defaultdict or simply testing the dictionary to see whether it already contains the key.
So, assuming d3, d4 = defaultdict(list), {}
# variant 1 (0.39)
d1[key] = d1.get(key, []) + [val]
# variant 2 (0.003)
d2.setdefault(key, []).append(val)
# variant 3 (0.0017)
d3[key].append(val)
# variant 4 (0.002)
if key in d4:
d4[key].append(val)
else:
d4[key] = [val]
variant 1 is by far the slowest because it copies the list every time, variant 2 is the second slowest, variant 3 is the fastest but won't work if you need Python older than 2.5, and variant 4 is just slightly slower than variant 3.
I would say use variant 3 if you can, with variant 4 as an option for those occasional places where defaultdict isn't an exact fit. Avoid both of your original variants.
For those who are still struggling in understanding these two term, let me tell you basic difference between get() and setdefault() method -
Scenario-1
root = {}
root.setdefault('A', [])
print(root)
Scenario-2
root = {}
root.get('A', [])
print(root)
In Scenario-1 output will be {'A': []} while in Scenario-2 {}
So setdefault() sets absent keys in the dict while get() only provides you default value but it does not modify the dictionary.
Now let come where this will be useful-
Suppose you are searching an element in a dict whose value is a list and you want to modify that list if found otherwise create a new key with that list.
using setdefault()
def fn1(dic, key, lst):
dic.setdefault(key, []).extend(lst)
using get()
def fn2(dic, key, lst):
dic[key] = dic.get(key, []) + (lst) #Explicit assigning happening here
Now lets examine timings -
dic = {}
%%timeit -n 10000 -r 4
fn1(dic, 'A', [1,2,3])
Took 288 ns
dic = {}
%%timeit -n 10000 -r 4
fn2(dic, 'A', [1,2,3])
Took 128 s
So there is a very large timing difference between these two approaches.
You might want to look at defaultdict in the collections module. The following is equivalent to your examples.
from collections import defaultdict
data = [('a', 1), ('b', 1), ('b', 2)]
d = defaultdict(list)
for k, v in data:
d[k].append(v)
There's more here.
1. Explained with a good example here:
http://code.activestate.com/recipes/66516-add-an-entry-to-a-dictionary-unless-the-entry-is-a/
dict.setdefault typical usage
somedict.setdefault(somekey,[]).append(somevalue)
dict.get typical usage
theIndex[word] = 1 + theIndex.get(word,0)
2. More explanation : http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html
dict.setdefault() is equivalent to get or set & get. Or set if necessary then get. It's especially efficient if your dictionary key is expensive to compute or long to type.
The only problem with dict.setdefault() is that the default value is always evaluated, whether needed or not. That only matters if the default value is expensive to compute. In that case, use defaultdict.
3. Finally the official docs with difference highlighted http://docs.python.org/2/library/stdtypes.html
get(key[, default])
Return the value for key if key is in the dictionary, else default. If
default is not given, it defaults to None, so that this method never
raises a KeyError.
setdefault(key[, default])
If key is in the dictionary, return its value. If not, insert key with a value of default and return default. default defaults to None.
The logic of dict.get is:
if key in a_dict:
value = a_dict[key]
else:
value = default_value
Take an example:
In [72]: a_dict = {'mapping':['dict', 'OrderedDict'], 'array':['list', 'tuple']}
In [73]: a_dict.get('string', ['str', 'bytes'])
Out[73]: ['str', 'bytes']
In [74]: a_dict.get('array', ['str', 'byets'])
Out[74]: ['list', 'tuple']
The mechamism of setdefault is:
levels = ['master', 'manager', 'salesman', 'accountant', 'assistant']
#group them by the leading letter
group_by_leading_letter = {}
# the logic expressed by obvious if condition
for level in levels:
leading_letter = level[0]
if leading_letter not in group_by_leading_letter:
group_by_leading_letter[leading_letter] = [level]
else:
group_by_leading_letter[leading_letter].append(word)
In [80]: group_by_leading_letter
Out[80]: {'a': ['accountant', 'assistant'], 'm': ['master', 'manager'], 's': ['salesman']}
The setdefault dict method is for precisely this purpose. The preceding for loop can be rewritten as:
In [87]: for level in levels:
...: leading = level[0]
...: group_by_leading_letter.setdefault(leading,[]).append(level)
Out[80]: {'a': ['accountant', 'assistant'], 'm': ['master', 'manager'], 's': ['salesman']}
It's very simple, means that either a non-null list append an element or a null list append an element.
The defaultdict, which makes this even easier. To create one, you pass a type or function for generating the default value for each slot in the dict:
from collections import defualtdict
group_by_leading_letter = defaultdict(list)
for level in levels:
group_by_leading_letter[level[0]].append(level)
There is no strict answer to this question. They both accomplish the same purpose. They can both be used to deal with missing values on keys. The only difference that I have found is that with setdefault(), the key that you invoke (if not previously in the dictionary) gets automatically inserted while it does not happen with get(). Here is an example:
Setdefault()
>>> myDict = {'A': 'GOD', 'B':'Is', 'C':'GOOD'} #(1)
>>> myDict.setdefault('C') #(2)
'GOOD'
>>> myDict.setdefault('C','GREAT') #(3)
'GOOD'
>>> myDict.setdefault('D','AWESOME') #(4)
'AWESOME'
>>> myDict #(5)
{'A': 'GOD', 'B': 'Is', 'C': 'GOOD', 'D': 'AWSOME'}
>>> myDict.setdefault('E')
>>>
Get()
>>> myDict = {'a': 1, 'b': 2, 'c': 3} #(1)
>>> myDict.get('a',0) #(2)
1
>>> myDict.get('d',0) #(3)
0
>>> myDict #(4)
{'a': 1, 'b': 2, 'c': 3}
Here is my conclusion: there is no specific answer to which one is best specifically when it comes to default values imputation. The only difference is that setdefault() automatically adds any new key with a default value in the dictionary while get() does not. For more information, please go here !
In [1]: person_dict = {}
In [2]: person_dict['liqi'] = 'LiQi'
In [3]: person_dict.setdefault('liqi', 'Liqi')
Out[3]: 'LiQi'
In [4]: person_dict.setdefault('Kim', 'kim')
Out[4]: 'kim'
In [5]: person_dict
Out[5]: {'Kim': 'kim', 'liqi': 'LiQi'}
In [8]: person_dict.get('Dim', '')
Out[8]: ''
In [5]: person_dict
Out[5]: {'Kim': 'kim', 'liqi': 'LiQi'}

Categories

Resources