I have a list of values, and a dictionary. I want to ensure that each value in the list exists as a key in the dictionary. At the moment I'm using two sets to figure out if any values don't exist in the dictionary
unmapped = set(foo) - set(bar.keys())
Is there a more pythonic way to test this though? It feels like a bit of a hack?
Your approach will work, however, there will be overhead from the conversion to set.
Another solution with the same time complexity would be:
all(i in bar for i in foo)
Both of these have time complexity O(len(foo))
bar = {str(i): i for i in range(100000)}
foo = [str(i) for i in range(1, 10000, 2)]
%timeit all(i in bar for i in foo)
462 µs ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit set(foo) - set(bar)
14.6 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# The overhead is all the difference here:
foo = set(foo)
bar = set(bar)
%timeit foo - bar
213 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The overhead here makes a pretty big difference, so I would choose all here.
Try this to see if there is any unmapped item:
has_unmapped = all( (x in bar) for x in foo )
To see the unmapped items:
unmapped_items = [ x for x in foo if x not in bar ]
Related
learn python by myself. I made a function for filling list. But I have 2 variants, and I want to discover which one is better and why. Or they both awful anyway I want to know truth.
def foo (x):
l = [0] * x
for i in range(x):
l[i] = i
return l
def foo1 (x):
l = []
for i in range(x):
l.append(i)
return l
from a performance perspective the first version foo is better:
%timeit foo(1000000)
# 52.4 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit foo1(1000000)
# 67.2 ms ± 916 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
but the pythonic way to unpack an iterator in a list will be:
list(range(x))
also is faster:
%timeit list(range(1000000))
# 26.7 ms ± 661 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Is there a way to speed up the following two lines of code?
choice = np.argmax(cust_profit, axis=0)
taken = np.array([np.sum(choice == i) for i in range(n_pr)])
%timeit np.argmax(cust_profit, axis=0)
37.6 µs ± 222 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([np.sum(choice == i) for i in range(n_pr)])
40.2 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
n_pr == 2
cust_profit.shape == (n_pr+1, 2000)
Solutions:
%timeit np.unique(choice, return_counts=True)
53.7 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.histogram(choice, bins=np.arange(n_pr + 2))
70.5 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.bincount(choice)
7.4 µs ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
These microseconds worry me, cause this code locates under two layers of scipy.optimize.minimize(method='Nelder-Mead'), that locates in double nested loop, so 40µs equals 4 hours. And I think to wrap it all in genetic search.
The first line seems pretty straightforward. Unless you can sort the data or something like that, you are stuck with the linear lookup in np.argmax. The second line can be sped up simply by using numpy instead of vanilla python to implement it:
v, counts = np.unique(choice, return_counts=True)
Alternatively:
counts = np.histogram(choice, bins=np.arange(n_pr + 2))
A version of histogram optimized for integers also exists:
count = np.bincount(choice)
The latter two options are better if you want to guarantee that the bins include all possible values of choice, regardless of whether they are actually present in the array or not.
That being said, you probably shouldn't worry about something that takes microseconds.
Using List comprehensions is way faster than a normal for loop. Reason which is given for this is that there is no need of append in list comprehensions, which is understandable.
But I have found at various places that list comparisons are faster than apply. I have experienced that as well. But not able to understand as to what is the internal working that makes it much faster than apply?
I know this has something to do with vectorization in numpy which is the base implementation of pandas dataframes. But what causes list comprehensions better than apply, is not quite understandable, since, in list comprehensions, we give for loop inside the list, whereas in apply, we don't even give any for loop (and I assume there also, vectorization takes place)
Edit:
adding code:
this is working on titanic dataset, where title is extracted from name:
https://www.kaggle.com/c/titanic/data
%timeit train['NameTitle'] = train['Name'].apply(lambda x: 'Mrs.' if 'Mrs' in x else \
('Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else\
('Master' if 'Master' in x else 'None'))))
%timeit train['NameTitle'] = ['Mrs.' if 'Mrs' in x else 'Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else ('Master' if 'Master' in x else 'None')) for x in train['Name']]
Result:
782 µs ± 6.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
499 µs ± 5.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Edit2:
To add code for SO, was creating a simple code, and surprisingly, for below code, the results reverse:
import pandas as pd
import timeit
df_test = pd.DataFrame()
tlist = []
tlist2 = []
for i in range (0,5000000):
tlist.append(i)
tlist2.append(i+5)
df_test['A'] = tlist
df_test['B'] = tlist2
display(df_test.head(5))
%timeit df_test['C'] = df_test['B'].apply(lambda x: x*2 if x%5==0 else x)
display(df_test.head(5))
%timeit df_test['C'] = [ x*2 if x%5==0 else x for x in df_test['B']]
display(df_test.head(5))
1 loop, best of 3: 2.14 s per loop
1 loop, best of 3: 2.24 s per loop
Edit3:
As suggested by some, that apply is essentially a for loop, which is not the case as if i run this code with for loop, it almost never ends, i had to stop it after 3-4 mins manually and it never completed during this time.:
for row in df_test.itertuples():
x = row.B
if x%5==0:
df_test.at[row.Index,'B'] = x*2
Running above code takes around 23 seconds, but apply takes only 1.8 seconds. So, what is the difference between these physical loop in itertuples and apply?
There are a few reasons for the performance difference between apply and list comprehension.
First of all, list comprehension in your code doesn't make a function call on each iteration, while apply does. This makes a huge difference:
map_function = lambda x: 'Mrs.' if 'Mrs' in x else \
('Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else \
('Master' if 'Master' in x else 'None')))
%timeit train['NameTitle'] = [map_function(x) for x in train['Name']]
# 581 µs ± 21.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['NameTitle'] = ['Mrs.' if 'Mrs' in x else \
('Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else \
('Master' if 'Master' in x else 'None'))) for x in train['Name']]
# 482 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Secondly, apply does much more than list comprehension. For example it tries to find appropriate dtype for the result. By disabling that behaviour you can see what impact it has:
%timeit train['NameTitle'] = train['Name'].apply(map_function)
# 660 µs ± 2.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['NameTitle'] = train['Name'].apply(map_function, convert_dtype=False)
# 626 µs ± 4.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
There's also a bunch of other stuff happening within apply, so in this example you would want to use map:
%timeit train['NameTitle'] = train['Name'].map(map_function)
# 545 µs ± 4.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Which performs better than list comprehension with a function call in it.
Then why use apply at all you might ask? I know at least one example where it outperforms everything else -- when the operation you want to apply is a vectorized universal function. That's because apply unlike map and list comprehension allows the function to run on the whole Series instead of individual objects in it. Let's see an example:
%timeit train['AgeExp'] = train['Age'].apply(lambda x: np.exp(x))
# 1.44 ms ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['AgeExp'] = train['Age'].apply(np.exp)
# 256 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['AgeExp'] = train['Age'].map(np.exp)
# 1.01 ms ± 8.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['AgeExp'] = [np.exp(x) for x in train['Age']]
# 1.21 ms ± 28.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
When we need to copy full data from a dictionary containing primitive data types ( for simplicity, lets ignore presence of datatypes like datetime etc), the most obvious choice that we have is to use deepcopy, but deepcopy is slower than some other hackish methods of achieving the same i.e. using serialization-unserialization for example like json-dump-json-load or msgpack-pack-msgpack-unpack. The difference in efficiency can be seen here :
>>> import timeit
>>> setup = '''
... import msgpack
... import json
... from copy import deepcopy
... data = {'name':'John Doe','ranks':{'sports':13,'edu':34,'arts':45},'grade':5}
... '''
>>> print(timeit.timeit('deepcopy(data)', setup=setup))
12.0860249996
>>> print(timeit.timeit('json.loads(json.dumps(data))', setup=setup))
9.07182312012
>>> print(timeit.timeit('msgpack.unpackb(msgpack.packb(data))', setup=setup))
1.42743492126
json and msgpack (or cPickle) methods are faster than a normal deepcopy, which is obvious as deepcopy would be doing much more in copying all the attributes of the object too.
Question: Is there a more pythonic/inbuilt way to achieve just a data copy of a dictionary or list, without having all the overhead that deepcopy has ?
It really depends on your needs. deepcopy was built with the intention to do the (most) correct thing. It keeps shared references, it doesn't recurse into infinite recursive structures and so on... It can do that by keeping a memo dictionary in which all encountered "things" are inserted by reference. That's what makes it quite slow for pure-data copies. However I would almost always say that deepcopy is the most pythonic way to copy data even if other approaches could be faster.
If you have pure-data and a limited amount of types inside it you could build your own deepcopy (build roughly after the implementation of deepcopy in CPython):
_dispatcher = {}
def _copy_list(l, dispatch):
ret = l.copy()
for idx, item in enumerate(ret):
cp = dispatch.get(type(item))
if cp is not None:
ret[idx] = cp(item, dispatch)
return ret
def _copy_dict(d, dispatch):
ret = d.copy()
for key, value in ret.items():
cp = dispatch.get(type(value))
if cp is not None:
ret[key] = cp(value, dispatch)
return ret
_dispatcher[list] = _copy_list
_dispatcher[dict] = _copy_dict
def deepcopy(sth):
cp = _dispatcher.get(type(sth))
if cp is None:
return sth
else:
return cp(sth, _dispatcher)
This only works correct for all immutable non-container types and list and dict instances. You could add more dispatchers if you need them.
# Timings done on Python 3.5.3 - Windows - on a really slow laptop :-/
import copy
import msgpack
import json
import string
data = {'name':'John Doe','ranks':{'sports':13,'edu':34,'arts':45},'grade':5}
%timeit deepcopy(data)
# 11.9 µs ± 280 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit copy.deepcopy(data)
# 64.3 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit json.loads(json.dumps(data))
# 65.9 µs ± 2.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit msgpack.unpackb(msgpack.packb(data))
# 56.5 µs ± 2.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Let's also see how it performs when copying a big dictionary containing strings and integers:
data = {''.join([a,b,c]): 1 for a in string.ascii_letters for b in string.ascii_letters for c in string.ascii_letters}
%timeit deepcopy(data)
# 194 ms ± 5.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit copy.deepcopy(data)
# 1.02 s ± 46.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit json.loads(json.dumps(data))
# 398 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit msgpack.unpackb(msgpack.packb(data))
# 238 ms ± 8.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I think you can manually implement what you need by overriding object.__deepcopy__.
A pythonic way to do this is creating your custom dict extends from builtin dict and implement your custom __deepcopy__.
#MSeifert The suggested answer is not accurate
So far i found ujson.loads(ujson.dumps(my_dict)) to be the fastest option which looks strange (how translating dict to string and then from string to new dict is faster then some pure copy)
Here is an example of the methods i tried and their running time for small dictionary (the results of course are more clear with larger dictionary):
x = {'a':1,'b':2,'c':3,'d':4, 'e':{'a':1,'b':2}}
#this function only handle dict of dicts very similar to the suggested solution
def fast_copy(d):
output = d.copy()
for key, value in output.items():
output[key] = fast_copy(value) if isinstance(value, dict) else value
return output
from copy import deepcopy
import ujson
%timeit deepcopy(x)
13.5 µs ± 146 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit fast_copy(x)
2.57 µs ± 31.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit ujson.loads(ujson.dumps(x))
1.67 µs ± 14.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
is there any other C extension that might work better than ujson?
it very strange that this is the fastest method to copy large dict.
It's always fastest to write your own copy function specific to your data structure.
Your example
data = {
'name': 'John Doe',
'ranks': {
'sports': 13,
'edu': 34,
'arts': 45
},
'grade': 5
}
is a dict consisting just of strs or dicts. Hence:
def copy(obj):
out = obj.copy() # Shallow copy
for k, v in obj.items():
if isinstance(obj[k], dict):
out[k] = obj[k].copy()
return obj
%timeit deepcopy(data)
5.26 µs ± 88.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit json.loads(json.dumps(data))
5.11 µs ± 117 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit msgpack.unpackb(msgpack.packb(data))
2.44 µs ± 76.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit ujson.loads(ujson.dumps(data))
1.63 µs ± 25.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit copy(data)
548 ns ± 5.77 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#MSeifert's answer did not work for me. so I implemented a somewhat different approach.
def myDictDeepCopy(dictToCopy) -> dict:
'''
Parameters
----------
dictToCopy : dict
dict that you want to copy
Returns
-------
dict
'''
# Shallow copy
temp = dictToCopy.copy()
dictToReturn = {}
for key, value in temp.items():
dictToReturn[key] = copy(value)
return dictToReturn
When we need to copy full data from a dictionary containing primitive data types ( for simplicity, lets ignore presence of datatypes like datetime etc), the most obvious choice that we have is to use deepcopy, but deepcopy is slower than some other hackish methods of achieving the same i.e. using serialization-unserialization for example like json-dump-json-load or msgpack-pack-msgpack-unpack. The difference in efficiency can be seen here :
>>> import timeit
>>> setup = '''
... import msgpack
... import json
... from copy import deepcopy
... data = {'name':'John Doe','ranks':{'sports':13,'edu':34,'arts':45},'grade':5}
... '''
>>> print(timeit.timeit('deepcopy(data)', setup=setup))
12.0860249996
>>> print(timeit.timeit('json.loads(json.dumps(data))', setup=setup))
9.07182312012
>>> print(timeit.timeit('msgpack.unpackb(msgpack.packb(data))', setup=setup))
1.42743492126
json and msgpack (or cPickle) methods are faster than a normal deepcopy, which is obvious as deepcopy would be doing much more in copying all the attributes of the object too.
Question: Is there a more pythonic/inbuilt way to achieve just a data copy of a dictionary or list, without having all the overhead that deepcopy has ?
It really depends on your needs. deepcopy was built with the intention to do the (most) correct thing. It keeps shared references, it doesn't recurse into infinite recursive structures and so on... It can do that by keeping a memo dictionary in which all encountered "things" are inserted by reference. That's what makes it quite slow for pure-data copies. However I would almost always say that deepcopy is the most pythonic way to copy data even if other approaches could be faster.
If you have pure-data and a limited amount of types inside it you could build your own deepcopy (build roughly after the implementation of deepcopy in CPython):
_dispatcher = {}
def _copy_list(l, dispatch):
ret = l.copy()
for idx, item in enumerate(ret):
cp = dispatch.get(type(item))
if cp is not None:
ret[idx] = cp(item, dispatch)
return ret
def _copy_dict(d, dispatch):
ret = d.copy()
for key, value in ret.items():
cp = dispatch.get(type(value))
if cp is not None:
ret[key] = cp(value, dispatch)
return ret
_dispatcher[list] = _copy_list
_dispatcher[dict] = _copy_dict
def deepcopy(sth):
cp = _dispatcher.get(type(sth))
if cp is None:
return sth
else:
return cp(sth, _dispatcher)
This only works correct for all immutable non-container types and list and dict instances. You could add more dispatchers if you need them.
# Timings done on Python 3.5.3 - Windows - on a really slow laptop :-/
import copy
import msgpack
import json
import string
data = {'name':'John Doe','ranks':{'sports':13,'edu':34,'arts':45},'grade':5}
%timeit deepcopy(data)
# 11.9 µs ± 280 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit copy.deepcopy(data)
# 64.3 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit json.loads(json.dumps(data))
# 65.9 µs ± 2.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit msgpack.unpackb(msgpack.packb(data))
# 56.5 µs ± 2.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Let's also see how it performs when copying a big dictionary containing strings and integers:
data = {''.join([a,b,c]): 1 for a in string.ascii_letters for b in string.ascii_letters for c in string.ascii_letters}
%timeit deepcopy(data)
# 194 ms ± 5.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit copy.deepcopy(data)
# 1.02 s ± 46.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit json.loads(json.dumps(data))
# 398 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit msgpack.unpackb(msgpack.packb(data))
# 238 ms ± 8.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I think you can manually implement what you need by overriding object.__deepcopy__.
A pythonic way to do this is creating your custom dict extends from builtin dict and implement your custom __deepcopy__.
#MSeifert The suggested answer is not accurate
So far i found ujson.loads(ujson.dumps(my_dict)) to be the fastest option which looks strange (how translating dict to string and then from string to new dict is faster then some pure copy)
Here is an example of the methods i tried and their running time for small dictionary (the results of course are more clear with larger dictionary):
x = {'a':1,'b':2,'c':3,'d':4, 'e':{'a':1,'b':2}}
#this function only handle dict of dicts very similar to the suggested solution
def fast_copy(d):
output = d.copy()
for key, value in output.items():
output[key] = fast_copy(value) if isinstance(value, dict) else value
return output
from copy import deepcopy
import ujson
%timeit deepcopy(x)
13.5 µs ± 146 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit fast_copy(x)
2.57 µs ± 31.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit ujson.loads(ujson.dumps(x))
1.67 µs ± 14.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
is there any other C extension that might work better than ujson?
it very strange that this is the fastest method to copy large dict.
It's always fastest to write your own copy function specific to your data structure.
Your example
data = {
'name': 'John Doe',
'ranks': {
'sports': 13,
'edu': 34,
'arts': 45
},
'grade': 5
}
is a dict consisting just of strs or dicts. Hence:
def copy(obj):
out = obj.copy() # Shallow copy
for k, v in obj.items():
if isinstance(obj[k], dict):
out[k] = obj[k].copy()
return obj
%timeit deepcopy(data)
5.26 µs ± 88.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit json.loads(json.dumps(data))
5.11 µs ± 117 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit msgpack.unpackb(msgpack.packb(data))
2.44 µs ± 76.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit ujson.loads(ujson.dumps(data))
1.63 µs ± 25.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit copy(data)
548 ns ± 5.77 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#MSeifert's answer did not work for me. so I implemented a somewhat different approach.
def myDictDeepCopy(dictToCopy) -> dict:
'''
Parameters
----------
dictToCopy : dict
dict that you want to copy
Returns
-------
dict
'''
# Shallow copy
temp = dictToCopy.copy()
dictToReturn = {}
for key, value in temp.items():
dictToReturn[key] = copy(value)
return dictToReturn