Why is list comprehension faster than apply in pandas - python

Using List comprehensions is way faster than a normal for loop. Reason which is given for this is that there is no need of append in list comprehensions, which is understandable.
But I have found at various places that list comparisons are faster than apply. I have experienced that as well. But not able to understand as to what is the internal working that makes it much faster than apply?
I know this has something to do with vectorization in numpy which is the base implementation of pandas dataframes. But what causes list comprehensions better than apply, is not quite understandable, since, in list comprehensions, we give for loop inside the list, whereas in apply, we don't even give any for loop (and I assume there also, vectorization takes place)
Edit:
adding code:
this is working on titanic dataset, where title is extracted from name:
https://www.kaggle.com/c/titanic/data
%timeit train['NameTitle'] = train['Name'].apply(lambda x: 'Mrs.' if 'Mrs' in x else \
('Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else\
('Master' if 'Master' in x else 'None'))))
%timeit train['NameTitle'] = ['Mrs.' if 'Mrs' in x else 'Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else ('Master' if 'Master' in x else 'None')) for x in train['Name']]
Result:
782 µs ± 6.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
499 µs ± 5.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Edit2:
To add code for SO, was creating a simple code, and surprisingly, for below code, the results reverse:
import pandas as pd
import timeit
df_test = pd.DataFrame()
tlist = []
tlist2 = []
for i in range (0,5000000):
tlist.append(i)
tlist2.append(i+5)
df_test['A'] = tlist
df_test['B'] = tlist2
display(df_test.head(5))
%timeit df_test['C'] = df_test['B'].apply(lambda x: x*2 if x%5==0 else x)
display(df_test.head(5))
%timeit df_test['C'] = [ x*2 if x%5==0 else x for x in df_test['B']]
display(df_test.head(5))
1 loop, best of 3: 2.14 s per loop
1 loop, best of 3: 2.24 s per loop
Edit3:
As suggested by some, that apply is essentially a for loop, which is not the case as if i run this code with for loop, it almost never ends, i had to stop it after 3-4 mins manually and it never completed during this time.:
for row in df_test.itertuples():
x = row.B
if x%5==0:
df_test.at[row.Index,'B'] = x*2
Running above code takes around 23 seconds, but apply takes only 1.8 seconds. So, what is the difference between these physical loop in itertuples and apply?

There are a few reasons for the performance difference between apply and list comprehension.
First of all, list comprehension in your code doesn't make a function call on each iteration, while apply does. This makes a huge difference:
map_function = lambda x: 'Mrs.' if 'Mrs' in x else \
('Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else \
('Master' if 'Master' in x else 'None')))
%timeit train['NameTitle'] = [map_function(x) for x in train['Name']]
# 581 µs ± 21.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['NameTitle'] = ['Mrs.' if 'Mrs' in x else \
('Mr' if 'Mr' in x else ('Miss' if 'Miss' in x else \
('Master' if 'Master' in x else 'None'))) for x in train['Name']]
# 482 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Secondly, apply does much more than list comprehension. For example it tries to find appropriate dtype for the result. By disabling that behaviour you can see what impact it has:
%timeit train['NameTitle'] = train['Name'].apply(map_function)
# 660 µs ± 2.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['NameTitle'] = train['Name'].apply(map_function, convert_dtype=False)
# 626 µs ± 4.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
There's also a bunch of other stuff happening within apply, so in this example you would want to use map:
%timeit train['NameTitle'] = train['Name'].map(map_function)
# 545 µs ± 4.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Which performs better than list comprehension with a function call in it.
Then why use apply at all you might ask? I know at least one example where it outperforms everything else -- when the operation you want to apply is a vectorized universal function. That's because apply unlike map and list comprehension allows the function to run on the whole Series instead of individual objects in it. Let's see an example:
%timeit train['AgeExp'] = train['Age'].apply(lambda x: np.exp(x))
# 1.44 ms ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['AgeExp'] = train['Age'].apply(np.exp)
# 256 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['AgeExp'] = train['Age'].map(np.exp)
# 1.01 ms ± 8.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit train['AgeExp'] = [np.exp(x) for x in train['Age']]
# 1.21 ms ± 28.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Related

Performatic way to count hashtags inside list (pandas)

I have a dataframe with ~7.000.000 rows and a lot of columns.
Each row is a Tweet, and i have a column text with tweet's content.
I created a new column just for hashtags inside text:
df['hashtags'] = df.Tweets.str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')
So i have a column called hashtags with each row containing a list structure: ['#b747', '#test'].
I would like to count the number of each hashtag but i have a heavy number of rows. What is the most performatic way to do it?
Here are some different approaches, along with timing, ordered by speed (fastest first):
# setup
n = 10_000
df = pd.DataFrame({
'hashtags': np.random.randint(0, int(np.sqrt(n)), (n, 10)).astype(str).tolist(),
})
# 1. using itertools.chain to build an iterator on the elements of the lists
from itertools import chain
%timeit Counter(chain(*df.hashtags))
# 7.35 ms ± 58.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2. as per #Psidom comment
%timeit df.hashtags.explode().value_counts()
# 8.06 ms ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 3. using Counter constructor, but specifying an iterator, not a list
%timeit Counter(h for hl in df.hashtags for h in hl)
# 10.6 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 4. iterating explicitly and using Counter().update()
def count5(s):
c = Counter()
for hl in s:
c.update(hl)
return c
%timeit count5(df.hashtags)
# 12.4 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 5. using itertools.reduce on Counter().update()
%timeit reduce(lambda x,y: x.update(y) or x, df.hashtags, Counter())
# 13.7 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 6. as per #EzerK
%timeit Counter(sum(df['hashtags'].values, []))
# 2.58 s ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Conclusion: the fastest is #1 (using Counter(chain(*df.hashtags))), but the more intuitive and natural #2 (from #Psidom comment) is almost as fast. I would probably go with that. #6 (#EzerK approach) is very slow for large df slow because we are building a new (long) list before passing it as argument to Counter().
you can all the lists to one big list and then use collections.Counter:
import pandas as pd
from collections import Counter
df = pd.DataFrame()
df['hashtags'] = [['#b747', '#test'], ['#b747', '#test']]
Counter(sum(df['hashtags'].values, []))

Python function with list

learn python by myself. I made a function for filling list. But I have 2 variants, and I want to discover which one is better and why. Or they both awful anyway I want to know truth.
def foo (x):
l = [0] * x
for i in range(x):
l[i] = i
return l
def foo1 (x):
l = []
for i in range(x):
l.append(i)
return l
from a performance perspective the first version foo is better:
%timeit foo(1000000)
# 52.4 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit foo1(1000000)
# 67.2 ms ± 916 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
but the pythonic way to unpack an iterator in a list will be:
list(range(x))
also is faster:
%timeit list(range(1000000))
# 26.7 ms ± 661 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

How to count choices in (3, 2000) ndarray faster?

Is there a way to speed up the following two lines of code?
choice = np.argmax(cust_profit, axis=0)
taken = np.array([np.sum(choice == i) for i in range(n_pr)])
%timeit np.argmax(cust_profit, axis=0)
37.6 µs ± 222 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([np.sum(choice == i) for i in range(n_pr)])
40.2 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
n_pr == 2
cust_profit.shape == (n_pr+1, 2000)
Solutions:
%timeit np.unique(choice, return_counts=True)
53.7 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.histogram(choice, bins=np.arange(n_pr + 2))
70.5 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.bincount(choice)
7.4 µs ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
These microseconds worry me, cause this code locates under two layers of scipy.optimize.minimize(method='Nelder-Mead'), that locates in double nested loop, so 40µs equals 4 hours. And I think to wrap it all in genetic search.
The first line seems pretty straightforward. Unless you can sort the data or something like that, you are stuck with the linear lookup in np.argmax. The second line can be sped up simply by using numpy instead of vanilla python to implement it:
v, counts = np.unique(choice, return_counts=True)
Alternatively:
counts = np.histogram(choice, bins=np.arange(n_pr + 2))
A version of histogram optimized for integers also exists:
count = np.bincount(choice)
The latter two options are better if you want to guarantee that the bins include all possible values of choice, regardless of whether they are actually present in the array or not.
That being said, you probably shouldn't worry about something that takes microseconds.

Efficiency problem of customizing numpy's vectorized operation

I have a python function given below:
def myfun(x):
if x > 0:
return 0
else:
return np.exp(x)
where np is the numpy library. I want to make the function vectorized in numpy, so I use:
vec_myfun = np.vectorize(myfun)
I did a test to evaluate the efficiency. First I generate a vector of 100 random numbers:
x = np.random.randn(100)
Then I run the following code to obtain the runtime:
%timeit np.exp(x)
%timeit vec_myfun(x)
The runtime for np.exp(x) is 1.07 µs ± 24.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each).
The runtime for vec_myfun(x) is 71.2 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
My question is: compared to np.exp, vec_myfun has only one extra step to check the value of $x$, but it runs much slowly than np.exp. Is there an efficient way to vectorize myfun to make it as efficient as np.exp?
Use np.where:
>>> x = np.random.rand(100,)
>>> %timeit np.exp(x)
1.22 µs ± 49.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> %timeit np.where(x > 0, 0, np.exp(x))
4.09 µs ± 282 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For comparison, your vectorized function runs in about 30 microseconds on my machine.
As to why it runs slower, it's just much more complicated than np.exp. It's doing lots of type deduction, broadcasting, and possibly making many calls to the actual method. Much of this happens in Python itself, while nearly everything in the call to np.exp (and the np.where version here) is in C.
ufunc like np.exp have a where parameter, which can be used as:
In [288]: x = np.random.randn(10)
In [289]: out=np.zeros_like(x)
In [290]: np.exp(x, out=out, where=(x<=0))
Out[290]:
array([0. , 0. , 0. , 0. , 0.09407685,
0.92458328, 0. , 0. , 0.46618914, 0. ])
In [291]: x
Out[291]:
array([ 0.37513573, 1.75273458, 0.30561659, 0.46554985, -2.3636433 ,
-0.07841215, 2.00878429, 0.58441085, -0.76316384, 0.12431333])
This actually skips the calculation where the where is false.
In contrast:
np.where(arr > 0, 0, np.exp(arr))
calculates np.exp(arr) first for all arr (that's normal Python evaluation order), and then performs the where selection. With this exp that isn't a big deal, but with log it could be problems.
Just thinking outside of the box, what about implementing a function piecewise_exp() that basically multiplies np.exp() with arr < 0?
import numpy as np
def piecewise_exp(arr):
return np.exp(arr) * (arr < 0)
Writing the code proposed so far as functions:
#np.vectorize
def myfun(x):
if x > 0:
return 0.0
else:
return np.exp(x)
def bnaeker_exp(arr):
return np.where(arr > 0, 0, np.exp(arr))
And testing that everything is consistent:
np.random.seed(0)
# : test that the functions have the same behavior
num = 10
x = np.random.rand(num) - 0.5
print(x)
print(myfun(x))
print(piecewise_exp(x))
print(bnaeker_exp(x))
Doing some micro-benchmarks for small inputs:
# : micro-benchmarks for small inputs
num = 100
x = np.random.rand(num) - 0.5
%timeit np.exp(x)
# 1.63 µs ± 45.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit myfun(x)
# 54 µs ± 967 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit bnaeker_exp(x)
# 4 µs ± 87.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit piecewise_exp(x)
# 3.38 µs ± 59.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
... and for larger inputs:
# : micro-benchmarks for larger inputs
num = 100000
x = np.random.rand(num) - 0.5
%timeit np.exp(x)
# 32.7 µs ± 1.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit myfun(x)
# 44.9 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit bnaeker_exp(x)
# 481 µs ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit piecewise_exp(x)
# 149 µs ± 2.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This shows that piecewise_exp() is faster than anything else proposed so far, especially for larger inputs for which np.where() gets more inefficient since it uses integer indexing instead of boolean masks, and reasonably approaches np.exp() speed.
EDIT
Also, the performances of the np.where() version (bnaeker_exp()) do depend on the number of elements of the array actually satisfying the condition. If none of them does (like when you test on x = np.random.rand(100)), this is slightly faster than the boolean array multiplication version (piecewise_exp()) (128 µs ± 3.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) on my machine for n = 100000).

Check if All Values Exist as Keys in Dictionary

I have a list of values, and a dictionary. I want to ensure that each value in the list exists as a key in the dictionary. At the moment I'm using two sets to figure out if any values don't exist in the dictionary
unmapped = set(foo) - set(bar.keys())
Is there a more pythonic way to test this though? It feels like a bit of a hack?
Your approach will work, however, there will be overhead from the conversion to set.
Another solution with the same time complexity would be:
all(i in bar for i in foo)
Both of these have time complexity O(len(foo))
bar = {str(i): i for i in range(100000)}
foo = [str(i) for i in range(1, 10000, 2)]
%timeit all(i in bar for i in foo)
462 µs ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit set(foo) - set(bar)
14.6 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# The overhead is all the difference here:
foo = set(foo)
bar = set(bar)
%timeit foo - bar
213 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The overhead here makes a pretty big difference, so I would choose all here.
Try this to see if there is any unmapped item:
has_unmapped = all( (x in bar) for x in foo )
To see the unmapped items:
unmapped_items = [ x for x in foo if x not in bar ]

Categories

Resources