Performance of pandas GroupBy to the Keys - python

I have to group a Dataframe and do some more calculations by using the keys as input parameter. Doing so I figured out some strange performance behavior.
The time for grouping is ok and the time to get the keys too. But if I execute both steps it takes 24x the time.
Am I using it wrong or is there another way to get the unique parameter pairs with all their indices?
Here is some easy example:
import numpy as np
import pandas as pd
def test_1(df):
grouped = df.groupby(['up','down'])
return grouped
def test_2(grouped):
keys = grouped.groups.keys()
return keys
def test_3(df):
keys = df.groupby(['up','down']).groups.keys()
return keys
def test_4(df):
grouped = df.groupby(['up','down'])
keys = grouped.groups.keys()
return keys
n = np.arange(1,10,1)
df = pd.DataFrame([],columns=['up','down'])
df['up'] = n
df['down'] = n[::-1]
grouped = df.groupby(['up','down'])
%timeit test_1(df)
169 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit test_2(grouped)
1.01 µs ± 70.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit test_3(df)
4.36 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test_4(df)
4.2 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Thanks in advance for comments or ideas.

Related

Performatic way to count hashtags inside list (pandas)

I have a dataframe with ~7.000.000 rows and a lot of columns.
Each row is a Tweet, and i have a column text with tweet's content.
I created a new column just for hashtags inside text:
df['hashtags'] = df.Tweets.str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')
So i have a column called hashtags with each row containing a list structure: ['#b747', '#test'].
I would like to count the number of each hashtag but i have a heavy number of rows. What is the most performatic way to do it?
Here are some different approaches, along with timing, ordered by speed (fastest first):
# setup
n = 10_000
df = pd.DataFrame({
'hashtags': np.random.randint(0, int(np.sqrt(n)), (n, 10)).astype(str).tolist(),
})
# 1. using itertools.chain to build an iterator on the elements of the lists
from itertools import chain
%timeit Counter(chain(*df.hashtags))
# 7.35 ms ± 58.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2. as per #Psidom comment
%timeit df.hashtags.explode().value_counts()
# 8.06 ms ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 3. using Counter constructor, but specifying an iterator, not a list
%timeit Counter(h for hl in df.hashtags for h in hl)
# 10.6 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 4. iterating explicitly and using Counter().update()
def count5(s):
c = Counter()
for hl in s:
c.update(hl)
return c
%timeit count5(df.hashtags)
# 12.4 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 5. using itertools.reduce on Counter().update()
%timeit reduce(lambda x,y: x.update(y) or x, df.hashtags, Counter())
# 13.7 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 6. as per #EzerK
%timeit Counter(sum(df['hashtags'].values, []))
# 2.58 s ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Conclusion: the fastest is #1 (using Counter(chain(*df.hashtags))), but the more intuitive and natural #2 (from #Psidom comment) is almost as fast. I would probably go with that. #6 (#EzerK approach) is very slow for large df slow because we are building a new (long) list before passing it as argument to Counter().
you can all the lists to one big list and then use collections.Counter:
import pandas as pd
from collections import Counter
df = pd.DataFrame()
df['hashtags'] = [['#b747', '#test'], ['#b747', '#test']]
Counter(sum(df['hashtags'].values, []))

Python speed up element in list

I have made a script where I check if a column's values from dataframe A exist in a columns of dataframe B. Here, dataframe A is named whole data, and dataframe B is named referrarls
users=set(whole_data['user_id'])
referees=set(referrals['referee_id'])
non_referees=set([x for x in users if x not in referees])
As you can see, I want a list of users (named non_referees) that contain users that are not referees, that's why I am checking for every user_id from whole_data if it exists in the set of referees.
Nonetheless, this is taking a massive amount of time, there are like 100K users and 4K referees. Is there a way to make this faster?
First, pandas can already give you the unique values of a series, which might be faster than building the set from the whole column.
Second, to build the set of non-referees, you can then use set operations:
non_referees = users - referees
EDIT: As an additional note, if you build a set using the generator expression style, you don't need to build an intermediate list:
# slow because it first builds a list and then turns that into a set:
some_set = set([x for x in something])
# faster because it goes right into building the set:
some_other_set = set(x for x in something_else)
You may want to consider:
non_referees = set(whole_data['user_id'].unique()).difference(
referrals['referee_id'].unique()
)
If you think there are few repeats in referrals['referee_id'], then you'll gain a smidgen of speed by avoiding .unique() for them.
Speed
Here are some experiments with a few closely related forms:
Case with lots of duplicated referee_id
n = 100_000
whole_data = pd.DataFrame({
'user_id': np.random.randint(0, n, n),
})
referrals = pd.DataFrame({
'referee_id': np.random.randint(0, n, n),
})
Measurements:
%timeit non_referees = set(pd.unique(whole_data['user_id'])) - set(pd.unique(referrals['referee_id']))
# 23.4 ms ± 53.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()) - set(referrals['referee_id'].unique())
# 23.3 ms ± 36 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(set(referrals['referee_id'].unique()))
# 23.3 ms ± 73.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# *** fastest
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'].unique())
# 21.4 ms ± 74.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'])
# 29.6 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Case with few duplicates in referee_id:
n = 100_000
whole_data = pd.DataFrame({
'user_id': np.random.randint(0, n, n),
})
referrals = pd.DataFrame({
'referee_id': np.random.randint(0, 100*n, n),
})
Measurements:
%timeit non_referees = set(pd.unique(whole_data['user_id'])) - set(pd.unique(referrals['referee_id']))
# 30.7 ms ± 61.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()) - set(referrals['referee_id'].unique())
# 30.7 ms ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(set(referrals['referee_id'].unique()))
# 30.6 ms ± 57.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'].unique())
# 23.7 ms ± 37.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# *** fastest
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'])
# 20.9 ms ± 54 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

How to count choices in (3, 2000) ndarray faster?

Is there a way to speed up the following two lines of code?
choice = np.argmax(cust_profit, axis=0)
taken = np.array([np.sum(choice == i) for i in range(n_pr)])
%timeit np.argmax(cust_profit, axis=0)
37.6 µs ± 222 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([np.sum(choice == i) for i in range(n_pr)])
40.2 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
n_pr == 2
cust_profit.shape == (n_pr+1, 2000)
Solutions:
%timeit np.unique(choice, return_counts=True)
53.7 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.histogram(choice, bins=np.arange(n_pr + 2))
70.5 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.bincount(choice)
7.4 µs ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
These microseconds worry me, cause this code locates under two layers of scipy.optimize.minimize(method='Nelder-Mead'), that locates in double nested loop, so 40µs equals 4 hours. And I think to wrap it all in genetic search.
The first line seems pretty straightforward. Unless you can sort the data or something like that, you are stuck with the linear lookup in np.argmax. The second line can be sped up simply by using numpy instead of vanilla python to implement it:
v, counts = np.unique(choice, return_counts=True)
Alternatively:
counts = np.histogram(choice, bins=np.arange(n_pr + 2))
A version of histogram optimized for integers also exists:
count = np.bincount(choice)
The latter two options are better if you want to guarantee that the bins include all possible values of choice, regardless of whether they are actually present in the array or not.
That being said, you probably shouldn't worry about something that takes microseconds.

Speed up list creation from pandas dataframe

I have a pandas dataframe df from which I need to create a list Row_list.
import pandas as pd
df = pd.DataFrame([[1, 572548.283, 166424.411, -11.849, -11.512],
[2, 572558.153, 166442.134, -11.768, -11.983],
[3, 572124.999, 166423.478, -11.861, -11.512],
[4, 572534.264, 166414.417, -11.123, -11.993]],
columns=['PointNo','easting', 'northing', 't_20080729','t_20090808'])
I am able to create the list in the required format with the code below, but my dataframe has up to 8 million rows and the list creation is very slow.
def test_get_value_iterrows(df):
Row_list =[]
for index, rows in df.iterrows():
entirerow = df.values[index,].tolist()
entirerow.append((df.iloc[index,1],df.iloc[index,2]))
Row_list.append(entirerow)
Row_list
%timeit test_get_value_iterrows(df)
436 µs ± 6.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not using df.iterrows() and df.iloc() is a little bit faster,
def test_get_value(df):
Row_list =[]
for i in df.index:
entirerow = df.values[i,].tolist()
entirerow.append((df.iloc[i,1],df.iloc[i,2]))
Row_list.append(entirerow)
Row_list
%timeit test_get_value(df)
270 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I am wondering if there is a faster solution to this?
Use list comprehension:
df = pd.concat([df] * 10000, ignore_index=True)
In [123]: %timeit [[*x, (x[1], x[2])] for x in df.values.tolist()]
27.8 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [124]: %timeit [x + [(x[1], x[2])] for x in df.values.tolist()]
26.6 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [125]: %timeit (test_get_value(df))
41.2 s ± 1.97 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Efficient way to check dtype of each row in a series

Say I have mixed ts/other data:
ser = pd.Series(pd.date_range('2017/01/05', '2018/01/05'))
ser.loc[3] = 4
type(ser.loc[0])
> pandas._libs.tslibs.timestamps.Timestamp
I would like to filter for all timestamps. For instance, this gives me what I want:
ser.apply(lambda x: isinstance(x, pd.Timestamp))
0 True
1 True
2 True
3 False
4 True
...
But I assume it would be faster to use a vectorized solution and avoid apply. I thought I should be able to use where:
ser.where(isinstance(ser, pd.Timestamp))
But I get
ValueError: Array conditional must be same shape as self
Is there a way to do this? Also, am I correct in my assumption that it would be faster/more 'Pandasic'?
It depends of length of data, but here for small data (365 rows) is faster list comprehension:
In [108]: %timeit (ser.apply(lambda x: isinstance(x, pd.Timestamp)))
434 µs ± 57.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [109]: %timeit ([isinstance(x, pd.Timestamp) for x in ser])
140 µs ± 5.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [110]: %timeit (pd.to_datetime(ser, errors='coerce').notna())
1.01 ms ± 25.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But if test larger DataFrame is faster to_datetime with test non missing values by Series.isna:
ser = pd.Series(pd.date_range('1980/01/05', '2020/01/05'))
ser.loc[3] = 4
print (len(ser))
14611
In [116]: %timeit (ser.apply(lambda x: isinstance(x, pd.Timestamp)))
6.42 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [117]: %timeit ([isinstance(x, pd.Timestamp) for x in ser])
4.9 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [118]: %timeit (pd.to_datetime(ser, errors='coerce').notna())
4.22 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
To address your question of filtering, you can convert to datetime and drop NaNs.
ser[pd.to_datetime(ser, errors='coerce').notna()]
Or, if you don't mind the result being datetime,
pd.to_datetime(ser, errors='coerce').dropna()

Categories

Resources