Groupby in python pandas: Fast Way - python

I want to improve the time of a groupby in python pandas.
I have this code:
df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
The objective is to count how many contracts a client has in a month and add this information in a new column (Nbcontrats).
Client: client code
Month: month of data extraction
Contrat: contract number
I want to improve the time. Below I am only working with a subset of my real data:
%timeit df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
1 loops, best of 3: 391 ms per loop
df.shape
Out[309]: (7464, 61)
How can I improve the execution time?

Here's one way to proceed :
Slice out the relevant columns (['Client', 'Month']) from the input dataframe into a NumPy array. This is mostly a performance-focused idea as we would be using NumPy functions later on, which are optimized to work with NumPy arrays.
Convert the two columns data from ['Client', 'Month'] into a single 1D array, which would be a linear index equivalent of it considering elements from the two columns as pairs. Thus, we can assume that the elements from 'Client' represent the row indices, whereas 'Month' elements are the column indices. This is like going from 2D to 1D. But, the issue would be deciding the shape of the 2D grid to perform such a mapping. To cover all pairs, one safe assumption would be assuming a 2D grid whose dimensions are one more than the max along each column because of 0-based indexing in Python. Thus, we would get linear indices.
Next up, we tag each linear index based on their uniqueness among others. I think this would correspond to the keys obtained with grouby instead. We also need to get counts of each group/unique key along the entire length of that 1D array. Finally, indexing into the counts with those tags should map for each element the respective counts.
That's the whole idea about it! Here's the implementation -
# Save relevant columns as a NumPy array for performing NumPy operations afterwards
arr_slice = df[['Client', 'Month']].values
# Get linear indices equivalent of those columns
lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)
# Get unique IDs corresponding to each linear index (i.e. group) and grouped counts
unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True)
# Index counts with the unique tags to map across all elements with the counts
df["Nbcontrats"] = counts[unqtags]
Runtime test
1) Define functions :
def original_app(df):
df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
def vectorized_app(df):
arr_slice = df[['Client', 'Month']].values
lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)
unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True)
df["Nbcontrats"] = counts[unqtags]
2) Verify results :
In [143]: # Let's create a dataframe with 100 unique IDs and of length 10000
...: arr = np.random.randint(0,100,(10000,3))
...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat'])
...: df1 = df.copy()
...:
...: # Run the function on the inputs
...: original_app(df)
...: vectorized_app(df1)
...:
In [144]: np.allclose(df["Nbcontrats"],df1["Nbcontrats"])
Out[144]: True
3) Finally time them :
In [145]: # Let's create a dataframe with 100 unique IDs and of length 10000
...: arr = np.random.randint(0,100,(10000,3))
...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat'])
...: df1 = df.copy()
...:
In [146]: %timeit original_app(df)
1 loops, best of 3: 645 ms per loop
In [147]: %timeit vectorized_app(df1)
100 loops, best of 3: 2.62 ms per loop

With the DataFrameGroupBy.size method:
df.set_index(['Client', 'Month'], inplace=True)
df['Nbcontrats'] = df.groupby(level=(0,1)).size()
df.reset_index(inplace=True)
The most work goes into assigning the result back into a column of the source DataFrame.

Related

What makes apply method in Pandas so inefficient [duplicate]

Apply function seems to work very slow with a large dataframe (about 1~3 million rows).
I have checked related questions here, like Speed up Pandas apply function, and Counting within pandas apply() function, it seems the best way to speed it up is not to use apply function :)
For my case, I have two kinds of tasks to do with the apply function.
First: apply with lookup dict query
f(p_id, p_dict):
return p_dict[p_dict['ID'] == p_id]['value']
p_dict = DataFrame(...) # it's another dict works like lookup table
df = df.apply(f, args=(p_dict,))
Second: apply with groupby
f(week_id, min_week_num, p_dict):
return p_dict[(week_id - min_week_num < p_dict['WEEK']) & (p_dict['WEEK'] < week_id)].ix[:,2].mean()
f_partial = partial(f, min_week_num=min_week_num, p_dict=p_dict)
df = map(f, df['WEEK'])
I guess for the fist case, it could be done with dataframe join, while I am not sure about resource cost for such join on a large dataset.
My question is:
Is there any way to substitute apply in the two above cases?
Why is apply so slow? For the dict lookup case, I think it should be O(N), it shouldn't cost that much even if N is 1 million.
Concerning your first question, I can't say exactly why this instance is slow. But generally, apply does not take advantage of vectorization. Also, apply returns a new Series or DataFrame object, so with a very large DataFrame, you have considerable IO overhead (I cannot guarantee this is the case 100% of the time since Pandas has loads of internal implementation optimization).
For your first method, I assume you are trying to fill a 'value' column in df using the p_dict as a lookup table. It is about 1000x faster to use pd.merge:
import string, sys
import numpy as np
import pandas as pd
##
# Part 1 - filling a column by a lookup table
##
def f1(col, p_dict):
return [p_dict[p_dict['ID'] == s]['value'].values[0] for s in col]
# Testing
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'ID': [s for s in string.ascii_uppercase], 'value': np.random.randint(0,n_size, 26)})
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
# Apply the f1 method as posted
%timeit -n1 -r5 temp = df.apply(f1, args=(p_dict,))
>>> 1 loops, best of 5: 832 ms per loop
# Using merge
np.random.seed(997)
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
%timeit -n1 -r5 temp = pd.merge(df, p_dict, how='inner', left_on='p_id', right_on='ID', copy=False)
>>> 1000 loops, best of 5: 826 µs per loop
Concerning the second task, we can quickly add a new column to p_dict that calculates a mean where the time window starts at min_week_num and ends at the week for that row in p_dict. This requires that p_dict is sorted by ascending order along the WEEK column. Then you can use pd.merge again.
I am assuming that min_week_num is 0 in the following example. But you could easily modify rolling_growing_mean to take a different value. The rolling_growing_mean method will run in O(n) since it conducts a fixed number of operations per iteration.
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'WEEK': range(52), 'value': np.random.randint(0, 1000, 52)})
df = pd.DataFrame({'WEEK': np.random.randint(0, 52, n_size)})
def rolling_growing_mean(values):
out = np.empty(len(values))
out[0] = values[0]
# Time window for taking mean grows each step
for i, v in enumerate(values[1:]):
out[i+1] = np.true_divide(out[i]*(i+1) + v, i+2)
return out
p_dict['Means'] = rolling_growing_mean(p_dict['value'])
df_merged = pd.merge(df, p_dict, how='inner', left_on='WEEK', right_on='WEEK')

Pandas - Explanation on apply function being slow

Apply function seems to work very slow with a large dataframe (about 1~3 million rows).
I have checked related questions here, like Speed up Pandas apply function, and Counting within pandas apply() function, it seems the best way to speed it up is not to use apply function :)
For my case, I have two kinds of tasks to do with the apply function.
First: apply with lookup dict query
f(p_id, p_dict):
return p_dict[p_dict['ID'] == p_id]['value']
p_dict = DataFrame(...) # it's another dict works like lookup table
df = df.apply(f, args=(p_dict,))
Second: apply with groupby
f(week_id, min_week_num, p_dict):
return p_dict[(week_id - min_week_num < p_dict['WEEK']) & (p_dict['WEEK'] < week_id)].ix[:,2].mean()
f_partial = partial(f, min_week_num=min_week_num, p_dict=p_dict)
df = map(f, df['WEEK'])
I guess for the fist case, it could be done with dataframe join, while I am not sure about resource cost for such join on a large dataset.
My question is:
Is there any way to substitute apply in the two above cases?
Why is apply so slow? For the dict lookup case, I think it should be O(N), it shouldn't cost that much even if N is 1 million.
Concerning your first question, I can't say exactly why this instance is slow. But generally, apply does not take advantage of vectorization. Also, apply returns a new Series or DataFrame object, so with a very large DataFrame, you have considerable IO overhead (I cannot guarantee this is the case 100% of the time since Pandas has loads of internal implementation optimization).
For your first method, I assume you are trying to fill a 'value' column in df using the p_dict as a lookup table. It is about 1000x faster to use pd.merge:
import string, sys
import numpy as np
import pandas as pd
##
# Part 1 - filling a column by a lookup table
##
def f1(col, p_dict):
return [p_dict[p_dict['ID'] == s]['value'].values[0] for s in col]
# Testing
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'ID': [s for s in string.ascii_uppercase], 'value': np.random.randint(0,n_size, 26)})
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
# Apply the f1 method as posted
%timeit -n1 -r5 temp = df.apply(f1, args=(p_dict,))
>>> 1 loops, best of 5: 832 ms per loop
# Using merge
np.random.seed(997)
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
%timeit -n1 -r5 temp = pd.merge(df, p_dict, how='inner', left_on='p_id', right_on='ID', copy=False)
>>> 1000 loops, best of 5: 826 µs per loop
Concerning the second task, we can quickly add a new column to p_dict that calculates a mean where the time window starts at min_week_num and ends at the week for that row in p_dict. This requires that p_dict is sorted by ascending order along the WEEK column. Then you can use pd.merge again.
I am assuming that min_week_num is 0 in the following example. But you could easily modify rolling_growing_mean to take a different value. The rolling_growing_mean method will run in O(n) since it conducts a fixed number of operations per iteration.
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'WEEK': range(52), 'value': np.random.randint(0, 1000, 52)})
df = pd.DataFrame({'WEEK': np.random.randint(0, 52, n_size)})
def rolling_growing_mean(values):
out = np.empty(len(values))
out[0] = values[0]
# Time window for taking mean grows each step
for i, v in enumerate(values[1:]):
out[i+1] = np.true_divide(out[i]*(i+1) + v, i+2)
return out
p_dict['Means'] = rolling_growing_mean(p_dict['value'])
df_merged = pd.merge(df, p_dict, how='inner', left_on='WEEK', right_on='WEEK')

Most efficient way to convert values of column in Pandas DataFrame

I have a a pd.DataFrame that looks like:
I want to create a cutoff on the values to push them into binary digits, my cutoff in this case is 0.85. I want the resulting dataframe to look like:
The script I wrote to do this is easy to understand but for large datasets it is inefficient. I'm sure Pandas has some way of taking care of these types of transformations.
Does anyone know of an efficient way to convert a column of floats to a column of integers using a threshold?
My extremely naive way of doing such a thing:
DF_test = pd.DataFrame(np.array([list("abcde"),list("pqrst"),[0.12,0.23,0.93,0.86,0.33]]).T,columns=["c1","c2","value"])
DF_want = pd.DataFrame(np.array([list("abcde"),list("pqrst"),[0,0,1,1,0]]).T,columns=["c1","c2","value"])
threshold = 0.85
#Empty dataframe to append rows
DF_naive = pd.DataFrame()
for i in range(DF_test.shape[0]):
#Get first 2 columns
first2cols = list(DF_test.ix[i][:-1])
#Check if value is greater than threshold
binary_value = [int((bool(float(DF_test.ix[i][-1]) > threshold)))]
#Create series object
SR_row = pd.Series( first2cols + binary_value,name=i)
#Add to empty dataframe container
DF_naive = DF_naive.append(SR_row)
#Relabel columns
DF_naive.columns = DF_test.columns
DF_naive.head()
#the sample DF_want
You can use np.where to set your desired value based on a boolean condition:
In [18]:
DF_test['value'] = np.where(DF_test['value'] > threshold, 1,0)
DF_test
Out[18]:
c1 c2 value
0 a p 0
1 b q 0
2 c r 1
3 d s 1
4 e t 0
Note that because your data is a heterogenous np array the 'value' column contains strings rather than floats:
In [58]:
DF_test.iloc[0]['value']
Out[58]:
'0.12'
So you'll need to convert the dtype to float first: DF_test['value'] = DF_test['value'].astype(float)
You can compare the timings:
In [16]:
%timeit np.where(DF_test['value'] > threshold, 1,0)
1000 loops, best of 3: 297 µs per loop
In [17]:
%%timeit
DF_naive = pd.DataFrame()
for i in range(DF_test.shape[0]):
#Get first 2 columns
first2cols = list(DF_test.ix[i][:-1])
#Check if value is greater than threshold
binary_value = [int((bool(float(DF_test.ix[i][-1]) > threshold)))]
#Create series object
SR_row = pd.Series( first2cols + binary_value,name=i)
#Add to empty dataframe container
DF_naive = DF_naive.append(SR_row)
10 loops, best of 3: 39.3 ms per loop
the np.where version is over 100x faster, admittedly your code is doing a lot of unnecessary stuff but you get the point
Since bool is a subclass of int, i.e. True == 1 and False == 0, you can convert a Boolean series to its integer form:
DF_test['value'] = (DF_test['value'] > threshold).astype(int)
Generally, including most uses in computation or indexing, the int conversion is not necessary and you may wish to forego it altogether.

Constructing a waterfall algorithm from multiple columns in a Pandas Data Frame

Suppose I have a multi-column data frame and I wish to implement a waterfall style algorithm that takes the first column if it is present, then looks at the second if it is not, and if that is not present takes the value in the third column, and so on, and if missing in the last column takes a default value (say zero). I have a way of doing this involving adding up a series of vector operations (see below) but it doesn't seem to scale to more columns very well. And, of course, I could do it with nested loops through rows (very unpythonic -- right?).
frame = pd.DataFrame(np.arange(15).reshape((5,3)),index=['a','b','c','d','e'],columns=['X','Y', 'Z'])
#Make some missing values
frame['X'].ix[0:2] = None
frame['Y'].ix[1:4] = None
frame['Z'].ix[3:5] = None
#This is my kludgy waterfall for the three column case.
frame['Waterfall'] = frame['X'].fillna(0) + frame['Y'].fillna(0) * frame['X'].isnull() + frame['Z'].fillna(0) * (frame['X'].isnull() & frame['Y'].isnull())
I am hoping for a solution to this problem that scales well to waterfalls of arbitrary length. If it could be more Pythonic that would be great. Ideally, it would be a function that takes an ordered list of column labels a dataframe as an argument and returns the desired values.
Thank you for your help.
First of all, don't use None as your missing data value. That forces all your columns to the object dtype, which will be slow. Use nan instead (this makes everything doubles so just be careful with floating point stuff.
I'd use the bfill method for fillna():
In [26]: frame.fillna(method='bfill', axis=1)['X'].fillna(0)
Out[26]:
a 1
b 5
c 6
d 9
e 12
Name: X, dtype: float64
performance:
In [27]: %timeit frame['X'].fillna(0) + frame['Y'].fillna(0) * frame['X'].isnull() + frame['Z'].fillna(0) * (frame['X'].isnull() & fra
me['Y'].isnull())
1000 loops, best of 3: 776 µs per loop
In [28]: %timeit frame.fillna(method='bfill', axis=1)['X']
10000 loops, best of 3: 138 µs per loop

Fast pandas filtering

I want to filter a pandas dataframe, if the name column entry has an item in a given list.
Here we have a DataFrame
x = DataFrame(
[['sam', 328], ['ruby', 3213], ['jon', 121]],
columns=['name', 'score'])
Now lets say we have a list, ['sam', 'ruby'] and we want to find all rows where the name is in the list, then sum the score.
The solution I have is as follows:
total = 0
names = ['sam', 'ruby']
for name in names:
identified = x[x['name'] == name]
total = total + sum(identified['score'])
However when the dataframe gets extremely large, and the list of names gets very large too, everything is very very slow.
Is there any faster alternative?
Thanks
Try using isin (thanks to DSM for suggesting loc over ix here):
In [78]: x = pd.DataFrame([['sam',328],['ruby',3213],['jon',121]], columns = ['name', 'score'])
In [79]: names = ['sam', 'ruby']
In [80]: x['name'].isin(names)
Out[80]:
0 True
1 True
2 False
Name: name, dtype: bool
In [81]: x.loc[x['name'].isin(names), 'score'].sum()
Out[81]: 3541
CT Zhu suggests a faster alternative using np.in1d:
In [105]: y = pd.concat([x]*1000)
In [109]: %timeit y.loc[y['name'].isin(names), 'score'].sum()
1000 loops, best of 3: 413 µs per loop
In [110]: %timeit y.loc[np.in1d(y['name'], names), 'score'].sum()
1000 loops, best of 3: 335 µs per loop
If I need to search on a field, I have noticed that it helps immensely if I change the index of the DataFrame to the search field. For one of my search and lookup requirements I got a performance improvement of around 500%.
So in your case the following could be used to search and filter by name.
df = pd.DataFrame([['sam', 328], ['ruby', 3213], ['jon', 121]],
columns=['name', 'score'])
names = ['sam', 'ruby']
df_searchable = df.set_index('name')
df_searchable[df_searchable.index.isin(names)]
Update Dec-21
Updates are driven by the comments on this answer.
Looking at the details of my use case, its not magic that is happening here. My use case was that of running millions of look-ups on a column with around 45k values. From what I remember, it was a lookup on US zip-codes. Understandably, once the set_index has incurred it's one time optimization cost, subsequent look-ups become way faster. The overall effect is magnified because of the large number of look-ups, the cost of optimization getting amortized over all the numerous look-ups.
The impressive performance improvement number is essentially due to the highly amortized optimization cost.
If your data repeats a lot of values, try using the 'categorical' data type for that column and then applying boolean filtering. Much more flexible than using indices and, at least in my case, much faster.
data = pd.read_csv('data.csv', dtype={'name':'category'})
data[(data.name=='sam')&(data.score>1)]
or
names=['sam','ruby']
data[data.name.isin(names)]
For the ~15 million row, ~200k unique terms dataset I'm working with in pandas 1.2, %timeit results are:
boolean filter on object column: 608ms
.loc filter on same object column as index: 281ms
boolean filter on same object column as 'categorical' type: 16ms
From there, add the .sum() or whatever aggregation function you're looking for.

Categories

Resources