Speed up grouping with `apply` on a single column

Speed up grouping with `apply` on a single column - python

I want to group a table such that the two first columns remain as they were when grouped, the 3d is the grouping mean, and the 4th the grouping dispersion, defined in the code. This is how I currently do it:
x = pd.DataFrame(np.array(((1,1,1,1),(1,1,10,2),(2,2,2,2),(2,2,8,3))))
0 1 2 3
0 1 1 1 1
1 1 1 10 2
2 2 2 2 2
3 2 2 8 3
g = x.groupby(0)
res = g.mean()
res[3] = g.apply(lambda x: ((x[2]+x[3]).max()-(x[2]-x[3]).min())*0.5)
res
1 2 3
0
1 1.0 5.5 6.0
2 2.0 5.0 5.5
I am looking to speed this up anyway possible. In particular if I could get rid of apply and use g only once that would be great.
For testing purposes, this runs on data sizes of:
A few to 60 rows
1-5 groups (there could be a single group)
4 columns
Here is a mid-sized sample:
array([[ 0.00000000e+000, 4.70221520e-003, 1.14943038e-003,
3.44829114e-009],
[ 1.81557753e-011, 4.94065646e-324, 4.70221520e-003,
1.14943038e-003],
[ 2.36416931e-008, 1.97231804e-011, 9.88131292e-324,
8.43322640e-003],
[ 1.74911362e-003, 3.43575891e-009, 1.12130677e-010,
1.48219694e-323],
[ 8.43322640e-003, 1.74911362e-003, 3.42014182e-009,
1.11974506e-010],
[ 1.97626258e-323, 4.70221520e-003, 1.14943038e-003,
3.48747627e-009],
[ 1.78945412e-011, 2.47032823e-323, 4.70221520e-003,
1.14943038e-003],
[ 2.32498418e-008, 1.85476266e-010, 2.96439388e-323,
4.70221520e-003],
[ 1.14943038e-003, 3.50053798e-009, 1.85476266e-011,
3.45845952e-323],
[ 4.70221520e-003, 1.14943038e-003, 4.53241298e-008,
3.00419304e-010],
[ 3.95252517e-323, 4.70221520e-003, 1.14943038e-003,
3.55278482e-009],
[ 1.80251583e-011, 4.44659081e-323, 4.70221520e-003,
1.14943038e-003],
[ 1.09587738e-008, 1.68496045e-011, 4.94065646e-323,
4.70221520e-003],
[ 1.14943038e-003, 3.48747627e-009, 1.80251583e-011,
5.43472210e-323],
[ 4.70221520e-003, 1.14943038e-003, 3.90545096e-008,
2.63846519e-010],
[ 5.92878775e-323, 8.43322640e-003, 1.74911362e-003,
3.15465136e-009],
[ 1.04009792e-010, 6.42285340e-323, 8.43322640e-003,
1.74911362e-003],
[ 2.56120209e-010, 4.15414486e-011, 6.91691904e-323,
8.43322640e-003],
[ 1.74911362e-003, 3.43575891e-009, 1.12286848e-010,
7.41098469e-323],
[ 8.43322640e-003, 1.74911362e-003, 5.91887557e-009,
1.45863583e-010],
[ 7.90505033e-323, 8.43322640e-003, 1.74911362e-003,
3.34205639e-009],
[ 1.07133209e-010, 8.39911598e-323, 8.43322640e-003,
1.74911362e-003],
[ 1.21188587e-009, 7.07453993e-011, 8.89318163e-323,
8.43322640e-003],
[ 1.74911362e-003, 3.38890765e-009, 1.12130677e-010,
9.38724727e-323],
[ 8.43322640e-003, 1.74911362e-003, 1.79596488e-009,
8.38637515e-011]])

You can use syntacting sugar - .groupby with Series:
res[3] = ((x[2] + x[3]).groupby(x[0]).max() - (x[2] - x[3]).groupby(x[0]).min())*.5
print (res)
1 2 3
0
1 1.0 5.5 6.0
2 2.0 5.0 5.5
I get this timigs with you array:
In [279]: %%timeit
...: res = x.groupby(0).mean()
...: res[3] = ((x[2] + x[3]).groupby(x[0]).max() - (x[2] - x[3]).groupby(x[0]).min())*.5
...:
4.26 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [280]: %%timeit
...: g = x.groupby(0)
...: res = g.mean()
...: res[3] = g.apply(lambda x: ((x[2]+x[3]).max()-(x[2]-x[3]).min())*0.5)
...:
11 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Also if turn off sorting by grouping column if possible:
In [283]: %%timeit
...: res = x.groupby(0, sort=False).mean()
...: res[3] = ((x[2] + x[3]).groupby(x[0], sort=False).max() - (x[2] - x[3]).groupby(x[0], sort=False).min())*.5
...:
4.1 ms ± 50.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Related

Multi-index Dataframe from dictionary of Dataframes

I'd like to create a multi-index dataframe from a dictionary of dataframes where the top-level index is index of the dataframes within the dictionaries and the second level index is the keys of the dictionary.
Example
import pandas as pd
dt_index = pd.to_datetime(['2003-05-01', '2003-05-02', '2003-05-03'])
column_names = ['Y', 'X']
df_dict = {'A':pd.DataFrame([[1,3],[7,4],[5,8]], index = dt_index, columns = column_names),
'B':pd.DataFrame([[12,3],[9,8],[75,0]], index = dt_index, columns = column_names),
'C':pd.DataFrame([[3,12],[5,1],[22,5]], index = dt_index, columns = column_names)}
Expected output:
Y X
2003-05-01 A 1 3
2003-05-01 B 12 3
2003-05-01 C 3 12
2003-05-02 A 7 4
2003-05-02 B 9 8
2003-05-02 C 5 1
2003-05-03 A 5 8
2003-05-03 B 75 0
2003-05-03 C 22 5
I've tried
pd.concat(df_dict, axis=0)
but this gives me the levels of the multi-index in the incorrect order.
Edit: Timings
Based on the answers so far, this seems like a slow operation to perform as the Dataframe scales.
Larger dummy data:
import numpy as np
import pandas as pd
D = 3000
C = 500
dt_index = pd.date_range('2000-1-1', periods=D)
keys = 'abcdefghijk'
df_dict = {k:pd.DataFrame(np.random.rand(D,C), index=dt_index) for k in keys}
To convert the dictionary to a dataframe, albeit with swapped indicies takes:
%timeit pd.concat(df_dict, axis=0)
63.4 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Even in the best case, creating a dataframe with the indicies in the other order takes 8 times longer than the above!
%timeit pd.concat(df_dict, axis=0).swaplevel().sort_index()
528 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.concat(df_dict, axis=1).stack(0)
1.72 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Use DataFrame.swaplevel with DataFrame.sort_index:
df = pd.concat(df_dict, axis=0).swaplevel(0,1).sort_index()
print (df)
Y X
2003-05-01 A 1 3
B 12 3
C 3 12
2003-05-02 A 7 4
B 9 8
C 5 1
2003-05-03 A 5 8
B 75 0
C 22 5

You can reach down into numpy for a speed up if you can guarantee two things:
Each of your DataFrames in df_dict have the exact same index
Each of your DataFrames are already sorted.
import numpy as np
import pandas as pd
D = 3000
C = 500
dt_index = pd.date_range('2000-1-1', periods=D)
keys = 'abcdefghijk'
df_dict = {k:pd.DataFrame(np.random.rand(D,C), index=dt_index) for k in keys}
out = pd.DataFrame(
data=np.column_stack([*df_dict.values()]).reshape(-1, C),
index=pd.MultiIndex.from_product([df_dict["a"].index, df_dict.keys()]),
)
# check if this result is consistent with other answers
assert (pd.concat(df_dict, axis=0).swaplevel().sort_index() == out).all().all()
Timing:
%%timeit
pd.concat(df_dict, axis=0)
# 26.2 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pd.DataFrame(
data=np.column_stack([*df_dict.values()]).reshape(-1, 500),
index=pd.MultiIndex.from_product([df_dict["a"].index, df_dict.keys()]),
)
# 31.2 ms ± 497 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pd.concat(df_dict, axis=0).swaplevel().sort_index()
# 123 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Use concat on axis=1 and stack:
out = pd.concat(df_dict, axis=1).stack(0)
Output:
X Y
2003-05-01 A 3 1
B 3 12
C 12 3
2003-05-02 A 4 7
B 8 9
C 1 5
2003-05-03 A 8 5
B 0 75
C 5 22

How do i take the sigmoid of all the records of a column in a dataset in python?

I wanted to take the sigmoid of a column in my data set.
I have defined a function for the same
import math
def sigmoid(x):
return 1 / (1 + math.exp(-x))
but how do i give all the values of the column in one time?

You can use apply, but it is not vecorized solution. Better/faster way is use np.exp for vecorized approach.
df = pd.DataFrame({'A': [10, 40, -6, 1, 0, -1, -60, 100, 0.2, 0.004, -0.0053]})
import math
def sigmoid(x):
return 1 / (1 + math.exp(-x))
df['s1'] = df.A.apply(sigmoid)
df['s2'] = 1 / (1 + np.exp(-df.A))
print (df)
A s1 s2
0 10.0000 9.999546e-01 9.999546e-01
1 40.0000 1.000000e+00 1.000000e+00
2 -6.0000 2.472623e-03 2.472623e-03
3 1.0000 7.310586e-01 7.310586e-01
4 0.0000 5.000000e-01 5.000000e-01
5 -1.0000 2.689414e-01 2.689414e-01
6 -60.0000 8.756511e-27 8.756511e-27
7 100.0000 1.000000e+00 1.000000e+00
8 0.2000 5.498340e-01 5.498340e-01
9 0.0040 5.010000e-01 5.010000e-01
10 -0.0053 4.986750e-01 4.986750e-01
#110 k rows
df = pd.DataFrame({'A': [10, 40, -6, 1, 0, -1, -60, 100, 0.2, 0.004, -0.0053] * 10000})
In [15]: %timeit df['s1'] = df.A.apply(sigmoid)
57.4 ms ± 6.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [16]: %timeit df['s2'] = 1 / (1 + np.exp(-df.A))
2.64 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Vectorized method for mapping a list from one Dataframe row to another Dataframe row

Given a dataframe df1 table that maps ids to names:
id
names
a 535159
b 248909
c 548731
d 362555
e 398829
f 688939
g 674128
and a second dataframe df2 which contains lists of names:
names foo
0 [a, b, c] 9
1 [d, e] 16
2 [f] 2
3 [g] 3
What would be the vectorized method for retrieve the ids from df1 for each list item in each row like this?
names foo ids
0 [a, b, c] 9 [535159, 248909, 548731]
1 [d, e] 16 [362555, 398829]
2 [f] 2 [688939]
3 [g] 3 [674128]
This is a working method to achieve the same result using apply:
import pandas as pd
import numpy as np
mock_uids = np.random.randint(100000, 999999, size=7)
df1=pd.DataFrame({'id':mock_uids, 'names': ['a','b','c','d','e','f','g'] })
df2=pd.DataFrame({'names':[['a','b','c'],['d','e'],['f'],['g']],'foo':[9,16,2,3]})
df1 = df1.set_index('names')
def with_apply(row):
row['ids'] = [ df1.loc[name]['id'] for name in row['names'] ]
return row
df2 = df2.apply(with_apply, axis=1)

I think vecorize this is really hard, one idea for improve performance is map by dictionary - solution use if y in d for working if no match in dictioanry:
df1 = df1.set_index('names')
d = df1['id'].to_dict()
df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
If all values match:
d = df1['id'].to_dict()
df2['ids2'] = [[d[y] for y in x] for x in df2['names']]
Test for 4k rows:
np.random.seed(2020)
mock_uids = np.random.randint(100000, 999999, size=7)
df1=pd.DataFrame({'id':mock_uids, 'names': ['a','b','c','d','e','f','g'] })
df2=pd.DataFrame({'names':[['a','b','c'],['d','e'],['f'],['g']],'foo':[9,16,2,3]})
df2 = pd.concat([df2] * 1000, ignore_index=True)
df1 = df1.set_index('names')
def with_apply(row):
row['ids'] = [ df1.loc[name]['id'] for name in row['names'] ]
return row
In [8]: %%timeit
...: df2.apply(with_apply, axis=1)
...:
928 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [9]: %%timeit
...: d = df1['id'].to_dict()
...: df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
...:
4.25 ms ± 47.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: %%timeit
...: df2['ids3'] = list(df1.loc[name]['id'].values for name in df2['names'])
...:
...:
1.66 s ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

One way using operator.itemgetter:
from operator import itemgetter
def listgetter(x):
i = itemgetter(*x)(d)
return list(i) if isinstance(i, tuple) else [i]
d = df.set_index("name")["id"]
df2["ids"] = df2["names"].apply(listgetter)
Output:
names foo ids
0 [a, b, c] 9 [535159, 248909, 548731]
1 [d, e] 16 [362555, 398829]
2 [f] 2 [688939]
3 [g] 3 [674128]
Benchmark on 100k rows:
d = df.set_index("name")["id"] # Common item
df2 = pd.concat([df2] * 25000, ignore_index=True)
%%timeit
df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
# 453 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2["ids2"] = df2["names"].apply(listgetter)
# 349 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2['ids2'] = [[d[y] for y in x] for x in df2['names']]
# 371 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

this seems to work:
df2['ids'] = list(df1.loc[name]['id'].values for name in df2['names'])
interested to know if this is the best approach

Concatenate column values in a pandas DataFrame while ignoring NaNs

I have a the following pandas table
df:
EVNT_ID col1 col2 col3 col4
123454 1 Nan 4 5
628392 Nan 3 Nan 7
293899 2 Nan Nan 6
127820 9 11 12 19
Now I am trying to concat all the columns except the first column and I want my data frame to look in the following way
new_df:
EVNT_ID col1 col2 col3 col4 new_col
123454 1 Nan 4 5 1|4|5
628392 Nan 3 Nan 7 3|7
293899 2 Nan Nan 6 2|6
127820 9 11 12 19 9|11|12|19
I am using the following code
df['new_column'] = df[~df.EVNT_ID].apply(lambda x: '|'.join(x.dropna().astype(str).values), axis=1)
but it is giving me the following error
ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
I would really appreciate if any one can give me where I am wrong. I'd really appreciate that.

Try the following code:
df['new_col'] = df.iloc[:, 1:].apply(lambda x:
'|'.join(str(el) for el in x if str(el) != 'nan'), axis=1)
Initially I thought about x.dropna() instead of x if str(el) != 'nan',
but %timeit showed that dropna() works much slower.

You can do this with filter and agg:
df.filter(like='col').agg(
lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
0 1|4|5
1 3|7
2 2|6
3 9|11|12|19
dtype: object
Or,
df.drop('EVNT_ID', 1).agg(
lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
0 1|4|5
1 3|7
2 2|6
3 9|11|12|19
dtype: object
If performance is important, you can use a list comprehension:
joined = [
'|'.join([str(int(x)) for x in r if pd.notna(x)])
for r in df.iloc[:,1:].values.tolist()
]
joined
# ['1|4|5', '3|7', '2|6', '9|11|12|19']
df.assign(new_col=joined)
EVNT_ID col1 col2 col3 col4 new_col
0 123454 1.0 NaN 4.0 5 1|4|5
1 628392 NaN 3.0 NaN 7 3|7
2 293899 2.0 NaN NaN 6 2|6
3 127820 9.0 11.0 12.0 19 9|11|12|19
If you can forgive the overhead of assignment to a DataFrame, here's timings for the two fastest solutions here.
df = pd.concat([df] * 1000, ignore_index=True)
# In this post.
%%timeit
[
'|'.join([str(int(x)) for x in r if pd.notna(x)])
for r in df.iloc[:,1:].values.tolist()
]
# RafaelC's answer.
%%timeit
[
'|'.join([k for k in a if k])
for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values.tolist())
]
31.9 ms ± 800 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
23.7 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Although note the answers aren't identical because #RafaelC's code produces floats: ['1.0|2.0|9.0', '3.0|11.0', ...]. If this is fine, then great. Otherwise you'll need to convert to int which adds more overhead.

Using list comprehension and zip
>>> [['|'.join([k for k in a if k])] for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values)]
Timing seems alright
df = pd.concat([df]*1000)
%timeit [['|'.join([k for k in a if k])] for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values)]
10.8 ms ± 568 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.filter(like='col').agg(lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
1.68 s ± 91.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.iloc[:, 1:].apply(lambda x: '|'.join(str(el) for el in x if str(el) != 'nan'), axis=1)
87.8 ms ± 5.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.assign(new_col=['|'.join([str(int(x)) for x in r if ~np.isnan(x)]) for r in df.iloc[:,1:].values])
45.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

import time
import timeit
from pandas import DataFrame
import numpy as np
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
'date' : ['05/9/2023', '07/10/2023', '08/11/2023', '06/12/2023'],
'A' : [1, np.nan,4, 7],
'B' : [2, np.nan, 5, 8],
'C' : [3, 6, 9, np.nan]
}).set_index('date')
print(df)
print('.........')
start_time = datetime.now()
df['ColumnA'] = df[df.columns].agg(
lambda x: ','.join(x.dropna().astype(str)),
axis=1
)
print(df['ColumnA'])
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))
"""
A B C
date
05/9/2023 1.0 2.0 3.0
07/10/2023 NaN NaN 6.0
08/11/2023 4.0 5.0 9.0
06/12/2023 7.0 8.0 NaN
...........................
OUTPUT:
date
05/9/2023 1.0,2.0,3.0
07/10/2023 6.0
08/11/2023 4.0,5.0,9.0
06/12/2023 7.0,8.0
Name: ColumnA, dtype: object
Duration: 0:00:00.002998
"""

Find Minimum without Zero and NaN in Pandas Dataframe

I have a pandas Dataframe and I want to find the minimum without zeros and Nans.
I was trying to combine from numpy nonzero and nanmin, but it does not work.
Does someone has an idea?

If you want the minimum of all df, you can try so:
m = np.nanmin(df.replace(0, np.nan).values)

Use numpy.where with numpy.nanmin:
df = pd.DataFrame({'B':[4,0,4,5,5,np.nan],
'C':[7,8,9,np.nan,2,3],
'D':[1,np.nan,5,7,1,0],
'E':[5,3,0,9,2,4]})
print (df)
B C D E
0 4.0 7.0 1.0 5
1 0.0 8.0 NaN 3
2 4.0 9.0 5.0 0
3 5.0 NaN 7.0 9
4 5.0 2.0 1.0 2
5 NaN 3.0 0.0 4
Numpy solution:
arr = df.values
a = np.nanmin(np.where(arr == 0, np.nan, arr))
print (a)
1.0
Pandas solution - NaNs are removed by default:
a = df.mask(df==0).min().min()
print (a)
1.0
Performance - for each row is added one NaN value:
np.random.seed(123)
df = pd.DataFrame(np.random.rand(1000,1000))
np.fill_diagonal(df.values, np.nan)
print (df)
#joe answer
In [399]: %timeit np.nanmin(df.replace(0, np.nan).values)
15.3 ms ± 425 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [400]: %%timeit
...: arr = df.values
...: a = np.nanmin(np.where(arr == 0, np.nan, arr))
...:
6.41 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [401]: %%timeit
...: df.mask(df==0).min().min()
...:
23.9 ms ± 727 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speed up grouping with `apply` on a single column - python

Related

Multi-index Dataframe from dictionary of Dataframes

How do i take the sigmoid of all the records of a column in a dataset in python?

Vectorized method for mapping a list from one Dataframe row to another Dataframe row

Concatenate column values in a pandas DataFrame while ignoring NaNs

Find Minimum without Zero and NaN in Pandas Dataframe

Categories

Resources