Basically, I am performing simple operation and updating 100 columns of my dataframe of size (550 rows and 2700 columns).
I am updating 100 columns like this:
df["col1"] = df["static"]-df["col1"])/df["col1"]*100
df["col2"] = (df["static"]-df["col2"])/df["col2"]*100
df["col3"] = (df["static"]-df["col3"])/df["col3"]*100
....
....
df["col100"] = (df["static"]-df["col100"])/df["col100"]*100
This operation is taking 170 ms in my original dataframe. I want to speed up the time. I am doing some real-time thing, so time is important.
You can select all columns and subtract with right side by DataFrame.rsub with DataFrame.div only columns vby list cols`:
cols = [f'col{c}' for c in range(1, 101)]
df[cols] = df[cols].rsub(df['static'], axis=0).div(df[cols], axis=0)
Performance:
np.random.seed(2022)
df=pd.DataFrame(np.random.randint(1001, size=(550,2700))).add_prefix('col')
df = df.rename(columns={'col0':'static'})
In [58]: %%timeit
...: for i in range(1, 101):
...: df[f"col{i}"] = (df["static"]-df[f"col{i}"])/df[f"col{i}"]*100
...:
59.9 ms ± 630 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [59]: %%timeit
...: cols = [f'col{c}' for c in range(1, 101)]
...: df[cols] = df[cols].rsub(df['static'], axis=0).div(df[cols], axis=0)
...:
11.9 ms ± 55.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Related
I have a large large DataFrame. And I want to add a series to every row of it.
The following is the current way for achieving my goal:
print(df.shape) # (31676, 3562)
diff = 4.1 - df.iloc[0] # I'd like to add diff to every row of df
for i in range(len(df)):
df.iloc[i] = df.iloc[i] + diff
The method takes a lot of time. Are there any other efficient way of doing this?
You can subtract Series for vectorize operation, a bit faster should be subtract by numpy array:
np.random.seed(2022)
df = pd.DataFrame(np.random.rand(1000,1000))
In [51]: %timeit df.add(4.1).sub(df.iloc[0])
7.99 ms ± 389 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df + 4.1 - df.iloc[0]
8.46 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [53]: %timeit df + 4.1 - df.iloc[0].to_numpy()
7.59 ms ± 59.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [49]: %%timeit
...: diff = 4.1 - df.iloc[0] # I'd like to add diff to every row of df
...: for i in range(len(df)):
...: df.iloc[i] = df.iloc[i] + diff
...:
433 ms ± 50.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I've got a data frame that has been binned into age groups ('AgeGroups' column), then filtered to those below the poverty level (100). I'm wondering if there is a simple way to calculate the count of those below poverty divided by the total amount of people, or poverty rate. This works but doesn't seem very pythonic.
The "PWGTP" column is the weight used to sum in this scenario.
pov_rate = df[df['POV'] <= 100].groupby('AgeGroups').sum()['PWGTP'] /df.groupby('AgeGroups').sum()['PWGTP']
Thank you
It's not clear from your description why you need a groupby. The data is already binned. Why not simply create a poverty rate column?
df['pov_rate']=(df['POV']<100)*df['PWGTP']/df['PWGTP'].sum()
Some another solutions:
Filtered only column PWGTP for aggregate sum, very important if more numeric columns:
pov_rate = (df[df['POV'] <= 100].groupby('AgeGroups')['PWGTP'].sum() /
df.groupby('AgeGroups')['PWGTP'].sum())
print (pov_rate)
Only one groupby with helper column filt:
pov_rate = (df.assign(filt = df['PWGTP'].where(df['POV'] <= 100))
.groupby('AgeGroups')[['filt','PWGTP']].sum()
.eval('filt / PWGTP'))
print (pov_rate)
Performance depends of number of groups, number of matched rows, number of numeric columns and length of Dataframe, so in real data should be different.
np.random.seed(2020)
N = 1000000
df = pd.DataFrame({'AgeGroups':np.random.randint(10000,size=N),
'POV': np.random.randint(50, 500, size=N),
'PWGTP':np.random.randint(100,size=N),
'a':np.random.randint(100,size=N),
'b':np.random.randint(100,size=N),
'c':np.random.randint(100,size=N)})
# print (df)
In [13]: %%timeit
...: pov_rate = (df[df['POV'] <= 100].groupby('AgeGroups').sum()['PWGTP'] /
...: df.groupby('AgeGroups').sum()['PWGTP'])
...:
209 ms ± 7.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [14]: %%timeit
...: pov_rate = (df[df['POV'] <= 100].groupby('AgeGroups')['PWGTP'].sum() /
...: df.groupby('AgeGroups')['PWGTP'].sum())
...:
85.8 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]: %%timeit
...: pov_rate = (df.assign(filt = df['PWGTP'].where(df['POV'] <= 100))
...: .groupby('AgeGroups')[['filt','PWGTP']].sum()
...: .eval('filt / PWGTP'))
...:
122 ms ± 388 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Is there any efficient way to concatenate Pandas column name to its value. I will like to prefix all my DataFrame values with their column names.
My current method is very slow on a large dataset:
import pandas as pd
# test data
df = pd.read_csv(pd.compat.StringIO('''date value data
01/01/2019 30 data1
01/01/2019 40 data2
02/01/2019 20 data1
02/01/2019 10 data2'''), sep=' ')
# slow method
dt = [df[c].apply(lambda x:f'{c}_{x}').values for c in df.columns]
dt = pd.DataFrame(dt, index=df.columns).T
The problem is that list compression and copying of data slows the transformation on a large dataset with lots of columns.
Is there are better way to prefix columns name to values?
here is a way without loops:
pd.DataFrame([df.columns]*len(df),columns=df.columns)+"_"+df.astype(str)
date value data
0 date_01/01/2019 value_30 data_data1
1 date_01/01/2019 value_40 data_data2
2 date_02/01/2019 value_20 data_data1
3 date_02/01/2019 value_10 data_data2
Timings (fastest to slowest):
m = pd.concat([df]*20000,ignore_index=True)
%%timeit
m.astype(str).radd(m.columns + '_')
#410 ms ± 39.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
m.astype(str).radd('_').radd([*m]) # courtesy #piR
#470 ms ± 76.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #piR solution
a = m.to_numpy().astype(str)
b = m.columns.to_numpy().astype(str)
pd.DataFrame(add(add(b, '_'), a), m.index, m.columns)
#710 ms ± 74.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #anky_91 sol
pd.DataFrame([m.columns]*len(m),columns=m.columns)+"_"+m.astype(str)
#1.7 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit #OP sol
dt = [m[c].apply(lambda x:f' {c}_{x}').values for c in m.columns]
pd.DataFrame(dt, index=m.columns).T
#14.4 s ± 643 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy.core.defchararray.add
from numpy.core.defchararray import add
a = df.to_numpy().astype(str)
b = df.columns.to_numpy().astype(str)
dt = pd.DataFrame(add(add(b, '_'), a), df.index, df.columns)
dt
date value data
0 date_01/01/2019 value_30 data_data1
1 date_01/01/2019 value_40 data_data2
2 date_02/01/2019 value_20 data_data1
3 date_02/01/2019 value_10 data_data2
This isn't as fast as the fastest answer but it's pretty zippy (see what I did there)
a = df.columns.tolist()
pd.DataFrame(
[[f'{k}_{v}' for k, v in zip(a, t)]
for t in zip(*map(df.get, a))],
df.index, df.columns
)
This solution:
result = pd.DataFrame({col: col + "_" + m[col].astype(str) for col in m.columns})
is as performant as the fastest solution above, and might be more readable, at least to some.
I would like to calculate the mean of age excluding the value 99. In real life the dataframe is much bigger, and I have other possible variables.
Is there a more efficient way (faster or more elegant) to do it? Maybe with a pivot table or group by, or a function?
data = {'age': [99,45,34,32,34,67,5,6,7,8,3,5]}
df = pd.DataFrame(data, columns = ['age'])
not99 = df['age'] != 99
mean_for_age = df.loc[not99, 'age'].mean()
numpy solution is faster - first create array and then filter:
arr = df['age'].values
not99 = arr != 99
mean_for_age = arr[not99].mean()
But if need generally solution for possible select another column use your solution:
not99 = df['age'] != 99
mean_for_age = df.loc[not99, 'age'].mean()
mean_for_age = df.loc[not99, 'another col'].mean()
Timings (depends of data, best test with real data):
data = {'age': [99,45,34,32,34,67,5,6,7,8,3,5]}
df = pd.DataFrame(data, columns = ['age'])
df = pd.concat([df] * 10000, ignore_index=True)
In [14]: %%timeit
...: arr = df['age'].values
...: not99 = arr != 99
...:
...: mean_for_age = arr[not99].mean()
...:
496 µs ± 36.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [15]: %%timeit
...: not99 = df['age'] != 99
...: mean_for_age = df.loc[not99, 'age'].mean()
...:
1.82 ms ± 40.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [16]: %%timeit
...: df.query("age != 99")['age'].mean()
...:
4.26 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I wish to efficiently use pandas (or numpy) instead of a nested for loop with an if statement to solve a particular problem. Here is a toy version:
Suppose I have the following two DataFrames
import pandas as pd
import numpy as np
dict1 = {'vals': [100,200], 'in': [0,1], 'out' :[1,3]}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': [500,800,300,200], 'in': [0.1,0.5,2,4], 'out' :[0.5,2,4,5]}
df2 = pd.DataFrame(data=dict2)
Now I wish to loop through each row each dataframe and multiply the vals if a particular condition is met. This code works for what I want
ans = []
for i in range(len(df1)):
for j in range(len(df2)):
if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
ans.append(df1['vals'][i]*df2['vals'][j])
np.sum(ans)
However, clearly this is very inefficient and in reality my DataFrames can have millions of entries making this unusable. I am also not making us of pandas or numpy efficient vector implementations. Does anyone have any ideas how to efficiently vectorize this nested loop?
I feel like this code is something akin to matrix multiplication so could progress be made utilising outer? It's the if condition that I'm finding hard to wedge in, as the if logic needs to compare each entry in df1 against all entries in df2.
You can also use a compiler like Numba to do this job. This would also outperform the vectorized solution and doesn't need a temporary array.
Example
import numba as nb
import numpy as np
import pandas as pd
import time
#nb.njit(fastmath=True,parallel=True,error_model='numpy')
def your_function(df1_in,df1_out,df1_vals,df2_in,df2_out,df2_vals):
sum=0.
for i in nb.prange(len(df1_in)):
for j in range(len(df2_in)):
if (df1_in[i] <= df2_out[j] and df1_out[i] >= df2_in[j]):
sum+=df1_vals[i]*df2_vals[j]
return sum
Testing
dict1 = {'vals': np.random.randint(1, 100, 1000),
'in': np.random.randint(1, 10, 1000),
'out': np.random.randint(1, 10, 1000)}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': np.random.randint(1, 100, 1500),
'in': 5*np.random.random(1500),
'out': 5*np.random.random(1500)}
df2 = pd.DataFrame(data=dict2)
# First call has some compilation overhead
res=your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
df2['in'].values, df2['out'].values, df2['vals'].values)
t1 = time.time()
for i in range(1000):
res = your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
df2['in'].values, df2['out'].values, df2['vals'].values)
print(time.time() - t1)
Timings
vectorized solution #AGN Gazer: 9.15ms
parallelized Numba Version: 0.7ms
m1 = np.less_equal.outer(df1['in'], df2['out'])
m2 = np.greater_equal.outer(df1['out'], df2['in'])
m = np.logical_and(m1, m2)
v12 = np.outer(df1['vals'], df2['vals'])
print(v12[m].sum())
Or, replace first three lines with this long line:
m = np.less_equal.outer(df1['in'], df2['out']) & np.greater_equal.outer(df1['out'], df2['in'])
s = np.outer(df1['vals'], df2['vals'])[m].sum()
For very large problems, dask is recommended.
Timing Tests:
Here is a timing comparison when using 1000 and 1500-long arrays:
In [166]: dict1 = {'vals': np.random.randint(1,100,1000), 'in': np.random.randint(1,10,1000), 'out': np.random.randint(1,10,1000)}
...: df1 = pd.DataFrame(data=dict1)
...:
...: dict2 = {'vals': np.random.randint(1,100,1500), 'in': 5*np.random.random(1500), 'out': 5*np.random.random(1500)}
...: df2 = pd.DataFrame(data=dict2)
Author's original method (Python loops):
In [167]: def f(df1, df2):
...: ans = []
...: for i in range(len(df1)):
...: for j in range(len(df2)):
...: if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
...: ans.append(df1['vals'][i]*df2['vals'][j])
...: return np.sum(ans)
...:
...:
In [168]: %timeit f(df1, df2)
47.3 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Ben.T method:
In [170]: %timeit df2['ans']= df2.apply(lambda row: df1['vals'][(df1['in'] <= row['out']) & (df1['out'] >= row['in'])].sum()*row['vals'],1); df2['a
...: ns'].sum()
2.22 s ± 40.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Vectorized solution proposed here:
In [171]: def g(df1, df2):
...: m = np.less_equal.outer(df1['in'], df2['out']) & np.greater_equal.outer(df1['out'], df2['in'])
...: return np.outer(df1['vals'], df2['vals'])[m].sum()
...:
...:
In [172]: %timeit g(df1, df2)
7.81 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Your answer:
471 µs ± 35.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 1 (3+ times slower):
df1.apply(lambda row: list((df2['vals'][(row['in'] <= df2['out']) & (row['out'] >= df2['in'])] * row['vals'])), axis=1).sum()
1.56 ms ± 7.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 2 (2+ times slower):
ans = []
for name, row in df1.iterrows():
_in = row['in']
_out = row['out']
_vals = row['vals']
ans.append(df2['vals'].loc[(df2['in'] <= _out) & (df2['out'] >= _in)].values * _vals)
1.01 ms ± 8.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 3 (3+ times faster):
df1_vals = df1.values
ans = np.zeros(shape=(len(df1_vals), len(df2.values)))
for i in range(df1_vals.shape[0]):
df2_vals = df2.values
df2_vals[:, 2][~np.logical_and(df1_vals[i, 1] >= df2_vals[:, 0], df1_vals[i, 0] <= df2_vals[:, 1])] = 0
ans[i, :] = df2_vals[:, 2] * df1_vals[i, 2]
144 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In Method 3 you can view the solution by performing:
ans[ans.nonzero()]
Out[]: array([ 50000., 80000., 160000., 60000.]
I wasn't able to think of a way to remove the underlying loop :( but I learnt a lot about numpy in the process! (yay for learning)
One way to do it is by using apply. Create a column in df2 containing the sum of vals in df1, meeting your criteria on in and out, multiplied by the vals of the row of df2
df2['ans']= df2.apply(lambda row: df1['vals'][(df1['in'] <= row['out']) &
(df1['out'] >= row['in'])].sum()*row['vals'],1)
then just sum this column
df2['ans'].sum()