Is it possible to call the apply function on multiple columns in pandas and if so how does one do this.. for example,
df['Duration'] = df['Hours', 'Mins', 'Secs'].apply(lambda x,y,z: timedelta(hours=x, minutes=y, seconds=z))
This is what the expected output should look like once everything comes together
Thank you.
You should use:
df['Duration'] = pd.to_timedelta(df.Hours*3600 + df.Mins*60 + df.Secs, unit='s')
When you use apply on a DataFrame with axis=1, it's a row calculation, so typically this syntax makes sense:
df['Duration'] = df.apply(lambda row: pd.Timedelta(hours=row.Hours, minutes=row.Mins,
seconds=row.Secs), axis=1)
Some timings
import pandas as pd
import numpy as np
df = pd.DataFrame({'Hours': np.tile([1,2,3,4],50),
'Mins': np.tile([10,20,30,40],50),
'Secs': np.tile([11,21,31,41],50)})
%timeit pd.to_timedelta(df.Hours*3600 + df.Mins*60 + df.Secs, unit='s')
#432 µs ± 5.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.apply(lambda row: pd.Timedelta(hours=row.Hours, minutes=row.Mins, seconds=row.Secs), axis=1)
#12 ms ± 67.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As always, apply should be a last resort.
Use apply on the dataframe with axis=1
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
triangles = [{ 'base': 20, 'height': 9 }, { 'base': 10, 'height': 7 }, { 'base': 40, 'height': 4 }]
triangles_df = pd.DataFrame(triangles)
def calculate_area(row):
return row['base'] * row['height'] * 0.5
triangles_df.apply(calculate_area, axis=1)
Good luck!
This might help.
import pandas as pd
import datetime as DT
df = pd.DataFrame({"Hours": [1], "Mins": [2], "Secs": [10]})
df = df.astype(int)
df['Duration'] = df[['Hours', 'Mins', 'Secs']].apply(lambda x: DT.timedelta(hours=x[0], minutes=x[1], seconds=x[2]), axis=1)
print(df)
print(df["Duration"])
Output:
Hours Mins Secs Duration
0 1 2 10 01:02:10
0 01:02:10
dtype: timedelta64[ns]
Related
I have a pandas dataframe, there I wanna search in one column for numbers, find it and put it in a new column.
import pandas
import regex as re
import numpy as np
data = {'numbers':['134.ABBC,189.DREB, 134.TEB', '256.EHBE, 134.RHECB, 345.DREBE', '456.RHN,256.REBN,864.TREBNSE', '256.DREB, 134.ETNHR,245.DEBHTECM'],
'rate':[434, 456, 454256, 2334544]}
df = pd.DataFrame(data)
print(df)
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = None
index_numbers = df.columns.get_loc('numbers')
index_mynumbers = df.columns.get_loc('mynumbers')
length = np.array([])
for row in range(0, len(df)):
number = re.findall(pattern, df.iat[row, index_numbers])
df.iat[row, index_mynumbers] = number
print(df)
I get my numbers: {'mynumbers': ['[134.ABBC, 134.TEB]', '[134.RHECB]', '[134.RHECB]']}. My dataframe is huge. Is there a better, faster method in pandas for going trough my df?
Sure, use Series.str.findall instead loops:
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = df['numbers'].str.findall(pattern)
print(df)
numbers rate mynumbers
0 134.ABBC,189.DREB, 134.TEB 434 [134.ABBC, 134.TEB]
1 256.EHBE, 134.RHECB, 345.DREBE 456 [134.RHECB]
2 456.RHN,256.REBN,864.TREBNSE 454256 []
3 256.DREB, 134.ETNHR,245.DEBHTECM 2334544 [134.ETNHR]
If want using re.findall is it possible, only 2 times slowier:
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = df['numbers'].map(lambda x: re.findall(pattern, x))
# [40000 rows]
df = pd.concat([df] * 10000, ignore_index=True)
pattern = '134.[A-Z]{2,}'
In [46]: %timeit df['numbers'].map(lambda x: re.findall(pattern, x))
50 ms ± 491 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [47]: %timeit df['numbers'].str.findall(pattern)
21.2 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have a simple dataframe and would like to apply a function to a particular column based on the status of another column.
myDF = pd.DataFrame({'trial': ['A','B','A','B','A','B'], 'score': [1,2,3,4,5,6]})
I would like to multiply each observation in the score column by 10, but only if the trial name is 'A'. If it is 'B', I would like the score to remain the same.
I have a working version that does what I need, but I'm wondering if there is a way I can do this without having to create the new dataframe. Not that it makes a huge difference, but eight lines of code seems pretty long and I'm guessing there might be a simpler solution I have overlooked.
newDF = pd.DataFrame(columns = ['trial','score'])
for row in myDF.iterrows():
if row[1][0] == 'A':
newScore = {'trial': 'A', 'score': row[1][1]*10}
newDF = newDF.append(newScore, ignore_index=True)
else:
newScore = {'trial': 'B', 'score': row[1][1]}
newDF = newDF.append(newScore, ignore_index=True)
you can use loc to select the rows and columns to multiply by 10.
myDF.loc[myDF['trial'].eq('A'), 'score'] *= 10
print(myDF)
trial score
0 A 10
1 B 2
2 A 30
3 B 4
4 A 50
5 B 6
and it will be much faster than a looping.
Pandas is pretty fast, and does the job well! But, functions like .iterrows() is dead slow compared to other methods. There are various articles on this topic, one such is linked here.
Now, you can simply use .apply() function. Which will work wonders - and you can even fit any custom functions!
Here is an example of your work:
myDF["score"] = myDF.apply(lambda x : x[1] * 10 if x[0] == "A" else x[1], axis = 1)
You can even apply any functions using .apply and lambda as below:
def updateScore(trial, score):
return score if trial != 'A' else score * 10
myDF["score"] = myDF.apply(lambda x : updateScore(trial = x[0], score = x[1]), axis = 1)
For more details, you can check the documentation.
An alternative way to do this is using replace.
myDF = pd.DataFrame({'trial': ['A','B','A','B','A','B'], 'score': [1,2,3,4,5,6]})
myDF['score'] = myDF['score'].mul(myDF['trial'].replace({'A':10, 'B':1}))
Performance
Below I tested the performance of the different solutions using a dataset of 100000 rows.
N=100000
myDF = pd.DataFrame({'trial': np.random.choice(['A', 'B'], N), 'score': np.random.choice(np.arange(0,10), N)})
Given solutions:
%timeit myDF['score'] = myDF['score'].mul(myDF['trial'].replace({'A':10, 'B':1}))
22.5 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit myDF.loc[myDF['trial'].eq('A'), 'score'] *= 10
6.4 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit myDF["score"] = myDF.apply(lambda x : x[1] * 10 if x[0] == "A" else x[1], axis = 1)
587 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
def updateScore(trial, score):
return score if trial != 'A' else score * 10
%timeit myDF["score"] = myDF.apply(lambda x : updateScore(trial = x[0], score = x[1]), axis = 1)
603 ms ± 4.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
hi I have the following Dataframe
import pandas as pd
d = {'col1': [0.02,0.12,-0.1,0-0.07,0.01]}
df = pd.DataFrame(data=d)
df['new'] = ''
df['new'].iloc[0] = 100
df
I tried to calculate (beginning in row 1) in column 'new' the previous value divided by the value of 'col1'+ 1.
For example in row one, column new: 100/(0.12+1) = 89,285
For example in row two, column new: 89,285/(-0.10+1) = 99,206
and so on
I already tried to use a lambda function - without success. Thanks for help
Try this:
df['new'].iloc[0] = 100
for i in range(1,df.shape[0]):
prev = df['new'].iloc[i-1]
df['new'].iloc[i] = prev/(df['col1'].iloc[i]+1)
Output:
col1 new
-------------------
0 0.02 100
1 0.12 89.2857
2 -0.10 99.2063
3 -0.07 106.673
4 0.01 105.617
I think numba is way how working with loops here if performance is important:
from numba import jit
d = {'col1': [0.02,0.12,-0.1,0-0.07,0.01]}
df = pd.DataFrame(data=d)
df.loc[0, 'new'] = 100
#jit(nopython=True)
def f(a, b):
for i in range(1, a.shape[0]):
a[i] = a[i-1] / (b[i] +1)
return a
df['new'] = f(df['new'].to_numpy(), df['col1'].to_numpy())
print (df)
col1 new
0 0.02 100.000000
1 0.12 89.285714
2 -0.10 99.206349
3 -0.07 106.673494
4 0.01 105.617321
Performance for 5000 rows:
d = {'col1': [0.02,0.12,-0.1,0-0.07,0.01]}
df = pd.DataFrame(data=d)
df = pd.concat([df] * 1000, ignore_index=True)
In [168]: %timeit df['new'] = f(df['new'].to_numpy(), df['col1'].to_numpy())
277 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [169]: %%timeit
...: for i in range(1,df.shape[0]):
...: prev = df['new'].iloc[i-1]
...: df['new'].iloc[i] = prev/(df['col1'].iloc[i]+1)
...:
1.31 s ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [170]: %%timeit
...: for i_row, row in df.iloc[1:, ].iterrows():
...: df.loc[i_row, 'new'] = df.loc[i_row - 1, 'new'] / (row['col1'] + 1)
...:
2.08 s ± 93.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I don't see any vectorized solution. Here's a sheer loop:
df['new'] = 100
for i_row, row in df.iloc[1:, ].iterrows():
df.loc[i_row, 'new'] = df.loc[i_row - 1, 'new'] / (row['col1'] + 1)
I have a large dataframe where each row contains a string.
I want to split each string into several columns, and also replace two character types.
The code below does the job, but it is slow on a large dataframe. Is there a faster way than using a for loop?
import re
import pandas as pd
df = pd.DataFrame(['[3.4, 3.4, 2.5]', '[3.4, 3.4, 2.5]'])
df_new = pd.DataFrame({'col1': [0,0], 'col2': [0,0], 'col3': [0,0]})
for i in range(df.shape[0]):
df_new.iloc[i, :] = re.split(',', df.iloc[i, 0].replace('[', '').replace(']', ''))
You can do it with:
import pandas as pd
df = pd.DataFrame(['[3.4, 3.4, 2.5]', '[3.4, 3.4, 2.5]'])
df_new = df[0].str[1:-1].str.split(",", expand=True)
df_new.columns = ["col1", "col2", "col3"]
The idea is to first get rid of the [ and ] and then split by , and expand the dataframe. The last step would be to rename the columns.
Your solution should be changed with Series.str.strip and Series.str.split:
df1 = df[0].str.strip('[]').str.split(', ', expand=True).add_prefix('col')
print(df1)
col0 col1 col2
0 3.4 3.4 2.5
1 3.4 3.4 2.5
If performance is important use list comprehension instead pandas functions:
df1 = pd.DataFrame([x.strip('[]').split(', ') for x in df[0]]).add_prefix('col')
Timings:
#20k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [208]: %timeit df[0].str.strip('[]').str.split(', ', expand=True).add_prefix('col')
61.5 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [209]: %timeit pd.DataFrame([x.strip('[]').split(', ') for x in df[0]]).add_prefix('col')
29.8 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I wish to efficiently use pandas (or numpy) instead of a nested for loop with an if statement to solve a particular problem. Here is a toy version:
Suppose I have the following two DataFrames
import pandas as pd
import numpy as np
dict1 = {'vals': [100,200], 'in': [0,1], 'out' :[1,3]}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': [500,800,300,200], 'in': [0.1,0.5,2,4], 'out' :[0.5,2,4,5]}
df2 = pd.DataFrame(data=dict2)
Now I wish to loop through each row each dataframe and multiply the vals if a particular condition is met. This code works for what I want
ans = []
for i in range(len(df1)):
for j in range(len(df2)):
if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
ans.append(df1['vals'][i]*df2['vals'][j])
np.sum(ans)
However, clearly this is very inefficient and in reality my DataFrames can have millions of entries making this unusable. I am also not making us of pandas or numpy efficient vector implementations. Does anyone have any ideas how to efficiently vectorize this nested loop?
I feel like this code is something akin to matrix multiplication so could progress be made utilising outer? It's the if condition that I'm finding hard to wedge in, as the if logic needs to compare each entry in df1 against all entries in df2.
You can also use a compiler like Numba to do this job. This would also outperform the vectorized solution and doesn't need a temporary array.
Example
import numba as nb
import numpy as np
import pandas as pd
import time
#nb.njit(fastmath=True,parallel=True,error_model='numpy')
def your_function(df1_in,df1_out,df1_vals,df2_in,df2_out,df2_vals):
sum=0.
for i in nb.prange(len(df1_in)):
for j in range(len(df2_in)):
if (df1_in[i] <= df2_out[j] and df1_out[i] >= df2_in[j]):
sum+=df1_vals[i]*df2_vals[j]
return sum
Testing
dict1 = {'vals': np.random.randint(1, 100, 1000),
'in': np.random.randint(1, 10, 1000),
'out': np.random.randint(1, 10, 1000)}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': np.random.randint(1, 100, 1500),
'in': 5*np.random.random(1500),
'out': 5*np.random.random(1500)}
df2 = pd.DataFrame(data=dict2)
# First call has some compilation overhead
res=your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
df2['in'].values, df2['out'].values, df2['vals'].values)
t1 = time.time()
for i in range(1000):
res = your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
df2['in'].values, df2['out'].values, df2['vals'].values)
print(time.time() - t1)
Timings
vectorized solution #AGN Gazer: 9.15ms
parallelized Numba Version: 0.7ms
m1 = np.less_equal.outer(df1['in'], df2['out'])
m2 = np.greater_equal.outer(df1['out'], df2['in'])
m = np.logical_and(m1, m2)
v12 = np.outer(df1['vals'], df2['vals'])
print(v12[m].sum())
Or, replace first three lines with this long line:
m = np.less_equal.outer(df1['in'], df2['out']) & np.greater_equal.outer(df1['out'], df2['in'])
s = np.outer(df1['vals'], df2['vals'])[m].sum()
For very large problems, dask is recommended.
Timing Tests:
Here is a timing comparison when using 1000 and 1500-long arrays:
In [166]: dict1 = {'vals': np.random.randint(1,100,1000), 'in': np.random.randint(1,10,1000), 'out': np.random.randint(1,10,1000)}
...: df1 = pd.DataFrame(data=dict1)
...:
...: dict2 = {'vals': np.random.randint(1,100,1500), 'in': 5*np.random.random(1500), 'out': 5*np.random.random(1500)}
...: df2 = pd.DataFrame(data=dict2)
Author's original method (Python loops):
In [167]: def f(df1, df2):
...: ans = []
...: for i in range(len(df1)):
...: for j in range(len(df2)):
...: if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
...: ans.append(df1['vals'][i]*df2['vals'][j])
...: return np.sum(ans)
...:
...:
In [168]: %timeit f(df1, df2)
47.3 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Ben.T method:
In [170]: %timeit df2['ans']= df2.apply(lambda row: df1['vals'][(df1['in'] <= row['out']) & (df1['out'] >= row['in'])].sum()*row['vals'],1); df2['a
...: ns'].sum()
2.22 s ± 40.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Vectorized solution proposed here:
In [171]: def g(df1, df2):
...: m = np.less_equal.outer(df1['in'], df2['out']) & np.greater_equal.outer(df1['out'], df2['in'])
...: return np.outer(df1['vals'], df2['vals'])[m].sum()
...:
...:
In [172]: %timeit g(df1, df2)
7.81 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Your answer:
471 µs ± 35.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 1 (3+ times slower):
df1.apply(lambda row: list((df2['vals'][(row['in'] <= df2['out']) & (row['out'] >= df2['in'])] * row['vals'])), axis=1).sum()
1.56 ms ± 7.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 2 (2+ times slower):
ans = []
for name, row in df1.iterrows():
_in = row['in']
_out = row['out']
_vals = row['vals']
ans.append(df2['vals'].loc[(df2['in'] <= _out) & (df2['out'] >= _in)].values * _vals)
1.01 ms ± 8.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 3 (3+ times faster):
df1_vals = df1.values
ans = np.zeros(shape=(len(df1_vals), len(df2.values)))
for i in range(df1_vals.shape[0]):
df2_vals = df2.values
df2_vals[:, 2][~np.logical_and(df1_vals[i, 1] >= df2_vals[:, 0], df1_vals[i, 0] <= df2_vals[:, 1])] = 0
ans[i, :] = df2_vals[:, 2] * df1_vals[i, 2]
144 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In Method 3 you can view the solution by performing:
ans[ans.nonzero()]
Out[]: array([ 50000., 80000., 160000., 60000.]
I wasn't able to think of a way to remove the underlying loop :( but I learnt a lot about numpy in the process! (yay for learning)
One way to do it is by using apply. Create a column in df2 containing the sum of vals in df1, meeting your criteria on in and out, multiplied by the vals of the row of df2
df2['ans']= df2.apply(lambda row: df1['vals'][(df1['in'] <= row['out']) &
(df1['out'] >= row['in'])].sum()*row['vals'],1)
then just sum this column
df2['ans'].sum()