I am pulling out a value from a table, searching for the value based on matches in other columns. Right now, because there are hundreds of thousands of grid cells to go through, each call of the function takes a few seconds, but it adds up to hours. Is there a faster way to do this?
data_1 = data.loc[(data['test1'] == test1) & (data['test2'] == X) & (data['Column'] == col1) & (data['Row']== row1)].Value
Sample data
Column Row Value test2 test1
2 3 5 X 0TO4
2 6 10 Y 100UP
2 10 5.64 Y 10TO14
5 2 9.4 Y 15TO19
9 2 6 X 20TO24
13 11 7.54 X 25TO29
25 2 6.222 X 30TO34
It may be worth a quick read-through on the enhancing performance docs to see what best fits your needs.
One option is to drop down to numpy using .values and slicing. Without seeing your actual data or use case, I created the following synthetic data:
data=pd.DataFrame({'column':[np.random.randint(30) for i in range(100000)],
'row':[np.random.randint(50) for i in range(100000)],
'value':[np.random.randint(100)+np.random.rand() for i in range(100000)],
'test1':[np.random.choice(['X','Y']) for i in range(100000)],
'test2':[np.random.choice(['d','e','f','g','h','i']) for i in range(100000)]})
data.head()
column row value test1 test2
0 4 30 88.367151 X e
1 7 10 92.482926 Y d
2 1 17 11.151060 Y i
3 27 10 78.707897 Y g
4 19 35 95.204207 Y h
Then using %timeit I got the following results using .loc indexing, boolean masking, and numpy slicing
(Note, at this point I realized I missed one of the lookups so that may affect the total time count but ratios should hold true)
%timeit data_1 = data.loc[(data['test1'] == 'X') & (data['column'] >=12) & (data['row'] > 22)]['value']
13 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit data_1 = data[(data['test1'] == 'X') & (data['column'] >=12) & (data['row'] > 22)]['value']
13.1 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Now, this next part contains some overhead for converting the dataframe to a numpy array. If you're converting it once then doing multiple lookups against it, then this will be faster. But if not, you will likely end up taking longer for a single convert/slice
Without considering conversion time:
d1=data.values
%timeit d1[(d1[:,3]=='X')&(d1[:,0]>=12)&(d1[:,1]>22)][:,2]
8.37 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Approximately 30% improvement
With conversion time:
%timeit d1=data.values;d1[(d1[:,3]=='X')&(d1[:,0]>=12)&(d1[:,1]>22)][:,2]
20.6 ms ± 624 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Approximately 50% worse
You can index by test1, test2, Column and Row, and then lookup by that index.
Indexing:
data.set_index(["test1", "test2", "Column", "Row"], inplace=True)
and then lookup by doing this:
data_1 = data.loc[(test1, X, col1, row1)].Value
Related
I have a dataframe
df_in = pd.DataFrame([[1,"A",32,">30"],[2,"B",12,"<10"],[3,"C",45,">=45"]],columns=['id', 'input', 'val', 'cond'])
I want to perform an operation on column "val" based on the condition present in "cond" column and get the True/False result in "Output" column.
Expected Output:
df_out = pd.DataFrame([[1,"A",32,">30",True],[2,"B",12,"<10",False],[3,"C",45,">=45",True]],columns=['id', 'input', 'val', 'cond',"Output"])
How to do it?
you can try:
df_in['output']=pd.eval(df_in['val'].astype(str)+df_in['cond'])
OR
If needed performance use the below method but also see this thread but I think in your case it is safe to use eval:
df_in['output']=list(map(lambda x:eval(x),(df_in['val'].astype(str)+df_in['cond']).tolist()))
OR
Even more efficient and fastest:
from numpy.core import defchararray
df_in['output']=list(map(lambda x:eval(x),defchararray.add(df_in['val'].values.astype(str),df_in['cond'].values)))
output of df_in:
id input val cond output
0 1 A 32 >30 True
1 2 B 12 <10 False
2 3 C 45 >=45 True
Time Comparison: using %%timeit -n 1000
Using numexpr
import numexpr
df_in['output'] = df_in.apply(lambda x: numexpr.evaluate(f"{x['val']}{x['cond']}"), axis=1 )
id input val cond output
0 1 A 32 >30 True
1 2 B 12 <10 False
2 3 C 45 >=45 True
Time Comparison:
using %%timeit -n 1000
using apply and numexpr:
865 µs ± 140 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
using pd.eval:
2.5 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Assuming I have the following Pandas DataFrame:
U A B
0 2000 10 20
1 3000 40 0
2 2100 20 30
3 2500 0 30
4 2600 30 40
How can I get the index of first row that both A and B have non-zero value and (A+B)/2 is larger than 15 ?
In this example, I would like to get 2 since it is the first row that have non-zero A and B column and avg value of 25 which is more than 15
Note that this DataFrame is huge, I am looking for the fastest way to the index value.
Lets try:
df[(df.A.ne(0)&df.B.ne(0))&((df.A+df.B)/2).gt(15)].first_valid_index()
I find more readable explicit variables, like:
AB2 = (df['A']+df['B'])/2
filter = (df['A'] != 0) & (df['B'] != 0) & (AB2>15)
your_index = df[filter].index[0]
Performance
For this use case (ridiculous dataset)
%%timeit
df[(df.A.ne(0)&df.B.ne(0))&((df.A+df.B)/2).gt(15)].first_valid_index()
**1.21 ms** ± 35.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
AB2 = (df['A']+df['B'])/2
filter = (df['A'].ne(0)) & (df['B'].ne(0)) & (AB2>15)
df[filter].index[0]
**1.08 ms** ± 28.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df.query("A!=0 and B!=0 and (A+B)/2 > 15").index[0]
**2.71 ms** ± 157 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If the dataframe is large, query might be faster:
df.query("A!=0 and B!=0 and (A+B)/2 > 15").index[0]
2
I am a newbie and I need some insight. Say I have a pandas dataframe as follows:
temp = pd.DataFrame()
temp['A'] = np.random.rand(100)
temp['B'] = np.random.rand(100)
temp['C'] = np.random.rand(100)
I need to write a function where I replace every value in column "C" with 0's if the value of "A" is bigger than 0.5 in the corresponding row. Otherwise I need to multiply A and B in the same row element-wise and write down the output at the corresponding row on column "C".
What I did so far, is:
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
It works just as I desire it to work HOWEVER I am not sure if there's a faster way to implement this. I am very skeptical especially in the slicings that I feel like it's abundant to use those many slices. Though, I couldn't find any other solutions since I have to write 0's for C rows where A is bigger than 0.5.
Or, is there a way to slice the part that is needed only, perform calculations, then somehow remember the indices so you could put the required values back to the original data-frame on the corresponding rows?
One way using numpy.where:
temp["C"] = np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)
Benchmark (about 4x faster in sample, and keeps on increasing):
# With given sample of 100 rows
%%timeit
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
# 819 µs ± 2.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)
# 174 µs ± 455 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Benchmark on larger data (about 7x faster)
temp = pd.DataFrame()
temp['A'] = np.random.rand(1000000)
temp['B'] = np.random.rand(1000000)
temp['C'] = np.random.rand(1000000)
%%timeit
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
# 35.2 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0)
# 5.16 ms ± 188 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Validation
A=temp.loc[temp['A']<0.5, 'A'].values
B=temp.loc[temp['A']<0.5, 'B'].values
temp['C'] = 0
temp.loc[temp['A']<0.5,'C']=A*B
np.array_equal(temp["C"], np.where(temp["A"]<0.5, temp["A"] * temp["B"], 0))
# True
I am extracting maximum rainfall intensity for different durations using data with 5-minute rainfall totals. The code produces a list of max rainfall intensity for each duration (DURS). The code works but is slow when using data sets with 1,000,000+ rows. I am new to Pandas and I understand the apply() method is much faster than using a For loop but I do not know how to re-write a For loop using the apply() method.
Example of dataframe:
Value[mm] State of value
Date_Time
2020-01-01 00:00:00 1.0 5
2020-01-01 00:05:00 0.5 5
2020-01-01 00:10:00 4.0 5
2020-01-01 00:15:00 2.0 5
2020-01-01 00:20:00 2.0 5
2020-01-01 00:25:00 0.5 5
Example of Code:
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter
import math, numpy, array, glob
import pandas as pd
import numpy as np
pluvi_file = "rain.csv"
DURS = [5,6,10,15,20,25,30,45,60,90,120,180,270,360,540,720,1440,2880,4320]
df = pd.read_csv(pluvi_file, delimiter=',',parse_dates=[['Date','Time']])
df['Date_Time'] = pd.to_datetime(df['Date_Time'], dayfirst=True)
df.index = df['Date_Time']
del df['Date_Time']
lista = []
for DUR in DURS:
x = str(DUR)+' Min'
df1 = df.groupby(pd.Grouper(freq=x)).sum()
a = df1['Value[mm]'].max()/DUR*60
print(a)
lista.append(a)
Output (Max rainfall intensity for each duration in mm/hr):
5 66.0
6 60.0
10 54.0
15 40.0
20 40.5
25 30.0
30 34.0
45 26.666666666666664
60 26.5
90 20.666666666666668
120 23.0
180 12.166666666666666
270 8.11111111111111
360 9.416666666666666
540 6.444444444444445
720 4.708333333333333
1440 3.8958333333333335
2880 2.7708333333333335
4320 2.1597222222222223
How would I re-write this using the apply() method?
Solution
It looks like applying doesn't suit here, since functions you are applying on groups are vectorised methods from Essential Basic Functionality. Also removing of for loop doesn't look like a promising way for performance optimization, since there are no too much durations in your DURS list, so the main issue is grouping operation and calculations on groups, and there's no too much space for optimization, at least at my opinion.
Create artificial data
import pandas as pd
df = pd.DataFrame({'Date_Time' : ["2020-01-01 00:00:00",
"2020-01-01 00:05:00",
"2020-01-01 00:10:00",
"2020-01-01 00:15:00",
"2020-01-01 00:20:00",
"2020-01-01 00:25:00"],
'Value[mm]' : [1.0,0.5,4.0,2.0,2.0,0.5],
'State of value': [5,5,5,5,5,5]
})
df = df.sample(3900875, replace=True).reset_index(drop=True)
Now, lets set Date_Time as an index, and get just series we need to calculate our values
df['Date_Time'] = pd.to_datetime(df['Date_Time'], dayfirst=True)
df = df.set_index('Date_Time', drop = True)
df = df['Value[mm]']
Compare the performance of different approaches
Grouping and looping
%%timeit
lista = []
for DUR in DURS:
x = str(DUR)+' Min'
df1 = df.groupby(pd.Grouper(freq=x)).sum()
a = df1.max()/DUR*60
lista.append(a)
19.6 s ± 439 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Resampling
Time boost is probably random here, since it looks like the same is hapening under the hood.
%%timeit
def get_max_by_dur(DUR):
return df.resample(str(DUR)+"Min").sum().max()
l_a = [get_max_by_dur(dur)/dur*60 for dur in DURS]
17.2 s ± 559 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Resampling + Dask
Despite the fact that there's no way to properly vectorize - you still can make some parallelization and optimization with Dask.
!python3 -m pip install "dask[dataframe]" --upgrade
import dask.dataframe as dd
%%timeit
dd_df = dd.from_pandas(df, npartitions = 1)
def get_max_by_dur(DUR):
return dd_df.resample(str(DUR)+"Min").sum().max()
l_a = [(get_max_by_dur(dur)/dur*60).compute() for dur in DURS]
2.21 s ± 110 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Few words on apply and optimization
Usually, you use apply, to apply a function along the axis of a DataFrame. So that's the substitution for looping thru rows or columns of DataFrame itself, but in reality, apply is just a glorified loop with some extra functionality. So, when the performance matters you usually want to optimize your code like this.
Vectorization or Essential Basic Functionality (as you made)
Cython routines or numba
List comprehension.
Apply method
Iteration
Ilustration
Let's say you want to get a product of two columns
1). Vectorization or basic methods.
Basic methods:
df["product"] = df.prod(axis=1)
162 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Vectorization:
import numpy as np
def multiply(Value,State): # you may use lambda here as well
return Value*State
%timeit df["new_column"] = np.vectorize(multiply) (df["Value[mm]"], df["State of value"])
853 ms ± 42.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2). Cython or numba
It can be very useful in cases if you already wrote some looping. You can often, just decorate it with #numba.jit and achieve significant performance boost. It's also very helpful when you want to compute some iterative value, which is difficult to vectorize.
Since the function we choose is multiplication you'll not have benefits comparing to usual apply.
%%cython
cdef double cython_multiply(double Value, double State):
return Value * State
%timeit df["new_column"] = df.apply(lambda row:multiply(row["Value[mm]"], row["State of value"]), axis = 1)
1min 38s ± 4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
3). List comprehension.
It's pythonic and, also quite similar to for loop.
%timeit df["new_column"] = [x*y for x, y in zip(df["Value[mm]"], df["State of value"])]
1.56 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4). Apply method
Notice, how slow it is.
%timeit df["new_column"] = df.apply(lambda row:row["Value[mm]"]*row["State of value"], axis = 1)
1min 37s ± 4.76 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
5). Looping thru rows
itertuples:
%%timeit
list_a = []
for row in df.itertuples():
list_a.append(row[2]*row[3])
df['product'] = list_a
9.81 s ± 831 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
iterrows (you probably shouldn't use that):
%%timeit
list_a = []
for row in df.iterrows():
list_a.append(row[1][1]*row[1][2])
df['product'] = list_a
6min 40s ± 1min 8s per loop (mean ± std. dev. of 7 runs, 1 loop each)
I am new to Python and I am not sure how to solve the following problem.
I have a function:
def EOQ(D,p,ck,ch):
Q = math.sqrt((2*D*ck)/(ch*p))
return Q
Say I have the dataframe
df = pd.DataFrame({"D": [10,20,30], "p": [20, 30, 10]})
D p
0 10 20
1 20 30
2 30 10
ch=0.2
ck=5
And ch and ck are float types. Now I want to apply the formula to every row on the dataframe and return it as an extra row 'Q'. An example (that does not work) would be:
df['Q']= map(lambda p, D: EOQ(D,p,ck,ch),df['p'], df['D'])
(returns only 'map' types)
I will need this type of processing more in my project and I hope to find something that works.
The following should work:
def EOQ(D,p,ck,ch):
Q = math.sqrt((2*D*ck)/(ch*p))
return Q
ch=0.2
ck=5
df['Q'] = df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
df
If all you're doing is calculating the square root of some result then use the np.sqrt method this is vectorised and will be significantly faster:
In [80]:
df['Q'] = np.sqrt((2*df['D']*ck)/(ch*df['p']))
df
Out[80]:
D p Q
0 10 20 5.000000
1 20 30 5.773503
2 30 10 12.247449
Timings
For a 30k row df:
In [92]:
import math
ch=0.2
ck=5
def EOQ(D,p,ck,ch):
Q = math.sqrt((2*D*ck)/(ch*p))
return Q
%timeit np.sqrt((2*df['D']*ck)/(ch*df['p']))
%timeit df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
1000 loops, best of 3: 622 µs per loop
1 loops, best of 3: 1.19 s per loop
You can see that the np method is ~1900 X faster
There are few more ways to apply a function on every row of a DataFrame.
(1) You could modify EOQ a bit by letting it accept a row (a Series object) as argument and access the relevant elements using the column names inside the function. Moreover, you can pass arguments to apply using its keyword, e.g. ch or ck:
def EOQ1(row, ck, ch):
Q = math.sqrt((2*row['D']*ck)/(ch*row['p']))
return Q
df['Q1'] = df.apply(EOQ1, ck=ck, ch=ch, axis=1)
(2) It turns out that apply is often slower than a list comprehension (in the benchmark below, it's 20x slower). To use a list comprehension, you could modify EOQ still further so that you access elements by its index. Then call the function in a loop over df rows that are converted to lists:
def EOQ2(row, ck, ch):
Q = math.sqrt((2*row[0]*ck)/(ch*row[1]))
return Q
df['Q2a'] = [EOQ2(x, ck, ch) for x in df[['D','p']].to_numpy().tolist()]
(3) As it happens, if the goal is to call a function iteratively, map is usually faster than a list comprehension. So you could convert df into a list, map the function to it; then unpack the result in a list:
df['Q2b'] = [*map(EOQ2, df[['D','p']].to_numpy().tolist(), [ck]*len(df), [ch]*len(df))]
(4) As #EdChum notes, it's always better to use vectorized methods if it's possible to do so, instead of applying a function row by row. Pandas offers vectorized methods that rival that of numpy's. In the case of EOQ for example, instead of math.sqrt, you could use pandas' pow method (in the benchmark below, using pandas vectorized methods is ~20% faster than using numpy):
df['Q_pd'] = df['D'].mul(2*ck).div(ch*df['p']).pow(0.5)
Output:
D p Q Q_np Q1 Q2a Q2b Q_pd
0 10 20 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
1 20 30 5.773503 5.773503 5.773503 5.773503 5.773503 5.773503
2 30 10 12.247449 12.247449 12.247449 12.247449 12.247449 12.247449
Timings:
df = pd.DataFrame({"D": [10,20,30], "p": [20, 30, 10]})
df = pd.concat([df]*10000)
>>> %timeit df['Q'] = df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
623 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['Q1'] = df.apply(EOQ1, ck=ck, ch=ch, axis=1)
615 ms ± 39.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['Q2a'] = [EOQ2(x, ck, ch) for x in df[['D','p']].to_numpy().tolist()]
31.3 ms ± 479 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df['Q2b'] = [*map(EOQ2, df[['D','p']].to_numpy().tolist(), [ck]*len(df), [ch]*len(df))]
26.9 ms ± 306 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df['Q_np'] = np.sqrt((2*df['D']*ck)/(ch*df['p']))
1.19 ms ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df['Q_pd'] = df['D'].mul(2*ck).div(ch*df['p']).pow(0.5)
966 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)