How can I find the difference in two rows and divide this result by the sum of two rows? - python

How can I find the difference in two rows and divide this result by the sum of two rows?
Here is how to do it in Excel.
Here is the formula I want to replicate, using Python.
=ABS(((B3-B2)/(B3+B2)/2)/((A3-A2)/(A3+A2)/2))
I know the difference can be calculated with df.diff(), but I can't figure out how to do the sum.
import pandas as pd
data = {'Price':[50,46],'Quantity':[3,6]}
df = pd.DataFrame(data)
print(df)

Can use rolling.sum with a window size of 2:
(df.diff()/df.rolling(2).sum()).eval('abs(Quantity/Price)')
0 NaN
1 8.0
dtype: float64

Basically you already have the diff then already you have two row sum
Since diff : x[2]-x[1] Then 'sum' : x[2]+x[1]=x[2]*2-(x[2]-x[1])
In your case the sum can be calculated by
df*2-df.diff()
Out[714]:
Price Quantity
0 NaN NaN
1 96.0 9.0
So the output is
(df.diff()/(df*2-df.diff())).eval('abs(Quantity/Price)')
Out[718]:
0 NaN
1 8.0
dtype: float64

For small dataframes the use of .eval() is not efficient.
The following is faster upto some 100.000 rows:
df = (df.diff() / df.rolling(2).sum()).div(2)
df['result'] = abs(df.Quantity / df.Price)
32.9 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
vs.
39.6 ms ± 931 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

Count elements in defined groups in pandas dataframe

Say I have a dataframe and I want to count how many times we have element e.g [1,5,2] in a/each column.
I could do something like
elem_list = [1,5,2]
for e in elemt_list:
(df["col1"]==e).sum()
but isn't there a better way like
elem_list = [1,5,2]
df["col1"].count_elements(elem_list)
#1 5 # 1 occurs 5 times
#5 3 # 5 occurs 3 times
#2 0 # 2 occurs 0 times
Note it should count all the elements in the list, and return "0" if an element in the list is not in the column.
You can use value_counts and reindex:
df = pd.DataFrame({'col1': [1,1,5,1,5,1,1,4,3]})
elem_list = [1,5,2]
df['col1'].value_counts().reindex(elem_list, fill_value=0)
output:
1 5
5 2
2 0
benchmark (100k values):
# setup
df = pd.DataFrame({'col1': np.random.randint(0,10, size=100000)})
df['col1'].value_counts().reindex(elem_list, fill_value=0)
# 774 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
pd.Categorical(df['col1'],elem_list).value_counts()
# 2.72 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df.loc[df["col1"].isin(elem_list), 'col1'].value_counts().reindex(elem_list, fill_value=0)
# 2.98 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pass to the Categorical which will return 0 for missing item
pd.Categorical(df['col1'],elem_list).value_counts()
Out[62]:
1 3
5 0
2 1
dtype: int64
First filter by Series.isin and DataFrame.loc and then use Series.value_counts, last if order is important add Series.reindex:
df.loc[df["col1"].isin(elem_list), 'col1'].value_counts().reindex(elem_list, fill_values=0)
You could do something like that:
df = pd.DataFrame({"col1":np.random.randint(0,10, 100)})
df[df["col1"].isin([0,1])].value_counts()
# col1
# 1 17
# 0 10
# dtype: int64

Getting NaN for dividing each row value by row sum

I'm trying something very simple, I got a Dataframe made of 1 and 0. I'm trying to divide each value in row by the sum of the row, so it will be weight so that the row sums to 1
trading_signal sample
btc. eth
2021/08/25. 1. 0
2021/08/26. 1. 1
2021/08/27. 0. 0
position expected output
btc. eth
2021/08/25. 1 0
2021/08/26. 0.5 0.5
2021/08/27. 0 0
I imagine it to just
positions = trading_signals / trading_signals.sum(axis=1)
But the positions df just populated with NaNs
You need to divide on axis=0, which is not the default with /. Use div instead:
df.div(df.sum(axis=1), axis=0)
NB. division by 0 will give you NaNs, so add .fillna(0) to fill with 0
Several options + comparison of running times:
apply
df.apply(lambda row: row / row.sum(axis=0), axis=1).fillna(0)
1.18 ms ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
transpose
(df.T / df.T.sum()).T.fillna(0)
1.54 ms ± 843 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
div
df.div(df.sum(axis=1), axis=0).fillna(0)
576 µs ± 354 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Option 3 has the lowest runtime - Probably because it makes the most of the vectoring capabilities of pandas-python
Same result for all of them:

Using apply() rather than for loop - Pandas

I am extracting maximum rainfall intensity for different durations using data with 5-minute rainfall totals. The code produces a list of max rainfall intensity for each duration (DURS). The code works but is slow when using data sets with 1,000,000+ rows. I am new to Pandas and I understand the apply() method is much faster than using a For loop but I do not know how to re-write a For loop using the apply() method.
Example of dataframe:
Value[mm] State of value
Date_Time
2020-01-01 00:00:00 1.0 5
2020-01-01 00:05:00 0.5 5
2020-01-01 00:10:00 4.0 5
2020-01-01 00:15:00 2.0 5
2020-01-01 00:20:00 2.0 5
2020-01-01 00:25:00 0.5 5
Example of Code:
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter
import math, numpy, array, glob
import pandas as pd
import numpy as np
pluvi_file = "rain.csv"
DURS = [5,6,10,15,20,25,30,45,60,90,120,180,270,360,540,720,1440,2880,4320]
df = pd.read_csv(pluvi_file, delimiter=',',parse_dates=[['Date','Time']])
df['Date_Time'] = pd.to_datetime(df['Date_Time'], dayfirst=True)
df.index = df['Date_Time']
del df['Date_Time']
lista = []
for DUR in DURS:
x = str(DUR)+' Min'
df1 = df.groupby(pd.Grouper(freq=x)).sum()
a = df1['Value[mm]'].max()/DUR*60
print(a)
lista.append(a)
Output (Max rainfall intensity for each duration in mm/hr):
5 66.0
6 60.0
10 54.0
15 40.0
20 40.5
25 30.0
30 34.0
45 26.666666666666664
60 26.5
90 20.666666666666668
120 23.0
180 12.166666666666666
270 8.11111111111111
360 9.416666666666666
540 6.444444444444445
720 4.708333333333333
1440 3.8958333333333335
2880 2.7708333333333335
4320 2.1597222222222223
How would I re-write this using the apply() method?
Solution
It looks like applying doesn't suit here, since functions you are applying on groups are vectorised methods from Essential Basic Functionality. Also removing of for loop doesn't look like a promising way for performance optimization, since there are no too much durations in your DURS list, so the main issue is grouping operation and calculations on groups, and there's no too much space for optimization, at least at my opinion.
Create artificial data
import pandas as pd
df = pd.DataFrame({'Date_Time' : ["2020-01-01 00:00:00",
"2020-01-01 00:05:00",
"2020-01-01 00:10:00",
"2020-01-01 00:15:00",
"2020-01-01 00:20:00",
"2020-01-01 00:25:00"],
'Value[mm]' : [1.0,0.5,4.0,2.0,2.0,0.5],
'State of value': [5,5,5,5,5,5]
})
df = df.sample(3900875, replace=True).reset_index(drop=True)
Now, lets set Date_Time as an index, and get just series we need to calculate our values
df['Date_Time'] = pd.to_datetime(df['Date_Time'], dayfirst=True)
df = df.set_index('Date_Time', drop = True)
df = df['Value[mm]']
Compare the performance of different approaches
Grouping and looping
%%timeit
lista = []
for DUR in DURS:
x = str(DUR)+' Min'
df1 = df.groupby(pd.Grouper(freq=x)).sum()
a = df1.max()/DUR*60
lista.append(a)
19.6 s ± 439 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Resampling
Time boost is probably random here, since it looks like the same is hapening under the hood.
%%timeit
def get_max_by_dur(DUR):
return df.resample(str(DUR)+"Min").sum().max()
l_a = [get_max_by_dur(dur)/dur*60 for dur in DURS]
17.2 s ± 559 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Resampling + Dask
Despite the fact that there's no way to properly vectorize - you still can make some parallelization and optimization with Dask.
!python3 -m pip install "dask[dataframe]" --upgrade
import dask.dataframe as dd
%%timeit
dd_df = dd.from_pandas(df, npartitions = 1)
def get_max_by_dur(DUR):
return dd_df.resample(str(DUR)+"Min").sum().max()
l_a = [(get_max_by_dur(dur)/dur*60).compute() for dur in DURS]
2.21 s ± 110 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Few words on apply and optimization
Usually, you use apply, to apply a function along the axis of a DataFrame. So that's the substitution for looping thru rows or columns of DataFrame itself, but in reality, apply is just a glorified loop with some extra functionality. So, when the performance matters you usually want to optimize your code like this.
Vectorization or Essential Basic Functionality (as you made)
Cython routines or numba
List comprehension.
Apply method
Iteration
Ilustration
Let's say you want to get a product of two columns
1). Vectorization or basic methods.
Basic methods:
df["product"] = df.prod(axis=1)
162 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Vectorization:
import numpy as np
def multiply(Value,State): # you may use lambda here as well
return Value*State
%timeit df["new_column"] = np.vectorize(multiply) (df["Value[mm]"], df["State of value"])
853 ms ± 42.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2). Cython or numba
It can be very useful in cases if you already wrote some looping. You can often, just decorate it with #numba.jit and achieve significant performance boost. It's also very helpful when you want to compute some iterative value, which is difficult to vectorize.
Since the function we choose is multiplication you'll not have benefits comparing to usual apply.
%%cython
cdef double cython_multiply(double Value, double State):
return Value * State
%timeit df["new_column"] = df.apply(lambda row:multiply(row["Value[mm]"], row["State of value"]), axis = 1)
1min 38s ± 4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
3). List comprehension.
It's pythonic and, also quite similar to for loop.
%timeit df["new_column"] = [x*y for x, y in zip(df["Value[mm]"], df["State of value"])]
1.56 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4). Apply method
Notice, how slow it is.
%timeit df["new_column"] = df.apply(lambda row:row["Value[mm]"]*row["State of value"], axis = 1)
1min 37s ± 4.76 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
5). Looping thru rows
itertuples:
%%timeit
list_a = []
for row in df.itertuples():
list_a.append(row[2]*row[3])
df['product'] = list_a
9.81 s ± 831 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
iterrows (you probably shouldn't use that):
%%timeit
list_a = []
for row in df.iterrows():
list_a.append(row[1][1]*row[1][2])
df['product'] = list_a
6min 40s ± 1min 8s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Applying group-specific function that returns a single series

I'm trying to figure out an efficient split/apply/combine scheme for the following scenario. Consider the pandas dataframe demoAll defined below:
import datetime
import pandas as pd
demoA = pd.DataFrame({'date':[datetime.date(2010,1,1), datetime.date(2010,1,2), datetime.date(2010,1,3)],
'ticker':['A', 'A', 'A'],
'x1':[10,20,30],
'close':[120, 133, 129]}).set_index('date', drop=True)
demoB = pd.DataFrame({'date':[datetime.date(2010,1,1), datetime.date(2010,1,2), datetime.date(2010,1,3)],
'ticker':['B', 'B', 'B'],
'x1':[18,11,45],
'close':[50, 49, 51]}).set_index('date', drop=True)
demoAll = pd.concat([demoA, demoB])
print(demoAll)
The result is:
ticker x1 close
date
2010-01-01 A 10 120
2010-01-02 A 20 133
2010-01-03 A 30 129
2010-01-01 B 18 50
2010-01-02 B 11 49
2010-01-03 B 45 51
I also have a dictionary mapping of tickers to model objects
ticker2model = {'A':model_A, 'B':model_B,...}
where each model has a self.predict(df) method that takes-in an entire dataframe and returns a series of the same length.
I now would like to create a new column, demoAll['predictions'], that corresponds to these predictions. What is the cleanest/most-efficient way of doing this? A few things to note:
demoAll was the concatenation of ticker-specific dataframes that were each indexed just by date. Thus the indices of demoAll are not unique. (However, the combination of date/ticker IS unique.)
My thinking has been to do something like the example below, but running into issues with indexing, data-type coercions, and slow run times. The real dataset is quite large (both rows and columns).
demoAll['predictions'] = demoAll.groupby('ticker').apply(
lambda x: ticker2model[x.name].predict(x)
)
I may have misunderstood what you are passing through the model in order to predict but if I have understood correctly I would do the following:
pre-allocate the predictions columns of the demoAll
loop through the unique values of the ticker and filter demoAll
filter out the ticker row
predict the result using the filtered df
save the results in the correct locations in demoAll['predictions']
An Example using your code:
# get non 'ticker' columns
non_ticker_cols = [col for col in demoAll.columns if col is not 'ticker']
# get unique set of tickers
tickers = demoAll.ticker.unique()
# create and prepopulate the predictions column
demoAll['predictions'] = 0
for ticker in tickers:
# get boolean Series to filter the Dataframes by.
filter_by_ticker = demoAll.ticker == ticker
# filter, predict and allocate
demoAll.loc[filter_by_ticker, 'predictions'] = ticker2model[
ticker].predict(
demoAll.loc[filter_by_ticker,
non_ticker_cols]
)
The output would look like:
ticker x1 close predictions
date
2010-01-01 A 10 120 10.0
2010-01-02 A 20 133 10.0
2010-01-03 A 30 129 10.0
2010-01-01 B 18 50 100.0
2010-01-02 B 11 49 100.0
2010-01-03 B 45 51 100.0
Comparison to using apply
We could use apply per row but as you mentioned it would slow. I will compare the two to give an idea of the speedup.
Setup
I will use DummyRegressor from sklearn to allow me to call a predict method and create the dictionary you mention in your question.
model_a = DummyRegressor(strategy='mean')
model_b = DummyRegressor(strategy='median')
model_a.fit([[10,14]], y=np.array([10]))
model_b.fit([[200,200]], [100])
ticker2model = {'A':model_a, 'B':model_b}
Defining both as functions
def predict_by_ticker_filter(df, model_dict):
# get non 'ticker' columns
non_ticker_cols = [col for col in df.columns if col is not 'ticker']
# get unique set of tickers
tickers = df.ticker.unique()
# create and prepopulate the predictions column
df['predictions'] = 0
for ticker in tickers:
# get boolean Series to filter the Dataframes by.
filter_by_ticker = df.ticker==ticker
# filter, predict and allocate
df.loc[filter_by_ticker,'predictions'] = model_dict[ticker].predict(
df.loc[filter_by_ticker,
non_ticker_cols]
)
return df
def model_apply_by_row(df_row, model_dict):
# includes some conversions to list to allow the predict method to run
return model_dict[df_row['ticker']].predict([df_row[['x1','close']].tolist()])[0]
Performance
I use timeit on the function call to give the following results
On your example demoAll:
model_apply_by_row
%timeit demoAll.apply(model_apply_by_row,model_dict=ticker2model, axis=1)
3.78 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
predict_by_ticker_filter
%timeit predict_by_ticker_filter(demoAll, ticker2model)
6.24 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Increasing the size of demoAll to (606, 3):
model_apply_by_row
%timeit demoAll.apply(model_apply_by_row,model_dict=ticker2model, axis=1)
320 ms ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
predict_by_ticker_filter
%timeit predict_by_ticker_filter(demoAll, ticker2model)
6.1 ms ± 512 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Increasing the size of demoAll to (6006, 3):
model_apply_by_row
%timeit demoAll.apply(model_apply_by_row,model_dict=ticker2model, axis=1)
3.15 s ± 371 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
predict_by_ticker_filter
%timeit predict_by_ticker_filter(demoAll, ticker2model)
9.1 ms ± 767 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

What is the fastest way to perform a replace on a column of a Pandas DataFrame based on the index of a separate Series?

Sorry if I've been googling the wrong keywords, but I haven't been able to find an efficient way to replace all instances of an integer in a DataFrame column with its corresponding indexed value from a secondary Series.
I'm working with the output of a third party program that strips the row and column labels from an input matrix and replaces them with their corresponding indices. I'd like to restore the true labels from the indices.
I have a dummy example of the dataframe and series in question:
In [6]: df
Out[6]:
idxA idxB var2
0 0 1 2.0
1 0 2 3.0
2 2 4 2.0
3 2 1 1.0
In [8]: labels
Out[8]:
0 A
1 B
2 C
3 D
4 E
Name: label, dtype: object
Currently, I'm converting the series to a dictionary and using replace:
label_dict = labels.to_dict()
df['idxA'] = df.idxA.replace(label_dict)
df['idxB'] = df.idxB.replace(label_dict)
which does give me the expected result:
In [12]: df
Out[12]:
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
However, this is very slow for my full dataset (approximately 3.8 million rows in the table, and 19,000 labels). Is there a more efficient way to approach this?
Thanks!
EDIT: I accepted #coldspeed's answer. Couldn't paste a code block in the comment reply to his answer, but his solution sped up the dummy code by about an order of magnitude:
In [10]: %timeit df.idxA.replace(label_dict)
4.41 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit df.idxA.map(labels)
435 µs ± 3.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can call map for each column using apply:
df.loc[:, 'idxA':'idxB'] = df.loc[:, 'idxA':'idxB'].apply(lambda x: x.map(labels))
df
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
This is effectively iterating over every column (but the map operation for a single column is vectorized, so it is fast). It might just be faster to do
cols_of_interest = ['idxA', 'idxB', ...]
for c in cols_of_interest: df[c] = df[c].map(labels)
map is faster than replace, depending on the number of columns to replace. Your mileage may vary.
df_ = df.copy()
df = pd.concat([df_] * 10000, ignore_index=True)
%timeit df.loc[:, 'idxA':'idxB'].replace(labels)
%%timeit
for c in ['idxA', 'idxB']:
df[c].map(labels)
6.55 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.95 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Categories

Resources