Fastest way to perform calculation between two dataframe colums?

Fastest way to perform calculation between two dataframe colums? - python

I have a pandas dataframe with 6 million rows. The columns are:
['x', 'y']
I need to apply a simple calculation between x an y, and append it to the dataframe.
This is what I've tried:
'''
Calculates the height of a pressure level in feet
'''
def pressure_to_elevation(P, T = None):
sea_level_pressure = 1013.25
if T is not None:
# https://www.omnicalculator.com/physics/air-pressure-at-altitude
P0 = sea_level_pressure
g = 9.80665
M = 0.0289644
R0 = 8.31447
m = (np.log(P/P0)*T) / -(g*M/R0)
f = 3.28084 * m
return f
b = 0.190284
c = 145366.45
return (1-math.pow((P/sea_level_pressure), b)) * c
test_df['result'] = test_fd.apply(lambda row: pressure_to_elevation(row['x'], row['y']),axis=1)
Unfortunately, this takes a ridiculous amount of time... in fact, I've yet to see it complete.
Is there a faster way to do this?

Try this:
def pressure_to_elevation(P, T):
sea_level_pressure = 1013.25
P0 = sea_level_pressure
g = 9.80665
M = 0.0289644
R0 = 8.31447
b = 0.190284
c = 145366.45
return np.where(T.notnull(),
3.28084 * ((np.log(P/P0)*T) / -(g*M/R0)),
(1-np.pow((P/sea_level_pressure), b)) * c)
Usage:
test_df['result'] = pressure_to_elevation(test_df['x'], test_df['y'])

I believe if you break this out into separate steps and avoid iterating through the entire dataframe, the speed will increase dramatically. Give the following a shot.
test_df['result_1'] = (test_df['x']/sea_level_pressure)
test_df['result_1'] = test_df['result']**0.190284
test_df['result_1'] = (1 - test_df['result'])*145366.45
test_df['result_2'] = 3.28084*((np.log(test_df['x']/sea_level_pressure)*test_df['y'])/(-1*(9.80665*0.0289644/8.31447)))
test_df['final_result'] = np.where(pd.isnull(test_df['y']), test_df['result_1'], test_df['result_2'])

Related

Barrier Option Pricing in Python

We have a barrier call option of European type with strike price K>0 and a barrier value
0 < b< S0 ,
where S_0 is the starting price.According to the contract, the times 0<t_1<...<t_k<T the price must be checked S(t_k)>b for every k.
Assuming the S(t) is described with the binomial option model with u=1.1 and d = 0.9,r=0.05,T=10, and t_1=2,t_2=4 and t-3=7 the times that the asset must be checked.Also consider the S_0=100,K=125 and the barrier b=60.
My attempt is the following :
# Initialise parameters
S0 = 100 # initial stock price
K = 125 # strike price
T = 10 # time to maturity in years
b = 60 # up-and-out barrier price/value
r = 0.05 # annual risk-free rate
N = 4 # number of time steps
u = 1.1 # up-factor in binomial models
d = 0.9 # ensure recombining tree
opttype = 'C' # Option Type 'C' or 'P'
def barrier_binomial(K,T,S0,b,r,N,u,d,opttype='C'):
#precompute values
dt = T/N
q = (1+r - d)/(u-d)
disc = np.exp(-r*dt)
# initialise asset prices at maturity
S = S0 * d**(np.arange(N,-1,-1)) * u**(np.arange(0,N+1,1))
# option payoff
if opttype == 'C':
C = np.maximum( S - K, 0 )
else:
C = np.maximum( K - S, 0 )
# check terminal condition payoff
C[S >= b] = 0
# backward recursion through the tree
for i in np.arange(N-1,-1,-1):
S = S0 * d**(np.arange(i,-1,-1)) * u**(np.arange(0,i+1,1))
C[:i+1] = disc * ( q * C[1:i+2] + (1-q) * C[0:i+1] )
C = C[:-1]
C[S >= H] = 0
return C[0]
barrier_binomial(K,T,S0,b,r,N,u,d,opttype='C')
I receive nothing because something is wrong and I don’t know what
But is it a simulation ?
Any help from someone ?

In your loop you are using C[S >= H] = 0, but your barrier param is defined as b. Also you are filling the array C with 0s only, so check the payoff condition. In general, I find it much easier looping through matrices when working with tree models.

JAX: time to jit a function grows superlinear with memory accessed by function

Here is a simple example, which numerically integrates the product of two Gaussian pdfs. One of the Gaussians is fixed, with mean always at 0. The other Gaussian varies in its mean:
import time
import jax.numpy as np
from jax import jit
from jax.scipy.stats.norm import pdf
# set up evaluation points for numerical integration
integr_resolution = 6400
lower_bound = -100
upper_bound = 100
integr_grid = np.linspace(lower_bound, upper_bound, integr_resolution)
proba = pdf(integr_grid)
integration_weight = (upper_bound - lower_bound) / integr_resolution
# integrate with new mean
def integrate(mu_new):
x_new = integr_grid - mu_new
proba_new = pdf(x_new)
total_proba = sum(proba * proba_new * integration_weight)
return total_proba
print('starting jit')
start = time.perf_counter()
integrate = jit(integrate)
integrate(1)
stop = time.perf_counter()
print('took: ', stop - start)
The function looks seemingly simple, but it doesn't scale at all. The following list contains pairs of (value for integr_resolution, time it took to run the code):
100 | 0.107s
200 | 0.23s
400 | 0.537s
800 | 1.52s
1600 | 5.2s
3200 | 19s
6400 | 134s
For reference, the unjitted function, applied to integr_resolution=6400 takes 0.02s.
I thought that this might be related to the fact that the function is accessing a global variable. But moving the code to set up the integration points inside of the function has no notable influence on the timing. The following code takes 5.36s to run. It corresponds to the table entry with 1600 which previously took 5.2s:
# integrate with new mean
def integrate(mu_new):
# set up evaluation points for numerical integration
integr_resolution = 1600
lower_bound = -100
upper_bound = 100
integr_grid = np.linspace(lower_bound, upper_bound, integr_resolution)
proba = pdf(integr_grid)
integration_weight = (upper_bound - lower_bound) / integr_resolution
x_new = integr_grid - mu_new
proba_new = pdf(x_new)
total_proba = sum(proba * proba_new * integration_weight)
return total_proba
What is happening here?

I also answered this at https://github.com/google/jax/issues/1776, but adding the answer here too.
It's because the code uses sum where it should use np.sum.
sum is a Python built-in that extracts each element of a sequence and sums them one by one using the + operator. This has the effect of building a large, unrolled chain of adds which XLA takes a long time to compile.
If you use np.sum, then JAX builds a single XLA reduction operator, which is much faster to compile.
And just to show how I figured this out: I used jax.make_jaxpr, which dumps JAX's internal trace representation of a function. Here, it shows:
In [3]: import jax
In [4]: jax.make_jaxpr(integrate)(1)
Out[4]:
{ lambda b c ; ; a.
let d = convert_element_type[ new_dtype=float32
old_dtype=int32 ] a
e = sub c d
f = sub e 0.0
g = pow f 2.0
h = div g 1.0
i = add 1.8378770351409912 h
j = neg i
k = div j 2.0
l = exp k
m = mul b l
n = mul m 2.0
o = slice[ start_indices=(0,)
limit_indices=(1,)
strides=(1,)
operand_shape=(100,) ] n
p = reshape[ new_sizes=()
dimensions=None
old_sizes=(1,) ] o
q = add p 0.0
r = slice[ start_indices=(1,)
limit_indices=(2,)
strides=(1,)
operand_shape=(100,) ] n
s = reshape[ new_sizes=()
dimensions=None
old_sizes=(1,) ] r
t = add q s
u = slice[ start_indices=(2,)
limit_indices=(3,)
strides=(1,)
operand_shape=(100,) ] n
v = reshape[ new_sizes=()
dimensions=None
old_sizes=(1,) ] u
w = add t v
x = slice[ start_indices=(3,)
limit_indices=(4,)
strides=(1,)
operand_shape=(100,) ] n
y = reshape[ new_sizes=()
dimensions=None
old_sizes=(1,) ] x
z = add w y
... similarly ...
and it's then obvious why this is slow: the program is very big.
Contrast the np.sum version:
In [5]: def integrate(mu_new):
...: x_new = integr_grid - mu_new
...:
...: proba_new = pdf(x_new)
...: total_proba = np.sum(proba * proba_new * integration_weight)
...:
...: return total_proba
...:
In [6]: jax.make_jaxpr(integrate)(1)
Out[6]:
{ lambda b c ; ; a.
let d = convert_element_type[ new_dtype=float32
old_dtype=int32 ] a
e = sub c d
f = sub e 0.0
g = pow f 2.0
h = div g 1.0
i = add 1.8378770351409912 h
j = neg i
k = div j 2.0
l = exp k
m = mul b l
n = mul m 2.0
o = reduce_sum[ axes=(0,)
input_shape=(100,) ] n
in [o] }
Hope that helps!

Detecting outliers from a list

I want to detect and store outliers from a list and this is what I am doing
Code:
def outliers(y,thresh=3.5):
m = np.median(y)
abs_dev = np.abs(y - m)
left_mad = np.median(abs_dev[y <= m])
right_mad = np.median(abs_dev[y >= m])
y_mad = left_mad * np.ones(len(y))
y_mad[y > m] = right_mad
modified_z_score = 0.6745 * abs_dev / y_mad
modified_z_score[y == m] = 0
return modified_z_score > thresh
bids = [5000,5500,4500,1000,15000,5200,4900]
z = outliers(bids)
bidd = np.array(bids)
out_liers = bidd[z]
This gives results as:
out_liers = array([ 1000, 15000])
Is there a better way to do this, where I don't get the results in array but in a list?
Also please can someone explain me why we used
thresh=3.5
modified_z_score = 0.6745 * abs_dev / y_mad

This works:
def outliers_modified_z_score(ys, threshold=3.5):
ys_arr = np.array(ys)
median_y = np.median(ys_arr)
median_absolute_deviation_y = np.median(np.abs(ys_arr - median_y))
modified_z_scores = 0.6745 * (ys_arr - median_y) / median_absolute_deviation_y
return (ys_arr[np.abs(modified_z_scores) > threshold]).tolist()

That's because you are using numpy function. Default type used there is numpy.ndarray, which speeds up the computations. In the case you just need a list as output argument, use tolist() method.
z = outliers(bids)
bidd = np.array(bids)
out_liers = bidd[z].tolist()

Want to get maximum values from output and apply them to equation

Input code is:
# Input data:
S = pd.S = 2000 # Saturation flow
L = pd.L = 5 # Lost time
eb = pd.eb = 1000
wb = pd.wb = 600
sb = pd.sb = 400
nb = pd.nb = 500
# a) C_min = Minimum cycle length calculation
Y_eb = pd.Y_eb = eb / S
Y_wb = pd.Y_wb = wb / S
Y_sb = pd.Y_sb = sb / S
Y_nb = pd.Y_nb = nb / S
Y_eb_wb_sb_nb = [Y_eb,Y_wb,Y_sb,Y_nb]
Y_eb_wb_sb_nb
Output:
[0.5, 0.3, 0.2, 0.25]
Then
if Y_eb > Y_wb:
print(C_min = L / 1 - (Y_eb + Y_wb))
I want to:
Get maximum values from (Y_eb;Y_wb) and (Y_sb;Y_nb) and apply these values to formula:
C_min = L / (1- [max of (Y_eb;Y_wb)] + [max of (Y_sb;Y_nb)])

Use max built-in fuction:
C_min = L / (1- max(Y_eb,Y_wb) + max(Y_sb,Y_nb))

python has a built-in max function, that give the max of a list...
max(iterable, *[, key, default])
max(arg1, arg2, *args[, key])
"Return the largest item in an iterable or the largest of two or more
arguments"
https://docs.python.org/3/library/functions.html#max
Answer:
C_min = L / (1- max([Y_eb, Y_wb]) + max([Y_sb, Y_nb]))

Optimising iterative loop

I'm gradually moving from Matlab to Python and would like to get some advice on optimising an iterative loop.
This is how I am currently running the loop, and for info I've included the code that defines the variables.
nh = 2000
h = np.array(range(nh))
nt = 10000
wmin = 1
wmax = 10
hw = np.array(wmin + (wmax-wmin)*invlogit(randn(1,nh)));
sl = np.array(zeros((nh,1))+radians(40))
fa = np.array(zeros((nh,1))+radians(35))
c = np.array(zeros((nh,1))+4.4)
y = np.array(zeros((nh,1))+17.6)
yw = np.array(zeros((nh,1))+9.81)
ir = 0.028
m = np.array(zeros((nh,nt)));
m[:,49] = 0.1
z = np.array(zeros((nh,nt)))
z[:,0] = 0+(3.0773-0)*rand(nh,1).T
reset = np.array(zeros((nh,nt)))
fs = np.array(zeros((nh,nt)))
for t in xrange(0, nt-1):
fs[:,t] = (c.T+(y.T-m[:,t]*yw.T)*z[:,t]*(np.cos(sl.T)**2)*np.tan(fa.T))/(y.T*z[:,t]*np.sin(sl.T)*np.cos(sl.T))
reset[fs[:,t]<=1,t+1] = 1;
z[fs[:,t]<=1,t+1] = 0;
z[fs[:,t]>1,t+1] = z[fs[:,t]>1,t]+(ir/hw[0,fs[:,t]>1]).T
This is how I would optimise the code in Matlab, however it runs fairly slowly in python. I suspect there is a more pythonic way of doing this and would really appreciate a nudge in the right direction.
Many thanks!

Not specifically about the loop, you're doing a ton of extra work in calls that look like:
np.array(zeros((nh,nt)))
Just use:
np.zeros((nh,nt))
in its place. Additionally, you could replace:
h = np.array(range(nh))
with:
h = np.arange(nh)
Other comments:
You're calling np.sin(sl.T)*np.cos(sl.T) in every loop although, sl does not appear to be changing at all. Just calculate it once and assign it to a variable that you use in the loop. You do this in a bunch of your trig calls.

The expression
(c.T+(y.T-m[:,t]*yw.T)*z[:,t]*(np.cos(sl.T)**2)*np.tan(fa.T))/(y.T*z[:,t]*np.sin(sl.T)*np.cos(sl.T))
uses c, y, m, yw, sl, fa that do not change inside the loop. You could compute several subexpressions before the loop.
Also, most of those arrays contain one repeated value. You could compute with scalars instead:
sl = radians(40)
fa = radians(35)
c = 4.4
y = 17.6
yw = 9.81
Then, with precomputed subexpressions:
A = cos(sl)**2 * tan(fa) * (y - m*yw)
B = y*sin(sl)*cos(sl)
for t in xrange(0, nt-1):
fs[:,t] = (c + A[:,t]*z[:,t]) / (B*z[:,t])
less = fs[:,t]<=1
more = np.logical_not(less)
reset[less,t+1] = 1
z[less,t+1] = 0
z[more,t+1] = z[more,t]+(ir/hw[0,more]).T

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fastest way to perform calculation between two dataframe colums? - python

Related

Barrier Option Pricing in Python

JAX: time to jit a function grows superlinear with memory accessed by function

Detecting outliers from a list

Want to get maximum values from output and apply them to equation

Optimising iterative loop

Categories

Resources