Calculating 1 minus spearman correlation between two large datasets - python

I have two very huge dataframes that I'd like to calculate ( 1 minus the Spearman correlation ) between each point between them...
And, store the information in a new table where the index of the first dataframe is the index of the new dataframe and the index of the 2nd dataframe is now the column names of the new dataframe.
I found this post where a similar question was asked, which I've attempted but its been running for over an hour now unsuccessfully.
So for example: Given the following two data frames:
import random
X = pd.DataFrame({"A":np.random.random_sample(size = 10000), "B":np.random.random_sample(size = 10000), "C":np.random.random_sample(size = 10000), "D":np.random.random_sample(size = 10000), "E":np.random.random_sample(size = 10000)})
Y = pd.DataFrame({"AA":np.random.random_sample(size = 10000), "BB":np.random.random_sample(size = 10000), "CC":np.random.random_sample(size = 10000), "DD":np.random.random_sample(size = 10000), "EE":np.random.random_sample(size = 10000)})
I'd like to calculate (1 - spearman correlation) between each point between dataframes, where each point is a 1 x 10000 matrix such that the end results looks like this:
AA BB CC DD EE
A (1 - spearman corr) x x
B x x
C x
D
E
This is the 1st part of what I've done. Its been running for hours. No result.
test = pd.concat([X,Y], axis=0).corr(method="spearman")

Related

Dataframe with Monte Carlo Simulation calculation next row Problem

I want to build up a Dataframe from scratch with calculations based on the Value before named Barrier option. I know that i can use a Monte Carlo simulation to solve it but it just wont work the way i want it to.
The formula is:
Value in row before * np.exp((r-sigma**2/2)*T/TradingDays+sigma*np.sqrt(T/TradingDays)*z)
The first code I write just calculates the first column. I know that I need a second loop but can't really manage it.
The result should be, that for each simulation it will calculate a new value using the the value before, for 500 Day meaning S_1 should be S_500 with a total of 1000 simulations. (I need to generate new columns based on the value before using the formular.)
similar to this:
So for the 1. Simulations 500 days, 2. Simulation 500 day and so on...
import numpy as np
import pandas as pd
from scipy.stats import norm
import random as rd
import math
simulation = 0
S_0 = 42
T = 2
r = 0.02
sigma = 0.20
TradingDays = 500
df = pd.DataFrame()
for i in range (0,TradingDays):
z = norm.ppf(rd.random())
simulation = simulation + 1
S_1 = S_0*np.exp((r-sigma**2/2)*T/TradingDays+sigma*np.sqrt(T/TradingDays)*z)
df = df.append ({
'S_1':S_1,
'S_0':S_0
}, ignore_index=True)
df = df.round ({'Z':6,
'S_T':2
})
df.index += 1
df.index.name = 'Simulation'
print(df)
I found another possible code which i found here and it does solve the problem but just for one row, the next row is just the same calculation. Generate a Dataframe that follow a mathematical function for each column / row
If i just replace it with my formular i get the same problem.
replacing:
exp(r - q * sqrt(sigma))*T+ (np.random.randn(nrows) * sqrt(deltaT)))
with:
exp((r-sigma**2/2)*T/nrows+sigma*np.sqrt(T/nrows)*z))
import numpy as np
import pandas as pd
from scipy.stats import norm
import random as rd
import math
S_0 = 42
T = 2
r = 0.02
sigma = 0.20
TradingDays = 50
Simulation = 100
df = pd.DataFrame({'s0': [S_0] * Simulation})
for i in range(1, TradingDays):
z = norm.ppf(rd.random())
df[f's{i}'] = df.iloc[:, -1] * np.exp((r-sigma**2/2)*T/TradingDays+sigma*np.sqrt(T/TradingDays)*z)
print(df)
I would work more likely with the last code and solve the problem with it.
How about just overwriting the value of S_0 by the new value of S_1 while you loop and keeping all simulations in a list?
Like this:
import numpy as np
import pandas as pd
import random
from scipy.stats import norm
S_0 = 42
T = 2
r = 0.02
sigma = 0.20
trading_days = 50
output = []
for i in range(trading_days):
z = norm.ppf(random.random())
value = S_0*np.exp((r - sigma**2 / 2) * T / trading_days + sigma * np.sqrt(T/trading_days) * z)
output.append(value)
S_0 = value
df = pd.DataFrame({'simulation': output})
Perhaps I'm missing something, but I don't see the need for a second loop.
Also, this eliminates calling df.append() in a loop, which should be avoided. (See here)
Solution based on the the answer of bartaelterman, thank you very much!
import numpy as np
import pandas as pd
from scipy.stats import norm
import random as rd
import math
#Dividing the list in chunks to later append it to the dataframe in the right order
def chunk_list(lst, chunk_size):
for i in range(0, len(lst), chunk_size):
yield lst[i:i + chunk_size]
def blackscholes():
d1 = ((math.log(S_0/K)+(r+sigma**2/2)*T)/(sigma*np.sqrt(2)))
d2 = ((math.log(S_0/K)+(r-sigma**2/2)*T)/(sigma*np.sqrt(2)))
preis_call_option = S_0*norm.cdf(d1)-K*np.exp(-r*T)*norm.cdf(d2)
return preis_call_option
K = 40
S_0 = 42
T = 2
r = 0.02
sigma = 0.2
U = 38
simulation = 10000
trading_days = 500
trading_days = trading_days -1
#creating 2 lists for the first and second loop
loop_simulation = []
loop_trading_days = []
#first loop calculates the first column in a list
for j in range (0,simulation):
print("Progressbar_1_2 {:2.2%}".format(j / simulation), end="\n\r")
S_Tag_new = 0
NORM_S_INV = norm.ppf(rd.random())
S_Tag = S_0*np.exp((r-sigma**2/2)*T/trading_days+sigma*np.sqrt(T/trading_days)*NORM_S_INV)
S_Tag_new = S_Tag
loop_simulation.append(S_Tag)
#second loop calculates the the rows for the columns in a list
for i in range (0,trading_days):
NORM_S_INV = norm.ppf(rd.random())
S_Tag = S_Tag_new*np.exp((r-sigma**2/2)*T/trading_days+sigma*np.sqrt(T/trading_days)*NORM_S_INV)
loop_trading_days.append(S_Tag)
S_Tag_new = S_Tag
#values from the second loop will be divided in number of Trading days per Simulation
loop_trading_days_chunked = list(chunk_list(loop_trading_days,trading_days))
#First dataframe with just the first results from the firstloop for each simulation
df1 = pd.DataFrame({'S_Tag 1': loop_simulation})
#Appending the the chunked list from the second loop to a second dataframe
df2 = pd.DataFrame(loop_trading_days_chunked)
#Merging both dataframe into one
df3 = pd.concat([df1, df2], axis=1)

Down sample DF1 according to the coordinates in DF2

I have two DataFrames. Both have X and Y coordinates. But DF1 is much denser than DF2. I want to downsample DF1 according to the X Y coordinates in DF2. Specifically, for each X/Y pairs in DF2, I select DF1 data between X +/-delta and Y +/-delta, and calculate the average value of Z. New_DF1 will have the same X Y coordinate as DF2, but with the average value of Z by downsampling.
Here are some examples and a function I made for this purpose. My problem was that it is too slow for a large dataset. It is highly appreciated if anyone has a better idea to vectorize the operation instead of crude looping.
Create data examples:
DF1 = pd.DataFrame({'X':[0.6,0.7,0.9,1.1,1.3,1.8,2.1,2.8,2.9,3.0,3.3,3.5],"Y":[0.6,0.7,0.9,1.1,1.3,1.8,2.1,2.8,2.9,3.0,3.3,3.5],'Z':[1,2,3,4,5,6,7,8,9,10,11,12]})
DF2 = pd.DataFrame({'X':[1,2,3],'Y':[1,2,3],'Z':[10,20,30]})
Function:
def DF1_match_DF2_target(half_range, DF2, DF1):
### half_range, scalar, define the area of dbf target
### dbf data
### raw pwg pixel map
DF2_X =DF2.loc[:,["X"]]
DF2_Y =DF2.loc[:,['Y']]
results = list()
for i in DF2.index:
#Select target XY from DF2
x= DF2_X.at[i,'X']
y= DF2_Y.at[i,'Y']
#Select X,Y range for DF1
upper_lmt_X = x+half_range
lower_lmt_X = x-half_range
upper_lmt_Y = y+half_range
lower_lmt_Y = y-half_range
#Select data from DF1 according to X,Y range, calculate average Z
subset_X = DF1.loc[(DF1['X']>lower_lmt_X) & (DF1['X']<upper_lmt_X)]
subset_XY = subset_X.loc[(subset_X['Y']>lower_lmt_Y) & (subset_X['Y']<upper_lmt_Y)]
result = subset_XY.mean(axis=0,skipna=True)
result[0] = x #set X,Y in new_DF1 the same as the X,Y in DF2
result[1] = y #set X,Y in new_DF1 the same as the X,Y in DF2
results.append(result)
results = pd.DataFrame(results)
return results
Test and Result:
new_DF1 = DF1_match_DF2_target(0.5,DF2,DF1)
new_DF1
Test and Result
How about using the 'pandas:cut()' function to aggregate using the boundary values?
half_range = 0.5
# create bins
x_bins = [0] + list(df2.x)
y_bins = [0] + list(df2.y)
tmp = [half_range]*(len(df2)+1)
x_bins = [a + b for a, b in zip(x_bins, tmp)]
y_bins = [a + b for a, b in zip(y_bins, tmp)]
key = pd.cut(df1.x, bins=x_bins, right=False, precision=1)
df3 = df1.groupby(key).mean().reset_index()
df2.z = df3.z
df2
x y z
0 1 1 3.0
1 2 2 6.5
2 3 3 9.5

average point on each bin pandas

I have 2 dataframes temperature(y) and ratio(x). In each dataframe I have 60 columns corresponding to 60 different machines that measure both parameters.
for now I have a plot for each machine of y vs x, as follow:
for column in ratio.columns:
x = ratio[column]
y = temperature[column]
if len(x) != len(y):
x_ind = x.index
y_ind = y.index
common_ind = x_ind.intersection(y_ind)
x = x[common_ind]
y = y[common_ind]
plt.scatter(x,y)
plt.savefig("plot" +column+".png")
plt.clf()
because I have a lot of data points, I want to do binning for each machine and to do an average on each bin, so that I will have an average point of y for each bin.
x is between 0 and 1 and I want to bin every 0.05, which gives 20 bins.
I got an histogram for each machine by doing:
for x in ratio.columns:
ratio.hist(column = x, bins = 20)
but this is only giving number of events vs ratio.
how can I link the temperature dataframe
I am new to pandas and I can't figure out how to do this
mask bin every 20
mask = my_df.index//20
then use groupby and agg
my_df.groupby(mask).agg(['mean'])

python: increase performance of finding the best timeshift for a correlation between each X column and y

I have a dataframe X with several columns and a dataframe y with only one column (series). The rows in X represent timesteps and I want to find the interval I need to shift each column of X to obtain the highest correlation with y. I wrote a function that loops over all columns and then loops over all timesteps and correlates the X column with y. If the R² is better than before I store the timestep. However, with over 300 columns this routine is really taking some time and I need to increase the performance. Is there a nice way to simplify this code?
(In the example I used the iris data set which is of course not a timeseries...)
from sklearn import datasets
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
from copy import deepcopy
def get_best_shift(dfX, dfy, ti=60, maxt=1440):
"""
determines the best correlation for the last maxt minutes based on a
timestep of ti minutes. Creates a dataframe with the shifted variables based on the
best match (strongest correlation).
"""
df_out = deepcopy(dfX)
for xcol in dfX:
bestshift = 0
Rmax = 0
for ishift in range(0, int(maxt / ti)):
xvals = dfX[xcol].iloc[0:(dfX.shape[0] - ishift)].values
yvals = np.array([val[0] for val in dfy.iloc[ishift:dfy.shape[0]].values])
selector = np.array([str(val)!="nan" for val in (xvals*yvals)],dtype=bool)
xvals = xvals[selector]
yvals = yvals[selector]
R = np.corrcoef(xvals,yvals)[0][1]
# plt.figure()
# plt.plot(xvals,yvals,'k.')
# plt.show()
if R ** 2 > Rmax:
Rmax = R ** 2
# print(Rmax)
bestshift = ishift
df_out[xcol] = list(np.zeros(bestshift)) + list(dfX[xcol].iloc[0:dfX.shape[0] - bestshift].values)
df_out = df_out.rename(columns={xcol: ''.join([str(xcol), '_t-', str(bestshift)])})
return df_out
iris = datasets.load_iris()
X = pd.DataFrame(iris.data)
y = pd.DataFrame(iris.target)
df = get_best_shift(X,y)

Splitting integrated probability density into two spatial regions

I have some probability density function:
T = 10000
tmin = 0
tmax = 10**20
t = np.linspace(tmin, tmax, T)
time = np.asarray(t) #this line may be redundant
for j in range(T):
timedep_PD[j]= probdensity_func(x,time[j],initial_state)
I want to integrate it over two distinct regions of x. I tried the following to split the timedep_PD array into two spatial regions and then proceeded to integrate:
step = abs(xmin - xmax) / T
l1 = int(np.floor((abs(ab - xmin)* T ) / abs(xmin - xmax)))
l2 = int(np.floor((abs(bd - ab)* T ) / abs(xmin - xmax)))
#For spatial region 1
R1 = np.empty([l1])
R1 = x[:l1]
for i in range(T):
Pd1[i] = Pd[i][:l1]
#For spatial region 2
Pd2 = np.empty([T,l2])
R2 = np.empty([l2])
R2 = x[l1:l1+l2]
for i in range(T):
Pd2[i] = Pd[i][l1:l1+l2]
#Integrating over each spatial region
for i in range(T):
P[0][i] = np.trapz(Pd1[i],R1)
P[1][i] = np.trapz(Pd2[i],R2)
Is there an easier/more clear way to go about splitting up a probability density function into two spatial regions and then integrating within each spatial region at each time-step?
The loops can be eliminated by using vectorized operations instead. It's not clear whether Pd is a 2D NumPy array; it it's something else (e.g., a list of lists), it should be converted to a 2D NumPy array with np.array(...). After that you can do this:
Pd1 = Pd[:, :l1]
Pd2 = Pd[:, l1:l1+l2]
No need to loop over the time index; the slicing happens for all times at once (having : in place of an index means "all valid indices").
Similarly, np.trapz can integrate all time slices at once:
P1 = np.trapz(Pd1, R1, axis=1)
P2 = np.trapz(Pd2, R2, axis=1)
Each P1 and P2 is now a time series of integrals. The axis parameter determines along which axis Pd1 gets integrated - it's the second axis, i.e., space.

Categories

Resources