Y intercept of pandas dataframe with multiple series for linear regression - python

count 716865 716873 716884 716943
0 -0.16029615828413712 -0.07630309240006158 0.11220663712532133 -0.2726775504078691
1 -0.6687265363491811 -0.6135022705188075 -0.49097425130988914 -0.736020384028633
2 0.06735205699309535 0.07948417451634422 0.09240256047258057 0.0617964313591086
3 0.372935701728449 0.44324822316416074 0.5625073287879649 0.3199599294007491
4 0.39439310866886124 0.45960496068147993 0.5591549439131621 0.34928093849248304
5 -0.08007381002566456 -0.021313801077641505 0.11996141286735541 -0.15572679401876433
I have this dataframe named df2_norm on python. I compute slope with the following code:
allowableCorr = self.df2_norm.corr(method = 'pearson')
self.slope = allowableCorr * (self.df2_norm.std().values / self.df2_norm.std().values[:, np.newaxis])
Q1) How do I compute the y intercept using pandas,numpy and matplotlib only into a matrix that is like a heat/correlation map?
Q2) Is there a way to compute the scatter plot for each column as the train data and the rest as the test data?
Thank you.

Related

linear Interpolation between points of dataframe using nearest points of dataframe

I need a quick solution to interpolate between the nearest points of a data frame without adding new points to a data frame if there is a lot of data - millions of points (without NANs). The dataframe is sorted using x vlaues.
E.g. I have a dataframe with the next columns:
x | y
-----
0 | 1
1 | 2
2 | 3
...
I need a function that will fire out for a given x_input value calculated as a linear interpolated value between nearest points, something like this:
calc_linear(df, cinput_col = 'x', input_val=1.5, output_col=y) will output 2.5 - as interpolated y value for a given x
Maybe there are some pandas functions for that?
Use numpy.interp:
import numpy as np
def calc_linear(df, input_val, input_col='x', output_col='y'):
return np.interp(input_val, df[input_col], df[output_col])
y = calc_linear(df, 1.5)
print(y)
# Output
2.5

Linear Regression on Multiindex Pandas Dataframe in Python

I'm trying to perform a regression of annual temperatures over time, and obtain a slope/linear trend (number generated by the regression) for each latitude and longitude coordinate (the full dataset has many lat/lon locations). I want to replace the year and temp for each location with this slope value. My end goal is to map these trends with cartopy.
Here is some test data in a pandas multi index dataframe
tempanomaly
lat lon time_bnds
-89.0 -179.0 1957 0.606364
1958 0.495000
1959 0.134286
this is my goal:
lat lon trend
-89.0 -179.0 -0.23604
this is my regression function
def regress(y):
#X is the year or index, y is the temperature
X=np.array(range(len(y))).reshape(len(y),1)
y = y.array
fit = np.polyfit(X, y, 1)
return (fit[0])
and here is how I'm attempting to call it
reg = df.groupby(["lat", "lon"]).transform(regress)
The error I'm receiving is TypeError: Transform function invalid for data types.
In the debugging process, I found that the regression was running for each line (3 times, using the test data), as opposed to once for each location (only one location is in the test data). I believe the problem lies in the method I'm using to call the regression, but can't figure out another way to iterate through and perform a regression by lat/lon pairs—I appreciate any help!
I think you have also error in your regress function because in your case X should be 1D vector. So here it is the fixed regress function:
def regress(y):
#X is the year or index, y is the temperature
X = np.array(range(len(y)))
y = y.array
fit = np.polyfit(X, y, 1)
return (fit[0])
For pandas.DataFrame.transform produced DataFrame will have same axis length as self. Pandas Documentation
Therefore aggregate is a better option for your case.
reg = df.groupby(["lat", "lon"]).aggregate(trend=pd.NamedAgg('tempanomaly', regress)).reset_index()
which produces:
lat lon trend
-89.0 -179.0 -0.236039
with the sample data created as follows:
lat_lon = [(-89.0, -179.0), (-89.0, -179.0), (-89.0, -179.0)]
index = pd.MultiIndex.from_tuples(lat_lon, names=["lat", "lon"])
df = pd.DataFrame({
'time_bnds':[1957,1958,1959],
'tempanomaly': [0.606364, 0.495000, 0.134286]
},index=index)

Finding the accuracy from the confusion matrix in pd.crosstab

Using pd.crosstab, I can produce a confusion matrix from my predicted data. I used the following line to generate the confusion matrix:
pd.crosstab(test_data['class'], test_data['predicted'], margins = True)
Similarly in R, I can generate a confusion matrix using the line below
confusion_matrix <- table(truth = data.test$class, prediction = predict(model, data.test[,-46], type = 'class'))
And in R I can find the accuracy of my model using this line
sum(diag(confusion_matrix)) / sum(confusion_matrix)
In Python, is there an equivalent of sum(diag(confusion_matrix)) / sum(confusion_matrix) to calculate the accuracy from my confusion matrix?
I will prefer not to use any libraries except pandas (e.g Scikit learn).
You need to use numpy, first use np.diag on the crosstab product to get sum of the diagonal, and then converting the crosstab product to a numpy array before summing:
import numpy as np
np.random.seed(123)
test_data = pd.DataFrame({'class':np.random.randint(0,2,10),
'predicted':np.random.randint(0,2,10)})
tab = pd.crosstab(test_data['class'], test_data['predicted'])
predicted 0 1
class
0 4 3
1 0 3
tab = pd.crosstab(test_data['class'], test_data['predicted'])
np.diag(tab).sum() / tab.to_numpy().sum()
0.7
Or hardcode it? not sure why you want to do this:
(tab.iloc[0,0] + tab.iloc[1,1]) / tab.to_numpy().sum()

Regression by group in python pandas

I want to ask a quick question related to regression analysis in python pandas.
So, assume that I have the following datasets:
Group Y X
1 10 6
1 5 4
1 3 1
2 4 6
2 2 4
2 3 9
My aim is to run regression; Y is dependent and X is independent variable. The issue is I want to run this regression by Group and print the coefficients in a new data set. So, the results should be like:
Group Coefficient
1 0.25 (lets assume that coefficient is 0.25)
2 0.30
I hope I can explain my question.
Many thanks in advance for your help.
I am not sure about the type of regression you need, but this is how you do an OLS (Ordinary least squares):
import pandas as pd
import statsmodels.api as sm
def regress(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]
X['intercept'] = 1.
result = sm.OLS(Y, X).fit()
return result.params
#This is what you need
df.groupby('Group').apply(regress, 'Y', ['X'])
You can define your regression function and pass parameters to it as mentioned.

Python cross correlation

I have a pair of 1D arrays (of different lengths) like the following:
data1 = [0,0,0,1,1,1,0,1,0,0,1]
data2 = [0,1,1,0,1,0,0,1]
I would like to get the max cross correlation of the 2 series in python. In matlab, the xcorr() function will return it OK
I have tried the following 2 methods:
numpy.correlate(data1, data2)
signal.fftconvolve(data2, data1[::-1], mode='full')
Both methods give me the same values, but the values I get from python are different from what comes out of matlab. Python gives me integers values > 1, whereas matlab gives actual correlation values between 0 and 1.
I have tried normalizing the 2 arrays first (value-mean/SD), but the cross correlation values I get are in the thousands which doesnt seem correct.
Matlab will also give you a lag value at which the cross correlation is the greatest. I assume it is easy to do this using indices but whats the most appropriate way of doing this if my arrays contain 10's of thousands of values?
I would like to mimic the xcorr() function that matlab has, any thoughts on how I would do that in python?
numpy.correlate(arr1,arr2,"full")
gave me same output as
xcorr(arr1,arr2)
gives in matlab
Implementation of MATLAB xcorr(x,y) and comparision of result with example.
import scipy.signal as signal
def xcorr(x,y):
"""
Perform Cross-Correlation on x and y
x : 1st signal
y : 2nd signal
returns
lags : lags of correlation
corr : coefficients of correlation
"""
corr = signal.correlate(x, y, mode="full")
lags = signal.correlation_lags(len(x), len(y), mode="full")
return lags, corr
n = np.array([i for i in range(0,15)])
x = 0.84**n
y = np.roll(x,5);
lags,c = xcorr(x,y);
plt.figure()
plt.stem(lags,c)
plt.show()
This code will help in finding the delay between two channels in audio file
xin, fs = sf.read('recording1.wav')
frame_len = int(fs*5*1e-3)
dim_x =xin.shape
M = dim_x[0] # No. of rows
N= dim_x[1] # No. of col
sample_lim = frame_len*100
tau = [0]
M_lim = 20000 # for testing as processing takes time
for i in range(1,N):
c = np.correlate(xin[0:M_lim,0],xin[0:M_lim,i],"full")
maxlags = M_lim-1
c = c[M_lim -1 -maxlags: M_lim + maxlags]
Rmax_pos = np.argmax(c)
pos = Rmax_pos-M_lim+1
tau.append(pos)
print(tau)

Categories

Resources