I want to ask a quick question related to regression analysis in python pandas.
So, assume that I have the following datasets:
Group Y X
1 10 6
1 5 4
1 3 1
2 4 6
2 2 4
2 3 9
My aim is to run regression; Y is dependent and X is independent variable. The issue is I want to run this regression by Group and print the coefficients in a new data set. So, the results should be like:
Group Coefficient
1 0.25 (lets assume that coefficient is 0.25)
2 0.30
I hope I can explain my question.
Many thanks in advance for your help.
I am not sure about the type of regression you need, but this is how you do an OLS (Ordinary least squares):
import pandas as pd
import statsmodels.api as sm
def regress(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]
X['intercept'] = 1.
result = sm.OLS(Y, X).fit()
return result.params
#This is what you need
df.groupby('Group').apply(regress, 'Y', ['X'])
You can define your regression function and pass parameters to it as mentioned.
Related
I need to make a linear regression and sum all the predictions. Maybe this isn't a question for Scikit-Learn but for NumPy because I get an array at the end and I am unable to turn it into a float.
df
rank Sales
0 1 18000
1 2 17780
2 3 17870
3 4 17672
4 5 17556
x = df['rank'].to_numpy()
y = df['Sales'].to_numpy()
X = x.reshape(-1,1)
regression = LinearRegression().fit(X, y)
I am getting it right up to this point. The next part (which is a while loop to sum all the values) is not working:
number_predictions = 100
x_current_prediction = 1
total_sales = 0
while x_current_prediction <= number_predictions:
variable_sum = x_current_prediction*regression.coef_
variable_sum_float = variable_sum.astype(np.float_)
total_sales = total_sales + variable_sum_float
x_current_prediction =+1
return total_sales
I think that the problem is getting regression.coef_ to be a float, but when I use astype, it does not work?
You don't need to loop like this, and you don't need to use the coefficient to compute the prediction (don't forget there may be an intercept as well).
Instead, make an array of all the values of x you want to predict for, and ask sklearn for the predictions:
X_new = np.arange(1, 101).reshape(-1, 1) # X must be 2D.
y_pred = regression.predict(X_new)
If you want to add all these numbers together, use y_pred.sum() or np.sum(y_pred), or if you want a cumulative sum, np.cumsum(y_pred) will do it.
count 716865 716873 716884 716943
0 -0.16029615828413712 -0.07630309240006158 0.11220663712532133 -0.2726775504078691
1 -0.6687265363491811 -0.6135022705188075 -0.49097425130988914 -0.736020384028633
2 0.06735205699309535 0.07948417451634422 0.09240256047258057 0.0617964313591086
3 0.372935701728449 0.44324822316416074 0.5625073287879649 0.3199599294007491
4 0.39439310866886124 0.45960496068147993 0.5591549439131621 0.34928093849248304
5 -0.08007381002566456 -0.021313801077641505 0.11996141286735541 -0.15572679401876433
I have this dataframe named df2_norm on python. I compute slope with the following code:
allowableCorr = self.df2_norm.corr(method = 'pearson')
self.slope = allowableCorr * (self.df2_norm.std().values / self.df2_norm.std().values[:, np.newaxis])
Q1) How do I compute the y intercept using pandas,numpy and matplotlib only into a matrix that is like a heat/correlation map?
Q2) Is there a way to compute the scatter plot for each column as the train data and the rest as the test data?
Thank you.
I've got a 3-class classification problem. Let's define them as classes 0,1 and 2. In my case, class 0 is not important - that is, whatever gets classified as class 0 is irrelevant. What's relevant, however, is accuracy, precision, recall, and error rate only for classes 1 and 2. I would like to define an accuracy metric that only looks at a subsection of the data that relates to 1 and 2 and gives me a measure of that as the model is training. I am not asking for code for accuracy or f1 or precision/recall - those I've found and can implement myself. What I'm asking is for code that can help select a subsection of the categories to perform these metrics on.
Visually, with a confusion matrix:
Given:
> 0 1 2
>0 10 3 4
>1 2 5 1
>2 8 5 9
I would like to only perform an accuracy measure in-training for the following subset only:
> 1 2
>1 5 1
>2 5 9
Possible idea:
Concatenate a categorized, argmaxed y_pred and argmaxed y_true, drop all instances where 0 appears, re-unravel them back into a one_hot array, and do a simple binary accuracy on what remains?
Edit:
I've tried to exclude the 0-class through this code, but it doesn't make sense. the 0-category gets effectively wrapped into the 1-category (that is, the true positives of both 0 and 1 end up being labeled as 1). Still looking for help - can anybody help out please?
#this solution does not work :(
def my_acc(y_true, y_pred):
#excluding the 0-category
y_true_cust = y_true[:,np.r_[1:3]]
y_pred_cust = y_pred[:,np.r_[1:3]]
#binary accuracy source code, slightly edited
y_pred_cat = Ker.round(y_pred_cust)
eql_cust = Ker.equal(y_true_cust, y_pred_cust)
return Ker.mean(eql_cust, axis = -1)
# Ashwin Geet D'Sa
correct_guesses_3cat = 10 + 5 + 9
print(correct_guesses_3cat)
24
total_guesses_3cat = 10+3+4+2+5+1+8+5+9
print(total_guesses_3cat)
47
accuracy_3cat = 24/47
print(accuracy_3cat)
51.1 %
correct_guesses_2cat =5 + 9
print(correct_guesses_2cat)
14
total_guesses_2cat = 5+1+5+9
print(total_guesses_2cat)
20
accuracy_2cat = 14/20
print(accuracy_2cat)
70.0 %
I have got a following problem.
Let's assume that we have a data frame with few variables. Morover one variable (var_A) is a probability score - its values ranges from 0 to 1. I want to sample rows from this data frame in a way that it will be more probable to pick a row with higher value of var_A - so I guess that I have to draw from an empirical distribution of var_A. I know how to implement edf function of var_A as it's suggested here but I have no idea how to use this distribution for sampling rows.
Can you please help me with this?
Thanks
You can use numpy.random.choice to sample in this manner:
import numpy as np
num_dists = 4
num_samples = 10
var_A = np.random.uniform(0, 1, num_dists)
# ensure var_A sums to 1
var_A /= np.sum(var_A)
samples = np.random.choice(len(var_A), num_samples, p=var_A)
print('var_A: ', var_A)
print('samples: ', samples)
Sample output:
var_A: [ 0.23262621 0.02990421 0.22357316 0.51389642]
samples: [3 0 0 2 0 0 2 3 3 2]
I want to run a rolling for example of 3 window OLS regression estimation for a dataset found in this link (https://drive.google.com/drive/folders/0B2Iv8dfU4fTUMVFyYTEtWXlzYkk) as in the following format. The third column (Y) in my dataset is my true value - that's what I wanted to predict (estimate).
time X Y
0.000543 0 10
0.000575 0 10
0.041324 1 10
0.041331 2 10
0.041336 3 10
0.04134 4 10
...
9.987735 55 239
9.987739 56 239
9.987744 57 239
9.987749 58 239
9.987938 59 239
Using a simple OLS regression estimation, I have tried it with the following script.
# /usr/bin/python -tt
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('estimated_pred.csv')
model = pd.stats.ols.MovingOLS(y=df.Y, x=df[['X']],
window_type='rolling', window=3, intercept=True)
df['Y_hat'] = model.y_predict
print(df['Y_hat'])
print (model.summary)
df.plot.scatter(x='X', y='Y', s=0.1)
However, using either statsmodels or scikit-learn seems to be a good choice for something beyond the a simple regression. I have tried to make the following script work but returing IndexError: index out of bounds with higher subset of the dataset (for example for more than 100 rows of the dataset).
# /usr/bin/python -tt
import pandas as pd
import numpy as np
import statsmodels.api as sm
df=pd.read_csv('estimated_pred.csv')
df=df.dropna() # to drop nans in case there are any
window = 3
#print(df.index) # to print index
df['a']=None #constant
df['b1']=None #beta1
df['b2']=None #beta2
for i in range(window,len(df)):
temp=df.iloc[i-window:i,:]
RollOLS=sm.OLS(temp.loc[:,'Y'],sm.add_constant(temp.loc[:,['time','X']])).fit()
df.iloc[i,df.columns.get_loc('a')]=RollOLS.params[0]
df.iloc[i,df.columns.get_loc('b1')]=RollOLS.params[1]
df.iloc[i,df.columns.get_loc('b2')]=RollOLS.params[2]
#The following line gives us predicted values in a row, given the PRIOR row's estimated parameters
df['predicted']=df['a'].shift(1)+df['b1'].shift(1)*df['time']+df['b2'].shift(1)*df['X']
print(df['predicted'])
#print(df['b2'])
#print(RollOLS.predict(sm.add_constant(predict_x)))
print(temp)
I want to do a prediction of Y (i.e. predict the current value of Y according to the previous value 3 rolling values of X. Finally, i want to include the mean squared error (MSE) for all the prediction (a summary of the regression analysis). For example, if we look at row 5, the value of X is 2 and the value of Y is 10. Let's say the prediction value of y at the current row is 6 and therefore the mse will be (10-6)^2. How can we do this using either statsmodels or scikit-learn for pd.stats.ols.MovingOLS was removed in Pandas version 0.20.0 and since I can't find any reference?