python sklearn multiple linear regression display r-squared

python sklearn multiple linear regression display r-squared - python

I calculated my multiple linear regression equation and I want to see the adjusted R-squared. I know that the score function allows me to see r-squared, but it is not adjusted.
import pandas as pd #import the pandas module
import numpy as np
df = pd.read_csv ('/Users/jeangelj/Documents/training/linexdata.csv', sep=',')
df
AverageNumberofTickets NumberofEmployees ValueofContract Industry
0 1 51 25750 Retail
1 9 68 25000 Services
2 20 67 40000 Services
3 1 124 35000 Retail
4 8 124 25000 Manufacturing
5 30 134 50000 Services
6 20 157 48000 Retail
7 8 190 32000 Retail
8 20 205 70000 Retail
9 50 230 75000 Manufacturing
10 35 265 50000 Manufacturing
11 65 296 75000 Services
12 35 336 50000 Manufacturing
13 60 359 75000 Manufacturing
14 85 403 81000 Services
15 40 418 60000 Retail
16 75 437 53000 Services
17 85 451 90000 Services
18 65 465 70000 Retail
19 95 491 100000 Services
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X, y = df[['NumberofEmployees','ValueofContract']], df.AverageNumberofTickets
model.fit(X, y)
model.score(X, y)
>>0.87764337132340009
I checked it manually and 0.87764 is R-squared; whereas 0.863248 is the adjusted R-squared.

There are many different ways to compute R^2 and the adjusted R^2, the following are few of them (computed with the data you provided):
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X, y = df[['NumberofEmployees','ValueofContract']], df.AverageNumberofTickets
model.fit(X, y)
SST = SSR + SSE (ref definitions)
# compute with formulas from the theory
yhat = model.predict(X)
SS_Residual = sum((y-yhat)**2)
SS_Total = sum((y-np.mean(y))**2)
r_squared = 1 - (float(SS_Residual))/SS_Total
adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)
print r_squared, adjusted_r_squared
# 0.877643371323 0.863248473832
# compute with sklearn linear_model, although could not find any function to compute adjusted-r-square directly from documentation
print model.score(X, y), 1 - (1-model.score(X, y))*(len(y)-1)/(len(y)-X.shape[1]-1)
# 0.877643371323 0.863248473832
Another way:
# compute with statsmodels, by adding intercept manually
import statsmodels.api as sm
X1 = sm.add_constant(X)
result = sm.OLS(y, X1).fit()
#print dir(result)
print result.rsquared, result.rsquared_adj
# 0.877643371323 0.863248473832
Yet another way:
# compute with statsmodels, another way, using formula
import statsmodels.formula.api as sm
result = sm.ols(formula="AverageNumberofTickets ~ NumberofEmployees + ValueofContract", data=df).fit()
#print result.summary()
print result.rsquared, result.rsquared_adj
# 0.877643371323 0.863248473832

regressor = LinearRegression(fit_intercept=False)
regressor.fit(x_train, y_train)
print(f'r_sqr value: {regressor.score(x_train, y_train)}')

Related

Errors attempting to use linearmodels.panel.PanelOLS entity effects (not time effects)

I have a Pandas DataFrame like (abridged):
age
gender
control
county
11877
67.0
F
0
AL-Calhoun
11552
60.0
F
0
AL-Coosa
11607
60.0
F
0
AL-Talladega
13821
NaN
NaN
1
AL-Mobile
11462
59.0
F
0
AL-Dale
I want to run a linear regression with fixed effects by county entity (not by time) to balance check my control and treatment groups for an experimental design, such that my dependent variable is membership in the treatment group (control = 1) or not (control = 0).
In order to do this, so far as I have seen I need to use linearmodels.panel.PanelOLS and set my entity field (county) as my index.
So far as I'm aware my model should look like this:
# set index on entity effects field:
to_model = to_model.set_index(["county"])
# implement fixed effects linear model
model = PanelOLS.from_formula("control ~ age + gender + EntityEffects", to_model)
When I try to do this, I get the below error:
ValueError: The index on the time dimension must be either numeric or date-like
I have seen a lot of implementations of such models online and they all seem to use a temporal effect, which is not relevant in my case. If I try to encode my county field using numerics, I get a different error.
# create a dict to map county values to numerics
county_map = dict(zip(to_model["county"].unique(), range(len(to_model.county.unique()))))
# create a numeric column as alternative to county
to_model["county_numeric"] = to_model["county"].map(county_map)
# set index on numeric entity effects field
to_model = to_model.set_index(["county_numeric"])
FactorEvaluationError: Unable to evaluate factor `control`. [KeyError: 'control']
How am I able to implement this model using the county as a unit fixed effect?

Assuming you have multiple entries for each county, then you could use the following. The key step is to use a groupby transform to create a distinct numeric index for each county which can be used as a fake time index.
import numpy as np
import pandas as pd
import string
import linearmodels as lm
# Generate randomd DF
rs = np.random.default_rng(1213892)
counties = rs.choice([c for c in string.ascii_lowercase], (1000, 3))
counties = np.array([["".join(c)] * 10 for c in counties]).ravel()
age = rs.integers(18, 65, (10 * 1000))
gender = rs.choice(["m", "f"], size=(10 * 1000))
control = rs.integers(0, 2, size=10 * 1000)
df = pd.DataFrame(
{"counties": counties, "age": age, "gender": gender, "control": control}
)
# Construct a dummy numeric index for each county
numeric_index = df.groupby("counties").age.transform(lambda c: np.arange(len(c)))
df["numeric_index"] = numeric_index
df = df.set_index(["counties","numeric_index"])
# Take a look
df.head(15)
age gender control
counties numeric_index
qbt 0 51 m 1
1 36 m 0
2 28 f 1
3 28 m 0
4 47 m 0
5 19 m 1
6 32 m 1
7 54 m 0
8 36 m 1
9 52 m 0
nub 0 19 m 0
1 57 m 0
2 49 f 0
3 53 m 1
4 30 f 0
This just shows that the model can be estimated.
# Fit the model
# Note: Results are meaningless, just shows that this works
lm.PanelOLS.from_formula("control ~ age + gender + EntityEffects", data=df)
mod = lm.PanelOLS.from_formula("control ~ age + gender + EntityEffects", data=df)
mod.fit()
PanelOLS Estimation Summary
================================================================================
Dep. Variable: control R-squared: 0.0003
Estimator: PanelOLS R-squared (Between): 0.0005
No. Observations: 10000 R-squared (Within): 0.0003
Date: Thu, May 12 2022 R-squared (Overall): 0.0003
Time: 11:08:00 Log-likelihood -6768.3
Cov. Estimator: Unadjusted
F-statistic: 1.4248
Entities: 962 P-value 0.2406
Avg Obs: 10.395 Distribution: F(2,9036)
Min Obs: 10.0000
Max Obs: 30.000 F-statistic (robust): 2287.4
P-value 0.0000
Time periods: 30 Distribution: F(2,9036)
Avg Obs: 333.33
Min Obs: 2.0000
Max Obs: 962.00
Parameter Estimates
===============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
-------------------------------------------------------------------------------
age -0.0002 0.0004 -0.5142 0.6072 -0.0010 0.0006
gender[T.f] 0.5191 0.0176 29.559 0.0000 0.4847 0.5535
gender[T.m] 0.5021 0.0175 28.652 0.0000 0.4678 0.5365
===============================================================================
F-test for Poolability: 0.9633
P-value: 0.7768
Distribution: F(961,9036)
Included effects: Entity
PanelEffectsResults, id: 0x2246f38a9d0

Trouble Trying to get an Anova Test in Python; (AttributeError: 'Summary' object has no attribute 'model' ) Error

A group of 25 randomly selected patients at a hospital. In addition to satisfaction, data were collected on patient age and an index that measured the severity of illness.
(a) Fit a linear regression model relating satisfaction to patient age. DONE
(b) Test for significance of regression. (Need to get Anova Table)
from pandas import DataFrame
import statsmodels.api as sm
from statsmodels.formula.api import ols
Stock_Market = {'Satisfaction': [68,77,96,80,43,44,26,88,75,57,56,88,88,102,88,70,52,43,46,56,59,26,52,83,75],
'Age': [55,46,30,35,59,61,74,38,27,51,53,41,37,24,42,50,58,60,62,68,70,79,63,39,49],
'Severity': [50,24,46,48,58,60,65,42,42,50,38,30,31,34,30,48,61,71,62,38,41,66,31,42,40],
}
df = DataFrame(Stock_Market,columns=['Satisfaction','Age','Severity'])
X = df[['Age','Severity']]
Y = df['Satisfaction']
X = sm.add_constant(X)
print(X)
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)
aov_table = sm.stats.anova_lm(print_model, typ=2)

you need to reshape the dataframe suitable for the statsmodel package
In [117]: df_melt = pd.melt(df.reset_index(), id_vars=['index'], value_vars=['Satisfaction', 'Age', 'Severity'])
In [118]: df_melt
Out[118]:
index variable value
0 0 Satisfaction 68
1 1 Satisfaction 77
2 2 Satisfaction 96
3 3 Satisfaction 80
4 4 Satisfaction 43
.. ... ... ...
70 20 Severity 41
71 21 Severity 66
72 22 Severity 31
73 23 Severity 42
74 24 Severity 40
[75 rows x 3 columns]
In [120]: df_melt.columns = ['index', 'categories', 'value']
In [121]: model = ols('value ~ C(categories)', data=df_melt).fit()
In [122]: anova_table = sm.stats.anova_lm(model, typ=2)
In [123]: anova_table
Out[123]:
sum_sq df F PR(>F)
C(categories) 5198.906667 2.0 9.304327 0.000255
Residual 20115.440000 72.0 NaN NaN

Your print_model is the return from summary().
Use your model, i.e. the results instance returned from OLS.fit in anova_lm.
The error message in the title indicates the problem:
AttributeError: 'Summary' object has no attribute 'model'

Why does Analyse Data in Excel give different result from OLS Stats Model in Python?

I'm trying to predict sales with multiple linear regression with variables X1 = customers and X2 = KiloWattHour(kWh). But when I try in Excel and try in Python, the results are different.
Data in Excel:
Sales (Y) KWH (X1) Customer(X2)
2,72 3,13 174
2,59 3,03 175
2,81 3,28 175
2,66 3,14 117
2,80 3,29 87
2,71 3,13 74
2,93 3,33 68
2,71 3,10 104
Data in CSV imported to Python:
Sales (Y) KWH (X1) Customer(X2)
2.72 3.13 174
2.59 3.03 175
2.81 3.28 175
2.66 3.14 117
2.80 3.29 87
2.71 3.13 74
2.93 3.33 68
2.71 3.10 104
Code for reading the CSV file:
import pandas as pd
import numpy as np
from sklearn import linear_model
import statsmodels.api as sm
data = pd.read_csv('/code/master_data.csv')
print(data)
This is code for prediction using linear regression:
x = data[['kwhpenjualan','totalpelanggan']]
y = data['totalpendapatan']
x_1 = sm.add_constant(x)
model = sm.OLS(y, x_1)
result = model.fit()
result.params
This is the result in Excel:
Intercept -2,345215066
KWH (X1) 1,618236605
Customer (X2) 0,002576039
This is the result in Python:
Intercept 127.619065
KWH -45.949302
Customer. 50.262137
dtype: float64
Can you help me solve this problem?

What kind of model can i use to forecast this data?

This is the dataset that I have of some orders each week. I want to predict the orders for the rest of the year. I've tried building an ARIMA model and it doesn't work.
Is there any other model that I can try for such a small dataset? Maybe a HMM or try fitting a polynomial curve to it or build a time series LSTM?
FW Order
1 6
2 45
3 59
4 60
5 50
6 115
7 23
8 44
9 164
10 8
11 30
12 20
13 0
14 50
15 60
16 0
17 50
18 30
19 115
20 75
21 54
22 29
23 124
24 32
25 28

Here's a plot of your data. Your main problem is that there isn't really enough data for any model to give you meaningful predictions with statistical significance. Your data mostly just looks like white noise around a mean, so you'd represent it with:
x_t = mu + e
where e is an error term representing white noise.
There is a hint of mean reversion, so you could try an Ornstein Uhlenbeck model:
dx_t = theta * (mu - x_t-1) dt + sigma * dW_t
https://en.wikipedia.org/wiki/Ornstein%E2%80%93Uhlenbeck_process
Here's it coded up. Orange line is the prediction. Again, the prediction isn't great, but you probably won't find much better without more data.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
def least_squares_naive(s, delta=1.0):
y = s.diff().iloc[1:]
x = s.shift(1)[1:]
res = sm.OLS(y, sm.add_constant(x)).fit()
b, a = res.params
residual_df = y - (a * x + b)
se = residual_df.std(ddof=2)
lambda_ = -a / delta
mu_ = b / (lambda_ * delta)
sigma_ = se / (delta ** 0.5)
return mu_, lambda_, sigma_
list = [6,45,59,60,50,115,23,44,164,8,30,20,0,50,60,0,50,30,115,75,54,29,124,32,28]
s = pd.Series(list)
mu_, lambda_, sigma_ = least_squares_naive(s)
dx = -lambda_ * (s - mu_)
pred = (s + dx).shift()
diff = s.diff(1).dropna()
s.plot()
pred.plot()
plt.show()

TypeError: '<' not supported between instances of 'str' and 'int' while doing PCA for k-means clustering

I am trying to apply Kernel Principle Component Analysis on a dataset without a dependent variable to do a cluster analysis with k-means, so that I can learn how to do so. Here is a sample of my dataset(according to the scenario, this is a dataset of a shopping mall, and the shopping mall wants to discover the segments of its customers according to the data below):
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
6 Female 22 17 76
7 Female 35 18 6
8 Female 23 18 94
9 Male 64 19 3
10 Female 30 19 72
11 Male 67 19 14
First, I omitted CustomerID column and then encoded the gender column to be able to apply kernel PCA. Here is how I did it:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the mall dataset with pandas
dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, 1:5].values
df = pd.DataFrame(X)
#df is in order to visualize the "X" on variable explorer
#Encoding independent categorical variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
After executing this code, I could get the array with float64 Type. The sample from the array I created is below:
0 1 19 15 39
0 1 21 15 81
1 0 20 16 6
1 0 23 16 77
1 0 31 17 40
1 0 22 17 76
1 0 35 18 6
1 0 23 18 94
0 1 64 19 3
1 0 30 19 72
0 1 67 19 14
And then, I wanted to apply Kernel PCA to get the principal components which I will use at k-means. However, when I try to execute the code below, I get the error "TypeError: '<' not supported between instances of 'str' and 'int'".
# Applying Kernel PCA
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components = 'None', kernel = 'rbf')
X = kpca.fit_transform(X)
explained_variance = kpca.explained_variance_ratio_
Even if I encoded my categorical data and I don't have any strings in my dataset, I cannot understand why it gives this error. Is there anyone that could help?
Thank you very much in advance.

n_components = 'None' is the problem. you should not put a string here...
use:
kpca = KernelPCA(n_components = None, kernel = 'rbf')

I suspect this is what is happening:
This is an error of an included file, or some code that is running, prior to your running code. The "TypeError: '<' to which this is referring is a string "<error>". Which is what something prior to your code is returning.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python sklearn multiple linear regression display r-squared - python

regressor = LinearRegression(fit_intercept=False) regressor.fit(x_train, y_train) print(f'r_sqr value: {regressor.score(x_train, y_train)}')

Related

Errors attempting to use linearmodels.panel.PanelOLS entity effects (not time effects)

Trouble Trying to get an Anova Test in Python; (AttributeError: 'Summary' object has no attribute 'model' ) Error

Why does Analyse Data in Excel give different result from OLS Stats Model in Python?

What kind of model can i use to forecast this data?

TypeError: '<' not supported between instances of 'str' and 'int' while doing PCA for k-means clustering

Categories

Resources