A lot of questions is answered regarding this, however, I could not figure out one thing.
I have a dataframe and I am performing regression,after that the results are stored in the new columns in Test dataframe. To compare methods I need to do both linear and polynomial regression.
I have found a way to beautifully do this with linear regression, where in result I have predicted values in new column of dataframe Test. But I cannot make this work within the same loop using polynomial regression, cause in the final Test dataframe I have multiple Null values as in the step of model_2.fit_transform(X) values somehow loses the corresponding Test index.
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import PolynomialFeatures
Test = pd.read_csv(r'D:\myfile.csv')
df_coef =[]
value = list(set(Test['Value']))
for value in value:
df_redux = Test[Test['Value'] == value]
Y = df_redux['Y']
X = df_redux[['X1', 'A', 'B', 'B']]
X = sm.add_constant(X)
# linear
model_1 = sm.OLS(Y, X).fit()
predictions_1 = model_1.predict(X)
# polynomial
polynomial_features = PolynomialFeatures(degree=2)
xp = polynomial_features.fit_transform(X)
model_2 = sm.OLS(Y, xp).fit()
predictions_2 = model_2.predict(xp)
stats_1 = pd.read_html(model_1.summary().tables[1].as_html(), header=0, index_col=0)[0]
stats_2 = pd.read_html(model_2.summary().tables[1].as_html(), header=0, index_col=0)[0]
predictions_1 = pd.DataFrame(predictions_1, columns=['lin'])
predictions_2 = pd.DataFrame(predictions_2, columns=['poly'])
# ??? how to concat and appen both prediction_1 and prediction_2 in the same df_coef = [] dataframe?
gf = pd.concat([predictions_1, df_redux], axis=1)
df_coef.append(gf)
all_coef = pd.concat(df_coef)
type(all_coef)
Out[234]: pandas.core.frame.DataFrame
The problem is that tranformed xp type is <class 'numpy.ndarray'> , but X type is <class 'pandas.core.frame.DataFrame'>. The question is how can I get the polynomial regression predicted values in new column of Test, next to linear reg. results. This is probably really simple, but I could not figure it out.
print(type(X))
print(type(xp))
print(X.sample(2))
print()
print(xp)
<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>
X1 A B G1
962 4.334912 1.945910 3.135494 3.258097
1365 4.197888 2.197225 3.135494 3.332205
[[ 1. 4.77041663 1.94591015 ... 35.74106743 34.52550933
33.35129251]
[ 1. 4.43240629 1.94591015 ... 33.28387641 32.03140262
30.82605947]
[ 1. 3.21669428 1.94591015 ... 29.95821572 30.38903979
30.82605947]
The result which I get with polynominal reg. predicted values appended to original Test dataframe:
0 6.178542 3.0 692 ... 2.079442 4.783216 6.146329
1 6.156108 11.0 692 ... 2.197225 4.842126 6.113682
2 6.071453 12.0 692 ... 2.197225 4.814595 6.052089
3 5.842053 NaN NaN ... NaN NaN NaN
4 4.625762 30.0 692 ... 1.945910 5.018201 5.828946
This is the correct and good result I obtained using only linear regression, without Nan and with value in each row, how it supposed to be:
0 6.151675 3 692 5 ... 3.433987 2.079442 4.783216 6.146329
1 6.132077 11 692 5 ... 3.401197 2.197225 4.842126 6.113682
2 6.068450 12 692 5 ... 3.332205 2.197225 4.814595 6.052089
4 5.819535 30 692 5 ... 3.258097 1.945910 5.018201 5.828946
8 4.761362 61 692 5 ... 2.564949 1.945910 3.889585 4.624973
Solve this by adding a line for numpy to series tranformation. And for model statistics statsmodels summary:
import pandas as pd
from statsmodels.api import OLS
predictions_2 = model_2.predict(xp)
predictions_2_series = pd.Series(predictions_2, index=df_redux.index.values)
print(OLS(Y, xp).fit().summary())
Related
I am adding more data to a my X_train data as well as to my y_train data in order to retrain my model with more data. I do this using pd. concat(). However, when I train my model using the concatenated dataset I get the following error:
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py:1692:
FutureWarning: Feature names only support names that are all strings. Got feature
names with dtypes: ['int', 'str']. An error will be raised in 1.2.
FutureWarning,
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-166-a11464987b97> in <module>
----> 1 model1_pool_preds = model1(LinearSVC(class_weight='balanced',
random_state=42), OneVsRestClassifier, X_train_init_new, y_train_init_new,
X_test_init, y_test_init, X_pool)
6 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __array__(self,
dtype)
1991
1992 def __array__(self, dtype: NpDtype | None = None) -> np.ndarray:
-> 1993 return np.asarray(self._values, dtype=dtype)
1994
1995 def __array_wrap__(
ValueError: could not convert string to float:
I suppose this is happening because the data I added to the existing dataframe contains some strings instead of float numbers. How can I convert the entire dataset into float? my code is below:
y_train_init_new = pd.concat([y_train_init, X_pool_labeled.iloc[:, -7:]])
X_train_init_new = pd.concat([X_train_init, X_pool_labeled.iloc[:, 0:27446]])
def model1(model, classifier, X, y, X_test, y_test, X_pool):
m = model
clf = classifier(m)
clf.fit(X,y)
clf_predictions = clf.predict(X_test)
C_report = classification_report(y_test, clf_predictions, zero_division=0)
print(C_report)
clf_roc_auc = roc_auc_score(y_test, clf_predictions, multi_class='ovr')
print('AUC: ', clf_roc_auc)
clf_predictions_pool = clf.predict(X_pool)
return clf_predictions_pool
model1_pool_preds = model1(LinearSVC(class_weight='balanced', random_state=42),
OneVsRestClassifier, X_train_init, y_train_init, X_test_init, y_test_init, X_pool)
How can I convert all the data of the concatenated dataset into float data?
Given a data frame that is entirely strings but can be turned without errors into numbers, you can just call df.astype(float) on the whole lot.
>>> df = pd.DataFrame([str(i) for i in range(0, 1000)], columns=['x'])
>>> df
x
0 0
1 1
2 2
3 3
4 4
.. ...
995 995
996 996
997 997
998 998
999 999
[1000 rows x 1 columns]
>>> df.astype(float)
x
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
.. ...
995 995.0
996 996.0
997 997.0
998 998.0
999 999.0
[1000 rows x 1 columns]
This is more difficult if you have mixed non-numeric columns. Given that such columns can't be used anyway, just drop them and call astype(float) on the remainder.
Let's say I have three statsmodels OLS objects that I want to compare side by side. I can use summary_col to create a summary table that I can print out as text or export to latex.
How can I export this table as csv?
Here's a replicable example of what I want to do:
# Libraries
import pandas as pd
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col
# Load silly data and add constant
df = sm.datasets.stackloss.load_pandas().data
df['CONSTANT'] = 1
# Train three silly models
m0 = sm.OLS(df['STACKLOSS'], df[['CONSTANT','AIRFLOW']]).fit()
m1 = sm.OLS(df['STACKLOSS'], df[['CONSTANT','AIRFLOW','WATERTEMP']]).fit()
m2 = sm.OLS(df['STACKLOSS'], df[['CONSTANT','AIRFLOW','WATERTEMP','ACIDCONC']]).fit()
# Results table
res = summary_col([m0,m1,m2], regressor_order=m2.params.index.tolist())
print(res)
================================================
STACKLOSS I STACKLOSS II STACKLOSS III
------------------------------------------------
CONSTANT -44.1320 -50.3588 -39.9197
(6.1059) (5.1383) (11.8960)
AIRFLOW 1.0203 0.6712 0.7156
(0.1000) (0.1267) (0.1349)
WATERTEMP 1.2954 1.2953
(0.3675) (0.3680)
ACIDCONC -0.1521
(0.1563)
================================================
Standard errors in parentheses.
Is there a way to export res to csv?
The results are stored as a list of data frames:
res.tables
[ STACKLOSS I STACKLOSS II STACKLOSS III
CONSTANT -44.1320 -50.3588 -39.9197
(6.1059) (5.1383) (11.8960)
AIRFLOW 1.0203 0.6712 0.7156
(0.1000) (0.1267) (0.1349)
WATERTEMP 1.2954 1.2953
(0.3675) (0.3680)
ACIDCONC -0.1521
(0.1563)
R-squared 0.8458 0.9088 0.9136
R-squared Adj. 0.8377 0.8986 0.8983]
This should work:
res.tables[0].to_csv("test.csv")
pd.read_csv("test.csv")
Unnamed: 0 STACKLOSS I STACKLOSS II STACKLOSS III
0 CONSTANT -44.1320 -50.3588 -39.9197
1 NaN (6.1059) (5.1383) (11.8960)
2 AIRFLOW 1.0203 0.6712 0.7156
3 NaN (0.1000) (0.1267) (0.1349)
4 WATERTEMP NaN 1.2954 1.2953
5 NaN NaN (0.3675) (0.3680)
6 ACIDCONC NaN NaN -0.1521
7 NaN NaN NaN (0.1563)
8 R-squared 0.8458 0.9088 0.9136
9 R-squared Adj. 0.8377 0.8986 0.8983
I am not comfortable with Python - much less intimidated and at ease with R. So indulge me on a silly question that is taking me a ton of searches without success.
I want to fit in a regression model with sklearn both with OLS and lasso. In particular, I like the mtcars dataset that is so easy to call in R, and, as it turns out, also very accessible in Python:
import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
It looks like this:
mpg cyl disp hp drat ... qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 ... 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 ... 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 ... 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 ... 19.44 1 0 3 1
In trying to use LinearRegression() the usual structure found is
import numpy as np
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(x, y)
but to do so, I need to select several columns of df to fit into the regressors x, and a column to be the independent variable y. For example, I'd like to get an x matrix that includes a column of 1's (for the intercept) as well as the disp and qsec (numerical variables), as well as cyl (categorical variable). On the side of the independent variable, I'd like to use mpg.
It would look if it were possible to word this way as
model = LinearRegression().fit(mpg ~['disp', 'qsec', C('cyl')], data=df)
But how do I go about the syntax for it?
Similarly, how can I do the same with lasso:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.001)
lasso.fit(mpg ~['disp', 'qsec', C('cyl')], data=df)
but again this is not the right syntax.
I did find that you can get the actual regression (OLS or lasso) by turning the dataframe into a matrix. However, the names of the columns are gone, and it is hard to read the variable corresponding to each coefficients. And I still haven't found a simple method to run diagnostic values, like p-values, or the r-square to begin with.
You can maybe try patsy which is used by statsmodels:
import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf
from patsy import dmatrix
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
mat = dmatrix("disp + qsec + C(cyl)", mtcars)
Looks like this, we can omit first column intercept since it is included in sklearn:
mat
DesignMatrix with shape (32, 5)
Intercept C(cyl)[T.6] C(cyl)[T.8] disp qsec
1 1 0 160.0 16.46
1 1 0 160.0 17.02
1 0 0 108.0 18.61
1 1 0 258.0 19.44
1 0 1 360.0 17.02
X = pd.DataFrame(mat[:,1:],columns = mat.design_info.column_names[1:])
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X,mtcars['mpg'])
But the parameters names in model.coef_ will not be named. You just have to put them into a series to read them maybe:
pd.Series(model.coef_,index = X.columns)
C(cyl)[T.6] -5.087564
C(cyl)[T.8] -5.535554
disp -0.025860
qsec -0.162425
Pvalues from sklearn linear regression, there's no ready method to do it, you can check out these answers, maybe one of them is what you are looking for.
Here are two ways - unsatisfactory, especially because the variables labels seem to be gone once the regression gets going:
import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
import numpy as np
from sklearn.linear_model import LinearRegression
Single variable regression mpg (i.v.) ~ hp (d.v.):
lm = LinearRegression()
mat = np.matrix(df)
lmFit = lm.fit(mat[:,3], mat[:,0])
print(lmFit.coef_)
print(lmFit.intercept_)
For multiple regression drat ~ wt + cyl + carb:
lmm = LinearRegression()
wt = np.array(df['wt'])
cyl = np.array(df['cyl'])
carb = np.array(df['carb'])
stack = np.column_stack((cyl,wt,carb))
stackmat = np.matrix(stack)
lmFit2 = lmm.fit(stackmat,mat[:,4])
print(lmFit2.coef_)
print(lmFit2.intercept_)
A group of 25 randomly selected patients at a hospital. In addition to satisfaction, data were collected on patient age and an index that measured the severity of illness.
(a) Fit a linear regression model relating satisfaction to patient age. DONE
(b) Test for significance of regression. (Need to get Anova Table)
from pandas import DataFrame
import statsmodels.api as sm
from statsmodels.formula.api import ols
Stock_Market = {'Satisfaction': [68,77,96,80,43,44,26,88,75,57,56,88,88,102,88,70,52,43,46,56,59,26,52,83,75],
'Age': [55,46,30,35,59,61,74,38,27,51,53,41,37,24,42,50,58,60,62,68,70,79,63,39,49],
'Severity': [50,24,46,48,58,60,65,42,42,50,38,30,31,34,30,48,61,71,62,38,41,66,31,42,40],
}
df = DataFrame(Stock_Market,columns=['Satisfaction','Age','Severity'])
X = df[['Age','Severity']]
Y = df['Satisfaction']
X = sm.add_constant(X)
print(X)
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)
aov_table = sm.stats.anova_lm(print_model, typ=2)
you need to reshape the dataframe suitable for the statsmodel package
In [117]: df_melt = pd.melt(df.reset_index(), id_vars=['index'], value_vars=['Satisfaction', 'Age', 'Severity'])
In [118]: df_melt
Out[118]:
index variable value
0 0 Satisfaction 68
1 1 Satisfaction 77
2 2 Satisfaction 96
3 3 Satisfaction 80
4 4 Satisfaction 43
.. ... ... ...
70 20 Severity 41
71 21 Severity 66
72 22 Severity 31
73 23 Severity 42
74 24 Severity 40
[75 rows x 3 columns]
In [120]: df_melt.columns = ['index', 'categories', 'value']
In [121]: model = ols('value ~ C(categories)', data=df_melt).fit()
In [122]: anova_table = sm.stats.anova_lm(model, typ=2)
In [123]: anova_table
Out[123]:
sum_sq df F PR(>F)
C(categories) 5198.906667 2.0 9.304327 0.000255
Residual 20115.440000 72.0 NaN NaN
Your print_model is the return from summary().
Use your model, i.e. the results instance returned from OLS.fit in anova_lm.
The error message in the title indicates the problem:
AttributeError: 'Summary' object has no attribute 'model'
I have a pandas dataframe containing 16 columns, of which 14 represent variables where i perform a looped Anova test using statsmodels. My dataframe looks something like this (simplified):
ID Cycle_duration Average_support_phase Average_swing_phase Label
1 23.1 34.3 47.2 1
2 27.3 38.4 49.5 1
3 25.8 31.1 45.7 1
4 24.5 35.6 41.9 1
...
So far this is what i'm doing:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv('features_total.csv')
for variable in df.columns:
model = ols('{} ~ Label'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
Which yields:
sum_sq df F PR(>F)
Label 0.124927 2.0 2.561424 0.084312
Residual 1.731424 71.0 NaN NaN
sum_sq df F PR(>F)
Label 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
sum_sq df F PR(>F)
Label 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
I'm getting an individual table print for each variable where the Anova is performed. Basically what i want is to print one single table with the summarized results, or something like this:
sum_sq df F PR(>F)
Cycle_duration 0.1249270 2.0 2.561424 0.084312
Residual 1.7314240 71.0 NaN NaN
Average_support_phase 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
Average_swing_phase 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
I can already see a problem because this method always outputs the 'Label' nomenclature before the actual values, and not the variable name in question (like i've shown above, i would like to have the variable name above each 'residual'). Is this even possible with the statsmodels approach?
I'm fairly new to python and excuse me if this has nothing to do with statsmodels - in that case, please do elucidate me on what i should be trying.
You can collect the tables and concatenate them at the end of your loop. This method will create a hierarchical index, but I think that makes it a bit more clear. Something like this:
keys = []
tables = []
for variable in df.columns:
model = ols('{} ~ Label'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
keys.append(variable)
tables.append(anova_table)
df_anova = pd.concat(tables, keys=keys, axis=0)
Somewhat related, I would also suggest correcting for multiple comparisons. This is more a statistical suggestion than a coding suggestion, but considering you are performing numerous statistical tests, it would make sense to account for the probability that one of the test would result in a false positive.