Let's say I have three statsmodels OLS objects that I want to compare side by side. I can use summary_col to create a summary table that I can print out as text or export to latex.
How can I export this table as csv?
Here's a replicable example of what I want to do:
# Libraries
import pandas as pd
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col
# Load silly data and add constant
df = sm.datasets.stackloss.load_pandas().data
df['CONSTANT'] = 1
# Train three silly models
m0 = sm.OLS(df['STACKLOSS'], df[['CONSTANT','AIRFLOW']]).fit()
m1 = sm.OLS(df['STACKLOSS'], df[['CONSTANT','AIRFLOW','WATERTEMP']]).fit()
m2 = sm.OLS(df['STACKLOSS'], df[['CONSTANT','AIRFLOW','WATERTEMP','ACIDCONC']]).fit()
# Results table
res = summary_col([m0,m1,m2], regressor_order=m2.params.index.tolist())
print(res)
================================================
STACKLOSS I STACKLOSS II STACKLOSS III
------------------------------------------------
CONSTANT -44.1320 -50.3588 -39.9197
(6.1059) (5.1383) (11.8960)
AIRFLOW 1.0203 0.6712 0.7156
(0.1000) (0.1267) (0.1349)
WATERTEMP 1.2954 1.2953
(0.3675) (0.3680)
ACIDCONC -0.1521
(0.1563)
================================================
Standard errors in parentheses.
Is there a way to export res to csv?
The results are stored as a list of data frames:
res.tables
[ STACKLOSS I STACKLOSS II STACKLOSS III
CONSTANT -44.1320 -50.3588 -39.9197
(6.1059) (5.1383) (11.8960)
AIRFLOW 1.0203 0.6712 0.7156
(0.1000) (0.1267) (0.1349)
WATERTEMP 1.2954 1.2953
(0.3675) (0.3680)
ACIDCONC -0.1521
(0.1563)
R-squared 0.8458 0.9088 0.9136
R-squared Adj. 0.8377 0.8986 0.8983]
This should work:
res.tables[0].to_csv("test.csv")
pd.read_csv("test.csv")
Unnamed: 0 STACKLOSS I STACKLOSS II STACKLOSS III
0 CONSTANT -44.1320 -50.3588 -39.9197
1 NaN (6.1059) (5.1383) (11.8960)
2 AIRFLOW 1.0203 0.6712 0.7156
3 NaN (0.1000) (0.1267) (0.1349)
4 WATERTEMP NaN 1.2954 1.2953
5 NaN NaN (0.3675) (0.3680)
6 ACIDCONC NaN NaN -0.1521
7 NaN NaN NaN (0.1563)
8 R-squared 0.8458 0.9088 0.9136
9 R-squared Adj. 0.8377 0.8986 0.8983
Related
I have a Pandas DataFrame like (abridged):
age
gender
control
county
11877
67.0
F
0
AL-Calhoun
11552
60.0
F
0
AL-Coosa
11607
60.0
F
0
AL-Talladega
13821
NaN
NaN
1
AL-Mobile
11462
59.0
F
0
AL-Dale
I want to run a linear regression with fixed effects by county entity (not by time) to balance check my control and treatment groups for an experimental design, such that my dependent variable is membership in the treatment group (control = 1) or not (control = 0).
In order to do this, so far as I have seen I need to use linearmodels.panel.PanelOLS and set my entity field (county) as my index.
So far as I'm aware my model should look like this:
# set index on entity effects field:
to_model = to_model.set_index(["county"])
# implement fixed effects linear model
model = PanelOLS.from_formula("control ~ age + gender + EntityEffects", to_model)
When I try to do this, I get the below error:
ValueError: The index on the time dimension must be either numeric or date-like
I have seen a lot of implementations of such models online and they all seem to use a temporal effect, which is not relevant in my case. If I try to encode my county field using numerics, I get a different error.
# create a dict to map county values to numerics
county_map = dict(zip(to_model["county"].unique(), range(len(to_model.county.unique()))))
# create a numeric column as alternative to county
to_model["county_numeric"] = to_model["county"].map(county_map)
# set index on numeric entity effects field
to_model = to_model.set_index(["county_numeric"])
FactorEvaluationError: Unable to evaluate factor `control`. [KeyError: 'control']
How am I able to implement this model using the county as a unit fixed effect?
Assuming you have multiple entries for each county, then you could use the following. The key step is to use a groupby transform to create a distinct numeric index for each county which can be used as a fake time index.
import numpy as np
import pandas as pd
import string
import linearmodels as lm
# Generate randomd DF
rs = np.random.default_rng(1213892)
counties = rs.choice([c for c in string.ascii_lowercase], (1000, 3))
counties = np.array([["".join(c)] * 10 for c in counties]).ravel()
age = rs.integers(18, 65, (10 * 1000))
gender = rs.choice(["m", "f"], size=(10 * 1000))
control = rs.integers(0, 2, size=10 * 1000)
df = pd.DataFrame(
{"counties": counties, "age": age, "gender": gender, "control": control}
)
# Construct a dummy numeric index for each county
numeric_index = df.groupby("counties").age.transform(lambda c: np.arange(len(c)))
df["numeric_index"] = numeric_index
df = df.set_index(["counties","numeric_index"])
# Take a look
df.head(15)
age gender control
counties numeric_index
qbt 0 51 m 1
1 36 m 0
2 28 f 1
3 28 m 0
4 47 m 0
5 19 m 1
6 32 m 1
7 54 m 0
8 36 m 1
9 52 m 0
nub 0 19 m 0
1 57 m 0
2 49 f 0
3 53 m 1
4 30 f 0
This just shows that the model can be estimated.
# Fit the model
# Note: Results are meaningless, just shows that this works
lm.PanelOLS.from_formula("control ~ age + gender + EntityEffects", data=df)
mod = lm.PanelOLS.from_formula("control ~ age + gender + EntityEffects", data=df)
mod.fit()
PanelOLS Estimation Summary
================================================================================
Dep. Variable: control R-squared: 0.0003
Estimator: PanelOLS R-squared (Between): 0.0005
No. Observations: 10000 R-squared (Within): 0.0003
Date: Thu, May 12 2022 R-squared (Overall): 0.0003
Time: 11:08:00 Log-likelihood -6768.3
Cov. Estimator: Unadjusted
F-statistic: 1.4248
Entities: 962 P-value 0.2406
Avg Obs: 10.395 Distribution: F(2,9036)
Min Obs: 10.0000
Max Obs: 30.000 F-statistic (robust): 2287.4
P-value 0.0000
Time periods: 30 Distribution: F(2,9036)
Avg Obs: 333.33
Min Obs: 2.0000
Max Obs: 962.00
Parameter Estimates
===============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
-------------------------------------------------------------------------------
age -0.0002 0.0004 -0.5142 0.6072 -0.0010 0.0006
gender[T.f] 0.5191 0.0176 29.559 0.0000 0.4847 0.5535
gender[T.m] 0.5021 0.0175 28.652 0.0000 0.4678 0.5365
===============================================================================
F-test for Poolability: 0.9633
P-value: 0.7768
Distribution: F(961,9036)
Included effects: Entity
PanelEffectsResults, id: 0x2246f38a9d0
A lot of questions is answered regarding this, however, I could not figure out one thing.
I have a dataframe and I am performing regression,after that the results are stored in the new columns in Test dataframe. To compare methods I need to do both linear and polynomial regression.
I have found a way to beautifully do this with linear regression, where in result I have predicted values in new column of dataframe Test. But I cannot make this work within the same loop using polynomial regression, cause in the final Test dataframe I have multiple Null values as in the step of model_2.fit_transform(X) values somehow loses the corresponding Test index.
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import PolynomialFeatures
Test = pd.read_csv(r'D:\myfile.csv')
df_coef =[]
value = list(set(Test['Value']))
for value in value:
df_redux = Test[Test['Value'] == value]
Y = df_redux['Y']
X = df_redux[['X1', 'A', 'B', 'B']]
X = sm.add_constant(X)
# linear
model_1 = sm.OLS(Y, X).fit()
predictions_1 = model_1.predict(X)
# polynomial
polynomial_features = PolynomialFeatures(degree=2)
xp = polynomial_features.fit_transform(X)
model_2 = sm.OLS(Y, xp).fit()
predictions_2 = model_2.predict(xp)
stats_1 = pd.read_html(model_1.summary().tables[1].as_html(), header=0, index_col=0)[0]
stats_2 = pd.read_html(model_2.summary().tables[1].as_html(), header=0, index_col=0)[0]
predictions_1 = pd.DataFrame(predictions_1, columns=['lin'])
predictions_2 = pd.DataFrame(predictions_2, columns=['poly'])
# ??? how to concat and appen both prediction_1 and prediction_2 in the same df_coef = [] dataframe?
gf = pd.concat([predictions_1, df_redux], axis=1)
df_coef.append(gf)
all_coef = pd.concat(df_coef)
type(all_coef)
Out[234]: pandas.core.frame.DataFrame
The problem is that tranformed xp type is <class 'numpy.ndarray'> , but X type is <class 'pandas.core.frame.DataFrame'>. The question is how can I get the polynomial regression predicted values in new column of Test, next to linear reg. results. This is probably really simple, but I could not figure it out.
print(type(X))
print(type(xp))
print(X.sample(2))
print()
print(xp)
<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>
X1 A B G1
962 4.334912 1.945910 3.135494 3.258097
1365 4.197888 2.197225 3.135494 3.332205
[[ 1. 4.77041663 1.94591015 ... 35.74106743 34.52550933
33.35129251]
[ 1. 4.43240629 1.94591015 ... 33.28387641 32.03140262
30.82605947]
[ 1. 3.21669428 1.94591015 ... 29.95821572 30.38903979
30.82605947]
The result which I get with polynominal reg. predicted values appended to original Test dataframe:
0 6.178542 3.0 692 ... 2.079442 4.783216 6.146329
1 6.156108 11.0 692 ... 2.197225 4.842126 6.113682
2 6.071453 12.0 692 ... 2.197225 4.814595 6.052089
3 5.842053 NaN NaN ... NaN NaN NaN
4 4.625762 30.0 692 ... 1.945910 5.018201 5.828946
This is the correct and good result I obtained using only linear regression, without Nan and with value in each row, how it supposed to be:
0 6.151675 3 692 5 ... 3.433987 2.079442 4.783216 6.146329
1 6.132077 11 692 5 ... 3.401197 2.197225 4.842126 6.113682
2 6.068450 12 692 5 ... 3.332205 2.197225 4.814595 6.052089
4 5.819535 30 692 5 ... 3.258097 1.945910 5.018201 5.828946
8 4.761362 61 692 5 ... 2.564949 1.945910 3.889585 4.624973
Solve this by adding a line for numpy to series tranformation. And for model statistics statsmodels summary:
import pandas as pd
from statsmodels.api import OLS
predictions_2 = model_2.predict(xp)
predictions_2_series = pd.Series(predictions_2, index=df_redux.index.values)
print(OLS(Y, xp).fit().summary())
I want to have an extra column with the maximum relative difference [-] of the row-values and the mean of these rows:
The df is filled with energy use data for several years.
The theoretical formula that should get me this is as follows:
df['max_rel_dif'] = MAX [ ABS(highest energy use – mean energy use), ABS(lowest energy use – mean energy use)] / mean energy use
Initial dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014
0 23 22631 21954.0 22314.0 22032 21843
1 43 27456 29654.0 28159.0 28654 2000
2 36 61200 NaN NaN 31895 1600
3 87 87621 86542.0 87542.0 88456 86961
4 90 58951 57486.0 2000.0 0 0
5 98 24587 25478.0 NaN 24896 25461
Desired dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014 max_rel_dif
0 23 22631 21954.0 22314.0 22032 21843 0.02149
1 43 27456 29654.0 28159.0 28654 2000 0.91373
2 36 61200 NaN NaN 31895 1600 0.94931
3 87 87621 86542.0 87542.0 88456 86961 0.01179
4 90 58951 57486.0 2000.0 0 0 1.48870
5 98 24587 25478.0 NaN 24896 25461 0.02065
tried code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [23,43,36,87,90,98],
"y_2010": [22631,27456,61200,87621,58951,24587],
"y_2011": [21954,29654,np.nan,86542,57486,25478],
"y_2012": [22314,28159,np.nan,87542,2000,np.nan],
"y_2013": [22032,28654,31895,88456,0,24896,],
"y_2014": [21843,2000,1600,86961,0,25461]})
print(df)
a = df.loc[:, ['y_2010','y_2011','y_2012','y_2013', 'y_2014']]
# calculate mean
mean = a.mean(1)
# calculate max_rel_dif
df['max_rel_dif'] = (((df.max(axis=1).sub(mean)).abs(),(df.min(axis=1).sub(mean)).abs()).max()).div(mean)
# AttributeError: 'tuple' object has no attribute 'max'
-> I'm obviously doing the wrong thing with the tuple, I just don't know how to get the maximum values
from the tuples and divide them then by the mean in the proper Phytonic way
I feel like the whole function can be
s=df.filter(like='y')
s.sub(s.mean(1),axis=0).abs().max(1)/s.mean(1)
0 0.021494
1 0.913736
2 0.949311
3 0.011800
4 1.488707
5 0.020653
dtype: float64
I have a pandas dataframe containing 16 columns, of which 14 represent variables where i perform a looped Anova test using statsmodels. My dataframe looks something like this (simplified):
ID Cycle_duration Average_support_phase Average_swing_phase Label
1 23.1 34.3 47.2 1
2 27.3 38.4 49.5 1
3 25.8 31.1 45.7 1
4 24.5 35.6 41.9 1
...
So far this is what i'm doing:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv('features_total.csv')
for variable in df.columns:
model = ols('{} ~ Label'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
Which yields:
sum_sq df F PR(>F)
Label 0.124927 2.0 2.561424 0.084312
Residual 1.731424 71.0 NaN NaN
sum_sq df F PR(>F)
Label 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
sum_sq df F PR(>F)
Label 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
I'm getting an individual table print for each variable where the Anova is performed. Basically what i want is to print one single table with the summarized results, or something like this:
sum_sq df F PR(>F)
Cycle_duration 0.1249270 2.0 2.561424 0.084312
Residual 1.7314240 71.0 NaN NaN
Average_support_phase 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
Average_swing_phase 62.626057 2.0 4.969491 0.009552
Residual 447.374788 71.0 NaN NaN
I can already see a problem because this method always outputs the 'Label' nomenclature before the actual values, and not the variable name in question (like i've shown above, i would like to have the variable name above each 'residual'). Is this even possible with the statsmodels approach?
I'm fairly new to python and excuse me if this has nothing to do with statsmodels - in that case, please do elucidate me on what i should be trying.
You can collect the tables and concatenate them at the end of your loop. This method will create a hierarchical index, but I think that makes it a bit more clear. Something like this:
keys = []
tables = []
for variable in df.columns:
model = ols('{} ~ Label'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
keys.append(variable)
tables.append(anova_table)
df_anova = pd.concat(tables, keys=keys, axis=0)
Somewhat related, I would also suggest correcting for multiple comparisons. This is more a statistical suggestion than a coding suggestion, but considering you are performing numerous statistical tests, it would make sense to account for the probability that one of the test would result in a false positive.
I've got two dataframes df1 and df2 that look like this:
#df1
counts freqs
categories
automatic 13 0.40625
manual 19 0.59375
#df2
counts freqs
categories
Straight Engine 18 0.5625
V engine 14 0.4375
Could anyone explain why pd.concat([df1, df2], axis = 1) will not give me this:
counts freqs
categories
automatic 13 0.40625
manual 19 0.59375
Straight Engine 18 0.5625
V engine 14 0.4375
Here is what I've tried:
1 - Using pd.concat()
I'm suspecting that the way I've built these dataframes may be the source of the issue.
And here is how I've ended up with these particular dataframes:
# imports
import pandas as pd
from pydataset import data # pip install pydataset to get datasets from R
# load data
df_mtcars = data('mtcars')
# change dummyvariables to more describing variables:
df_mtcars['am'][df_mtcars['am'] == 0] = 'manual'
df_mtcars['am'][df_mtcars['am'] == 1] = 'automatic'
df_mtcars['vs'][df_mtcars['vs'] == 0] = 'Straight Engine'
df_mtcars['vs'][df_mtcars['vs'] == 1] = 'V engine'
# describe categorical variables
df1 = pd.Categorical(df_mtcars['am']).describe()
df2 = pd.Categorical(df_mtcars['vs']).describe()
I understand that 'categories' is what is causing the problems here since df_con = pd.concat([df1, df2], axis = 1) raises this error:
TypeError: categories must match existing categories when appending
But it confuses me that this is ok:
# code
df_con = pd.concat([df1, df2], axis = 1)
# output:
counts freqs counts freqs
categories
automatic 13.0 0.40625 NaN NaN
manual 19.0 0.59375 NaN NaN
Straight Engine NaN NaN 18.0 0.5625
V engine NaN NaN 14.0 0.4375
2 - Using df.append() raises the same error as pd.concat()
3 - Using pd.merge() sort of works, but I'm losing the indexes:
# Code
df_merge = pd.merge(df1, df2, how = 'outer')
# Output
counts freqs
0 13 0.40625
1 19 0.59375
2 18 0.56250
3 14 0.43750
3 - Using pd.concat() on transposed dataframes
Since pd.concat() worked with axis = 0 I thought I would get there using transposed dataframes.
# df1.T
categories automatic manual
counts 13.00000 19.00000
freqs 0.40625 0.59375
# df2.T
categories Straight Engine V engine
counts 18.0000 14.0000
freqs 0.5625 0.4375
But still no success:
# code
df_con = pd.concat([df1.T, df2.T], axis = 1)
>>> TypeError: categories must match existing categories when appending
By the way, what I was hoping for here is this:
categories automatic manual Straight Engine V engine
counts 13.00000 19.00000 18.0000 14.0000
freqs 0.40625 0.59375 0.5625 0.4375
Still kind of works with axis = 0 though:
# code
df_con = pd.concat([df1.T, df2.T], axis = 0)
# Output
categories automatic manual Straight Engine V engine
counts 13.00000 19.00000 NaN NaN
freqs 0.40625 0.59375 NaN NaN
counts NaN NaN 18.0000 14.0000
freqs NaN NaN 0.5625 0.4375
But that is still far from what I'm trying to accomplish.
Now I'm thinking that it would be possible to strip the 'category' info from df1 and df2, but I haven't been able to find out how to do that yet.
Thank you for any other suggestions!
try this,
pd.concat([df1.reset_index(),df2.reset_index()],ignore_index=True)
Output:
categories counts freqs
0 automatic 13 0.40625
1 manual 19 0.59375
2 Straight Engine 18 0.56250
3 V engine 14 0.43750
To get again category as index follow this,
pd.concat([df1.reset_index(),df2.reset_index()],ignore_index=True).set_index('categories')
Output:
counts freqs
categories
automatic 13 0.40625
manual 19 0.59375
Straight Engine 18 0.56250
V engine 14 0.43750
for more details follow this docs