I am trying to do an "OLS Regression Results" ,to a college project, and my code is this:
import statsmodels.api as sm
from statsmodels.formula.api import ols
import numpy as np
data=np.loadtxt('file.txt',skiprows=1)
season=data[:nb,0]
tod=data[:nb,1]
obs=data[:nb,2]
pr=data[:nb,3]
data_lm = ols('pr ~ tod + season',data=data).fit()
table = sm.stats.anova_lm(data_lm, typ=2)
data_lm.summary()
print(table)
It gives me this error "PatsyError: Error evaluating factor: IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
pr ~ tod) + season"
I think the error is in the format of my data. The text file contains 4 different columns (season, tod, obs and pr).
season:[3., 3., 1., 3., 3., 3., 3., 3., 1., 3., 3., 1., 3., 2., 3., 3., 3.,
1., 1., 1., 1., 3., 1., 2., 1., 3., 1., 1., 2., 1., 3., 3., 1., 1.,
1., 2., 3.]
tod:[2., 4., 1., 2., 2., 2., 4., 1., 3., 3., 1., 3., 3., 2., 2., 4., 3.,
3., 4., 3., 3., 2., 4., 1., 3., 4., 1., 1., 1., 3., 3., 4., 3., 3.,
4., 4., 4.]
obs:[ 1., 1., 1., 3., 3., 3., 3., 3., 4., 4., 4., 5., 5.,
5., 5., 5., 6., 9., 9., 12., 12., 12., 12., 12., 13., 13.,
16., 16., 17., 19., 19., 19., 20., 20., 20., 20., 24.]
pr:[0. , 0. , 0. , 0.1, 0.2, 0.2, 0.4, 0.4, 0.5, 0.5, 0.7, 0.7, 0.7,
0.8, 0.8, 0.8, 0.8, 0.9, 0.9, 1. , 1. , 1.1, 1.1, 1.2, 1.3, 1.4,
1.4, 1.5, 1.6, 1.7, 1.7, 1.8, 1.8, 1.9, 2. , 2. , 2. ]
Can anyone help me?
data is a basic NumPy ndarray object. These accept integers, slices, or other "array like" objects when you index them with []. However, the ols function explicitly says in the documentation:
data must define __getitem__ with the keys in the formula
That means data must be a pandas DataFrame, a dictionary, or a NumPy structured array, with a __getitem__ method that accepts str objects as indices.
Related
I have 2 numpy arrays:
one of shape (753,8,1) denoting 8 sequential actions of a customer
and other of shape (753,10) denoting 10 features of a training sample.
How can I combine these two such that:
all 10 features are appended to each of the 8 sequential actions of a training sample , that is, the combined final array should have shape of (753,8,11).
Maybe something like this:
import numpy as np
# create dummy arrays
a = np.zeros((753, 8, 1))
b = np.arange(753*10).reshape(753, 10)
# make a new axis for b and repeat the values along axis 1
c = np.repeat(b[:, np.newaxis, :], 8, axis=1)
c.shape
>>> (753, 8, 10)
# now the first two axes of a and c have the same shape
# append the values in c to a along the last axis
result = np.append(a, c, axis=2)
result.shape
>>> (753, 8, 11)
result[0]
>>> array([[0., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[0., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[0., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[0., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[0., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[0., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[0., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[0., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]])
# values from b (0-9) have been appended to a (0)
This is my X:
X = np.array([[ 5., 8., 3., 4., 0., 5., 4., 0., 2., 5., 11.,
3., 19., 2.],
[ 5., 8., 3., 4., 0., 1., 4., 0., 3., 5., 13.,
4., 19., 2.],
[ 5., 8., 3., 4., 0., 4., 4., 0., 3., 5., 12.,
2., 19., 2.],
[ 5., 8., 3., 4., 0., 1., 4., 0., 4., 5., 12.,
4., 19., 2.],
[ 5., 8., 3., 4., 0., 1., 4., 0., 3., 5., 12.,
5., 19., 2.],
[ 5., 8., 3., 4., 0., 2., 4., 0., 3., 5., 13.,
3., 19., 2.],
[ 5., 8., 3., 4., 0., 2., 4., 0., 4., 5., 11.,
4., 19., 2.],
[ 5., 8., 3., 4., 0., 2., 4., 0., 3., 5., 11.,
5., 19., 2.],
[ 5., 8., 3., 4., 0., 1., 4., 0., 3., 5., 12.,
5., 19., 2.],
[ 5., 8., 3., 4., 0., 1., 4., 0., 3., 5., 12.,
5., 19., 2.]])
and this is my response y
y = np.array([ 70.14963195, 70.20937046, 70.20890363, 70.14310389,
70.18076206, 70.13179977, 70.13536797, 70.10700998,
70.09194074, 70.09958111])
Ridge Regression
# alpha = 0.1
model = Ridge(alpha = 0.1)
model.fit(X,y)
model.score(X,y) # gives 0.36898424479816627
# alpha = 0.01
model1 = Ridge(alpha = 0.01)
model1.fit(X,y)
model1.score(X,y) # gives 0.3690347045143918 > 0.36898424479816627
# alpha = 0.001
model2 = Ridge(alpha = 0.001)
model2.fit(X,y)
model2.score(X,y) #gives 0.36903522192901728 > 0.3690347045143918
# alpha = 0.0001
model3 = Ridge(alpha = 0.0001)
model3.fit(X,y)
model3.score(X,y) # gives 0.36903522711624259 > 0.36903522192901728
Thus from here it should be clear that alpha = 0.0001 is the best option. Indeed reading the documentation it says that the score is the coefficient of determination. If the coefficient closest to 1 describes the best model. Now let's see what RidgeCV tells us
RidgeCV regression
modelCV = RidgeCV(alphas = [0.1, 0.01, 0.001,0.0001], store_cv_values = True)
modelCV.fit(X,y)
modelCV.alpha_ #giving 0.1
modelCV.score(X,y) # giving 0.36898424479812919 which is the same score as ridge regression with alpha = 0.1
What is going wrong? Surely we can check manually, as I have done, that all the other alphas are better. So not only it is not choosing the best alpha, but it is choosing the worst!
Can someone explain to me what it's going wrong?
That's perfectly normal behaviour.
Your manual approach is not doing any cross-validation and therefore train- and testdata are the same!
# alpha = 0.1
model = Ridge(alpha = 0.1)
model.fit(X,y) #!!
model.score(X,y) #!!
With some mild assumptions on the classifier (e.g convex-optimization problem) and the solver (guaranteed epsilon-convergence) this means, that you will always get the lowest score for the least regularized model (overfitting!): in your case: alpha = 0.0001. (Have a look at RidgeRegression's formula)
Using RidgeCV though, cross-validation is by default activated, leave-one-out being selected. The scoring-process used to determine the best parameters is not using the same data for train and test.
You can print out the mean cv_values_ as you are using store_cv_values = True:
print(np.mean(modelCV.cv_values_, axis=0))
# [ 0.00226582 0.0022879 0.00229021 0.00229044]
# alpha [0.1, 0.01, 0.001,0.0001]
# by default: mean squared errors!
# left / 0.1 best; right / 0.0001 worst
# this is only a demo: not sure how sklearn selects best (mean vs. ?)
This is expected, but not the general rule. As you are now scoring with two different datasets, you are optimizing not to overfit and with high probability some regularization is needed!
sascha's answer is correct. Here is a proof that RidgeCV does pick the right alpha.
I write a function to test whether the index of the minimum cross validated error for the alphas match the index of 0.1 in the list of alphas.
def test_RidgeCV(alphas):
modelCV = RidgeCV(alphas = alphas, store_cv_values = True)
modelCV.fit(X,y)
modelCV.score(X,y)
# print(modelCV.alpha_)
CV_values =modelCV.cv_values_
mean_error = np.min(CV_values, axis=0)
return alphas.index(0.1) == np.argmin(mean_error)
Then I run through the full permutation of the list of alphas provided in the question. No matter where we put the 0.1, its index always matches the index of the minimum error.
This is exhaustive test. We got 24 True's.
alphas=[0.1, 0.01, 0.001,0.0001]
from itertools import permutations
alphas_list = list(permutations(alphas))
for i in range(len(alphas_list)):
print(test_RidgeCV(alphas=alphas_list[i]))
Out:
True
True
...
True
For a numpy array of dimension n, I'd like to apply np.nanmax() to n-1 dimensions producing a 1 dimensional array of maxima, ignoring all values set to np.nan.
q = np.arange(5*4*3.).reshape(3,4,5) % (42+1)
q[q%5==0] = np.nan
producing:
array([[[ nan, 1., 2., 3., 4.],
[ nan, 6., 7., 8., 9.],
[ nan, 11., 12., 13., 14.],
[ nan, 16., 17., 18., 19.]],
[[ nan, 21., 22., 23., 24.],
[ nan, 26., 27., 28., 29.],
[ nan, 31., 32., 33., 34.],
[ nan, 36., 37., 38., 39.]],
[[ nan, 41., 42., nan, 1.],
[ 2., 3., 4., nan, 6.],
[ 7., 8., 9., nan, 11.],
[ 12., 13., 14., nan, 16.]]])
If I know ahead of time that I want to use the last axis as the remaining dimension, I can use the -1 feature in .reshape() and do this:
np.nanmax(q.reshape(-1, q.shape[-1]), axis=0)
which produces the result I want:
array([ 12., 41., 42., 38., 39.])
However, suppose I don't know ahead of time to which one of the axes that I don't want to apply the maximum? Suppose I started with n=4 dimensions, and wanted it to apply to all axes except the mth axis, which could be 0, 1, 2, or 3? Would have to actually use a conditional if-elif-else ?
Is there something that would work like a hypothetical exeptaxis=m?
The axis argument of nanmax can be a tuple of axes over which the maximum is computed. In your case, you want that tuple to contain all the axes except m. Here's one way you could do that:
In [62]: x
Out[62]:
array([[[[ 4., 3., nan, nan],
[ 0., 2., 2., nan],
[ 4., 5., nan, 3.],
[ 2., 0., 3., 1.]],
[[ 2., 0., 0., 1.],
[ nan, 3., 0., nan],
[ 0., 1., nan, 2.],
[ 5., 4., 0., 1.]],
[[ 4., 0., 2., 0.],
[ 4., 0., 4., 5.],
[ 3., 4., 1., 0.],
[ 5., 3., 4., 3.]]],
[[[ 2., nan, 6., 4.],
[ 3., 1., 2., nan],
[ 5., 4., 1., 0.],
[ 2., 6., 0., nan]],
[[ 4., 1., 4., 2.],
[ nan, 1., 5., 5.],
[ 2., 0., 1., 1.],
[ 6., 3., 6., 5.]],
[[ 1., 0., 0., 1.],
[ 1., nan, 2., nan],
[ 3., 4., 0., 5.],
[ 1., 6., 2., 3.]]]])
In [63]: m = 0
In [64]: np.nanmax(x, axis=tuple(i for i in range(x.ndim) if i != m))
Out[64]: array([ 5., 6.])
I have a five 100x100 arrays, A, and I want to multiply each matrix by a value from an array of length five, B. I wish to multiply the first matrix in A by the first value in B and the second matrix by the second value in B, etc. Am I able to do this?
Actually the answer has been provided by gboffi in his comment. Yet I want to elaborate that answer, giving a concrete example with code:
import numpy as np
#example data, all arrays of ones 100x100
A1 = A2 = A3 =A4 = A5 = np.ones((100, 100))
#example array containing the factor for each matrix
B = np.array([1, 2, 3, 4, 5])
#create an array containing all matrices
A = np.array([A1, A2, A3, A4, A5])
A*B[:,None,None]
The result then looks like this:
array([[[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
...,
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.]],
[[ 2., 2., 2., ..., 2., 2., 2.],
[ 2., 2., 2., ..., 2., 2., 2.],
[ 2., 2., 2., ..., 2., 2., 2.],
...,
[ 2., 2., 2., ..., 2., 2., 2.],
[ 2., 2., 2., ..., 2., 2., 2.],
[ 2., 2., 2., ..., 2., 2., 2.]],
[[ 3., 3., 3., ..., 3., 3., 3.],
[ 3., 3., 3., ..., 3., 3., 3.],
[ 3., 3., 3., ..., 3., 3., 3.],
...,
[ 3., 3., 3., ..., 3., 3., 3.],
[ 3., 3., 3., ..., 3., 3., 3.],
[ 3., 3., 3., ..., 3., 3., 3.]],
[[ 4., 4., 4., ..., 4., 4., 4.],
[ 4., 4., 4., ..., 4., 4., 4.],
[ 4., 4., 4., ..., 4., 4., 4.],
...,
[ 4., 4., 4., ..., 4., 4., 4.],
[ 4., 4., 4., ..., 4., 4., 4.],
[ 4., 4., 4., ..., 4., 4., 4.]],
[[ 5., 5., 5., ..., 5., 5., 5.],
[ 5., 5., 5., ..., 5., 5., 5.],
[ 5., 5., 5., ..., 5., 5., 5.],
...,
[ 5., 5., 5., ..., 5., 5., 5.],
[ 5., 5., 5., ..., 5., 5., 5.],
[ 5., 5., 5., ..., 5., 5., 5.]]])
I have an array:
a = array([
[ nan, 2., 3., 2., 5., 3.],
[ nan, 4., 3., 2., 5., 4.],
[ nan, 2., 1., 2., 3., 2.]
])
And I make a filled contour with:
plt.contourf(a)
So, I'll have it:
Nothing happens when I do plt.axis('tight'), but I want to hide boundary NaN values. How to do it easy?
You can set the min and max xlim using nanmin and nanmax:
import numpy as np
a = np.array([
[ np.nan, 2., 3., 2., 5., 3.],
[ np.nan, 4., 3., 2., 5., 4.],
[ np.nan, 2., 1., 2., 3., 2.]
])
import pylab as plt
xmax= np.nanmax(a)
xmin=np.nanmin(a)
plt.xlim(xmin,xmax)
plt.contourf(a)
plt.show()
If the array has the NaNs in a column like in your example, you can do the following way:
import matplotlib.pyplot as plt
a = array([
[ nan, 2., 3., 2., 5., 3.],
[ nan, 4., 3., 2., 5., 4.],
[ nan, 2., 1., 2., 3., 2.]
])
b = np.delete(a,0,1)
plt.contourf(b)
Well..
If I consider columns os NaNs in the begin and end, I tried that and it worked:
x = np.arange(0,a.shape[1])
plt.xlim([x[~np.isnan(a[0,:])][0],x[~np.isnan(a[0,:])][-1]])