Scikit learn: RidgeCV seems not to give the best option?

Scikit learn: RidgeCV seems not to give the best option? - python

This is my X:
X = np.array([[ 5., 8., 3., 4., 0., 5., 4., 0., 2., 5., 11.,
3., 19., 2.],
[ 5., 8., 3., 4., 0., 1., 4., 0., 3., 5., 13.,
4., 19., 2.],
[ 5., 8., 3., 4., 0., 4., 4., 0., 3., 5., 12.,
2., 19., 2.],
[ 5., 8., 3., 4., 0., 1., 4., 0., 4., 5., 12.,
4., 19., 2.],
[ 5., 8., 3., 4., 0., 1., 4., 0., 3., 5., 12.,
5., 19., 2.],
[ 5., 8., 3., 4., 0., 2., 4., 0., 3., 5., 13.,
3., 19., 2.],
[ 5., 8., 3., 4., 0., 2., 4., 0., 4., 5., 11.,
4., 19., 2.],
[ 5., 8., 3., 4., 0., 2., 4., 0., 3., 5., 11.,
5., 19., 2.],
[ 5., 8., 3., 4., 0., 1., 4., 0., 3., 5., 12.,
5., 19., 2.],
[ 5., 8., 3., 4., 0., 1., 4., 0., 3., 5., 12.,
5., 19., 2.]])
and this is my response y
y = np.array([ 70.14963195, 70.20937046, 70.20890363, 70.14310389,
70.18076206, 70.13179977, 70.13536797, 70.10700998,
70.09194074, 70.09958111])
Ridge Regression
# alpha = 0.1
model = Ridge(alpha = 0.1)
model.fit(X,y)
model.score(X,y) # gives 0.36898424479816627
# alpha = 0.01
model1 = Ridge(alpha = 0.01)
model1.fit(X,y)
model1.score(X,y) # gives 0.3690347045143918 > 0.36898424479816627
# alpha = 0.001
model2 = Ridge(alpha = 0.001)
model2.fit(X,y)
model2.score(X,y) #gives 0.36903522192901728 > 0.3690347045143918
# alpha = 0.0001
model3 = Ridge(alpha = 0.0001)
model3.fit(X,y)
model3.score(X,y) # gives 0.36903522711624259 > 0.36903522192901728
Thus from here it should be clear that alpha = 0.0001 is the best option. Indeed reading the documentation it says that the score is the coefficient of determination. If the coefficient closest to 1 describes the best model. Now let's see what RidgeCV tells us
RidgeCV regression
modelCV = RidgeCV(alphas = [0.1, 0.01, 0.001,0.0001], store_cv_values = True)
modelCV.fit(X,y)
modelCV.alpha_ #giving 0.1
modelCV.score(X,y) # giving 0.36898424479812919 which is the same score as ridge regression with alpha = 0.1
What is going wrong? Surely we can check manually, as I have done, that all the other alphas are better. So not only it is not choosing the best alpha, but it is choosing the worst!
Can someone explain to me what it's going wrong?

That's perfectly normal behaviour.
Your manual approach is not doing any cross-validation and therefore train- and testdata are the same!
# alpha = 0.1
model = Ridge(alpha = 0.1)
model.fit(X,y) #!!
model.score(X,y) #!!
With some mild assumptions on the classifier (e.g convex-optimization problem) and the solver (guaranteed epsilon-convergence) this means, that you will always get the lowest score for the least regularized model (overfitting!): in your case: alpha = 0.0001. (Have a look at RidgeRegression's formula)
Using RidgeCV though, cross-validation is by default activated, leave-one-out being selected. The scoring-process used to determine the best parameters is not using the same data for train and test.
You can print out the mean cv_values_ as you are using store_cv_values = True:
print(np.mean(modelCV.cv_values_, axis=0))
# [ 0.00226582 0.0022879 0.00229021 0.00229044]
# alpha [0.1, 0.01, 0.001,0.0001]
# by default: mean squared errors!
# left / 0.1 best; right / 0.0001 worst
# this is only a demo: not sure how sklearn selects best (mean vs. ?)
This is expected, but not the general rule. As you are now scoring with two different datasets, you are optimizing not to overfit and with high probability some regularization is needed!

sascha's answer is correct. Here is a proof that RidgeCV does pick the right alpha.
I write a function to test whether the index of the minimum cross validated error for the alphas match the index of 0.1 in the list of alphas.
def test_RidgeCV(alphas):
modelCV = RidgeCV(alphas = alphas, store_cv_values = True)
modelCV.fit(X,y)
modelCV.score(X,y)
# print(modelCV.alpha_)
CV_values =modelCV.cv_values_
mean_error = np.min(CV_values, axis=0)
return alphas.index(0.1) == np.argmin(mean_error)
Then I run through the full permutation of the list of alphas provided in the question. No matter where we put the 0.1, its index always matches the index of the minimum error.
This is exhaustive test. We got 24 True's.
alphas=[0.1, 0.01, 0.001,0.0001]
from itertools import permutations
alphas_list = list(permutations(alphas))
for i in range(len(alphas_list)):
print(test_RidgeCV(alphas=alphas_list[i]))
Out:
True
True
...
True

Related

OLS Regression Results

I am trying to do an "OLS Regression Results" ,to a college project, and my code is this:
import statsmodels.api as sm
from statsmodels.formula.api import ols
import numpy as np
data=np.loadtxt('file.txt',skiprows=1)
season=data[:nb,0]
tod=data[:nb,1]
obs=data[:nb,2]
pr=data[:nb,3]
data_lm = ols('pr ~ tod + season',data=data).fit()
table = sm.stats.anova_lm(data_lm, typ=2)
data_lm.summary()
print(table)
It gives me this error "PatsyError: Error evaluating factor: IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
pr ~ tod) + season"
I think the error is in the format of my data. The text file contains 4 different columns (season, tod, obs and pr).
season:[3., 3., 1., 3., 3., 3., 3., 3., 1., 3., 3., 1., 3., 2., 3., 3., 3.,
1., 1., 1., 1., 3., 1., 2., 1., 3., 1., 1., 2., 1., 3., 3., 1., 1.,
1., 2., 3.]
tod:[2., 4., 1., 2., 2., 2., 4., 1., 3., 3., 1., 3., 3., 2., 2., 4., 3.,
3., 4., 3., 3., 2., 4., 1., 3., 4., 1., 1., 1., 3., 3., 4., 3., 3.,
4., 4., 4.]
obs:[ 1., 1., 1., 3., 3., 3., 3., 3., 4., 4., 4., 5., 5.,
5., 5., 5., 6., 9., 9., 12., 12., 12., 12., 12., 13., 13.,
16., 16., 17., 19., 19., 19., 20., 20., 20., 20., 24.]
pr:[0. , 0. , 0. , 0.1, 0.2, 0.2, 0.4, 0.4, 0.5, 0.5, 0.7, 0.7, 0.7,
0.8, 0.8, 0.8, 0.8, 0.9, 0.9, 1. , 1. , 1.1, 1.1, 1.2, 1.3, 1.4,
1.4, 1.5, 1.6, 1.7, 1.7, 1.8, 1.8, 1.9, 2. , 2. , 2. ]
Can anyone help me?

data is a basic NumPy ndarray object. These accept integers, slices, or other "array like" objects when you index them with []. However, the ols function explicitly says in the documentation:
data must define __getitem__ with the keys in the formula
That means data must be a pandas DataFrame, a dictionary, or a NumPy structured array, with a __getitem__ method that accepts str objects as indices.

combining 2 numpy arrays

I have 2 numpy arrays:
one of shape (753,8,1) denoting 8 sequential actions of a customer
and other of shape (753,10) denoting 10 features of a training sample.
How can I combine these two such that:
all 10 features are appended to each of the 8 sequential actions of a training sample , that is, the combined final array should have shape of (753,8,11).

Maybe something like this:
import numpy as np
# create dummy arrays
a = np.zeros((753, 8, 1))
b = np.arange(753*10).reshape(753, 10)
# make a new axis for b and repeat the values along axis 1
c = np.repeat(b[:, np.newaxis, :], 8, axis=1)
c.shape
>>> (753, 8, 10)
# now the first two axes of a and c have the same shape
# append the values in c to a along the last axis
result = np.append(a, c, axis=2)
result.shape
>>> (753, 8, 11)
result[0]
>>> array([[0., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[0., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[0., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[0., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[0., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[0., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[0., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[0., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]])
# values from b (0-9) have been appended to a (0)

Correctly using the numpy's convolve with an image

I was watching Andrew Ng's videos on CNN and wanted to to convolve a 6 x 6 image with a 3 x 3 filter. The way I approached this with numpy is as follows:
image = np.ones((6,6))
filter = np.ones((3,3))
convolved = np.convolve(image, filter)
Running this gives an error saying:
ValueError: object too deep for desired array
I could comprehend from the numpy documentation of convolve on how to correctly use the convolve method.
Also, is there a way I could do a strided convolutions with numpy?

np.convolve function, unfortunately, only works for 1-D convolution. That's why you get an error; you need a function that allows you to perform 2-D convolution.
However, even if it did work, you actually have the wrong operation. What is called convolution in machine learning is more properly termed cross-correlation in mathematics. They're actually almost the same; convolution involves flipping the filter matrix followed by performing cross-correlation.
To solve your problem, you can look at scipy.signal.correlate (also, don't use filter as a name, as you'll shadow the inbuilt function):
from scipy.signal import correlate
image = np.ones((6, 6))
f = np.ones((3, 3))
correlate(image, f)
Output:
array([[1., 2., 3., 3., 3., 3., 2., 1.],
[2., 4., 6., 6., 6., 6., 4., 2.],
[3., 6., 9., 9., 9., 9., 6., 3.],
[3., 6., 9., 9., 9., 9., 6., 3.],
[3., 6., 9., 9., 9., 9., 6., 3.],
[3., 6., 9., 9., 9., 9., 6., 3.],
[2., 4., 6., 6., 6., 6., 4., 2.],
[1., 2., 3., 3., 3., 3., 2., 1.]])
This is the standard setting of full cross-correlation. If you want to remove elements which would rely on the zero-padding, pass mode='valid':
from scipy.signal import correlate
image = np.ones((6, 6))
f = np.ones((3, 3))
correlate(image, f, mode='valid')
Output:
array([[9., 9., 9., 9.],
[9., 9., 9., 9.],
[9., 9., 9., 9.],
[9., 9., 9., 9.]])

How to make a loop for going through the input variables in python?

I want to modify this code by making maybe some sort of loop that could go through the variable names so I should not write down that 'if data1 is not None' part? I was also wondering if there is a way that I could make some sort of dynamic code that the number of inputs into the function could change somehow, for example, let's say I want to input 100 different data sets, I can't write down all of them in the part of input for function, what should I do for that?
Also, how could I put title for both of the plots? because when I use the plt.title(), it only shows the last title.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(4)
randomSet = np.random.randint(0, 2, (10, 20))
np.random.seed(3)
randomSet3 = np.random.randint(0, 2, (10, 20))
np.random.seed(2)
randomSet2 = np.random.randint(0, 2, (10, 20))
np.random.seed(1)
randomSet1 = np.random.randint(0, 2, (10, 20))
def showResult(data, data1 = None, data2 = None, data3 = None, data4 = None, data5 = None, nscan = 1):
#index = 0
total = np.zeros(data.shape[0]*data.shape[1])
dataList = [data.reshape(data.shape[0]*data.shape[1])]
if data1 is not None:
dataList.append(data1.reshape(data1.shape[0]*data1.shape[1]))
if data2 is not None:
dataList.append(data2.reshape(data2.shape[0]*data2.shape[1]))
if data3 is not None:
dataList.append(data3.reshape(data3.shape[0]*data3.shape[1]))
if data4 is not None:
dataList.append(data4.reshape(data4.shape[0]*data4.shape[1]))
if data5 is not None:
dataList.append(data5.reshape(data5.shape[0]*data5.shape[1]))
#total = copy.copy(data)
for i in range(nscan):
total += dataList[i]
fig = plt.figure(figsize = (8, 10))
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
ax1.imshow(total.reshape(data.shape[0], data.shape[1]), cmap= 'gray', interpolation= 'nearest')
#plt.title('Image')
ax2.hist(total)
#plt.title('Histogram')
plt.show()
return total
showResult(randomSet, randomSet1, randomSet, randomSet3, randomSet, randomSet2, nscan= 6)
Output should be:
array([ 1., 2., 5., 4., 4., 2., 4., 3., 2., 5., 0., 3., 5.,
6., 2., 5., 5., 5., 0., 0., 0., 2., 2., 1., 2., 0.,
4., 0., 5., 4., 4., 4., 1., 6., 2., 1., 3., 1., 4.,
1., 2., 4., 1., 3., 5., 3., 1., 5., 2., 4., 4., 1.,
1., 3., 1., 6., 3., 5., 5., 1., 3., 5., 4., 1., 4.,
3., 5., 5., 4., 5., 2., 1., 4., 1., 2., 1., 6., 3.,
2., 4., 5., 1., 1., 2., 5., 3., 2., 5., 3., 2., 3.,
3., 4., 1., 4., 2., 5., 2., 4., 5., 5., 5., 1., 4.,
5., 0., 4., 1., 5., 1., 5., 2., 2., 2., 1., 3., 1.,
1., 3., 1., 3., 3., 5., 5., 5., 2., 2., 1., 4., 5.,
2., 5., 2., 3., 2., 0., 0., 5., 5., 5., 2., 2., 1.,
1., 4., 4., 4., 2., 5., 2., 4., 5., 4., 2., 2., 1.,
4., 4., 2., 4., 4., 1., 4., 3., 5., 0., 1., 2., 3.,
0., 5., 3., 2., 2., 2., 4., 4., 2., 4., 0., 5., 5.,
2., 3., 0., 1., 1., 5., 3., 1., 3., 5., 1., 2., 3.,
5., 5., 2., 2., 5.])
Output plots

You don't need to hardcore each dataset individually. You can simply call np.random.randint(low, high, (x, y, n)), with n being the number of scans/trials. Summing them along the last axis means you'll get an array with shape (x, y). This can be done trivially with np.sum().
The way to add a title in a subplot can be found here. Overall,
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
sets = 6
data = np.random.randint(0, 2, (10, 20, sets))
def plot_data(data):
total = np.sum(data, axis=-1)
fig = plt.figure(figsize=(8, 10))
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
ax1.imshow(total, cmap= 'gray', interpolation= 'nearest')
ax1.set_title('Image')
# best way to flatten a numpy array
ax2.hist(total.flatten())
ax2.set_title('Histogram')
plt.show()
plot_data(data)

Are there alternative way to manage value assignment of n-dim array/matrix/list in Python?

In python we do some thing like this for example:
n = 30
A = numpy.zeros(shape=(n,n))
for i in range(0, n):
for j in range(0, n):
A[i, j] = i+j
#i+j just example of assignment
To manage a 2-dim array. It's so simple. just use nest loop to walk around rows and columns.
But my friend told me why it's so complicated. Could you give me the another way to manage it?
He told me in Mathematica have some way more easier to manage n-dim array (I'm not sure. I've never use Mathematica)
Can you give me the alternative way to manage value assignment on n-dim matrix/array(in Numpy) or list(ordinary one in Python)?

You are looking for numpy.fromfunction:
>>> numpy.fromfunction(lambda x, y: x + y, (5, 5))
array([[ 0., 1., 2., 3., 4.],
[ 1., 2., 3., 4., 5.],
[ 2., 3., 4., 5., 6.],
[ 3., 4., 5., 6., 7.],
[ 4., 5., 6., 7., 8.]])
You can simplify slightly using operator:
>>> from operator import add
>>> numpy.fromfunction(add, (5, 5))
array([[ 0., 1., 2., 3., 4.],
[ 1., 2., 3., 4., 5.],
[ 2., 3., 4., 5., 6.],
[ 3., 4., 5., 6., 7.],
[ 4., 5., 6., 7., 8.]])

You can use the mathematical rules for matrixes and vectors:
n = 30
w = numpy.arange(n).reshape(1,-1)
A = w+w.T

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scikit learn: RidgeCV seems not to give the best option? - python

Related

OLS Regression Results

combining 2 numpy arrays

Correctly using the numpy's convolve with an image

How to make a loop for going through the input variables in python?

Are there alternative way to manage value assignment of n-dim array/matrix/list in Python?

Categories

Resources