I'm trying to run a simple Sklearn Ridge regression using an array of sample weights.
X_train is a ~200k by 100 2D Numpy array. I get a Memory error when I try to use sample_weight option. It works just fine without that option. For the sake of simplicity I reduced the features to 2 and sklearn still throws me a Memory Error.
Any ideas?
model=linear_model.Ridge()
model.fit(X_train, y_train,sample_weight=w_tr)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/g/anaconda/lib/python2.7/site-packages/sklearn/linear_model/ridge.py", line 449, in fit
return super(Ridge, self).fit(X, y, sample_weight=sample_weight)
File "/home/g/anaconda/lib/python2.7/site-packages/sklearn/linear_model/ridge.py", line 338, in fit
solver=self.solver)
File "/home/g/anaconda/lib/python2.7/site-packages/sklearn/linear_model/ridge.py", line 286, in ridge_regression
K = safe_sparse_dot(X, X.T, dense_output=True)
File "/home/g/anaconda/lib/python2.7/site-packages/sklearn/utils/extmath.py", line 83, in safe_sparse_dot
return np.dot(a, b)
MemoryError
Setting sample weights can cause big differences in the way the sklearn linear_model Ridge object processes your data - especially if the matrix is tall (n_samples > n_features), as is your case. Without sample weights it will exploit the fact that X.T.dot(X) is a relatively small matrix (100x100 in your case) and will thus invert a matrix in feature space. With given sample weights, the Ridge object decides to stay in sample space (in order to be able to weight the samples individually, see relevant lines here and here for the branching to _solve_dense_cholesky_kernel which works in sample space) and thus needs to invert a matrix of the same size as X.dot(X.T) (which in your case is n_samples x n_samples = 200000 x 200000 and will cause a memory error before it is even created). This is actually an implementation issue, please see the manual workaround below.
TL;DR: The Ridge object is unable to treat sample weights in feature space, and will generate a matrix n_samples x n_samples, which causes your memory error
While waiting for a possible remedy within scikit learn, you could try to solve the problem in feature space explicitly, like so
import numpy as np
alpha = 1. # You did not specify this in your Ridge object, but it is the default penalty for the Ridge object
sample_weights = w_tr.ravel() # make sure this is 1D
target = y.ravel() # make sure this is 1D as well
n_samples, n_features = X.shape
coef = np.linalg.inv((X.T * sample_weights).dot(X) +
alpha * np.eye(n_features)).dot(sample_weights * target)
For a new sample X_new, your prediction would be
prediction = np.dot(X_new, coef)
In order to confirm the validity of this approach you can compare these coef to model.coef_ (after you have fit the model) from your code when applying it to smaller numbers of samples (e.g. 300), that do not cause the memory error when used with the Ridge object.
IMPORTANT: The code above only coincides with sklearn implementations if your data is already centered, i.e. your data must have mean 0. Implementing a full ridge regression with intercept fitting here would amount to a contribution to scikit learn, so it would be better to post it there. The way to center your data is as follows:
X_mean = X.mean(axis=0)
target_mean = target.mean() # Assuming target is 1d as forced above
You then use the provided code on
X_centered = X - X_mean
target_centered = target - target_mean
For predictions on new data, you need
prediction = np.dot(X_new - X_mean, coef) + target_mean
EDIT: As of April 15th, 2014, scikit-learn ridge regression can deal with this problem (bleeding edge code). It will be available in the 0.15 release.
What NumPy version do you have installed?
Looks like the ultimate method call that does it is numpy.dot(X, X.T) which if in your case X.shape = (200000,2) would generate a 200k-by-200k matrix.
Try converting your observations to a sparse matrix type or reduce the number of observations you use (there may be a variant of ridge regression that uses a few observations one batch at a time?).
Related
When performed a logistic regression using the two API, they give different coefficients.
Even with this simple example it doesn't produce the same results in terms of coefficients. And I follow advice from older advice on the same topic, like setting a large value for the parameter C in sklearn since it makes the penalization almost vanish (or setting penalty="none").
import pandas as pd
import numpy as np
import sklearn as sk
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
n = 200
x = np.random.randint(0, 2, size=n)
y = (x > (0.5 + np.random.normal(0, 0.5, n))).astype(int)
display(pd.crosstab( y, x ))
max_iter = 100
#### Statsmodels
res_sm = sm.Logit(y, x).fit(method="ncg", maxiter=max_iter)
print(res_sm.params)
#### Scikit-Learn
res_sk = LogisticRegression( solver='newton-cg', multi_class='multinomial', max_iter=max_iter, fit_intercept=True, C=1e8 )
res_sk.fit( x.reshape(n, 1), y )
print(res_sk.coef_)
For example I just run the above code and get 1.72276655 for statsmodels and 1.86324749 for sklearn. And when run multiple times it always gives different coefficients (sometimes closer than others, but anyway).
Thus, even with that toy example the two APIs give different coefficients (so odds ratios), and with real data (not shown here), it almost get "out of control"...
Am I missing something? How can I produce similar coefficients, for example at least at one or two numbers after the comma?
There are some issues with your code.
To start with, the two models you show here are not equivalent: although you fit your scikit-learn LogisticRegression with fit_intercept=True (which is the default setting), you don't do so with your statsmodels one; from the statsmodels docs:
An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.
It seems that this is a frequent point of confusion - see for example scikit-learn & statsmodels - which R-squared is correct? (and own answer there as well).
The other issue is that, although you are in a binary classification setting, you ask for multi_class='multinomial' in your LogisticRegression, which should not be the case.
The third issue is that, as explained in the relevant Cross Validated thread Logistic Regression: Scikit Learn vs Statsmodels:
There is no way to switch off regularization in scikit-learn, but you can make it ineffective by setting the tuning parameter C to a large number.
which makes the two models again non-comparable in principle, but you have successfully addressed it here by setting C=1e8. In fact, since then (2016), scikit-learn has indeed added a way to switch regularization off, by setting penalty='none' since, according to the docs:
If ‘none’ (not supported by the liblinear solver), no regularization is applied.
which should now be considered the canonical way to switch off the regularization.
So, incorporating these changes in your code, we have:
np.random.seed(42) # for reproducibility
#### Statsmodels
# first artificially add intercept to x, as advised in the docs:
x_ = sm.add_constant(x)
res_sm = sm.Logit(y, x_).fit(method="ncg", maxiter=max_iter) # x_ here
print(res_sm.params)
Which gives the result:
Optimization terminated successfully.
Current function value: 0.403297
Iterations: 5
Function evaluations: 6
Gradient evaluations: 10
Hessian evaluations: 5
[-1.65822763 3.65065752]
with the first element of the array being the intercept and the second the coefficient of x. While for scikit learn we have:
#### Scikit-Learn
res_sk = LogisticRegression(solver='newton-cg', max_iter=max_iter, fit_intercept=True, penalty='none')
res_sk.fit( x.reshape(n, 1), y )
print(res_sk.intercept_, res_sk.coef_)
with the result being:
[-1.65822806] [[3.65065707]]
These results are practically identical, within the machine's numeric precision.
Repeating the procedure for different values of np.random.seed() does not change the essence of the results shown above.
I have 2 major problem with defining custom loss-function in Keras to compile my CNN network. I am working on 2D image registration (aligning a pair of 2D images to be best fit on each other) via CNN. The output of the network will be a 5-dim float-typed array as the prediction of net. (1 scaling, 2 translation and 2 scaling over x and y). There are two main loss functions (and also metrics) for the registration problem called Dice Coefficient and TRE (Target Registration Error, which is the sum of distances between pairs of landmark points marked by a physician). By the way, I need to implement these two loss functions. For Dice coefficient:
1- First of all, I need to know which sample is under the consideration by the optimizer so that I can read the content of that sample and compute Dice, while there are only y_true and y_pred defined in the custom loss functions based on the Keras Documentation.
2- I write the following code as my loss function to 1) First, warp the 1st image, 2) Second, make both image binary (each sample is composed of 2 images: one is moving image and the other is fixed image), 3) third, to return the Dice coefficient between the pair images (warped and fixed).
Since the parameters of custom loss function are restricted to y_true and y_pred, and there is no index for the sample under the consideration and my problem is unsupervised (i.e. there is no need for any label), I used the index of samples feeded to the CNN as the labels, and tried to use y_true[0] as the index of train sample under-the-consideration by CNN, and by setting the batch-size to 1.
def my_loss_f(y_true,y_pred):
from scipy.spatial import distance as dis
a = y_true[0]
nimg1=warping(Train_DataCT[a],y_pred) # line 83 in CNN1.py
return dis.dice(BW(nimg1).flatten(),BW(Train_DataMR[a]).flatten())
def warping(nimg,x):
import scipy.ndimage as ndi
nimg1 = ndi.rotate(nimg, x[0], reshape=False)
nimg1 = ndi.shift(nimg1, [x[1], x[2]])
nimg1 = clipped_zoom(nimg1, [x[3], x[4]])
return nimg1
def BW(nimg1):
hist = ndi.histogram(nimg1, 0, 255, 255)
som = ndi.center_of_mass(hist)
bwnimg = np.where(nimg1 > som, 1, 0)
return bwnimg
But, I constantly get different errors such as follows. Someone told me to use TensorFlow or Keras-backend to rewrite my own loss function, but I need Numpy and SciPy and cannot jump into the such kind of low-level programming as my time to complete the project is very restricted.
The main problem is that y_true is empty (it is just a placeholder not real variable with value), and cannot be used as the index for Train_DataCT[y_true[0]] as the error is: the index should be integer, :, Boolean and so on and a tensor cannot be used as a index! I tried a number of ways e.g. to convert the y_true to ndarray or use y_true.eval() to initialize it but instead I got the error: Session error, no default session.
Thanks ahead, please someone help me.
Traceback (most recent call last):
File "D:/Python/Reg/Deep/CNN1.py", line 83, in <module>
model.compile(optimizer='rmsprop',loss=my_loss_f)
File "C:\Users\Hamidreza\Anaconda3\lib\site-packages\keras\engine\training.py", line 342, in compile
sample_weight, mask)
File "C:\Users\Hamidreza\Anaconda3\lib\site-packages\keras\engine\training_utils.py", line 404, in weighted
score_array = fn(y_true, y_pred)
File "D:/Python/Reg/Deep/CNN1.py", line 68, in my_loss_f
nimg1=warping(Train_DataCT[1],y_pred)
File "D:/Python/Reg/Deep/CNN1.py", line 55, in warping
nimg1 = ndi.rotate(nimg, x[0], reshape=False)
File "C:\Users\Hamidreza\Anaconda3\lib\site-packages\scipy\ndimage\interpolation.py", line 703, in rotate
m11 = math.cos(angle)
TypeError: must be real number, not Tensor
Process finished with exit code 1
Your loss functions should work on the tensor type of you backend. If you're using keras with tf backend, the following function might help with combining advanced numpy/scipy functions and tensors:
https://www.tensorflow.org/api_docs/python/tf/numpy_function?version=stable
Also in the following you can find a lot more useful stuff on this:
How to make a custom activation function with only Python in Tensorflow?
Let me refine my question: I need my inputted sample data to calculate the loss function. With/Without batch, I should know the index of the sample under-the-consideration by CNN in order to compute loss e.g. Dice coefficient between a pair of inputted images.
Since my problem is unsupervised learning, as an alternative solution, I used y_true as the index of sample, but when e.g. after tf.flatten, I use y_true[0] such as Train_DataCT[y_true[0]], I get the error: The index cannot be a tensor!
How could I use .run() or .eval() in a customized loss function so that y_true can get value so that I can convert it to a e.g. ndarray???
I'm trying to write my own logistic regressor (using batch/mini-batch gradient descent) for practice purposes.
I generated a random dataset (see below) with normally distributed inputs, and the output is binary (0,1). I manually used coefficients for the input and was hoping to be able to reproduce them (see below for the code snippet). However, to my surprise, neither my own code, nor sklearn LogisticRegression were able to reproduce the actual numbers (although the sign and order of magnitude are in line). Moreso, the coefficients my algorithm produced are different than the one produced by sklearn.
Am I misinterpreting what the coefficients for a logistic regression are?
I will appreciate any insight into this discrepancy.
Thank you!
edit: I tried using statsmodels Logit and got yet a third set of slightly different values for the coefficients
Some more info that might be relevant:
I wrote a linear regressor using an almost identical code and it worked perfectly, so I am fairly confident this is not a problem in the code. Also my regressor actually outperformed the sklearn one on the training set, and they have the exact same accuracy on the test set, so I have no reason to believe the regressors are wrong.
Code snippets for the generation of the dataset:
o1 = 2
o2 = -3
x[:,1]=np.random.rand(size)*2
x[:,2]=np.random.rand(size)*3
y = np.vectorize(sigmoid)(x[:,1]*o1+x[:,2]*o2 + np.random.normal(size=size))
so as can be seen, input coefficients are +2 and -3 (intercept 0);
sklearn coefficients were ~2.8 and ~-4.8;
my coefficients were ~1.7 and ~-2.6
and of the regressor (the most relevant parts of it):
for j in range(bin_size):
xs = x[i]
y_real = y[i]
z = np.dot(self.coeff,xs)
h = sigmoid(z)
dc+= (h-y_real)*xs
self.coeff-= dc * (learning_rate/n)
What was the intercept learned? It really should not be a surprise, as your y is polynomial of 3rd degree, while your model has only two coefficients, while 3 + y-intercept would be needed to model the response variable from predictors.
Furthermore, values may be different due to SGD for example.
Not really sure, but the coefficients could be different and return correct y for finite set of points. What are the metrics on each model? Do those differ?
I am trying to use KNN to correctly classify .wav files into two groups, group 0 and group 1.
I extracted the data, created the model, fit the model, however when I try and use the .predict() method I get the following error:
Traceback (most recent call last):
File "/..../....../KNN.py", line 20, in <module>
classifier.fit(X_train, y_train)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/neighbors/base.py", line 761, in fit
X, y = check_X_y(X, y, "csr", multi_output=True)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py", line 521, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py", line 405, in check_array
% (array.ndim, estimator_name))
ValueError: Found array with dim 3. Estimator expected <= 2.
I have found these two stackoverflow posts which describe similar issues:
sklearn Logistic Regression "ValueError: Found array with dim 3. Estimator expected <= 2."
Error: Found array with dim 3. Estimator expected <= 2
And, correct me if I'm wrong, but it appears that scikit-learn can only accept 2-dimensional data.
My training data has shape (3240, 20, 5255)
Which consists of:
3240 .wav files in this dataset (this is index 0 of the training data)
For
For each .wav file there is a (20, 5255) numpy array which represents the MFCC coefficients (MFCC coefficients try and represent the sound in a numeric way).
My testing data has shape (3240,) #category is 0 or 1
What code can I use to manipulated my training and testing data to convert it into a form that is usable by scikit-learn? Also, how can I ensure that data is not lost when I go down from 3 dimensions to 2 dimensions?
It is true, sklearn works only with 2D data.
What you can try to do:
Just use np.reshape on the training data to convert it to shape (3240, 20*5255). It will preserve all the original information. But sklearn will not be able to exploit the implicit structure in this data (e.g. that features 1, 21, 41, etc. are different versions of the same variable).
Build a convolutional neural network on your original data (e.g. with tensorflow+Keras stack). CNNs were designed specially to handle such multidimensional data and exploit its structure. But they have lots of hyperparameters to tune.
Use dimensionality reduction (e.g. PCA) on the data reshaped to (3240, 20*5255). It fill try to preserve as much information as possible, while still keeping number of features low.
Use manual feature engineering to extract specific information from the data structure (e.g. descriptive statistics along each dimension), and train your model on such features.
If you had more data (e.g. 100K examples), the first approach might work best. In your case (3K examples and 10K features) you need to regularize your model heavily to avoid overfitting.
I would like to get these plots:
http://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_coordinate_descent_path.html
from an elastic net I have already trained.
The example does
from sklearn.linear_model import lasso_path, enet_path
from sklearn import datasets
diabetes = datasets.load_diabetes()
X = diabetes.data
print("Computing regularization path using the elastic net...")
alphas_enet, coefs_enet, _ = enet_path(
X, y, eps=eps, l1_ratio=0.8, fit_intercept=False)
which basically requires recomputing from X,y the whole model.
Unfortunately, I do not have X,y.
In the training I have used sklearn.linear_model.ElasticNetCV which returns:
coef_ : array, shape (n_features,) | (n_targets, n_features)
parameter vector (w in the cost function formula)
mse_path_ : array, shape (n_l1_ratio, n_alpha, n_folds)
Mean square error for the test set on each fold, varying l1_ratio and alpha.
while I would need parameter vector varying l1_ratio and alpha.
Can this be done without recomputation? It would be a tremendous waste of time as those coef_paths are actually calculated already
Short answer
Not once it is fit.
Long answer
If you look through the source code for ElasticNetCV, you will see that within the fit method the class is calling enet_path, but with alphas set to the value of alpha initialized in ElasticNet (default 1.0) which is set by the value of alphas in ElasticNetCV which will end up being a single value. So instead of calculating the coefficients for the default 100 values of alpha that allow you to create the path graphs, you only get the one for each value of alpha you set in your CV. That being said you could initialize the alphas in your CV to mimic the 100 default in enet_path and then combine the coefficients from each fold, but this would be rather long running. As you mentioned you have already fit the CV this is not an option.