sklearn dimensionality issues "Found array with dim 3. Estimator expected <= 2" - python

I am trying to use KNN to correctly classify .wav files into two groups, group 0 and group 1.
I extracted the data, created the model, fit the model, however when I try and use the .predict() method I get the following error:
Traceback (most recent call last):
File "/..../....../KNN.py", line 20, in <module>
classifier.fit(X_train, y_train)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/neighbors/base.py", line 761, in fit
X, y = check_X_y(X, y, "csr", multi_output=True)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py", line 521, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py", line 405, in check_array
% (array.ndim, estimator_name))
ValueError: Found array with dim 3. Estimator expected <= 2.
I have found these two stackoverflow posts which describe similar issues:
sklearn Logistic Regression "ValueError: Found array with dim 3. Estimator expected <= 2."
Error: Found array with dim 3. Estimator expected <= 2
And, correct me if I'm wrong, but it appears that scikit-learn can only accept 2-dimensional data.
My training data has shape (3240, 20, 5255)
Which consists of:
3240 .wav files in this dataset (this is index 0 of the training data)
For
For each .wav file there is a (20, 5255) numpy array which represents the MFCC coefficients (MFCC coefficients try and represent the sound in a numeric way).
My testing data has shape (3240,) #category is 0 or 1
What code can I use to manipulated my training and testing data to convert it into a form that is usable by scikit-learn? Also, how can I ensure that data is not lost when I go down from 3 dimensions to 2 dimensions?

It is true, sklearn works only with 2D data.
What you can try to do:
Just use np.reshape on the training data to convert it to shape (3240, 20*5255). It will preserve all the original information. But sklearn will not be able to exploit the implicit structure in this data (e.g. that features 1, 21, 41, etc. are different versions of the same variable).
Build a convolutional neural network on your original data (e.g. with tensorflow+Keras stack). CNNs were designed specially to handle such multidimensional data and exploit its structure. But they have lots of hyperparameters to tune.
Use dimensionality reduction (e.g. PCA) on the data reshaped to (3240, 20*5255). It fill try to preserve as much information as possible, while still keeping number of features low.
Use manual feature engineering to extract specific information from the data structure (e.g. descriptive statistics along each dimension), and train your model on such features.
If you had more data (e.g. 100K examples), the first approach might work best. In your case (3K examples and 10K features) you need to regularize your model heavily to avoid overfitting.

Related

TimeSeries K-means clustering for multi-dimensional data

I'm using Tslearn's TimeSeriesKmeans library to cluster my dataset with shape (3000,300,8), However the documentation only talks about cases where the dimension of the dataset being (n_samples,timesteps,1)i.e (single feature). Can anybody help me understand if I can perform clustering with a higher dimension?
I'm using "DTW" as my distance metric.
I used TimeSeriesKMeans from tslearn.clustering library. As you mentioned, the only example available on tslearn documentation is using 1 dimension input. However, it is very common to work with time series data with higher dimensions. For instance, in my case, I was clustering human motion which was 30 frames of 135 joint key points for each frame. Therefore, my data shape was like (number_of_samples, number_of_frames, features).
In order to use tslearns's Timeserieskmeans, you need to input an ndarray with (n_sample, m_time_step(sequence_length), k_features(k_dimensions) ).
If you take a look at the documentations, fit function parameters is as follows:
fit(X, y=None)[source] Compute k-means clustering.
Parameters: X : array-like of shape=(n_ts, sz, d) Time series
dataset.
y Ignored
The point is, your input data should be an ndarray with shape of (n_sample, seq_length, n_features) otherwise, it won't work. For example, at the first, my data was like a list of (n_samples,) and each element in that list was like (seq_length, features). It wan't work until I converted it to an ndarray with (n_sample, seq_length, features).

Error in Keras Custom Loss Function for Compile the Network (CNN)

I have 2 major problem with defining custom loss-function in Keras to compile my CNN network. I am working on 2D image registration (aligning a pair of 2D images to be best fit on each other) via CNN. The output of the network will be a 5-dim float-typed array as the prediction of net. (1 scaling, 2 translation and 2 scaling over x and y). There are two main loss functions (and also metrics) for the registration problem called Dice Coefficient and TRE (Target Registration Error, which is the sum of distances between pairs of landmark points marked by a physician). By the way, I need to implement these two loss functions. For Dice coefficient:
1- First of all, I need to know which sample is under the consideration by the optimizer so that I can read the content of that sample and compute Dice, while there are only y_true and y_pred defined in the custom loss functions based on the Keras Documentation.
2- I write the following code as my loss function to 1) First, warp the 1st image, 2) Second, make both image binary (each sample is composed of 2 images: one is moving image and the other is fixed image), 3) third, to return the Dice coefficient between the pair images (warped and fixed).
Since the parameters of custom loss function are restricted to y_true and y_pred, and there is no index for the sample under the consideration and my problem is unsupervised (i.e. there is no need for any label), I used the index of samples feeded to the CNN as the labels, and tried to use y_true[0] as the index of train sample under-the-consideration by CNN, and by setting the batch-size to 1.
def my_loss_f(y_true,y_pred):
from scipy.spatial import distance as dis
a = y_true[0]
nimg1=warping(Train_DataCT[a],y_pred) # line 83 in CNN1.py
return dis.dice(BW(nimg1).flatten(),BW(Train_DataMR[a]).flatten())
def warping(nimg,x):
import scipy.ndimage as ndi
nimg1 = ndi.rotate(nimg, x[0], reshape=False)
nimg1 = ndi.shift(nimg1, [x[1], x[2]])
nimg1 = clipped_zoom(nimg1, [x[3], x[4]])
return nimg1
def BW(nimg1):
hist = ndi.histogram(nimg1, 0, 255, 255)
som = ndi.center_of_mass(hist)
bwnimg = np.where(nimg1 > som, 1, 0)
return bwnimg
But, I constantly get different errors such as follows. Someone told me to use TensorFlow or Keras-backend to rewrite my own loss function, but I need Numpy and SciPy and cannot jump into the such kind of low-level programming as my time to complete the project is very restricted.
The main problem is that y_true is empty (it is just a placeholder not real variable with value), and cannot be used as the index for Train_DataCT[y_true[0]] as the error is: the index should be integer, :, Boolean and so on and a tensor cannot be used as a index! I tried a number of ways e.g. to convert the y_true to ndarray or use y_true.eval() to initialize it but instead I got the error: Session error, no default session.
Thanks ahead, please someone help me.
Traceback (most recent call last):
File "D:/Python/Reg/Deep/CNN1.py", line 83, in <module>
model.compile(optimizer='rmsprop',loss=my_loss_f)
File "C:\Users\Hamidreza\Anaconda3\lib\site-packages\keras\engine\training.py", line 342, in compile
sample_weight, mask)
File "C:\Users\Hamidreza\Anaconda3\lib\site-packages\keras\engine\training_utils.py", line 404, in weighted
score_array = fn(y_true, y_pred)
File "D:/Python/Reg/Deep/CNN1.py", line 68, in my_loss_f
nimg1=warping(Train_DataCT[1],y_pred)
File "D:/Python/Reg/Deep/CNN1.py", line 55, in warping
nimg1 = ndi.rotate(nimg, x[0], reshape=False)
File "C:\Users\Hamidreza\Anaconda3\lib\site-packages\scipy\ndimage\interpolation.py", line 703, in rotate
m11 = math.cos(angle)
TypeError: must be real number, not Tensor
Process finished with exit code 1
Your loss functions should work on the tensor type of you backend. If you're using keras with tf backend, the following function might help with combining advanced numpy/scipy functions and tensors:
https://www.tensorflow.org/api_docs/python/tf/numpy_function?version=stable
Also in the following you can find a lot more useful stuff on this:
How to make a custom activation function with only Python in Tensorflow?
Let me refine my question: I need my inputted sample data to calculate the loss function. With/Without batch, I should know the index of the sample under-the-consideration by CNN in order to compute loss e.g. Dice coefficient between a pair of inputted images.
Since my problem is unsupervised learning, as an alternative solution, I used y_true as the index of sample, but when e.g. after tf.flatten, I use y_true[0] such as Train_DataCT[y_true[0]], I get the error: The index cannot be a tensor!
How could I use .run() or .eval() in a customized loss function so that y_true can get value so that I can convert it to a e.g. ndarray???

Python, ValueError, BroadCast Error with SKLearn Preproccesing

I am trying to run SKLearn Preprocessing standard scaler function and I receive the following error:
from sklearn import preprocessing as pre
scaler = pre.StandardScaler().fit(t_train)
t_train_scale = scaler.transform(t_train)
t_test_scale = scaler.transform(t_test)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-149-c0133b7e399b> in <module>()
4 scaler = pre.StandardScaler().fit(t_train)
5 t_train_scale = scaler.transform(t_train)
----> 6 t_test_scale = scaler.transform(t_test)
C:\Users\****\Anaconda\lib\site-packages\sklearn\preprocessing\data.pyc in transform(self, X, y, copy)
356 else:
357 if self.with_mean:
--> 358 X -= self.mean_
359 if self.with_std:
360 X /= self.std_
ValueError: operands could not be broadcast together with shapes (40000,59) (119,) (40000,59)
I understand the shapes do not match. The train and test data set are different lengths so how would I transform the data?
please print the output from t_train.shape[1] and t_test.shape[1]
StandardScaler expects any two datasets to have the same number of columns. I suspect earlier pre-processing (dropping columns, adding dummy columns, etc) is the source of your problem. Whatever transformations you make to the t_train also need to be made to t_test.
The error is telling you the information that I'm asking for:
ValueError: operands could not be broadcast together with shapes (40000,59) (119,) (40000,59)
I expect you'll find that t_train.shape[1] is 59 and t_test.shape[1] is 119.
So you have 59 columns in your training dataset and 119 in your test dataset.
Did you remove any columns from the training set prior to attempting to use StandardScaler?
What do you mean by "train and test data set are different lengths"?? How did you obtain your training data?
If your testing data have more features than your training data in order to efficiently reduce the dimensionality of your testing data you should know how your training data were formulated.For example using a dimensionality reduction technique (PCA,SVD etc.) or something like that. If that is the case you have to multiply each testing vector with the same matrix that was used to reduce the dimensionality of your training data.
The time series was in the format with time as the columns and data in the rows. I did the following before the original posted code:
t_train.transpose()
t_test.transpose()
Just a reminder, I had to run the cell a 2x before the change 'stuck' for some reason...
t_train shape is (x, 119), whereas t_test shape is (40000,59).
If you want to use same scaler object for transformation then your data should have same number of columns always.
Since you fit scaler on t_train, that's the reason you are getting issue when you are trying to transform t_test.

Scikit-learn RandomForestClassifier output of predict_proba

I have a dataset that I split in two for training and testing a random forest classifier with scikit learn.
I have 87 classes and 344 samples. The output of predict_proba is, most of the times, a 3-dimensional array (87, 344, 2) (it's actually a list of 87 numpy.ndarrays of (344, 2) elements).
Sometimes, when I pick a different subset of samples for training and testing, I only get a 2-dimensional array (87, 344) (though I can't work out in which cases).
My two questions are:
what do these dimensions represent? I worked out that to get a ROC AUC score, I have to take one half of the output (that is (87, 344, 2)[:,:,1], transpose it, and then compare it with my ground truth (roc_auc_score(ground_truth, output_of_predict_proba[:,:,1].T) essentially) . But I don't understand what it really means.
why does the output change with different subsets of the data? I can't understand in which cases it returns a 3D array and in which cases a 2D one.
classifier.predict_proba() returns the class probabilities. The n dimension of the array will vary depending on how many classes there are in the subset you train on
Are you sure the arrays you're using to fit the RF has the right shape ? (n_samples,n_features) for the data and (n_samples) for the target classes.
You should get an array Y_pred of shape (n_samples,n_classes) so (344,87) in your case, where item i of row r is the predictied probability of the class i for the sample X[r,:]. Note that sum( Y_pred[r,:] ) = 1.
However I think if your target array Y has shape (n_samples,n_classes), where each row would be all zeros except one corresponding to the class of the sample, then sklearn take it as a multi-output prediction problem (consider each class independently) but I don't think that's what you'd like to do. In that case, for each class and each sample, you would predict the probability of belonging to this class or not.
Finally the output indeed depend on the training set because it depends on the number of classes (in the training set). You can get it with the attribute n_classes (and you may also be able to force the number of classes by setting it manually) and you can also get the classes' value with the attribute classes. See the documentation.
Hope it helps !

sklearn Ridge and sample_weight gives Memory Error

I'm trying to run a simple Sklearn Ridge regression using an array of sample weights.
X_train is a ~200k by 100 2D Numpy array. I get a Memory error when I try to use sample_weight option. It works just fine without that option. For the sake of simplicity I reduced the features to 2 and sklearn still throws me a Memory Error.
Any ideas?
model=linear_model.Ridge()
model.fit(X_train, y_train,sample_weight=w_tr)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/g/anaconda/lib/python2.7/site-packages/sklearn/linear_model/ridge.py", line 449, in fit
return super(Ridge, self).fit(X, y, sample_weight=sample_weight)
File "/home/g/anaconda/lib/python2.7/site-packages/sklearn/linear_model/ridge.py", line 338, in fit
solver=self.solver)
File "/home/g/anaconda/lib/python2.7/site-packages/sklearn/linear_model/ridge.py", line 286, in ridge_regression
K = safe_sparse_dot(X, X.T, dense_output=True)
File "/home/g/anaconda/lib/python2.7/site-packages/sklearn/utils/extmath.py", line 83, in safe_sparse_dot
return np.dot(a, b)
MemoryError
Setting sample weights can cause big differences in the way the sklearn linear_model Ridge object processes your data - especially if the matrix is tall (n_samples > n_features), as is your case. Without sample weights it will exploit the fact that X.T.dot(X) is a relatively small matrix (100x100 in your case) and will thus invert a matrix in feature space. With given sample weights, the Ridge object decides to stay in sample space (in order to be able to weight the samples individually, see relevant lines here and here for the branching to _solve_dense_cholesky_kernel which works in sample space) and thus needs to invert a matrix of the same size as X.dot(X.T) (which in your case is n_samples x n_samples = 200000 x 200000 and will cause a memory error before it is even created). This is actually an implementation issue, please see the manual workaround below.
TL;DR: The Ridge object is unable to treat sample weights in feature space, and will generate a matrix n_samples x n_samples, which causes your memory error
While waiting for a possible remedy within scikit learn, you could try to solve the problem in feature space explicitly, like so
import numpy as np
alpha = 1. # You did not specify this in your Ridge object, but it is the default penalty for the Ridge object
sample_weights = w_tr.ravel() # make sure this is 1D
target = y.ravel() # make sure this is 1D as well
n_samples, n_features = X.shape
coef = np.linalg.inv((X.T * sample_weights).dot(X) +
alpha * np.eye(n_features)).dot(sample_weights * target)
For a new sample X_new, your prediction would be
prediction = np.dot(X_new, coef)
In order to confirm the validity of this approach you can compare these coef to model.coef_ (after you have fit the model) from your code when applying it to smaller numbers of samples (e.g. 300), that do not cause the memory error when used with the Ridge object.
IMPORTANT: The code above only coincides with sklearn implementations if your data is already centered, i.e. your data must have mean 0. Implementing a full ridge regression with intercept fitting here would amount to a contribution to scikit learn, so it would be better to post it there. The way to center your data is as follows:
X_mean = X.mean(axis=0)
target_mean = target.mean() # Assuming target is 1d as forced above
You then use the provided code on
X_centered = X - X_mean
target_centered = target - target_mean
For predictions on new data, you need
prediction = np.dot(X_new - X_mean, coef) + target_mean
EDIT: As of April 15th, 2014, scikit-learn ridge regression can deal with this problem (bleeding edge code). It will be available in the 0.15 release.
What NumPy version do you have installed?
Looks like the ultimate method call that does it is numpy.dot(X, X.T) which if in your case X.shape = (200000,2) would generate a 200k-by-200k matrix.
Try converting your observations to a sparse matrix type or reduce the number of observations you use (there may be a variant of ridge regression that uses a few observations one batch at a time?).

Categories

Resources