scikit-learn PCA for image dataset

scikit-learn PCA for image dataset - python

I am trying to perform PCA on an image dataset with 100.000 images each of size 224x224x3.
I was hoping to project the images into a space of dimension 1000 (or somewhere around that).
I am doing this on my laptop (16gb ram, i7, no GPU) and already set svd_solver='randomized'.
However, fitting takes forever. Is the dataset and the image dimension just too large or is there some trick I could be using?
Thanks!
Edit:
This is the code:
pca = PCA(n_components=1000, svd_solver='randomized')
pca.fit(X)
Z = pca.transform(X)
X is a 100000 x 150528 matrix whose rows represent a flattened image.

You should really reconsider your choice of dimensionality reduction if you think you need 1000 principal components. If you have that many, then you no longer have interpretability so you might as well use other and more flexible dimensionality reduction algorithms (e.g. variational autencoders, t-sne, kernel-PCA). A key benefit of PCA is the interpretability if the principal components.
If you have a video stream of the same place, then you should be fine with <10 components (though principal component pursuit might be better). Moreover, if your image-dataset is not comprised of similar-ish images, then PCA is probably not the right choice.
Also, for images, nonnegative matrix factorisation (NMF) might be better suited. For NMF, you can perform stochastic gradient optimisation, subsampling both pixels and images for each gradient step.
However, if you still insist on performing PCA, then I think that the randomised solver provided by Facebook is the best shot you have. Run pip install fbpca and run the following code
from fbpca import pca
# load data into X
U, s, Vh = pca(X, 1000)
It's not possible to get faster than that without utilising some matrix structure, e.g. sparsity or block composition (which your dataset is unlikely to have).
Also, if you need help to pick the correct number of principal components, I reccomend using this code
import fbpca
from bisect import bisect_left
def compute_explained_variance(singular_values):
return np.cumsum(singular_values**2)/np.sum(singular_values**2)
def ideal_number_components(X, wanted_explained_variance):
singular_values = fbpca.svd(X, compute_uv=False) # This line is a bottleneck.
explained_variance = compute_explained_variance(singular_values)
return bisect_left(explained_variance, wanted_explained_variance)
def auto_pca(X, wanted_explained_variance):
num_components = ideal_number_components(X, explained_variance)
return fbpca.pca(X, num_components) # This line is a bottleneck if the number of components is high
Of course, the above code doesn't support cross validation, which you really should use to choose the correct number of components.

You can try to set
svd_solver="svd_solver"
The training should be much faster.
You could also try to use :
from sklearn.decomposition import FastICA
Which is more scalable
Last resort solution could be to turn your images black & white, to reduce the dimension by 3, this might be a good step if your task is not color-sentitive (for instance Optical character Recognition)

try to experiment with iterated_power parameter of PCA

Related

Why does 'kd_tree' take more time than 'brute'?

I am benchmarking knn with sklearn. Here is sys info.
sys info
Intel(R) Xeon(R) L5640 (6 cores 12 siblings);
Ubuntu 18.04, Python 3.7.3, numpy 1.16.4, sklearn 0.21.2;
There is no any other jobs/tasks occupying the cpu cores.
dataset
the benchmark is running on sklearn MNIST, which has 1797 Samples, 10 Classes, 8*8 Dimensionality and 17 Features.
Each square in this sample image stands for one pixel, 8*8 Dimensionality in total. Each pixel ranges from 0 to 16.
code
here is the code.
snippet_1:
n_neighbors=5; n_jobs=1; algorithm = 'brute'
model = KNeighborsClassifier(n_neighbors=n_neighbors, n_jobs=n_jobs, algorithm = algorithm)
model.fit(trainData, trainLabels)
predictions = model.predict(testData)
takes about 0.1s
snippet_2:
n_neighbors=5; n_jobs=1; algorithm = 'kd_tree'
model = KNeighborsClassifier(n_neighbors=n_neighbors, n_jobs=n_jobs, algorithm = algorithm)
model.fit(trainData, trainLabels)
predictions = model.predict(testData)
takes about 0.2s
I repeated the benchmark multiple times, no matter which one I ran first, snippet_1 is always 2 times faster than snippet_2.
question
Why does 'kd_tree' take more time than 'brute'?
I know "curse of dimensionality", since the doc says it clearly, what I am asking is why is that?

The answer seems to be related to dimensionality associated to your models. Curse of dimensionality as it is also known. KD-tree has a very poor scaling when it comes to a dimension above 15/20 (kinda exponential) whereas brute Force seems to follow a more linear-like pattern. When run on GPUs, for higher dimensions, brute force can indeed be faster. Here another researcher found a similar problem: Comparison search time between K-D tree and Brute-force

In general, KD-Tree will be slower than brute force if N < 2**k, where k is the number of dimensions (in this case 8 * 8 = 64) and N is the number of samples. In this case 2**64 = 1.8E19 >> 1797, so KDTree is far slower.
Basically, a KDTree does binary splits of the data along each dimension as a first step. If it has enough data to do that, it can guess the closest neighbors by the number of splits in common they have. If N < 2**k, it runs out of data before it runs out of dimensions to split. It then has no distance information about the other dimensions. With no good guess, it still has to brute force the rest of the dimensions, making the KDTree unnecessary overhead.
A more in-depth discussion of the issues and possible solutions can be found here. For this application, the third answer suggesting using PCA first to reduce your dimensionality is probably your best bet.

Applying machine learning algorithms on Google's Quickdraw dataset

I'm trying to apply machine learning algorithms available in python's scikit-learn package to predict doodle names from set of doodle images.
Since I'm a complete beginner in machine learning and I have no knowledge about how neural network work yet. I wanted to try with scikit-learn's algorithms.
I've downloaded doodles ( of cats and guitars ) with the help of api named quickdraw.
Then I load the images with the following code
import numpy as np
from PIL import Image
import random
#To hold image arrays
images = []
#0-cat, 1-guitar
target = []
#5000 images of cats and guitar each
for i in range(5000):
#cat images are named like cat0.png, cat1.png ...
img = Image.open('data/cats/cat'+str(i)+'.png')
img = np.array(img)
img = img.flatten()
images.append(img)
target.append(0)
#guitar images are named like guitar0.png, guitar1.png ...
img = Image.open('data/guitars/guitar'+str(i)+'.png')
img = np.array(img)
img = img.flatten()
images.append(img)
target.append(1)
random.shuffle(images)
random.shuffle(target)
Then I applied the algorithm : -
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(images,target,test_size=0.2, random_state=0)
from sklearn.naive_bayes import GaussianNB
GB = GaussianNB()
GB.fit(X_train,y_train)
print(GB.score(X_test,y_test))
Upon running the above code (with other algorithms like SVM,MLP too), My system just freezes. I've do a force shutdown to get back. I'm not sure why is this happening.
I have tried lowering the number of images to load by changing
for i in range(5000):
to
for i in range(1000):
But I only get accuracy around 50%

First of all, if I may say so:
Since I'm a complete beginner in machine learning and I have no knowledge about >how neural network work yet. I wanted to try with scikit-learn's algorithms.
This is not a good way to approach ML in general, I strongly suggest you start studying the basics at least, otherwise you won't be able to tell what's going on at all (it's not something you can figure out by trying it).
Back to your problem, applying Naive Bayes methods to raw images it's not a good strategy: the problem is that each pixel of your image is a feature and with images you can get a very high number of dimensions easily (also assuming each pixel is independant of its neighbors it's not what you want).
NB is commonly used with documents and looking at this example on wikipedia might help you understand a bit more the algorithm.
In short, NB boils down to computing joint conditional probabilities, which boils down to counting co-occurences of features (words in wikipedia's example) being co-occurences of pixels in your case, which in turn boils down to computing a huge matrix of occurences that you need to formulate your NB model.
Now, if your matrix is made of all the words in a set of documents, this can get pretty expensive in both time and space (O(n^2)/2), with n being the number of features; instead, imagine the matrix being composed of ALL the pixels in your training set, as you're doing in your example... this explodes really fast.
That's why cutting your dataset to 1000 images allows your PC to not run out of memory.
Hope it helps.

Parameter estimation for linear One Class SVM training via libsvm for n-grams

I know, there are multiple questions to this, but not a single one to my particular problem.
I'll simplify my problem in order to make it more clear.
Lets say I have multiple sentences from an english document and I want to classify them using a one class svm (in libsvm) in order to be able to see anomalities (e.g. a german sentence) afterwards.
For training: I have samples of one class only (lets assume other classes are not existing beforehand). I extract all 3-grams (so the feature space includes max. 16777216 different features) and save them in libsvm format (label=1, just in case that matters)
Now I want to estimate my paramters. I tried to use the grid.py using additional parameters, however, the runtime is too big for rbf kernels. So I try using linear kernels (therefore, the grid.py may be changed in order to use only one value of gamma, as it does not matter for linear kernels).
Whatsoever, the smallest c grid.py tests will shown as the best solution (does -c matter for linear kernels?).
Furthermore, it does not matter how much I change the -n (nu) value, everytime the same relation between scores will be achieved (even though the number of support vectors changes). Scores are gathered by using the python implementation. (relation between scores means, that e.g. at first they are -1 and -2, i change nu and afterwards they are e.g. -0.5 and -1, so if i sort them, the same order always appears, as in this example):
# python2
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from svmutil import *
y,x = svm_read_problem("/tmp/english-3-grams.libsvm") # 5000 sentence samples
ym,xm = svm_read_problem("/tmp/german-3-grams.libsvm") # 50 sentence samples
m = svm_train(y,x,"-s 2 -t 2 -n 0.5");
# do the prediction in one or two steps, here is one step:
p_l, p_a, p_v = svm_predict(y[:100]+ym[:100],x[:100]+xm[:100],m)
# p_v are our scores.
# let's plot a roc curve
roc_ret = roc_curve([1]*100+[-1]*100,p_v)
plt.plot(roc_ret[0],roc_ret[1])
plt.show()
Here, everytime the exact same roc-curve is achieved (even though -n is varied). Even if there is only 1 support vector, the same curve is shown.
Hence, my question (let's assume a maximum of 50000 samples per training):
- why is -n not changing anything for the one class training process?
- what parameters do i need to change for a one class svm?
- is a linear kernel the best approach? (+ with regard to runtime) and rbf kernel parameter grid search takes ages for such big datasets
- liblinear is not being use because I want to do anomaly detection = one class svm
Best regards,
mutilis

The performance impact is a result of your huge feature space of 16777216 elements. This results in very sparse vectors for elements like german sentences.
A study by Yang & Petersen, A Comparative Study on Feature Selection in Text Categorization shows, that aggressive feature-selection does not necessarily decrease classification accuracy. I achieved similar results, while performing text classification for (medical) German text documents.
As stated in the comments, LIBLINEAR is fast, because it is build for such sparse data. However, you end up with a linear classifier with all its pitfalls and benefits.
I would suggest the following strategy:
Perform aggressive feature selection (e.g. with InformationGain) with a remaining feature-space of N
Increase N stepwise in combination with cross-validation and find the best maching N for your data.
Go for a grid-search with the N found in 2.
Train your classifier with the best matching parameters found in 3. and the N found in 2.

Can sklearn SVM be performed using numpy 16bit float in python? If not, is there an alternate package

I am trying to perform image segmentation using machine learning (SVM in particular). I am segmenting MRIs and the original images are 512x512x100. I have created 78 features per image. At that image size and number of features I quickly run out of memory.
To resolve the memory issue I have done a couple of things. 1) I down sampled the images to 256x256x50. I also reduced the precision to 16bit float as the original image is 16bit and so I didn't believe it necessary to have more precise data than that. (Maybe I'm wrong here.)
So. I was able to reduce the memory of my data to an amount that can be held in memory 6GB. Until I went to actually use the SVM function in sklearn and my computer quickly started using swap memory as it had run out of ram (16gb). I went searching a bit and found on the sklearn docs (http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) that "If X and y are not C-ordered and contiguous arrays of np.float64 and X is not a scipy.sparse.csr_matrix, X and/or y may be copied." This along with other posts on github made me realize the data was being scaled up to flaot64 and therefore taking up all my memory, as going from 16 to 64 from what I gather would increase the ram from 6 to 24gb... which goes beyond what I have available.
Here is a simple example of the code. Features is a bumpy array of 39,321,600 (256*256*50*12[training images]) by 78 (the features) and segmentations is 39,321,600 by 1 with values between 0-6 for the various regions of interest.
from sklearn import svm
clf = svm.SVC()
clf.fit(features, segmentations)
Above is the only code that is relevant at this point as I haven't gotten past the training portion.
Any help with either training a dataset of this size using SVM and sklearn, or any other options would be greatly appreciated.
Thanks.
Anthony.
PS. I have performed a subsampling of the data as an option. Though, this is not ideal as I would like to use the whole image. If this is my best bet I guess I will pursue.

Improving SVC prediction performance on single samples

I have large-ish SVC models (~50Mb cPickles) for text classification and I am trying out various ways to use them in a production environment. Classifying batches of documents works very well (about 1k documents per minute using both predict and predict_proba).
However, prediction on a single document is another story, as explained in a comment to this question:
Are you doing predictions in batches? The SVC.predict method, unfortunately, incurs a lot of overhead because it has to reconstruct a LibSVM data structure similar to the one that the training algorithm produced, shallow-copy in the support vectors, and convert the test samples to a LibSVM format that may be different from the NumPy/SciPy formats. Therefore, prediction on a single sample is bound to be slow. – larsmans
I am already serving the SVC models as Flask web-applications, so a part of the overhead is gone (unpickling) but the prediction times for single docs are still on the high side (0.25s).
I have looked at the code in the predict methods but cannot figure out if there is a way to "pre-warm" them, reconstructing the LibSVM data structure in advance at server startup... any ideas?
def predict(self, X):
"""Perform classification on samples in X.
For an one-class model, +1 or -1 is returned.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns
-------
y_pred : array, shape = [n_samples]
Class labels for samples in X.
"""
y = super(BaseSVC, self).predict(X)
return self.classes_.take(y.astype(np.int))

I can see three possible solutions.
Custom server
It is not the matter of "warming" anything up. Simply - libSVM is the C library, and you need to pack/unpack data into correct format. This process is more efficient on the whole matrices than on each row separately. The only way to overcome this would be to write more efficient wrapper between your production env and the libSVM (you could write a libsvm based server, which would use some kind of shared memory with your service). Unfortunately, this is to custom problem to be solvable by existing implementations.
Batches
Naive approach like buffering the queries is an option (if it is "high performance" system with thousands of queries, you can simply store them in N-element batches, and send them to libSVM in such packs).
Own classification
Lastly - classification using SVM is really simple task. You don't need libSVM to perform classification. Only training is a complex problem. Once you get all the support vectors (SV_i), kernel (K), lagragian multipliers (alpha_i) and intercept term (b), you classify using:
cl(x) = sgn( SUM_i y_i alpha_i K(SV_i, x) + b)
You can code this operation directly in your app, without the need to actualy pack/unpack/send anything to libsvm. This can speed things up by the order of magnitude. Obviously - probability is more complex to retrieve, as it requires the Platt's scaliing, but it is still possible.

You can't construct the LibSVM data structure in advance. When a request to classify a document arrives, you get the text of the document, make a vector out of if and only then convert to LibSVM format so you can get a decision.
LinearSVC should be considerably faster than a SVC with a linear kernel as it uses liblinear. You could try using a different classifier if that does not decrease performance too much.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.