I am benchmarking knn with sklearn. Here is sys info.
sys info
Intel(R) Xeon(R) L5640 (6 cores 12 siblings);
Ubuntu 18.04, Python 3.7.3, numpy 1.16.4, sklearn 0.21.2;
There is no any other jobs/tasks occupying the cpu cores.
dataset
the benchmark is running on sklearn MNIST, which has 1797 Samples, 10 Classes, 8*8 Dimensionality and 17 Features.
Each square in this sample image stands for one pixel, 8*8 Dimensionality in total. Each pixel ranges from 0 to 16.
code
here is the code.
snippet_1:
n_neighbors=5; n_jobs=1; algorithm = 'brute'
model = KNeighborsClassifier(n_neighbors=n_neighbors, n_jobs=n_jobs, algorithm = algorithm)
model.fit(trainData, trainLabels)
predictions = model.predict(testData)
takes about 0.1s
snippet_2:
n_neighbors=5; n_jobs=1; algorithm = 'kd_tree'
model = KNeighborsClassifier(n_neighbors=n_neighbors, n_jobs=n_jobs, algorithm = algorithm)
model.fit(trainData, trainLabels)
predictions = model.predict(testData)
takes about 0.2s
I repeated the benchmark multiple times, no matter which one I ran first, snippet_1 is always 2 times faster than snippet_2.
question
Why does 'kd_tree' take more time than 'brute'?
I know "curse of dimensionality", since the doc says it clearly, what I am asking is why is that?
The answer seems to be related to dimensionality associated to your models. Curse of dimensionality as it is also known. KD-tree has a very poor scaling when it comes to a dimension above 15/20 (kinda exponential) whereas brute Force seems to follow a more linear-like pattern. When run on GPUs, for higher dimensions, brute force can indeed be faster. Here another researcher found a similar problem: Comparison search time between K-D tree and Brute-force
In general, KD-Tree will be slower than brute force if N < 2**k, where k is the number of dimensions (in this case 8 * 8 = 64) and N is the number of samples. In this case 2**64 = 1.8E19 >> 1797, so KDTree is far slower.
Basically, a KDTree does binary splits of the data along each dimension as a first step. If it has enough data to do that, it can guess the closest neighbors by the number of splits in common they have. If N < 2**k, it runs out of data before it runs out of dimensions to split. It then has no distance information about the other dimensions. With no good guess, it still has to brute force the rest of the dimensions, making the KDTree unnecessary overhead.
A more in-depth discussion of the issues and possible solutions can be found here. For this application, the third answer suggesting using PCA first to reduce your dimensionality is probably your best bet.
Related
I am using LinearDiscriminantAnalysis of scikit-learn to perform class prediction on a dataset. The problem seems to be that the performance of LDA is not consistent. In general, when I increase the number of features that I use for training, I expect the performance to increase as well. Same thing with the number of samples that I use for training, the more samples the better the performance of LDA.
However, in my case, there seems to be a sweet spot where the LDA performs poorly depending on the number of samples and features used for training. More precisely, LDA performs poorly when the number of features equals the number of samples. I think that it does not have to do with my dataset. Not sure exactly what is the issue here but I have an extensive example code that can recreate these results.
Here is an image of the LDA performance results that I am talking about.
The dataset I use has shape 400 X 75 X 400 (trials X time X features). Here the trials represent the different samples. Each time I shuffle the trial indices of the dataset. I create the train set by picking the trials for training and similarly for the test set. Finally, I calculate the mean across time (second axis) and insert the final matrix with shape (trials X features) as input in the LDA to compute the score on the test set. The test set is always of size 50 trials.
A detailed jupyter notebook with comments and the data I use can be found here https://github.com/esigalas/LDA-test. In my environment I use
sklearn: 1.1.1,
numpy: 1.22.4.
Not sure if there is an issue with LDA itself (that would be worthy of opening an issue on the github) or something wrong with how I handle the dataset, but this behavior of LDA looks wrong.
Any comment/help is welcome. Thanks in advance!
I am trying to perform PCA on an image dataset with 100.000 images each of size 224x224x3.
I was hoping to project the images into a space of dimension 1000 (or somewhere around that).
I am doing this on my laptop (16gb ram, i7, no GPU) and already set svd_solver='randomized'.
However, fitting takes forever. Is the dataset and the image dimension just too large or is there some trick I could be using?
Thanks!
Edit:
This is the code:
pca = PCA(n_components=1000, svd_solver='randomized')
pca.fit(X)
Z = pca.transform(X)
X is a 100000 x 150528 matrix whose rows represent a flattened image.
You should really reconsider your choice of dimensionality reduction if you think you need 1000 principal components. If you have that many, then you no longer have interpretability so you might as well use other and more flexible dimensionality reduction algorithms (e.g. variational autencoders, t-sne, kernel-PCA). A key benefit of PCA is the interpretability if the principal components.
If you have a video stream of the same place, then you should be fine with <10 components (though principal component pursuit might be better). Moreover, if your image-dataset is not comprised of similar-ish images, then PCA is probably not the right choice.
Also, for images, nonnegative matrix factorisation (NMF) might be better suited. For NMF, you can perform stochastic gradient optimisation, subsampling both pixels and images for each gradient step.
However, if you still insist on performing PCA, then I think that the randomised solver provided by Facebook is the best shot you have. Run pip install fbpca and run the following code
from fbpca import pca
# load data into X
U, s, Vh = pca(X, 1000)
It's not possible to get faster than that without utilising some matrix structure, e.g. sparsity or block composition (which your dataset is unlikely to have).
Also, if you need help to pick the correct number of principal components, I reccomend using this code
import fbpca
from bisect import bisect_left
def compute_explained_variance(singular_values):
return np.cumsum(singular_values**2)/np.sum(singular_values**2)
def ideal_number_components(X, wanted_explained_variance):
singular_values = fbpca.svd(X, compute_uv=False) # This line is a bottleneck.
explained_variance = compute_explained_variance(singular_values)
return bisect_left(explained_variance, wanted_explained_variance)
def auto_pca(X, wanted_explained_variance):
num_components = ideal_number_components(X, explained_variance)
return fbpca.pca(X, num_components) # This line is a bottleneck if the number of components is high
Of course, the above code doesn't support cross validation, which you really should use to choose the correct number of components.
You can try to set
svd_solver="svd_solver"
The training should be much faster.
You could also try to use :
from sklearn.decomposition import FastICA
Which is more scalable
Last resort solution could be to turn your images black & white, to reduce the dimension by 3, this might be a good step if your task is not color-sentitive (for instance Optical character Recognition)
try to experiment with iterated_power parameter of PCA
I want to cluster 3.5M 300-dimensional word2vec vectors from my custom gensim model to determine whether I can use those clustering to find topic-related words. It is not the same as model.most_similar_..., as I hope to attach quite distant, but still related words.
The overall size of the model (after normalization of vectors, i.e. model.init_sims(replace=True)) in memory is 4GB:
words = sorted(model.wv.vocab.keys())
vectors = np.array([model.wv[w] for w in words])
sys.getsizeof(vectors)
4456416112
I tried both scikit's DBSCAN and some other implementations from GitHub, but they seem to consume more and more RAM during processing and crash with std::bad_alloc after some time. I have 32 GB of RAM and 130GB swap.
Metric is euclidean, I convert my cosine distance threshold cos=0.48 as eps=sqrt(2-2*0.48), so all the optimizations should be applied.
The problem is that I don't know the number of clusters and want to determine them by setting the threshold for closely related words (let it be cos<0.48 or d_l2 < sqrt(2-2*0.48)). DBSCAN seems working on small subsets, but I can't pass the computation on the full data.
Is there any algorithm or workaround in Python which can help with that?
EDIT: Distance matrix seem to be for a size(float)=4bytes: 3.5M*3.5M*4/1024(KB)/1024(MB)/1024(GB)/1024(TB) = 44.5 TB, so it's impossible to precompute it.
EDIT2: Currently trying ELKI, but cannot make it to cluster data on toy subset properly.
I know, there are multiple questions to this, but not a single one to my particular problem.
I'll simplify my problem in order to make it more clear.
Lets say I have multiple sentences from an english document and I want to classify them using a one class svm (in libsvm) in order to be able to see anomalities (e.g. a german sentence) afterwards.
For training: I have samples of one class only (lets assume other classes are not existing beforehand). I extract all 3-grams (so the feature space includes max. 16777216 different features) and save them in libsvm format (label=1, just in case that matters)
Now I want to estimate my paramters. I tried to use the grid.py using additional parameters, however, the runtime is too big for rbf kernels. So I try using linear kernels (therefore, the grid.py may be changed in order to use only one value of gamma, as it does not matter for linear kernels).
Whatsoever, the smallest c grid.py tests will shown as the best solution (does -c matter for linear kernels?).
Furthermore, it does not matter how much I change the -n (nu) value, everytime the same relation between scores will be achieved (even though the number of support vectors changes). Scores are gathered by using the python implementation. (relation between scores means, that e.g. at first they are -1 and -2, i change nu and afterwards they are e.g. -0.5 and -1, so if i sort them, the same order always appears, as in this example):
# python2
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from svmutil import *
y,x = svm_read_problem("/tmp/english-3-grams.libsvm") # 5000 sentence samples
ym,xm = svm_read_problem("/tmp/german-3-grams.libsvm") # 50 sentence samples
m = svm_train(y,x,"-s 2 -t 2 -n 0.5");
# do the prediction in one or two steps, here is one step:
p_l, p_a, p_v = svm_predict(y[:100]+ym[:100],x[:100]+xm[:100],m)
# p_v are our scores.
# let's plot a roc curve
roc_ret = roc_curve([1]*100+[-1]*100,p_v)
plt.plot(roc_ret[0],roc_ret[1])
plt.show()
Here, everytime the exact same roc-curve is achieved (even though -n is varied). Even if there is only 1 support vector, the same curve is shown.
Hence, my question (let's assume a maximum of 50000 samples per training):
- why is -n not changing anything for the one class training process?
- what parameters do i need to change for a one class svm?
- is a linear kernel the best approach? (+ with regard to runtime) and rbf kernel parameter grid search takes ages for such big datasets
- liblinear is not being use because I want to do anomaly detection = one class svm
Best regards,
mutilis
The performance impact is a result of your huge feature space of 16777216 elements. This results in very sparse vectors for elements like german sentences.
A study by Yang & Petersen, A Comparative Study on Feature Selection in Text Categorization shows, that aggressive feature-selection does not necessarily decrease classification accuracy. I achieved similar results, while performing text classification for (medical) German text documents.
As stated in the comments, LIBLINEAR is fast, because it is build for such sparse data. However, you end up with a linear classifier with all its pitfalls and benefits.
I would suggest the following strategy:
Perform aggressive feature selection (e.g. with InformationGain) with a remaining feature-space of N
Increase N stepwise in combination with cross-validation and find the best maching N for your data.
Go for a grid-search with the N found in 2.
Train your classifier with the best matching parameters found in 3. and the N found in 2.
I have a billion feature vectors and I would like to put them into approximate clusters. Looking at the methods from http://scikit-learn.org/stable/modules/clustering.html#clustering for example it is not at all clear to me how their running time scales with the data size (except for Affinity Propagation which is clearly too slow).
What methods are suitable for clustering such a large data set? I assume any method will have to run in O(n) time.
The K-means complexity sounds reasonable for your data (only 4 components). The tricky part is the initialization and the choice of number of clusters. You can try different random initialization but this can be time consuming. An alternative is to sub-sample your data and run a more expensive clustering algorithm like Affinity Propagation. Then use the solution as init for k-means and run it with all your data.
For a billion feature vectors I'd be dubious of using K-means on its own. I'm sure you could do it, but it would take a long time and would thusly be difficult to debug. I recommend using Canopy Clustering first then applying K-means to reduce the complexity and computations. These sub-clusters could then be reduced further with a Map Reduce implementation to solve even faster.