I have large-ish SVC models (~50Mb cPickles) for text classification and I am trying out various ways to use them in a production environment. Classifying batches of documents works very well (about 1k documents per minute using both predict and predict_proba).
However, prediction on a single document is another story, as explained in a comment to this question:
Are you doing predictions in batches? The SVC.predict method, unfortunately, incurs a lot of overhead because it has to reconstruct a LibSVM data structure similar to the one that the training algorithm produced, shallow-copy in the support vectors, and convert the test samples to a LibSVM format that may be different from the NumPy/SciPy formats. Therefore, prediction on a single sample is bound to be slow. – larsmans
I am already serving the SVC models as Flask web-applications, so a part of the overhead is gone (unpickling) but the prediction times for single docs are still on the high side (0.25s).
I have looked at the code in the predict methods but cannot figure out if there is a way to "pre-warm" them, reconstructing the LibSVM data structure in advance at server startup... any ideas?
def predict(self, X):
"""Perform classification on samples in X.
For an one-class model, +1 or -1 is returned.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Returns
-------
y_pred : array, shape = [n_samples]
Class labels for samples in X.
"""
y = super(BaseSVC, self).predict(X)
return self.classes_.take(y.astype(np.int))
I can see three possible solutions.
Custom server
It is not the matter of "warming" anything up. Simply - libSVM is the C library, and you need to pack/unpack data into correct format. This process is more efficient on the whole matrices than on each row separately. The only way to overcome this would be to write more efficient wrapper between your production env and the libSVM (you could write a libsvm based server, which would use some kind of shared memory with your service). Unfortunately, this is to custom problem to be solvable by existing implementations.
Batches
Naive approach like buffering the queries is an option (if it is "high performance" system with thousands of queries, you can simply store them in N-element batches, and send them to libSVM in such packs).
Own classification
Lastly - classification using SVM is really simple task. You don't need libSVM to perform classification. Only training is a complex problem. Once you get all the support vectors (SV_i), kernel (K), lagragian multipliers (alpha_i) and intercept term (b), you classify using:
cl(x) = sgn( SUM_i y_i alpha_i K(SV_i, x) + b)
You can code this operation directly in your app, without the need to actualy pack/unpack/send anything to libsvm. This can speed things up by the order of magnitude. Obviously - probability is more complex to retrieve, as it requires the Platt's scaliing, but it is still possible.
You can't construct the LibSVM data structure in advance. When a request to classify a document arrives, you get the text of the document, make a vector out of if and only then convert to LibSVM format so you can get a decision.
LinearSVC should be considerably faster than a SVC with a linear kernel as it uses liblinear. You could try using a different classifier if that does not decrease performance too much.
Related
I have data that fast time fourier transform is applied.
(amplitudes at specific Hzs)
There are solutions on internet that CNN is applied to mel spectrogram, however, I see no solution that CNN is applied to Fast Fourier Transformed signal.
Is it possible that CNN is applied to Fast Fourier Transformed signals?
Or is it not possible because CNN is considering temporal attribute?
Thanks!
I'm assuming each row of your spreadsheet is IID, e.g. it wouldn't change the problem to re-order the rows in that spreadsheet.
In this case you have a pretty typical ML problem. The fact that the FFT has already been applied and specific frequency responses (columns) have been extracted is a process called "feature engineering". Prior to the common use of neural networks, this was a standard step in all machine learning problems and remains common to a great many domains.
With data that has been feature engineered, you should look to traditional ML algorithms. Random Forests, XGBoost, and Linear Regression come to mind. A fully connected neural network is also appropriate, but I would typically expect it to under-perform other ML methods.
The hallmark of a CNN is that it operates on an ordered sequence of data. In your case the raw data, from which your dataset was derived, would be appropriate for a CNN. In a sound file you have a 1D sequence of information. You could not re-order the data in the time dimension without fundamentally changing its meaning.
A 2D CNN operates over an image where the pixel order in X and Y cannot be changed. Again the sequential order of the data matters. The same applies for 3D CNNs.
Be aware that the application of a FFT has fundamentally biased your solution by representing it only in a limited set of frequency responses. All feature engineering is fundamentally biasing the problem, presumably in a well thoughout-out way. However, it's entirely possible that other useful signals in the data exist, which aren't expressed by the FFT # 10, 20, 30 Hz, etc. The CNN has the capacity to learn its own version of an FFT as well as other non cyclic patterns. Typically, the lack of a feature engineering step is the key differentiator between the CNN and traditional ML algorithms.
I have a trouble with supervised classification method I am using for my data.
Let's think we are training our algorithm with a data (N=70) after reducing the dimensions from 100 to 2 by using LDA dimensionality reduction method.
Now, we would like to predict the class of the 71st sample, whose class is completely unknown to us. However, it has 100 features still; so we have to reduce its dimensions.
That seems easy in the first look: I can use the transform characteristics of the first reduction. For example, in python:
clf.fit(X,Y)
lda = LinearDiscriminantAnalysis(n_components=2)
flda = lda.fit(X, Y)
X_lda = flda.transform(X)
I had stored the fitting properties of training data. X_p is my single sample. So when I use 'flda' again for transformation, same fitting information is used:
X_p = flda.transform(X_p.reshape(1, -1))
However, it doesn't predict properly! To test, I used my first N=70 data. Extract one of them (so now, it is N=69). I used 70th data as test sample. And it didn't predict properly again.
When I compared my previous data (N=70) and the new one (N=69), I saw that every single number changed! If I am not missing something (I hope I am missing and you can tell me what I am missing) LDA dimensionality reduction is not applicable for real machine learning applications, because only one data could change everything.
As a note, the plot of the reduced data doesn't change, despite all numbers significanly change (which means the relative locations of points do not change).
Do you know how LDA dimensionality reduction is used in real machine learning applications? What must I do, to test one sample in the following order:
Reduce dimensions to 2 for training data
Reduce dimensions to 2 for test data
Predict!
without using the same tranformation charactheristics?
I'm working on a TREC task involving use of machine learning techniques, where the dataset consists of more than 5 terabytes of web documents, from which bag-of-words vectors are planned to be extracted. scikit-learn has a nice set of functionalities that seems to fit my need, but I don't know whether it is going to scale well to handle big data. For example, is HashingVectorizer able to handle 5 terabytes of documents, and is it feasible to parallelize it? Moreover, what are some alternatives out there for large-scale machine learning tasks?
HashingVectorizer will work if you iteratively chunk your data into batches of 10k or 100k documents that fit in memory for instance.
You can then pass the batch of transformed documents to a linear classifier that supports the partial_fit method (e.g. SGDClassifier or PassiveAggressiveClassifier) and then iterate on new batches.
You can start scoring the model on a held-out validation set (e.g. 10k documents) as you go to monitor the accuracy of the partially trained model without waiting for having seen all the samples.
You can also do this in parallel on several machines on partitions of the data and then average the resulting coef_ and intercept_ attribute to get a final linear model for the all dataset.
I discuss this in this talk I gave in March 2013 at PyData: http://vimeo.com/63269736
There is also sample code in this tutorial on paralyzing scikit-learn with IPython.parallel taken from: https://github.com/ogrisel/parallel_ml_tutorial
I have a csv file of [66k, 56k] size (rows, columns). Its a sparse matrix. I know that numpy can handle that size a matrix. I would like to know based on everyone's experience, how many features scikit-learn algorithms can handle comfortably?
Depends on the estimator. At that size, linear models still perform well, while SVMs will probably take forever to train (and forget about random forests since they won't handle sparse matrices).
I've personally used LinearSVC, LogisticRegression and SGDClassifier with sparse matrices of size roughly 300k × 3.3 million without any trouble. See #amueller's scikit-learn cheat sheet for picking the right estimator for the job at hand.
Full disclosure: I'm a scikit-learn core developer.
Some linear model (Regression, SGD, Bayes) will probably be your best bet if you need to train your model frequently.
Although before you go running any models you could try the following
1) Feature reduction. Are there features in your data that could easily be removed? For example if your data is text or ratings based there are lots known options available.
2) Learning curve analysis. Maybe you only need a small subset of your data to train a model, and after that you are only fitting to your data or gaining tiny increases in accuracy.
Both approaches could allow you to greatly reduce the training data required.
I have trained a bunch of RBF SVMs using scikits.learn in Python and then Pickled the results. These are for image processing tasks and one thing I want to do for testing is run each classifier on every pixel of some test images. That is, extract the feature vector from a window centered on pixel (i,j), run each classifier on that feature vector, and then move on to the next pixel and repeat. This is far too slow to do with Python.
Clarification: When I say "this is far too slow..." I mean that even the Libsvm under-the-hood code that scikits.learn uses is too slow. I'm actually writing a manual decision function for the GPU so classification at each pixel happens in parallel.
Is it possible for me to load the classifiers with Pickle, and then grab some kind of attribute that describes how the decision is computed from the feature vector, and then pass that info to my own C code? In the case of linear SVMs, I could just extract the weight vector and bias vector and add those as inputs to a C function. But what is the equivalent thing to do for RBF classifiers, and how do I get that info from the scikits.learn object?
Added: First attempts at a solution.
It looks like the classifier object has the attribute support_vectors_ which contains the support vectors as each row of an array. There is also the attribute dual_coef_ which is a 1 by len(support_vectors_) array of coefficients. From the standard tutorials on non-linear SVMs, it appears then that one should do the following:
Compute the feature vector v from your data point under test. This will be a vector that is the same length as the rows of support_vectors_.
For each row i in support_vectors_, compute the squared Euclidean distance d[i] between that support vector and v.
Compute t[i] as gamma * exp{-d[i]} where gamma is the RBF parameter.
Sum up dual_coef_[i] * t[i] over all i. Add the value of the intercept_ attribute of the scikits.learn classifier to this sum.
If the sum is positive, classify as 1. Otherwise, classify as 0.
Added: On numbered page 9 at this documentation link it mentions that indeed the intercept_ attribute of the classifier holds the bias term. I have updated the steps above to reflect this.
Yes your solution looks alright. To pass the raw memory of a numpy array directly to a C program you can use the ctypes helpers from numpy or wrap you C program with cython and call it directly by passing the numpy array (see the doc at http://cython.org for more details).
However, I am not sure that trying to speedup the prediction on a GPU is the easiest approach: kernel support vector machines are known to be slow at prediction time since their complexity directly depend on the number of support vectors which can be high for highly non-linear (multi-modal) problems.
Alternative approaches that are faster at prediction time include neural networks (probably more complicated or slower to train right than SVMs that only have 2 hyper-parameters C and gamma) or transforming your data with a non linear transformation based on distances to prototypes + thresholding + max pooling over image areas (only for image classification).
for the first method you will find good documentation on the deep learning tutorial
for the second read the recent papers by Adam Coates and have a look at this page on kmeans feature extraction
Finally you can also try to use NuSVC models whose regularization parameter nu has a direct impact on the number of support vectors in the fitted model: less support vectors mean faster prediction times (check the accuracy though, it will be a trade-off between prediction speed and accuracy in the end).