I am using the scikit-learn implementation of Gaussian Process Regression here and I want to fit single points instead of fitting a whole set of points. But the resulting alpha coefficients should remain the same e.g.
gpr2 = GaussianProcessRegressor()
for i in range(x.shape[0]):
gpr2.fit(x[i], y[i])
should be the same as
gpr = GaussianProcessRegressor().fit(x, y)
But when accessing gpr2.alpha_ and gpr.alpha_, they are not the same. Why is that?
Indeed, I am working on a project where new data points arise. I dont want to append the x, y arrays and fit on the whole dataset again as it is very time intense. Let x be of size n, then I am having:
n+(n-1)+(n-2)+...+1 € O(n^2) fittings
when considering that the fitting itself is quadratic (correct me if I'm wrong), the run time complexity should be in O(n^3). It would be more optimal, if I do a single fitting on n points:
1+1+...+1 = n € O(n)
What you refer to is actually called online learning or incremental learning; it is in itself a huge sub-field in machine learning, and is not available out-of-the-box for all scikit-learn models. Quoting from the relevant documentation:
Although not all algorithms can learn incrementally (i.e. without seeing all the instances at once), all estimators implementing the partial_fit API are candidates. Actually, the ability to learn incrementally from a mini-batch of instances (sometimes called “online learning”) is key to out-of-core learning as it guarantees that at any given time there will be only a small amount of instances in the main memory.
Following this excerpt in the linked document above, there is a complete list of all scikit-learn models currently supporting incremental learning, from where you can see that GaussianProcessRegressor is not one of them.
Although sklearn.gaussian_process.GaussianProcessRegressor does not directly implement incremental learning, it is not necessary to fully retrain your model from scratch.
To fully understand how this works, you should understand the GPR fundamentals. The key idea is that training a GPR model mainly consists of optimising the kernel parameters to minimise some objective function (the log-marginal likelihood by default). When using the same kernel on similar data these parameters can be reused. Since the optimiser has a stopping condition based on convergence, reoptimisation can be sped up by initialising the parameters with pre-trained values (a so-called warm-start).
Below is an example based on the one in the sklearn docs.
from time import time
from sklearn.datasets import make_friedman2
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import DotProduct, WhiteKernel
X, y = make_friedman2(n_samples=1000, noise=0.1, random_state=0)
kernel = DotProduct() + WhiteKernel()
start = time()
gpr = GaussianProcessRegressor(kernel=kernel,
random_state=0).fit(X, y)
print(f'Time: {time()-start:.3f}')
# Time: 4.851
print(gpr.score(X, y))
# 0.3530096529277589
# the kernel is copied to the regressor so we need to
# retieve the trained parameters
kernel.set_params(**(gpr.kernel_.get_params()))
# use slightly different data
X, y = make_friedman2(n_samples=1000, noise=0.1, random_state=1)
# note we do not train this model
start = time()
gpr2 = GaussianProcessRegressor(kernel=kernel,
random_state=0).fit(X, y)
print(f'Time: {time()-start:.3f}')
# Time: 1.661
print(gpr2.score(X, y))
# 0.38599549162834046
You can see retraining can be done in significantly less time than training from scratch. Although this might not be fully incremental, it can help speed up training in a setting with streaming data.
Related
Usually people use scikit-learn to train a model this way:
from sklearn.ensemble import GradientBoostingClassifier as gbc
clf = gbc()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
It works fine as long as users' memory is large enough to accommodate the entire dataset. The dilemma for me is exactly this--the dataset is too big for my memory. My current solution is to enlarge the virtual memory of my machine and I have already made the system extremely slow by having too much virtual memory--so I start to think whether or not is it possible to feed the fit() method with samples in batches like this (and the answre is no, please keep reading and stop reminding me that the answer is no):
clf = gbc()
for i in range(X_train.shape[0]):
clf.fit(X_train[i], y_train[i])
so that I can read the training set from hard drive only when needed. I read the sklearn's manual and it seems to me that it does not support this:
Calling fit() more than once will overwrite what was learned by any previous fit()
So, is this possible?
This do not work in scikit-learn as explained in the comment section as well as in the documentation. However you can use river ( which is a python package for online/streaming machine learning). This package should be well-suited for you problematic.
Below is an example of training a LinearRegression using river.
from river import datasets
from river import linear_model
from river import metrics
from river import preprocessing
dataset = datasets.TrumpApproval()
model = (
preprocessing.StandardScaler() |
linear_model.LinearRegression(intercept_lr=.1)
)
metric = metrics.MAE()
for x, y, in dataset:
y_pred = model.predict_one(x)
# Update the running metric with the prediction and ground truth value
metric.update(y, y_pred)
# Train the model with the new sample
model.learn_one(x, y)
It is not clear in your question is which steps in the machine learning are slow for you. As also noted in the manual for riverml and this post in sklearn there is an option to do a partial fit. You will be restricted in terms of the models you can use for this incremental learning.
So using your example lets say we use a stochastic gradient descent classifier:
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
X,y = make_classification(100000)
clf = SGDClassifier(loss='log')
all_classes = list(set(y))
for ix in np.split(np.arange(0,X.shape[0]),100):
clf.partial_fit(X[ix,:],y[ix],classes = all_classes)
After reading the section 6. Strategies to scale computationally: bigger data of the official manual mentioned by #StupidWolf in this post, I am aware that this question is more to this than meets the eye.
The real difficulty is about the design of a lot of models.
Take Random Forest as an example, one of the most important techniques used to improve its performance compared with the simpler Decision Tree is the application of bagging, which means that the algorithm has to pick some random samples from the entire dataset to construct several weak learners as the basis of the Random Forest. It means that feeding the model with one sample after another won't work with this design.
Although it is still possible for scikit-learn to define an interface for end-users to implement so that scikit-learn can pick a random sample by calling this interface and end-users will decide how their implementation of the interface is about to return the needed data by scanning the dataset on the hard drive, it becomes way more complicated than I initially thought and the performance gain may not be very significant given that the IO-heavy "full table scan" (in database's term) is frequently needed.
I've noticed a rather peculiar but potentially very useful phenomenon when using Scikit-Learn's SVC implementation. Using the built-in rbf kernel with SVC is slower by magnitudes than passing a custom rbf function to SVC().
From what I could see and understand so far, the only difference between the two versions is that in the built-in rbf case, not sklearn but libsvm will compute the kernel. Passing a dedicated kernel function as hyperparameter to SVC() leads to the computation of the kernel inside sklearn, not in libsvm. The results are identical, but the latter case takes only a fraction of the computation time.
Example
I've included an example so that you can replicate this behavior.
I've created a toy dataset that mimics the data I am currently working on. By the way, I also work on data with around a thousand samples but high dimensionality (~50000 features). This results in pretty much the same behavior.
import numpy as np
from time import time
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.metrics.pairwise import rbf_kernel
from sklearn.metrics import accuracy_score
# create toy data
n_features = 1000
n_samples = 10000
n_informative = 10
X, y = make_classification(n_samples, n_features, n_informative=n_informative)
gamma = 1 / n_features
Built-in RBF
First, let's fit an SVC using the built-in 'rbf' kernel. This is probably the way people usually run an SVC.
# fit SVC with built-in rbf kernel
svc_built_in = SVC(kernel='rbf', gamma=gamma)
np.random.seed(13)
t1 = time()
svc_built_in.fit(X, y)
acc = accuracy_score(y, svc_built_in.predict(X))
print("Fitting SVC with built-in kernel took {:.1f} seconds".format(time()-t1))
print("Accuracy: {}".format(acc))
Custom RBF function
Second, let's do the same thing only passing the rbf kernel function of sklearn which should do exactly the same.
# fit SVC with custom rbf kernel
svc_custom = SVC(kernel=rbf_kernel, gamma=gamma)
np.random.seed(13)
t1 = time()
svc_custom.fit(X, y)
acc = accuracy_score(y, svc_custom.predict(X))
print("Fitting SVC with a custom kernel took {:.1f} seconds".format(time()-t1))
print("Accuracy: {}".format(acc))
Results
This will give the following result.
Fitting SVC with built-in kernel took 58.6 seconds
Accuracy: 0.9846
Fitting SVC with a custom kernel took 3.2 seconds
Accuracy: 0.9846
My question
Does anyone have an idea why passing a kernel function is so much faster than using libsvm's kernel computation?
For my specific use case (usually large datasets and long computation time), this actually is super useful as I can run many more hyperparameter settings using the second method since the computation time is so significantly decreased. Any reasons not to do this?
I have received some good answers to this question on the sklearn bug report (https://github.com/scikit-learn/scikit-learn/issues/21410) so I thought I'd share this knowledge here.
Apparently, the computation of the kernel within sklearn (and not libsvm) is done using numpy. Numpy, however, automatically uses all available threads on your machine to speed up the kernel computation. As I was running this analysis on a machine with 32 threads, I was seeing a dramatic performance increase. Not sure if there are other reasons for numpy being faster (faster or smarter memory access or something like that) but I can definitely confirm the parallelization happening.
So, my take on this is, if you are running SVC on a larger dataset and can make use of multiple threads on your machine, it might be worthwhile to pass the kernel function itself and not merely a string specifier to the SVC instance. All standard kernel functions are already implemented in sklearn in metrics.pairwise (https://scikit-learn.org/stable/modules/metrics.html).
I am going through the incremental learning algorithm in Scikit-learn. SGD in sci-kit learn is such a kind of algorithm that allows learning incrementally by passing chunks/batches.
Does sci-kit learn keep all the batches for training data in memory?
Or does it keeps chunks/batches in memory up to a certain amount of size?
Or does it keep only one chunk/batch while training in memory and removes the other trained chunks/batches after training? Does that mean it suffers from catastrophic forgetting?
The purpose of incremental learning is to not keep the whole training data in memory. Thus, it is possible to learn on big data sets that would not fit in memory as a whole. Incremental learning is also useful if the training data becomes available piece by piece.
Stochastic Gradient Descent (SGD) keeps no batches in memory, except the one it is working on. However, that does not mean it immediately forgets past patches. Batches are used to compute the gradient, which is used to update the model coefficients. So information contained in the batches remains in the model although the data itself is discarded.
Since the gradient is updated with the most recent batch, newer batches have more influence on the current training state of the model than older batches. You could say that recent batches are more vivid in the model's memory while it gradually forgets older batches.
Here is a toy example to illustrate this issue (code at the bottom):
An SGD classifier was trained incrementally with three classes in the first 100 batches. In batches 100-200 class 3 was not present in the training data. It is very apparent that the classifier "forgets" everything it had learned about this class before. You may label this effect "catastrophic forgetting", or you may see it as desirable "adapting to changes in data"; the interpretation depends on the use case.
So, yes, SGD indeed seems to suffer from catastrophic forgetting. I do not think it's a big deal, though; just something you have to be aware of when designing the training strategy in a particular application.
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_blobs
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
np.random.seed(42)
n_features = 150
centers = np.concatenate([np.eye(3)*3, np.zeros((3, n_features-3))], axis=1)
x_test, y_test = make_blobs([100, 100, 100], centers=centers)
cla = SGDClassifier()
performance = []
def train_some_batches(n_samples_per_class):
for _ in range(100):
x_batch, y_batch = make_blobs(n_samples_per_class, centers=centers)
cla.partial_fit(x_batch, y_batch, classes=[0, 1, 2])
conf = confusion_matrix(y_test, cla.predict(x_test))
performance.append(np.diag(conf) / np.sum(conf, axis=1))
train_some_batches([50, 50, 50])
train_some_batches([50, 50, 0])
plt.plot(performance)
plt.legend(['class 1', 'class 2', 'class 3'])
plt.xlabel('training batches')
plt.ylabel('accuracy')
plt.show()
Lately I've been trying to fit a Regularized Logistic Regression on vectorized text data. I first tried with sklearn, and had no problem, but then I discovered and I can't do inference through sklearn, so I tried to switch to statsmodels. The problem is, when I try to fit the logit it keeps running forever and using about 95% of my RAM (tried both on 8GB and 16GB RAM computers).
My first guess was it had to do with dimensionality, because I was working with a 2960 x 43k matrix. So, to reduce it, I deleted bigrams and took a sample of only 100 observations, which leaves me with a 100 x 6984 matrix, which, I think, shouldn't be too problematic.
This is a little sample of my code:
for train_index, test_index in sss.split(df_modelo.Cuerpo, df_modelo.Dummy_genero):
X_train, X_test = df_modelo.Cuerpo[train_index], df_modelo.Cuerpo[test_index]
y_train, y_test = df_modelo.Dummy_genero[train_index], df_modelo.Dummy_genero[test_index]
cvectorizer=CountVectorizer(max_df=0.97, min_df=3, ngram_range=(1,1) )
vec=cvectorizer.fit(X_train)
X_train_vectorized = vec.transform(X_train)
This gets me a train and a test set, and then vectorizes text from X_train.
Then I try:
import statsmodels.api as sm
logit=sm.Logit(y_train.values,X_train_vectorized.todense())
result=logit.fit_regularized(method='l1')
Everything works fine until the result line, which keeps running forever. Is there something I can do? Should I switch to R if I'm looking for statistical inference?
Thanks in advance!
Almost all of statsmodels and all the inference is designed for the case when the number of observations is much larger than the number of features.
Logit.fit_regularized uses an interior point algorithm with scipy optimizers which needs to keep all features in memory. Inference for the parameters requires the covariance of the parameter estimate which has shape n_features by n_features. The use case for which it was designed is when the number of features is relatively small compared to the number of observations, and the Hessian can be used in-memory.
GLM.fit_regularized estimates elastic net penalized parameters and uses coordinate descend. This can possibly handle a large number of features, but it does not have any inferential results available.
Inference after Lasso and similar penalization that select variables has only been available in recent research. See for example selective inference in Python https://github.com/selective-inference/Python-software for which also a R package is available.
Say I have a sklearn training data:
features, labels = assign_dataSets() #assignment operation
Here the feature is a 2-D array, whereas label consists is a 1-D array consisting of values [0,1]
The classification operation:
f1x = [features[i][0] for i in range(0, len(features)) if labels[i]==0]
f2x = [features[i][0] for i in range(0, len(features)) if labels[i]==1]
f1y = [features[i][1] for i in range(0, len(features)) if labels[i]==0]
f2y = [features[i][1] for i in range(0, len(features)) if labels[i]==1]
Now I plot the said data:
import matplotlib.pyplot as plt
plt.scatter(f1x,f1y,color='b')
plt.scatter(f2x,f2y,color='y')
plt.show()
Now I want to run the fitting operation with a classifier for example SVC.
from sklearn.svm import SVC
clf = SVC()
clf.fit(features, labels)
Now my question is as support vectors are really slow, is there a way to monitor the decision boundary of the classifier in real-time (I mean as the fitting operation is occurring)? I know that I can plot the decision boundary after the fitting operation has occurred, but I want the plotting of the classifier to occur in real time. Perhaps with threading and running predictions of an array of points declared by a linespace. Does fit function even allow such operations, or do I need to go for a some other library?
Just so you know, I am new to machine-learning.
scikit-learn has this feature, but it's is limited to a few classifiers from my understanding (e.g. GradientBoostingClassifier, MPLClassifier). To turn on this feature, you need to set verbose=True. For example:
clf = GradientBoostingClassifier(verbose=True)
I tried it with SVC and didn't work as expected (probably for the reason sascha mentioned in the comment section). Here is a different variation of your question on StackOverflow.
With regards to your second question, if you switch to Tensorflow (another machine learning library), you can use the tensorboard feature to monitor a few of metrics (e.g. error decay) in real time.
However, to the best of my knowledge SVM implementation is still experimental in v1.5. Tensorflow is really good when working with neural network based models.
If you decide to use a DNN for classification using Tensorflow then here is a discussion about implementation on StackOverflow: No easy way to add Tensorboard output to pre-defined estimator functions DnnClassifier?
Useful References:
Tensorflow SVM (only linear support for now - v1.5): https://www.tensorflow.org/api_docs/python/tf/contrib/learn/SVM
Tensorflow Kernals Methods: https://www.tensorflow.org/versions/master/tutorials/kernel_methods
Tensorflow Tensorboard: https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard
Tensorflow DNNClassifier Estimator: https://www.tensorflow.org/api_docs/python/tf/estimator/DNNClassifier