Lately I've been trying to fit a Regularized Logistic Regression on vectorized text data. I first tried with sklearn, and had no problem, but then I discovered and I can't do inference through sklearn, so I tried to switch to statsmodels. The problem is, when I try to fit the logit it keeps running forever and using about 95% of my RAM (tried both on 8GB and 16GB RAM computers).
My first guess was it had to do with dimensionality, because I was working with a 2960 x 43k matrix. So, to reduce it, I deleted bigrams and took a sample of only 100 observations, which leaves me with a 100 x 6984 matrix, which, I think, shouldn't be too problematic.
This is a little sample of my code:
for train_index, test_index in sss.split(df_modelo.Cuerpo, df_modelo.Dummy_genero):
X_train, X_test = df_modelo.Cuerpo[train_index], df_modelo.Cuerpo[test_index]
y_train, y_test = df_modelo.Dummy_genero[train_index], df_modelo.Dummy_genero[test_index]
cvectorizer=CountVectorizer(max_df=0.97, min_df=3, ngram_range=(1,1) )
vec=cvectorizer.fit(X_train)
X_train_vectorized = vec.transform(X_train)
This gets me a train and a test set, and then vectorizes text from X_train.
Then I try:
import statsmodels.api as sm
logit=sm.Logit(y_train.values,X_train_vectorized.todense())
result=logit.fit_regularized(method='l1')
Everything works fine until the result line, which keeps running forever. Is there something I can do? Should I switch to R if I'm looking for statistical inference?
Thanks in advance!
Almost all of statsmodels and all the inference is designed for the case when the number of observations is much larger than the number of features.
Logit.fit_regularized uses an interior point algorithm with scipy optimizers which needs to keep all features in memory. Inference for the parameters requires the covariance of the parameter estimate which has shape n_features by n_features. The use case for which it was designed is when the number of features is relatively small compared to the number of observations, and the Hessian can be used in-memory.
GLM.fit_regularized estimates elastic net penalized parameters and uses coordinate descend. This can possibly handle a large number of features, but it does not have any inferential results available.
Inference after Lasso and similar penalization that select variables has only been available in recent research. See for example selective inference in Python https://github.com/selective-inference/Python-software for which also a R package is available.
Related
I am using LinearDiscriminantAnalysis of scikit-learn to perform class prediction on a dataset. The problem seems to be that the performance of LDA is not consistent. In general, when I increase the number of features that I use for training, I expect the performance to increase as well. Same thing with the number of samples that I use for training, the more samples the better the performance of LDA.
However, in my case, there seems to be a sweet spot where the LDA performs poorly depending on the number of samples and features used for training. More precisely, LDA performs poorly when the number of features equals the number of samples. I think that it does not have to do with my dataset. Not sure exactly what is the issue here but I have an extensive example code that can recreate these results.
Here is an image of the LDA performance results that I am talking about.
The dataset I use has shape 400 X 75 X 400 (trials X time X features). Here the trials represent the different samples. Each time I shuffle the trial indices of the dataset. I create the train set by picking the trials for training and similarly for the test set. Finally, I calculate the mean across time (second axis) and insert the final matrix with shape (trials X features) as input in the LDA to compute the score on the test set. The test set is always of size 50 trials.
A detailed jupyter notebook with comments and the data I use can be found here https://github.com/esigalas/LDA-test. In my environment I use
sklearn: 1.1.1,
numpy: 1.22.4.
Not sure if there is an issue with LDA itself (that would be worthy of opening an issue on the github) or something wrong with how I handle the dataset, but this behavior of LDA looks wrong.
Any comment/help is welcome. Thanks in advance!
I am using the scikit-learn implementation of Gaussian Process Regression here and I want to fit single points instead of fitting a whole set of points. But the resulting alpha coefficients should remain the same e.g.
gpr2 = GaussianProcessRegressor()
for i in range(x.shape[0]):
gpr2.fit(x[i], y[i])
should be the same as
gpr = GaussianProcessRegressor().fit(x, y)
But when accessing gpr2.alpha_ and gpr.alpha_, they are not the same. Why is that?
Indeed, I am working on a project where new data points arise. I dont want to append the x, y arrays and fit on the whole dataset again as it is very time intense. Let x be of size n, then I am having:
n+(n-1)+(n-2)+...+1 € O(n^2) fittings
when considering that the fitting itself is quadratic (correct me if I'm wrong), the run time complexity should be in O(n^3). It would be more optimal, if I do a single fitting on n points:
1+1+...+1 = n € O(n)
What you refer to is actually called online learning or incremental learning; it is in itself a huge sub-field in machine learning, and is not available out-of-the-box for all scikit-learn models. Quoting from the relevant documentation:
Although not all algorithms can learn incrementally (i.e. without seeing all the instances at once), all estimators implementing the partial_fit API are candidates. Actually, the ability to learn incrementally from a mini-batch of instances (sometimes called “online learning”) is key to out-of-core learning as it guarantees that at any given time there will be only a small amount of instances in the main memory.
Following this excerpt in the linked document above, there is a complete list of all scikit-learn models currently supporting incremental learning, from where you can see that GaussianProcessRegressor is not one of them.
Although sklearn.gaussian_process.GaussianProcessRegressor does not directly implement incremental learning, it is not necessary to fully retrain your model from scratch.
To fully understand how this works, you should understand the GPR fundamentals. The key idea is that training a GPR model mainly consists of optimising the kernel parameters to minimise some objective function (the log-marginal likelihood by default). When using the same kernel on similar data these parameters can be reused. Since the optimiser has a stopping condition based on convergence, reoptimisation can be sped up by initialising the parameters with pre-trained values (a so-called warm-start).
Below is an example based on the one in the sklearn docs.
from time import time
from sklearn.datasets import make_friedman2
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import DotProduct, WhiteKernel
X, y = make_friedman2(n_samples=1000, noise=0.1, random_state=0)
kernel = DotProduct() + WhiteKernel()
start = time()
gpr = GaussianProcessRegressor(kernel=kernel,
random_state=0).fit(X, y)
print(f'Time: {time()-start:.3f}')
# Time: 4.851
print(gpr.score(X, y))
# 0.3530096529277589
# the kernel is copied to the regressor so we need to
# retieve the trained parameters
kernel.set_params(**(gpr.kernel_.get_params()))
# use slightly different data
X, y = make_friedman2(n_samples=1000, noise=0.1, random_state=1)
# note we do not train this model
start = time()
gpr2 = GaussianProcessRegressor(kernel=kernel,
random_state=0).fit(X, y)
print(f'Time: {time()-start:.3f}')
# Time: 1.661
print(gpr2.score(X, y))
# 0.38599549162834046
You can see retraining can be done in significantly less time than training from scratch. Although this might not be fully incremental, it can help speed up training in a setting with streaming data.
I'm creating a model to perform Logistic regression on a dataset using Python. This is my code:
from sklearn import linear_model
my_classifier2=linear_model.LogisticRegression(solver='lbfgs',max_iter=10000)
Now, according to Sklearn doc page, max_iter is maximum number of iterations taken for the solvers to converge. How do I specifically state that I need 'N' number of iterations ?
Any kind of help would be really appreciated.
I’m not sure, but, Do you want to know the optimal number of iterations for your model? If so, you are better off utilizing GridSearchCV that scan tune hyper parameter like max_iter.
Briefly,
Split your data into two groups: train/test data with train_test_split or KFold that can be imported from sklean
Set your parameter, for instance para=[{‘max_iter’:[1,10,100,100]}]
Instance, for example clf=GridSearchCV(LogisticRegression, param_grid=para, cv=5, scoring=‘r2’)
Implement with using train data like this: clf.fit(x_train, y_train)
You can also fetch the best number of iterations with RandomizedSearchCV or BayesianOptimization.
About the GridSearchCV of the max_iter parameter, the fitted LogisticRegression models have and attribute n_iter_ so you can discover the exact max_iter needed for a given sample size and regarding features:
n_iter_: ndarray of shape (n_classes,) or (1, )
Actual number of iterations for all classes. If binary or multinomial, it
returns only 1 element. For liblinear solver, only the maximum number of
iteration across all classes is given.
Scanning very short intervals, like 1 by 1, is a waste of resources that could be used for more important LogisticRegression fit parameters such as the combination of solver itself, its regularization penalty and the inverse of the regularization strength C which contributes for a faster convergence within a given max_iter.
Setting a very high max_iter could be also a waste of resources if you haven't previously did a minimal feature preprocessing, at least, feature scaling or maybe imputation, outlier clipping and a dimensionality reduction (e.g. PCA).
Things can become worse: a tunned max_iter could be ok for a given sample size but not for a bigger sample size, for instance, if you are developing a cross-validated learning curve, which by the way is imperative for optimal machine learning.
It becomes even worse if you increase a sample size in a pipeline that generates feature vectors such as n-grams (NLP): more rows will generate more (sparse) features for the LogisticRegression classification.
I think it's important to observe if different solvers converges or not on given sample size, generated features and max_iter.
Methods that help a faster convergence which eventually won't demand increasing max_iter are:
Feature scaling
Dimensionality Reduction (e.g. PCA) of scaled features
There's a nice sklearn example demonstrating the importance of feature scaling
I'm having a problem where my scikit learn regression forest trains almost instantly (in a few seconds; I know from experience with this machine it should take about a half hour or more on a dataset of the size I am working on) and then predicts the exact same output for every row of input data.
My current theory is that it has something to with the order of magnitude of the target variables - around 10^-11. I tried multiplying them by 100,000 to see what happened, and it starting running forever and not doing anything until I killed the script.
The code is below:
n_estimators = 200
rfr = RandomForestRegressor(n_estimators=n_estimators, verbose = 2, n_jobs = -1)
y_train = df_train[target].values*100000
rfr.fit(X_train, y_train)
rfr.predict(X_train)
You're probably wondering why I ran it back on the training data - I was just trying to test if it was actually doing anything, which it isn't.
Thank you for your help!
Edit:
This is the describe() output for the target data. The training data is mostly similar in magnitude:
count 4.000000e+04
mean -1.062353e-11
std 5.990830e-10
min -1.063333e-08
25% -2.305633e-10
50% -6.325584e-12
75% 2.110687e-10
max 1.564848e-08
I tried standardizing the data and running the forest; it printed no output but the memory usage keeps creeping up so it must be doing something. Do regression forests take high computer power? I'm using a laptop with an i7 processor and an OK graphics card; I can classify fine.
I am using sklearn.svr with the RBF kernel on an 80k-size dataset with 20+ variables. I was wondering how to choose the termination parameter tol. I ask because the regression does not seem to converge for certain combinations of C and gamma (2+ days before I give up). Interestingly, it converges after less than 10 minutes for certain combinations with an average run-time of approximately an hour.
Is there some sort of rule of thumb for setting this parameter? Perhaps a relationship to the standard deviation or expected value of the forecast?
Mike's answer is correct: subsampling for grid searching parameter is probably the best strategy to train SVR on medium-ish dataset sizes. SVR is not scalable so don't waste your time doing a grid search on the full dataset. Try on 1000 random sub samples, then 2000 and then 4000. Each time find the optimal values for C and gamma and try to guess how they evolve whenever you double the size of the dataset.
Also you can approximate the true SVR solution with the Nystroem kernel approximation and a linear regressor model such as SGDRegressor, LinearRegression, LassoCV or ElasticNetCV. RidgeCV is likely not to improve upon LinearRegression in the n_samples >> n_features regime.
Finally, do not forget to scale your input data by putting a MinMaxScaler or a StandardScaler before the SVR model in a Pipeline.
I would also try GradientBoostingRegressor models (although completely unrelated to SVR).
You really shouldn't use SVR on large data sets: its training algorithm takes between quadratic and cubic time. sklearn.linear_model.SGDRegressor can fit a linear regression on such datasets without trouble, so try that instead. If linear regression won't hack it, transform your data with a kernel approximation before feeding it to SGDRegressor to get a linear-time approximation of an RBF-SVM.
You may have seen the scikit learn documentation for the RBF function. Considering what C and gamma actually do and the fact that the SVR training time is at worst quadratic in the number of samples, I would try training first on a small subset of the data. By first getting a result for all parameter settings and then scaling up the amount of training data used, you might find you actually only need a small sample of the data to get results very close to the full set.
This is the advice I was given by my MSc project supervisor recently, as I had the exact same problem. I found that out of a set of 120k examples with 250 features I only needed around 3000 samples to get within 2% of the error of the full set models.
Sorry this isn't answering your question directly, but I thought it might help.