Unexpected behavior with LinearDiscriminantAnalysis of scikit-learn - python

I am using LinearDiscriminantAnalysis of scikit-learn to perform class prediction on a dataset. The problem seems to be that the performance of LDA is not consistent. In general, when I increase the number of features that I use for training, I expect the performance to increase as well. Same thing with the number of samples that I use for training, the more samples the better the performance of LDA.
However, in my case, there seems to be a sweet spot where the LDA performs poorly depending on the number of samples and features used for training. More precisely, LDA performs poorly when the number of features equals the number of samples. I think that it does not have to do with my dataset. Not sure exactly what is the issue here but I have an extensive example code that can recreate these results.
Here is an image of the LDA performance results that I am talking about.
The dataset I use has shape 400 X 75 X 400 (trials X time X features). Here the trials represent the different samples. Each time I shuffle the trial indices of the dataset. I create the train set by picking the trials for training and similarly for the test set. Finally, I calculate the mean across time (second axis) and insert the final matrix with shape (trials X features) as input in the LDA to compute the score on the test set. The test set is always of size 50 trials.
A detailed jupyter notebook with comments and the data I use can be found here https://github.com/esigalas/LDA-test. In my environment I use
sklearn: 1.1.1,
numpy: 1.22.4.
Not sure if there is an issue with LDA itself (that would be worthy of opening an issue on the github) or something wrong with how I handle the dataset, but this behavior of LDA looks wrong.
Any comment/help is welcome. Thanks in advance!


Statsmodels' Logit.fit_regularized keeps running forever

Lately I've been trying to fit a Regularized Logistic Regression on vectorized text data. I first tried with sklearn, and had no problem, but then I discovered and I can't do inference through sklearn, so I tried to switch to statsmodels. The problem is, when I try to fit the logit it keeps running forever and using about 95% of my RAM (tried both on 8GB and 16GB RAM computers).
My first guess was it had to do with dimensionality, because I was working with a 2960 x 43k matrix. So, to reduce it, I deleted bigrams and took a sample of only 100 observations, which leaves me with a 100 x 6984 matrix, which, I think, shouldn't be too problematic.
This is a little sample of my code:
for train_index, test_index in sss.split(df_modelo.Cuerpo, df_modelo.Dummy_genero):
X_train, X_test = df_modelo.Cuerpo[train_index], df_modelo.Cuerpo[test_index]
y_train, y_test = df_modelo.Dummy_genero[train_index], df_modelo.Dummy_genero[test_index]
cvectorizer=CountVectorizer(max_df=0.97, min_df=3, ngram_range=(1,1) )
X_train_vectorized = vec.transform(X_train)
This gets me a train and a test set, and then vectorizes text from X_train.
Then I try:
import statsmodels.api as sm
Everything works fine until the result line, which keeps running forever. Is there something I can do? Should I switch to R if I'm looking for statistical inference?
Thanks in advance!
Almost all of statsmodels and all the inference is designed for the case when the number of observations is much larger than the number of features.
Logit.fit_regularized uses an interior point algorithm with scipy optimizers which needs to keep all features in memory. Inference for the parameters requires the covariance of the parameter estimate which has shape n_features by n_features. The use case for which it was designed is when the number of features is relatively small compared to the number of observations, and the Hessian can be used in-memory.
GLM.fit_regularized estimates elastic net penalized parameters and uses coordinate descend. This can possibly handle a large number of features, but it does not have any inferential results available.
Inference after Lasso and similar penalization that select variables has only been available in recent research. See for example selective inference in Python https://github.com/selective-inference/Python-software for which also a R package is available.

xgboost predict method returns the same predicted value for all rows

I've created an xgboost classifier in Python:
train is a pandas dataframe with 100k rows and 50 features as columns.
target is a pandas series
xgb_classifier = xgb.XGBClassifier(nthread=-1, max_depth=3, silent=0,
objective='reg:linear', n_estimators=100)
xgb_classifier = xgb_classifier.fit(train, target)
predictions = xgb_classifier.predict(test)
However, after training, when I use this classifier to predict values the entire results array is the same number. Any idea why this would be happening?
Data clarification:
~50 numerical features with a numerical target
I've also tried RandomForestRegressor from sklearn with the same data and it does give realistic predictions. Perhaps a legitimate bug in the xgboost implementation?
This question has received several responses including on this thread as well as here and here.
I was having a similar issue with both XGBoost and LGBM. For me, the solution was to increase the size of the training dataset.
I was training on a local machine using a random sample (~0.5%) of a large sparse dataset (200,000 rows and 7000 columns) because I did not have enough local memory for the algorithm. It turned out that for me, the array of predicted values was just an array of the average values of the target variable. This suggests to me that the model may have been underfitting. One solution to an underfitting model is to train your model on more data, so I tried my analysis on a machine with more memory and the issue was resolved: my prediction array was no longer an array of average target values. On the other hand, the issue could simply have been that the slice of predicted values I was looking at were predicted from training data with very little information (e.g. 0's and nan's). For training data with very little information, it seems reasonable to predict the average value of the target feature.
None of the other suggested solutions I came across were helpful for me. To summarize some of the suggested solutions included:
1) check if gamma is too high
2) make sure your target labels are not included in your training dataset
3) max_depth may be too small.
One of the reasons for the same is that you're providing a high penalty through parameter gamma. Compare the mean value of your training response variable and check if the prediction is close to this. If yes then the model is restricting too much on the prediction to keep train-rmse and val-rmse as close as possible. Your prediction is the simplest with higher value of gamma. So you'll get the simplest model prediction like mean of training set as prediction or naive prediction.
Won't the max_depth =3 too smaller, try to get it bigger,the default value is 7 if i remember it correctly. and set silent to be 1, then you can monitor what's the error each epochs
You need to post a reproducible example for any real investigation. It's entirely likely that your response target is highly unbalanced and that your training data is not super predictive, thus you always (or almost always) get one class predicted. Have you looked at the predicted probabilities at all to see if there is any variance? Is it just an issue of not using the proper cut-off for classification labels?
Since you said that a RF gave reasonable predictions it would useful to see your training parameters for that. At a glance, it's curious why you're using a regression objective function in your xgboost call though -- that could easily be why you are seeing such poor performance. Trying changing your objective to: 'binary:logistic.
You should check there are no inf values in your target.
Try to increase (significantly) min_child_weight in XGBoost or min_data_in_leaf in LightGBM:
min_data_in_leaf oof_rmse
20000 0.052998
2000 0.053001
200 0.053002
20 0.053015
2 0.054261
Actually, it may be a case of overfitting masking as underfitting. It happens for instance for zero-inflated targets in case of insurance claims frequency models. One solution would be to increase the representation/coverage of rare target levels (e.g. non-zero insurance claims) in each tree leaf, by increasing the hyperparameter controlling minimum leaf size to some rather large values, such as those specified in the example above.
I just had this problem and managed to fix it. The problem was I was training on tree_method='gpu_hist' which gave all the same predictions. If I set tree_method='auto' it works properly but wayy longer runtimes. So then if I set tree_method='gpu_hist' along with base_score=0 it worked. I think base_score should be about the mean of your predicted variable.
I have tried all solutions on this page, but none worked.
As I was grouping time series, certain frequencies created gaps in data.
I solved this issue by filling all NaN's.
Probably the hyper-parameters you use cause errors. Try using default values. In my case, this problem was solved by removing subsample and min_child_weight hyper-parameters from params.

Multi-label classification for large dataset

I am solving a multilabel classification problem. I have about 6 Million of rows to be processed which are huge chunks of text. They are tagged with multiple tags in a separate column.
Any advice on what scikit libraries can help me scale up my code. I am using One-vs-Rest and SVM within it. But they don't scale beyond 90-100k rows.
classifier = Pipeline([
('vectorizer', CountVectorizer(min_df=1)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
SVM's scale well as the number of columns increase, but poorly with the number of rows, as they are essentially learning which rows constitute the support vectors. I have seen this as a common complaint with SVM's, but most people don't understand why, as they typically scale well for most reasonable datasets.
You will want 1 vs the rest, as you are using. One vs One will not scale well for this (n(n-1) classifiers, vs n).
I set a minimum df for the terms you consider to at least 5, maybe higher, which will drastically reduce your row size. You will find a lot of words occur once or twice, and they add no value to your classification as at that frequency, an algorithm cannot possibly generalize. Stemming may help there.
Also remove stop words (the, a, an, prepositions, etc, look on google). That will further cut down the number of columns.
Once you have reduced your column size as described, I would try to eliminate some rows. If there are documents that are very noisy, or very short after steps 1-3, or maybe very long, I would look to eliminate them. Look at the s.d. and mean doc length, and plot the length of the docs (in terms of word count) against the frequency at that length to decide
If the dataset is still too large, I would suggest a decision tree, or naive bayes, both are present in sklearn. DT's scale very well. I would set a depth threshold to limit the depth of the tree, as otherwise it will try to grow a humungous tree to memorize that dataset. NB on the other hand is very fast to train and handles large numbers of columns quite well. If the DT works well, you can try RF with a small number of trees, and leverage the ipython parallelization to multi-thread.
Alternatively, segment your data into smaller datasets, train a classifier on each, persist that to disk, and then build an ensemble classifier from those classifiers.
HashingVectorizer will work if you iteratively chunk your data into batches of 10k or 100k documents that fit in memory for instance.
You can then pass the batch of transformed documents to a linear classifier that supports the partial_fit method (e.g. SGDClassifier or PassiveAggressiveClassifier) and then iterate on new batches.
You can start scoring the model on a held-out validation set (e.g. 10k documents) as you go to monitor the accuracy of the partially trained model without waiting for having seen all the samples.
You can also do this in parallel on several machines on partitions of the data and then average the resulting coef_ and intercept_ attribute to get a final linear model for the all dataset.
I discuss this in this talk I gave in March 2013 at PyData: http://vimeo.com/63269736
There is also sample code in this tutorial on paralyzing scikit-learn with IPython.parallel taken from: https://github.com/ogrisel/parallel_ml_tutorial

SVM: Choosing Support Vector Machine regression termination criterion tolerence in sklearn

I am using sklearn.svr with the RBF kernel on an 80k-size dataset with 20+ variables. I was wondering how to choose the termination parameter tol. I ask because the regression does not seem to converge for certain combinations of C and gamma (2+ days before I give up). Interestingly, it converges after less than 10 minutes for certain combinations with an average run-time of approximately an hour.
Is there some sort of rule of thumb for setting this parameter? Perhaps a relationship to the standard deviation or expected value of the forecast?
Mike's answer is correct: subsampling for grid searching parameter is probably the best strategy to train SVR on medium-ish dataset sizes. SVR is not scalable so don't waste your time doing a grid search on the full dataset. Try on 1000 random sub samples, then 2000 and then 4000. Each time find the optimal values for C and gamma and try to guess how they evolve whenever you double the size of the dataset.
Also you can approximate the true SVR solution with the Nystroem kernel approximation and a linear regressor model such as SGDRegressor, LinearRegression, LassoCV or ElasticNetCV. RidgeCV is likely not to improve upon LinearRegression in the n_samples >> n_features regime.
Finally, do not forget to scale your input data by putting a MinMaxScaler or a StandardScaler before the SVR model in a Pipeline.
I would also try GradientBoostingRegressor models (although completely unrelated to SVR).
You really shouldn't use SVR on large data sets: its training algorithm takes between quadratic and cubic time. sklearn.linear_model.SGDRegressor can fit a linear regression on such datasets without trouble, so try that instead. If linear regression won't hack it, transform your data with a kernel approximation before feeding it to SGDRegressor to get a linear-time approximation of an RBF-SVM.
You may have seen the scikit learn documentation for the RBF function. Considering what C and gamma actually do and the fact that the SVR training time is at worst quadratic in the number of samples, I would try training first on a small subset of the data. By first getting a result for all parameter settings and then scaling up the amount of training data used, you might find you actually only need a small sample of the data to get results very close to the full set.
This is the advice I was given by my MSc project supervisor recently, as I had the exact same problem. I found that out of a set of 120k examples with 250 features I only needed around 3000 samples to get within 2% of the error of the full set models.
Sorry this isn't answering your question directly, but I thought it might help.

How many features can scikit-learn handle?

I have a csv file of [66k, 56k] size (rows, columns). Its a sparse matrix. I know that numpy can handle that size a matrix. I would like to know based on everyone's experience, how many features scikit-learn algorithms can handle comfortably?
Depends on the estimator. At that size, linear models still perform well, while SVMs will probably take forever to train (and forget about random forests since they won't handle sparse matrices).
I've personally used LinearSVC, LogisticRegression and SGDClassifier with sparse matrices of size roughly 300k × 3.3 million without any trouble. See #amueller's scikit-learn cheat sheet for picking the right estimator for the job at hand.
Full disclosure: I'm a scikit-learn core developer.
Some linear model (Regression, SGD, Bayes) will probably be your best bet if you need to train your model frequently.
Although before you go running any models you could try the following
1) Feature reduction. Are there features in your data that could easily be removed? For example if your data is text or ratings based there are lots known options available.
2) Learning curve analysis. Maybe you only need a small subset of your data to train a model, and after that you are only fitting to your data or gaining tiny increases in accuracy.
Both approaches could allow you to greatly reduce the training data required.

