Sci kit learn regression random forest regressor implementation - python

I'm having a problem where my scikit learn regression forest trains almost instantly (in a few seconds; I know from experience with this machine it should take about a half hour or more on a dataset of the size I am working on) and then predicts the exact same output for every row of input data.
My current theory is that it has something to with the order of magnitude of the target variables - around 10^-11. I tried multiplying them by 100,000 to see what happened, and it starting running forever and not doing anything until I killed the script.
The code is below:
n_estimators = 200
rfr = RandomForestRegressor(n_estimators=n_estimators, verbose = 2, n_jobs = -1)
y_train = df_train[target].values*100000
rfr.fit(X_train, y_train)
rfr.predict(X_train)
You're probably wondering why I ran it back on the training data - I was just trying to test if it was actually doing anything, which it isn't.
Thank you for your help!
Edit:
This is the describe() output for the target data. The training data is mostly similar in magnitude:
count 4.000000e+04
mean -1.062353e-11
std 5.990830e-10
min -1.063333e-08
25% -2.305633e-10
50% -6.325584e-12
75% 2.110687e-10
max 1.564848e-08
I tried standardizing the data and running the forest; it printed no output but the memory usage keeps creeping up so it must be doing something. Do regression forests take high computer power? I'm using a laptop with an i7 processor and an OK graphics card; I can classify fine.

Related

SKLearn VotingRegressor - why so slow?

I'm trying to work with SciKit-Learn's VotingRegressor, but I find the experience quite frustrating due to the apparent overhead this class adds.
All it should be doing according to the documentation is
...fits several base regressors, each on the whole dataset. Then it averages the individual predictions to form a final prediction.
But by doing this, I find it somehow increases the runtime by LOADS. Why?
For example, if I import 6 different regressors and train them individually, it amounts to around 5 minutes of training on my computer. Based on the description, the only additional step the VotingRegressor takes is it averages each predictor's prediction. However, when I pass the same 6 regressors to a VotingRegressor and start training, the training keeps running well above the 20 minute mark.
For getting an average, I wouldn't expect an over 5-fold increase in runtime (I'm currently running a training with over 30 minutes passed and still not stopped). What is the overhead that VotingRegressor is adding? Keep in mind this is happening with a circa 30 000 x 150 sized dataset.

TPOT taking too long to train

Ive been trying to use tpot for the first time on a dataset that has approximately 7000 rows, when trying to train tpot on the training dataset which is 25% of the dataset as a whole, tpot takes too long. ive been running the code for approximately 45 minutes on google colab and the optimization progress is still at 4%. Ive just been trying to use the example as seen on :http://epistasislab.github.io/tpot/examples/. Is it typical for tpot to take this long, because so far i dont think its worth even trying to use it
TPOT can take quite a long time depending on the dataset you have. You have to consider what TPOT is doing: TPOT is evaluating thousands of analysis pipelines and fitting thousands of ML models on your dataset in the background, and if you have a large dataset, then all that fitting can take a long time--especially if you're running it on a less powerful computer.
If you'd like faster results, you have a few options:
Use the "TPOT light" configuration, which uses simpler models and will run faster.
Set the n_jobs parameter to -1 or a number greater than 1, which will allow TPOT to evaluate pipelines in parallel. -1 will use all of the available cores and speed things up significantly if you have a multicore machine.
Subsample the data using the subsample parameter. The default is 1.0, corresponding to using 100% of your training data. You can subsample to lower percentages of the data and TPOT will run faster.

Why does R2-value increase after feature-reduction with RFE?

For an exploratory semester project, I am trying to predict the outcome value of a quality control measurement using various measurements made during production. For the project I was testing different algorithms (LinearRegression, RandomForestRegressor, GradientBoostingRegressor, ...). I generally get rather low r2-values (around 0.3), which is probably due to the scattering of the feature values and not my real problem here.
Initially, I have around 100 features, which I am trying to reduce using RFE with LinearRegression() as estimator. Cross validation indicates, I should reduce my features to only 60 features. However, when I do so, for some models the R2-value increases. How is that possible? I was under the impression that adding variables to the model always increases R2 and thus reducing the number of variables should lead to lower R2 values.
Can anyone comment on this or provide an explanation?
Thanks in advance.
It depends on whether you are using the testing or training data to measure R2. This is a measure of how much of the variance of the data your model captures. So, if you increase the number of predictors then you are correct in that you do a better job predicting exactly where the training data lie and thus your R2 should increase (converse is true for decreasing the number of predictors).
However, if you increase number of predictors too much you can overfit to the training data. This means the variance of the model is actually artificially high and thus your predictions on the test set will begin to suffer. Therefore, by reducing the number of predictors you actually might do a better job of predicting the test set data and thus your R2 should increase.

Statsmodels' Logit.fit_regularized keeps running forever

Lately I've been trying to fit a Regularized Logistic Regression on vectorized text data. I first tried with sklearn, and had no problem, but then I discovered and I can't do inference through sklearn, so I tried to switch to statsmodels. The problem is, when I try to fit the logit it keeps running forever and using about 95% of my RAM (tried both on 8GB and 16GB RAM computers).
My first guess was it had to do with dimensionality, because I was working with a 2960 x 43k matrix. So, to reduce it, I deleted bigrams and took a sample of only 100 observations, which leaves me with a 100 x 6984 matrix, which, I think, shouldn't be too problematic.
This is a little sample of my code:
for train_index, test_index in sss.split(df_modelo.Cuerpo, df_modelo.Dummy_genero):
X_train, X_test = df_modelo.Cuerpo[train_index], df_modelo.Cuerpo[test_index]
y_train, y_test = df_modelo.Dummy_genero[train_index], df_modelo.Dummy_genero[test_index]
cvectorizer=CountVectorizer(max_df=0.97, min_df=3, ngram_range=(1,1) )
vec=cvectorizer.fit(X_train)
X_train_vectorized = vec.transform(X_train)
This gets me a train and a test set, and then vectorizes text from X_train.
Then I try:
import statsmodels.api as sm
logit=sm.Logit(y_train.values,X_train_vectorized.todense())
result=logit.fit_regularized(method='l1')
Everything works fine until the result line, which keeps running forever. Is there something I can do? Should I switch to R if I'm looking for statistical inference?
Thanks in advance!
Almost all of statsmodels and all the inference is designed for the case when the number of observations is much larger than the number of features.
Logit.fit_regularized uses an interior point algorithm with scipy optimizers which needs to keep all features in memory. Inference for the parameters requires the covariance of the parameter estimate which has shape n_features by n_features. The use case for which it was designed is when the number of features is relatively small compared to the number of observations, and the Hessian can be used in-memory.
GLM.fit_regularized estimates elastic net penalized parameters and uses coordinate descend. This can possibly handle a large number of features, but it does not have any inferential results available.
Inference after Lasso and similar penalization that select variables has only been available in recent research. See for example selective inference in Python https://github.com/selective-inference/Python-software for which also a R package is available.

SciKit One-class SVM classifier training time increases exponentially with size of training data

I am using the Python SciKit OneClass SVM classifier to detect outliers in lines of text. The text is converted to numerical features first using bag of words and TF-IDF.
When I train (fit) the classifier running on my computer, the time seems to increase exponentially with the number of items in the training set:
Number of items in training data and training time taken:
10K: 1 sec, 15K: 2 sec, 20K: 8 sec, 25k: 12 sec, 30K: 16 sec, 45K: 44 sec.
Is there anything I can do to reduce the time taken for training, and avoid that this will become too long when training data size increases to a couple of hundred thousand items ?
Well scikit's SVM is a high-level implementation so there is only so much you can do, and in terms of speed, from their website, "SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation."
You can increase your kernel size parameter based on your available RAM, but this increase does not help much.
You can try changing your kernel, though your model might be incorrect.
Here is some advice from http://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use: Scale your data.
Otherwise, don't use scikit and implement it yourself using neural nets.
Hope I'm not too late. OCSVM, and SVM, is resource hungry, and the length/time relationship is quadratic (the numbers you show follow this). If you can, see if Isolation Forest or Local Outlier Factor work for you, but if you're considering applying on a lengthier dataset I would suggest creating a manual AD model that closely resembles the context of these off-the-shelf solutions. By doing this then you should be able to work either in parallel or with threads.
For anyone coming here from Google, sklearn has implemented SGDOneClassSVM, which "has a linear complexity in the number of training samples". It should be faster for large datasets.

Categories

Resources