Why does test takes longer than training? - python

I'am training sklearn KNNClassifier on MNIST digits dataset.
Here is the code :
knn = KNeighborsClassifier()
start_time = time.time()
print (start_time)
knn.fit(X_train, y_train)
elapsed_time = time.time() - start_time
print (elapsed_time)
it takes 40s. However, when I test on test data, it takes more than a few minutes (still running), while there are 6 times less test data than train data.
Here is the code :
y_pred = knn.predict(X_test)
print(confusion_matrix(y_test,y_pred))
Could you explain me why it takes so much time (more time than training) ? Something to solve this ?

Think about the working of the k-NN algorithm. It is a classic example of lazy learning, where at prediction time the distances to the original training data have to be calculated (to decide which are its closest neigbours).
At training time, it doesn't need to do very expensive distance calculation.
So the difference is mostly about going from .fit() to .predict()
When you would actually try to predict the train-set, this would take even longer.
For more information, see e.g. wikipedia
For solutions: think about whether this algorithm is actually ideal for your case, or if you could do with cruder approximation of the distance.

Related

Tensorflow slow on first prediction, much faster after

I have a trained dataset and saved the weights, at a later date and running the python script from scratch, I load the model and the weights and do a prediction, this takes for example 10 seconds, then all predictions afterward take 0.5 seconds.
I am measuring the time of the prediction only
t = perf_counter()
a = model.predict(p, verbose=0, workers=8).reshape(1,-1)
print(f'prediction took {perf_counter()-t} seconds')
I was expecting there to be no difference.
I see this post Tensorflow JS first prediction delay but not sure in my case how to "warm up".
I am coding a server and hence the concern as the first time someone issues a request for a prediction and in my case that's 10 of them, the users needs to wait a long time, which in this use case is not good.
Thanks for the help!

Why does my LSTM model predict wrong values although the loss is decreasing?

I am trying to build a machine learning model which predicts a single number from a series of numbers. I am using an LSTM model with Tensorflow.
You can imagine my dataset to look something like this:
Index
x data
y data
0
np.array(shape (10000,1) )
numpy.float32
1
np.array(shape (10000,1) )
numpy.float32
2
np.array(shape (10000,1) )
numpy.float32
...
...
...
56
np.array(shape (10000,1) )
numpy.float32
Easily said I just want my model to predict a number (y data) from a sequence of numbers (x data).
For example like this:
array([3.59280851, 3.60459062, 3.60459062, ...]) => 2.8989773
array([3.54752101, 3.56740332, 3.56740332, ...]) => 3.0893357
...
x and y data
From my x data I created a numpy array x_train which I want to use to train the network.
Because I am using an LSTM network, x_train should be of shape (samples, time_steps, features).
I reshaped my x_train array to be shaped like this: (57, 10000, 1), because I have 57 samples, which each are of length 10000 and contain a single number.
The y data was created similarly and is of shape (57,1) because, once again, I have 57 samples which each contain a single number as the desired y output.
Current model attempt
My model summary looks like this:
The model was compiled with model.compile(loss="mse", optimizer="adam") so my loss function is simply the mean squared error and as an optimizer I'm using Adam.
Current results
Training of the model works fine and I can see that the loss and validation loss decreases after some epochs.
The actual problem occurs when I want to predict some data y_verify from some data x_verify.
I do this after the training is finished to determine how well the model is trained.
In the following example I simply used the data I used for training to determine how well the model is trained (I know about overfitting and that verifying with the training set is not the right way of doing it, but that is not the problem I want to demonstrate right not).
In the following graph you can see the y data I provided to the model in blue.
The orange line is the result of calling model.predict(x_verify) where x_verify is of the same shape as x_train.
I also calculated the mean absolute percentage error (MAPE) of my prediction and the actual data and it came out to be around 4% which is not bad, because I only trained for 40 epochs. But this result still is not helpful at all because as you can see in the graph above the curves do not match at all.
Question:
What is going on here?
Am I using an incorrect loss function?
Why does it seem like the model tries to predict a single value for all samples rather than predicting a different value for all samples like it's supposed to be?
Ideally the prediction should be the y data which I provided so the curves should look the same (more or less).
Do you have any ideas?
Thanks! :)
After some back and forth in the comments, I'll give my best estimation to your questions:
What is going on here?
Very complex (too many layers deep) model with very little data, trained for too few epochs on non-normalized data (credit to Muhammad in his answer). The biggest issue, as far as I can tell, is the number of training epochs.
Am I using an incorrect loss function?
MSE is an appropriate loss function for a regression task.
Why does it seem like the model tries to predict a single value for all samples rather than predicting a different value for all samples like it's supposed to be? Ideally the prediction should be the y data which I provided so the curves should look the same (more or less). Do you have any ideas?
Too few training epochs is the biggest contributor, as far as I can tell.
Based on the collab notebook that Luca shared:
30 Epochs, no normalization
Way off target, flat predictions (though I can't reproduce how flat the predictions are that Luca posted)
30 Epochs, with normalization
Worse off.
2000(!) epochs, no normalization
Okay, now the predictions are at least in the ballpark
2000 epochs, with normalization
And now the model seems to be starting to figure things out, like we'd hope it should. Given, this is training on the 11 samples that were cobbled together in the notebook, so it's naturally going to overfit. We're just happy to see it learn something.
2000 epochs, normalization, different loss
Never be afraid to try out different losses, as some may be better suited than others. Not knowing the domain of this task, I'm just trying out mean_absolute_error instead of mean_squared_error.
Caution! Don't compare loss values between different losses. They're not on the same scale.
2000 epochs, normalization, larger learning rate
Okay, so it's taking a long time to learn. Can I nudge it along a little faster? Sure, up the learning rate of the optimizer, and it'll get you to where you're going faster. Here, we up it by a factor of 5.
model.compile(loss="mse", optimizer=tf.keras.optimizers.Adam(learning_rate=0.005))
You could even employ a learning rate scheduler that starts big and slowly diminishes it over the course of epochs.
def scheduler(epoch, lr):
if epoch < 400:
return lr
else:
return lr * tf.math.exp(-0.01)
lrs = tf.keras.callbacks.LearningRateScheduler(scheduler)
history = model.fit(x=x_train, y=y_train, epochs=1000, callbacks=[lrs])
Hope this all helps!
From the notebook it seems you are not scaling your data. You should normalize or standardize your data before training your model.
https://machinelearningmastery.com/how-to-improve-neural-network-stability-and-modeling-performance-with-data-scaling/
can add normalization layer in keras https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization
I just wanted to post a quick update.
First of all, this is my current result:
I am absolutely happy, that I was finally able to achieve what I wanted to. At least to some extent.
There were some steps I had to take to achieve this result:
Normalization
Training for 500-1000 epochs
Most importantly: Reducing the amount of time steps to 1000
In the end my thought of "the more data, the better" was a huge misconception. I was not able to achieve such results with 10000 time steps per sample AT ALL. So I'm glad that I just gave 1000 a shot.
Thank you all very much for your answers!
I will try to further imroved my model with your suggestions :)
i think it would be helpful if you change loss into huber loss and even change optimizer into sgd and then first try out to define the best learning rate based on a callback (learning rate schedule) cause of small dataset and even normalize or standardize data before training model.

Statsmodels' Logit.fit_regularized keeps running forever

Lately I've been trying to fit a Regularized Logistic Regression on vectorized text data. I first tried with sklearn, and had no problem, but then I discovered and I can't do inference through sklearn, so I tried to switch to statsmodels. The problem is, when I try to fit the logit it keeps running forever and using about 95% of my RAM (tried both on 8GB and 16GB RAM computers).
My first guess was it had to do with dimensionality, because I was working with a 2960 x 43k matrix. So, to reduce it, I deleted bigrams and took a sample of only 100 observations, which leaves me with a 100 x 6984 matrix, which, I think, shouldn't be too problematic.
This is a little sample of my code:
for train_index, test_index in sss.split(df_modelo.Cuerpo, df_modelo.Dummy_genero):
X_train, X_test = df_modelo.Cuerpo[train_index], df_modelo.Cuerpo[test_index]
y_train, y_test = df_modelo.Dummy_genero[train_index], df_modelo.Dummy_genero[test_index]
cvectorizer=CountVectorizer(max_df=0.97, min_df=3, ngram_range=(1,1) )
vec=cvectorizer.fit(X_train)
X_train_vectorized = vec.transform(X_train)
This gets me a train and a test set, and then vectorizes text from X_train.
Then I try:
import statsmodels.api as sm
logit=sm.Logit(y_train.values,X_train_vectorized.todense())
result=logit.fit_regularized(method='l1')
Everything works fine until the result line, which keeps running forever. Is there something I can do? Should I switch to R if I'm looking for statistical inference?
Thanks in advance!
Almost all of statsmodels and all the inference is designed for the case when the number of observations is much larger than the number of features.
Logit.fit_regularized uses an interior point algorithm with scipy optimizers which needs to keep all features in memory. Inference for the parameters requires the covariance of the parameter estimate which has shape n_features by n_features. The use case for which it was designed is when the number of features is relatively small compared to the number of observations, and the Hessian can be used in-memory.
GLM.fit_regularized estimates elastic net penalized parameters and uses coordinate descend. This can possibly handle a large number of features, but it does not have any inferential results available.
Inference after Lasso and similar penalization that select variables has only been available in recent research. See for example selective inference in Python https://github.com/selective-inference/Python-software for which also a R package is available.

Is RandomForestRegressor predict() fundamentally slow?

I can only make 2-3 predictions per second with this model which is super slow.
When using LinearRegression model I can easily achieve 40x speedup.
I'm using scikit-learn python package with a very simple dataset containing 3 columns (day, hour and result) so basically 2 features.
day and hour are categorical variables.
Naturally there are 7 day and 24 hour categories.
Training sample is relatively small (cca 5000 samples).
It takes just a dew seconds to train it.
But when I go on predicting something it's very slow.
So my question is: is this fundamental characteristic of RandomForrestRegressor or I can actually do something about it?
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100,
max_features='auto',
oob_score=True,
n_jobs=-1,
random_state=42,
min_samples_leaf=2)
Here are some steps to optimize a RandomForest with sklearn
Do batch predictions by passing multiple datapoints to predict(). This reduces Python overhead.
Reduce the depth of trees. Using something like min_samples_leaf or min_samples_split to avoid having lots of small decision nodes. To use 5% percent of training set, use 0.05.
Reduce the number of trees. With somewhat pruned trees, RF can often perform OK with as little as n_estimators=10.
Use an optimized RF inference implementation like emtrees. Last thing to try, also dependent on prior steps to perform well.
The performance of the optimized model must be validated, using cross-validation or similar. Steps 2 and 3 are related, so one can do a grid-search to find the combination that best preserves model performance.

SciKit One-class SVM classifier training time increases exponentially with size of training data

I am using the Python SciKit OneClass SVM classifier to detect outliers in lines of text. The text is converted to numerical features first using bag of words and TF-IDF.
When I train (fit) the classifier running on my computer, the time seems to increase exponentially with the number of items in the training set:
Number of items in training data and training time taken:
10K: 1 sec, 15K: 2 sec, 20K: 8 sec, 25k: 12 sec, 30K: 16 sec, 45K: 44 sec.
Is there anything I can do to reduce the time taken for training, and avoid that this will become too long when training data size increases to a couple of hundred thousand items ?
Well scikit's SVM is a high-level implementation so there is only so much you can do, and in terms of speed, from their website, "SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation."
You can increase your kernel size parameter based on your available RAM, but this increase does not help much.
You can try changing your kernel, though your model might be incorrect.
Here is some advice from http://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use: Scale your data.
Otherwise, don't use scikit and implement it yourself using neural nets.
Hope I'm not too late. OCSVM, and SVM, is resource hungry, and the length/time relationship is quadratic (the numbers you show follow this). If you can, see if Isolation Forest or Local Outlier Factor work for you, but if you're considering applying on a lengthier dataset I would suggest creating a manual AD model that closely resembles the context of these off-the-shelf solutions. By doing this then you should be able to work either in parallel or with threads.
For anyone coming here from Google, sklearn has implemented SGDOneClassSVM, which "has a linear complexity in the number of training samples". It should be faster for large datasets.

Categories

Resources