How are machine learning models updated in web apps?
Take SKLearn for example, after training a huge model (let's say on 10gb of data) how might you update the model based on the current day's new data?
Presumably you wouldn't want to update real time, maybe something like once at the end of each day– but I can't find a way to do this in SKLearn. Do you just have to re-train the entire thing on the entire ever growing data set every day?
Number of estimators in sklearn implement partial_fit that allows incremental (online) learning. Check this article.
Related
I want LSTM to learn with newer data. It needs to update itself depending on the trend in the new data and I wish to save this say in a file. Then I wish to call this pre-fed training file into any other X,Y,Z fresh files where testing is done. So I wish to 're-fit' [update NOT re-train] the model with new data such that model parameters are just updated and not re-initialized. I understand this is online learning but how to implement it through Keras? Can someone please advise how to successfully implement it?
I'm working with a company on a project to develop ML models for predictive maintenance. The data we have is a collection of log files. In each log file we have time series from sensors (Temperature, Pressure, MototSpeed,...) and a variable in which we record the faults occurred. The aim here is to build a model that will use the log files as its input (the time series) and to predict whether there will be a failure or not. For this I have some questions:
1) What is the best model capable of doing this?
2) What is the solution to deal with imbalanced data? In fact, for some kind of failures we don't have enough data.
I tried to construct an RNN classifier using LSTM after transforming the time series to sub time series of a fixed length. The targets were 1 if there was a fault and 0 if not. The number of ones compared to the number of zeros is negligible. As a result, the model always predicted 0. What is the solution?
Mohamed, for this problem you could actually start with traditional ML models (random forest, lightGBM, or anything of this nature). I recommend you focus on your features. For example you mentioned Pressure, MototSpeed. Look at some window of time going back. Calculate moving averages, min/max values in that same window, st.dev. To tackle this problem you will need to have a set of healthy features. Take a look at featuretools package. You can either use it or get some ideas what features can be created using time series data. Back to your questions.
1) What is the best model capable of doing this? Traditional ML methods as mentioned above. You could also use deep learning models, but I would first start with easy models. Also if you do not have a lot of data I probably would not touch RNN models.
2) What is the solution to deal with imbalanced data? You may want to oversample or undersample your data. For oversampling look at the SMOTE package.
Good luck
I recently made a Disease prediction API ( Still not solved )
but that's not the matter
In the same app, I first deployed the app in a way that it trains and predicts every when requested that worked fine but when I saved a model and used the same model to predict the value I got 500 internal server error
Because I believe that would directly hit on the response time of the
So, I was curious whether predicting through model is more CPU consuming task or training and predicting so that I can work further on my API as Cloud computers have Specific CPU performance, etc
Of course, It also depends on a Tier we choose and I am working on a free tier of Heroku
It would really nice if guys answer it
Regards,
Roshan
If I understand it correctly, you are hitting some API endpoint with your request and the code that runs when the same endpoint is hit trains a model and then returns some prediction.
I can't really imagine how this should work in general. Training is a time consuming process that can take hours or months (how knows how long). Also, how are you sending a training data to your backend (assuming this data can be arbitrarily large)?
General approach is to build/train a model offline and then perform only predictions via you API. (unless you are building some very low level cloud API that is to be consumed by some other ML developers)
But to answer your question. No, predicting can't take more time than training-and-predicting (assuming that you are making the prediction on the same data). You are just adding one more (much more computationally intensive) operation to the equation. And since training and predicting are two separate steps that do not influence each other directly, your prediction time stays the same whether you are just predicting or training-and-predicting.
Training + Predicting is definitely more intensive as compared to only Predicting.
Typically, we train a model and save it as a binary file. Once saved, we use to for predicting.
Keep in mind that you would need to perform the same pre-processing steps you used during training while predicting.
As for the error, I'd suggest you do the following step-by-step to pin-point what is causing the error -
Try to access the API end point by sending a simple json reply.
Send the input data to the API end point and and try to return the input as json just to verify whether your server is receiving data as intended. You can also print it out as opposed sending back a json file.
Now, once you have the data, perform same pre-processing steps (like in training), make a prediction, and send it back to your frontend.
I am able to build an Isolation Forest for anomaly detection. However, due to storage limitations, I cannot store all the data I used to train it. I would also like to input more data later.
I was wondering if it's possible for me to get the estimator values when I originally train it, and save those. Then, a week later when I want to retrain the model with some newly acquired data, could I first restore my old model using those stored estimator values (so I don't need to be able to access the old data), and then the model would adapt to the newly added values.
The reason I have chosen to resort to this is because I couldn't find any algorithms for anomaly detection that learn iteratively (so a free, open source suggestion in that department would work great too!)
Any help with this would be deeply appreciated!
I have received tens of thousands of user reviews on the app.
I know the meaning of many of the comments are the same.
I can not read all these comments.
Therefore, I would like to use a python program to analyze all comments,
Identify the most frequently the most important feedback information.
I would like to ask, how can I do that?
I can download an app all comments, also a preliminary understanding of the Google Prediction API.
You can use the Google Prediction API to characterize your comments as important or unimportant. What you'd want to do is manually classify a subset of your comments. Then you upload the manually classified model to Google Cloud Storage and, using the Prediction API, train your model. This step is asynchronous and can take some time. Once the trained model is ready, you can use it to programmatically classify the remaining (and any future) comments.
Note that the more comments you classify manually (i.e. the larger your training set), the more accurate your programmatic classifications will be. Also, you can extend this idea as follows: instead of a binary classification (important/unimportant), you could use grades of importance, e.g. on a 1-5 scale. Of course, that entails more manual labor in constructing your model so the best strategy will be a function of your needs and how much time you can spend building the model.