I need to build a timeseries forecaster, I have more than 3000 devices in which we monitor with 3 key performance indicators and I would like to make predictions for each device. What would be the best approach for this?
I need to build this in python
Related
We are exploring various ways to create the training and test sets to evaluate a Learning to Rank (LTR) model.
In a Learning to Rank scenario, for each query, there are a number of associated documents grouped according to their relevance judgments.
Therefore data splitting should be managed differently than other supervised machine learning techniques (where samples are simply split between training and test sets in a certain percentage) since two important elements need to be taken into account: the query Id and the relevance label.
How do I choose how to divide the data? What is the best approach to handle queries when creating the two sets? Are you splitting within a query or leave entire queries for the test? What is the approach that worked best for you?
I'm working with a company on a project to develop ML models for predictive maintenance. The data we have is a collection of log files. In each log file we have time series from sensors (Temperature, Pressure, MototSpeed,...) and a variable in which we record the faults occurred. The aim here is to build a model that will use the log files as its input (the time series) and to predict whether there will be a failure or not. For this I have some questions:
1) What is the best model capable of doing this?
2) What is the solution to deal with imbalanced data? In fact, for some kind of failures we don't have enough data.
I tried to construct an RNN classifier using LSTM after transforming the time series to sub time series of a fixed length. The targets were 1 if there was a fault and 0 if not. The number of ones compared to the number of zeros is negligible. As a result, the model always predicted 0. What is the solution?
Mohamed, for this problem you could actually start with traditional ML models (random forest, lightGBM, or anything of this nature). I recommend you focus on your features. For example you mentioned Pressure, MototSpeed. Look at some window of time going back. Calculate moving averages, min/max values in that same window, st.dev. To tackle this problem you will need to have a set of healthy features. Take a look at featuretools package. You can either use it or get some ideas what features can be created using time series data. Back to your questions.
1) What is the best model capable of doing this? Traditional ML methods as mentioned above. You could also use deep learning models, but I would first start with easy models. Also if you do not have a lot of data I probably would not touch RNN models.
2) What is the solution to deal with imbalanced data? You may want to oversample or undersample your data. For oversampling look at the SMOTE package.
Good luck
Do you know if it's possible to use a very small subset of my training data (100 or 500 instances only for example), to train very rough CNN network quickly in order to compare different architectures, then select the best performing one ?
When I say "possible", I mean is there evidence that applying that kind of selection strategy works, and that the selected network will consistently outperform the other to for this specific task.
Thank you,
For information, the project in question would constist of two stages CNNs to classify multichannel timeseries. The first CNN would forecast the inputs data over the next period of time, then the second CNN would use this forecast and classify the results in two categories.
The procedure you are talking about is actually used in practice. When tuning hyperparameters, a lot of people select a subset of the whole dataset to do this.
Is the best architecture on the subset necessarily the best on the full dataset? NO! However, it's the best guess you have and that's why it's useful.
A couple of things to note on your question:
100-500 instances is extremely low! The CNN still needs to be trained. When we say subset we usually mean tens of thousands of images (out of the millions of the dataset). If your dataset is under 50000 images then why do you need a subset? Train on the whole dataset.
Contrary to what a lot of people believe, the details of the architecture are of little importance to the classification performance. Some of the hyperparameters you mention (e.g. kernel size) are of secondary importance. The key things you should focus on is depth, size of layers, use of pooling/skip connections/batch norm/dropout, etc.
It is a common practice to normalize input values (to a neural network) to speed up the learning process, especially if features have very large scales.
In its theory, normalization is easy to understand. But I wonder how this is done if the training data set is very large, say for 1 million training examples..? If # features per training example is large as well (say, 100 features per training example), 2 problems pop up all of a sudden:
- It will take some time to normalize all training samples
- Normalized training examples need to be saved somewhere, so that we need to double the necessary disk space (especially if we do not want to overwrite the original data).
How is input normalization solved in practice, especially if the data set is very large?
One option maybe is to normalize inputs dynamically in the memory per mini batch while training.. But normalization results will then be changing from one mini batch to another. Would it be tolerable then?
There is maybe someone in this platform having hands on experience on this question. I would really appreciate if you could share your experiences.
Thank you in advance.
A large number of features makes it easier to parallelize the normalization of the dataset. This is not really an issue. Normalization on large datasets would be easily GPU accelerated, and it would be quite fast. Even for large datasets like you are describing. One of my frameworks that I have written can normalize the entire MNIST dataset in under 10 seconds on a 4-core 4-thread CPU. A GPU could easily do it in under 2 seconds. Computation is not the problem. While for smaller datasets, you can hold the entire normalized dataset in memory, for larger datasets, like you mentioned, you will need to swap out to disk if you normalize the entire dataset. However, if you are doing reasonably large batch sizes, about 128 or higher, your minimums and maximums will not fluctuate that much, depending upon the dataset. This allows you to normalize the mini-batch right before you train the network on it, but again this depends upon the network. I would recommend experimenting based on your datasets, and choosing the best method.
I'm working on a TREC task involving use of machine learning techniques, where the dataset consists of more than 5 terabytes of web documents, from which bag-of-words vectors are planned to be extracted. scikit-learn has a nice set of functionalities that seems to fit my need, but I don't know whether it is going to scale well to handle big data. For example, is HashingVectorizer able to handle 5 terabytes of documents, and is it feasible to parallelize it? Moreover, what are some alternatives out there for large-scale machine learning tasks?
HashingVectorizer will work if you iteratively chunk your data into batches of 10k or 100k documents that fit in memory for instance.
You can then pass the batch of transformed documents to a linear classifier that supports the partial_fit method (e.g. SGDClassifier or PassiveAggressiveClassifier) and then iterate on new batches.
You can start scoring the model on a held-out validation set (e.g. 10k documents) as you go to monitor the accuracy of the partially trained model without waiting for having seen all the samples.
You can also do this in parallel on several machines on partitions of the data and then average the resulting coef_ and intercept_ attribute to get a final linear model for the all dataset.
I discuss this in this talk I gave in March 2013 at PyData: http://vimeo.com/63269736
There is also sample code in this tutorial on paralyzing scikit-learn with IPython.parallel taken from: https://github.com/ogrisel/parallel_ml_tutorial