I have a dataset with daily observations of sales for 1000 company shops during the last 3 years(of course apart from the sales figure alone, I've got features like: promotions, shop type, assortment type etc.)
The goal is to build a model to predict future sales. How would you build a model from 1000 time series and generalize it such that it could be used to predict sales for 1 shop with certain features?
The dataset is similar to : https://www.kaggle.com/c/rossmann-store-sales/notebooks.
Basing on the solutions (in python) for this dataset provided on Kaggle, I noticed that nearly everyone is using XGBoost, but I have some doubts regarding these solutions that were provided and I'd be thankful for some clarification. In particular:
How can people just load the data with daily observations for over 1000 shops for 3.5 years for each shop to the model, without one-hot encoding the store id's first? Isn't the model going to fail at some point because it will learn that shop #1040 is better than shop #35 - just because of shop id?
If we would use traditional one-hot encoding this would create 1000 new columns which is unmanageable - but nevertheless is there a way of solving this problem with one-hot encoding?
Why are people extracting the "date" feature by adding : day,week,month as separate variables? Isn't that misleading to the model? Why aren't people assigning the "date" as an index instead?
1 - This explains data prepration for XGBoost. You could be hurting the models ability of learning by not one-hot encoding, but it seems like you have enough features that your model can still learn the correct correlations.
2 - You pretty much answer this question yourself. The one-hot encoding approach would explode the number of features you have, overwhelming the model, slowing the training process and worst of all it might not help that much
3 - This is a common feature engineering approach to give me the model more information to work with while not completely overshadowing other features. For ex. Store A might have sales on Monday at 5 most of the time.
Basically you have to try it to see if it works.
Im sure your fellow Kagglers have tried many more combinations of feature engineering, one-hot encoding categorical data, different model and many more.
Related
I am working on a waste water data. The data is collected every 5 min. This is the sample data.
The threshold of the individual parameters is provided. My question is what kind of models should I go for to classify it as usable or not useable and also output the anomaly because of which it is unusable (if possible since it is a combination of the variables). The column for yes/no is yet to be and will be provided to me.
The other question I have is how do I keep it running since the data is collected every 5 minutes?
Your data and use case seem fit for a decision tree classifier. Decision trees are easy to train and interpret (which is one of your requirements, since you want to know why a given sample was classified as usable or not usable), do not require large amounts of labeled data, can be trained and used for prediction on most haedware, and are well suited for structured data with no missing values and low dimensionality. They also work well without normalizing your variables.
Scikit learn is super mature and easy to use, so you should be able to get something working without too much trouble.
As regards time, I'm not sure how you or your employee will be taking samples, so I don't know. If you will be getting and reading samples at that rate, using your model to label data should not be a problem, but I'm not sure if I understood your situation.
Note stackoverflow is aimed towards questions of the form "here's my code, how do I fix this?", and not so much towards general questions such as this. There are other stackexhange sites specially dedicated to statistics and data science. If you don't find here what you need, maybe you can try those other sites!
So I have a time series that only has traffic volume. I've done FB prophet and neural prophet. They work okay, but I would like to do something using machine learning. So far I have the problem of trying to make my features. Using the classical dayofyear, month, etc does not give me good results. I have tried using shift where I get the average, minimum, and max of the two previous days. However that would work, but my problem is when I try to predict days in advance the feature doesn't really work for that since I cant get the average of that day. My main concern is trying to find a good feature that my predicting future dataframe also has. A picture of my data is included. Does anyone know how I would do this?
First of all, you have to clarify some definitions. FBProphet works on a mechanism that is the same as any machine learning algorithm that is, fitting the model and then predicting the output. Being an additive regression model with a piecewise linear or logistic growth curve trend, it can be considered as a Machine Learning method that allows us to predict a continuous outcome variable.
Secondly, I think you missed the most important word that your question was about - namely: Feature engineering.
Feature Engineering includes :
Process of using domain knowledge to Extract features (characteristics, properties, attributes) from raw data.
Process of transforming raw data into features that better represent the underlying problem to the predictive models, etc..
But it's very unlikely to use Machine Learning to do Feature engineering. You do Feature engineering in order to improve your Machine Learning model. Many techniques such as imputation, handling outliers, binning, log transform, one-hot encoding, grouping operations, feature split, scaling are hybrid methods using a statistical approach and/or domain knowledge.
Regarding your data, bearing in my mind that the seasonality is already handled by FBProphet, I am not confident if feature engineering transformations such as adding the day of the week, adding holidays periods, etc... could really help improve performance...
To conclude, it is not possible to create ex-nihilo new features that would outperform your model. Whether you process/transform your data or add external domain-knowledge dataset
I want to train an autoencoder on timeseries from systems of an aircrafts to be able to recognize when the patterns from a specific system shows an anomaly (ie: actual pattern will “significantly” deviate from reconstructed signal).
I’m hesitant between two different approaches to pre-process and train the model and I’d like to consult your collective wisdom and experience to guide me.
First a quick look on how the data is structured :
Each column is a collection of timestamp from different flights for a specific system.
Since System1 from Aircraft A is completely independent from System1 on Aircraft B, I cannot simply feed the preceding data set the “regular way” by batching together slices of ‘sequence_lenght’ and feeding it to the network. This would only work as long as I’m reading data from the same Aircraft but as soon as I switch Aircraft there is no continuity.
Approach number 1 : Incremental training
I would train the model the “regular way” using only data from one aircraft at a time. At each iteration I’ll keep training the same model but on a different batch from a different Aircraft.
Each iteration (hopefully) improving on the preceding one.
Pros : that’s the “easy” way in term of pre-processing + I could leverage the memory effect to identify longer pattern (using RNN or LSTM layers)….
Cons : … As long as the system does not try to find pattern across batches from different aircraft, would that be the case ?
Approach number 2 : the “Electrocardiogram way”
This one’s inspired by the ECG dataset. Instead of looking at the dataset as one big timeseries stream, I could think in term flight.
In this case I would resample each flights to be the same length. The model would then be trained to identify what’s wrong during a flight by looking at a collection of independent flight, and the actual chronology (between flight) would no longer be relevant. So I could train in one go over all the dataset.
Pros : simplify the training + ‘sequence_lenght’ is equivalent to the flight duration so it’s easier to work with.
Cons : resampling can involve compression and loss of information + By assuming each flight are independent from each other’s I prevent the system from finding longer time pattern occurring over several flights.
Of course an easy answer would be to “try both and see for yourself”, and I would totally go for it, were not time and resources a constraint.
Do you have any experience or recommendation as to what I should try first ?
Thank you for your help,
I have 2 years of historical data of customers, items ordered and the numbers of orders. Based on this data, I am trying to predict the future sales at customer - item level. I tried ARIMA model which didn't give me the expected results. Any suggestions or references of implementation. I am interested to try LSTM and looking for good retail references.
ARIMA models can be challenging to work with as there are a lot of parameters which need to be set in a reasonable manner, especially if you're trying to model seasonality.
Here's a decent starter discussion (with code) on using an LSTM for retail sales problem: https://machinelearningmastery.com/time-series-forecasting-long-short-term-memory-network-python/
Also, consider working to identify some dependent variables which may impact sales to add as auxiliary inputs. For example - some items could sell more during certain seasons, around national holidays, etc.
Forecasting at this detailed of a level is always challenging so the more variables you can find which explain the observed phenomena, the better you may be able to do.
I have the dataset which you can find the (updated) file here , containing many different characteristics of different office buildings, including their surface area and number of people working in there. In total there are about 200 records. I want to use an algorithm, that can be trained using the dataset above, in order to be able to predict the electricity consumption(given in the column 'kwh') of a the building that is not in the set.
I have tried most of the possible machine learning algorithms using the scikit library in python (linear regression, Ridge, Lasso, SVC etc) in order to predict a continuous variable. Surface_area and number of workers had a coorelation value with the target variable between 0.3-0.4 so I assumed them to be good features for the model and included them in the training of the model. However I had about 13350 mean absolute error and R-squared value of about 0.22-0.35, which is not good at all.
I would be very grateful, if someone could give me some advice, or if you could examine a little the dataset and run some algorithms on it. What type of preprocessing should I use, and what type of algorithm? Is the number of datasets too low to train a regression model for predicting continuous variables?
Any feedback would be helpful as I am new to machine learning :)
The first thing that should be done in these kinds of Machine Learning Problems is to understand the data. Yes, the number of features in your dataset is small, yes, the number of data samples are very less, but it is important to do the best we can with what we have.
The data set header is in a language other than English, it is important to convert it to a language most of the people in the community would understand (in this case English). After doing a bit of tinkering, I found out that the language being used is Dutch.
There are some key features missing in the dataset. From something as obvious as the number of floors in the building to something not obvious like the number of working hours. Surface Area and the number of workers seems to me are the most important features, but you are missing out on a feature called building_function which (after using Google Translate) tells what the purpose of the building is. Intuitively, this is supposed to have a large correlation with the power consumption. Industries tend to use more power than normal Households. After translation, I found out that the main types were Residential, Office, Accommodation and Meeting. This feature thus has to be encoded as a nominal variable to train the model.
Another feature hoofsbi also seems to have some variance. But I do not know what that feature means.
If you could translate the headers in the data and share it, I will be able to provide you some code to perform this regression task. It is very important in such tasks to understand what the data is and thus perform feature engineering.