I have the dataset which you can find the (updated) file here , containing many different characteristics of different office buildings, including their surface area and number of people working in there. In total there are about 200 records. I want to use an algorithm, that can be trained using the dataset above, in order to be able to predict the electricity consumption(given in the column 'kwh') of a the building that is not in the set.
I have tried most of the possible machine learning algorithms using the scikit library in python (linear regression, Ridge, Lasso, SVC etc) in order to predict a continuous variable. Surface_area and number of workers had a coorelation value with the target variable between 0.3-0.4 so I assumed them to be good features for the model and included them in the training of the model. However I had about 13350 mean absolute error and R-squared value of about 0.22-0.35, which is not good at all.
I would be very grateful, if someone could give me some advice, or if you could examine a little the dataset and run some algorithms on it. What type of preprocessing should I use, and what type of algorithm? Is the number of datasets too low to train a regression model for predicting continuous variables?
Any feedback would be helpful as I am new to machine learning :)
The first thing that should be done in these kinds of Machine Learning Problems is to understand the data. Yes, the number of features in your dataset is small, yes, the number of data samples are very less, but it is important to do the best we can with what we have.
The data set header is in a language other than English, it is important to convert it to a language most of the people in the community would understand (in this case English). After doing a bit of tinkering, I found out that the language being used is Dutch.
There are some key features missing in the dataset. From something as obvious as the number of floors in the building to something not obvious like the number of working hours. Surface Area and the number of workers seems to me are the most important features, but you are missing out on a feature called building_function which (after using Google Translate) tells what the purpose of the building is. Intuitively, this is supposed to have a large correlation with the power consumption. Industries tend to use more power than normal Households. After translation, I found out that the main types were Residential, Office, Accommodation and Meeting. This feature thus has to be encoded as a nominal variable to train the model.
Another feature hoofsbi also seems to have some variance. But I do not know what that feature means.
If you could translate the headers in the data and share it, I will be able to provide you some code to perform this regression task. It is very important in such tasks to understand what the data is and thus perform feature engineering.
Related
I am working on a waste water data. The data is collected every 5 min. This is the sample data.
The threshold of the individual parameters is provided. My question is what kind of models should I go for to classify it as usable or not useable and also output the anomaly because of which it is unusable (if possible since it is a combination of the variables). The column for yes/no is yet to be and will be provided to me.
The other question I have is how do I keep it running since the data is collected every 5 minutes?
Your data and use case seem fit for a decision tree classifier. Decision trees are easy to train and interpret (which is one of your requirements, since you want to know why a given sample was classified as usable or not usable), do not require large amounts of labeled data, can be trained and used for prediction on most haedware, and are well suited for structured data with no missing values and low dimensionality. They also work well without normalizing your variables.
Scikit learn is super mature and easy to use, so you should be able to get something working without too much trouble.
As regards time, I'm not sure how you or your employee will be taking samples, so I don't know. If you will be getting and reading samples at that rate, using your model to label data should not be a problem, but I'm not sure if I understood your situation.
Note stackoverflow is aimed towards questions of the form "here's my code, how do I fix this?", and not so much towards general questions such as this. There are other stackexhange sites specially dedicated to statistics and data science. If you don't find here what you need, maybe you can try those other sites!
So I have a time series that only has traffic volume. I've done FB prophet and neural prophet. They work okay, but I would like to do something using machine learning. So far I have the problem of trying to make my features. Using the classical dayofyear, month, etc does not give me good results. I have tried using shift where I get the average, minimum, and max of the two previous days. However that would work, but my problem is when I try to predict days in advance the feature doesn't really work for that since I cant get the average of that day. My main concern is trying to find a good feature that my predicting future dataframe also has. A picture of my data is included. Does anyone know how I would do this?
First of all, you have to clarify some definitions. FBProphet works on a mechanism that is the same as any machine learning algorithm that is, fitting the model and then predicting the output. Being an additive regression model with a piecewise linear or logistic growth curve trend, it can be considered as a Machine Learning method that allows us to predict a continuous outcome variable.
Secondly, I think you missed the most important word that your question was about - namely: Feature engineering.
Feature Engineering includes :
Process of using domain knowledge to Extract features (characteristics, properties, attributes) from raw data.
Process of transforming raw data into features that better represent the underlying problem to the predictive models, etc..
But it's very unlikely to use Machine Learning to do Feature engineering. You do Feature engineering in order to improve your Machine Learning model. Many techniques such as imputation, handling outliers, binning, log transform, one-hot encoding, grouping operations, feature split, scaling are hybrid methods using a statistical approach and/or domain knowledge.
Regarding your data, bearing in my mind that the seasonality is already handled by FBProphet, I am not confident if feature engineering transformations such as adding the day of the week, adding holidays periods, etc... could really help improve performance...
To conclude, it is not possible to create ex-nihilo new features that would outperform your model. Whether you process/transform your data or add external domain-knowledge dataset
I have a general question on machine learning that can be applied to any algorithm. Suppose I have a particular problem, let us say soccer team winning/losing prediction. The features I choose are the amount of sleep each player gets before the game, sentiment analysis on news coverage, etc etc.
In this scenario, there is a pattern or correlation (something only a machine learning algorithm can pick up on) that only occurs around 5% of the time. But when it occurs, it is very predictive of the upcoming match.
How do you setup a machine learning algorithm to handle such a case in which it has the ability to discard most samples as noise. For example, consider a binary SVM. If there was a way to discard most of the “noisy” samples, a lot less overfitting would occur because the hyperplane would not have to eliminate error from these samples.
Regularization would help in this case, but due to the very low percentage of predictive information, is there a way we can code the algorithm to discard these samples in training and refuse to predict certain test data samples?
I have also read into confidence intervals but they seem more of an analytic tool to me than something to use in the algorithm.
I was thinking that using another ml algorithm which uses the same features to decide which testing samples are keepers might be a good idea.
Any answers using any machine learning algorithm (e.g. svm, neural net, random forest) as an example would be much appreciated. Any suggestions on where to look would be great as well (google is usually my friend, but not this time). Please let me know if I can rephrase the question better. Thanks.
I am new to machine learning and had a problem I wanted to solve and see if anyone has any ideas on what type of algorithm would be best to use. I am not looking for code, but rather a process.
Problem: I am classifying people into 2 categories: high risk and low risk. (this is a very basic starting point and I will expand as I learn how to classify more detailed)
Each person has 11 variables I am looking at and each variable has a binary value (0 for no, 1 for yes). The variables are like has married, gun_owner, home_owner, etc. So I gather each person can have 2^11 or 2048 different combinations of these variables.
I have a data set that has this information and then the result (whether or not they committed a crime). I figured this data would be used for training and then the algorithm can make predictions on high risk individuals.
Does anyone have any ideas for what would be the best algorithm? Since there are so many variables, I am having more trouble trying to figure out what may work bets.
This is a binary classification problem, with each input a binary string of length 11. There are many algorithms for this problem. The simplest one is the naive Bayes model (https://en.wikipedia.org/wiki/Naive_Bayes_classifier). You could also try some linear classifiers such as logistic regression or SVM. They both work well for linear separable data and binary classification.
It seems like you want to classify people based on a few features. It looks like a simple binary classification problem. However, it is not very clear that if the data you have is labeled or not.
So the first question is, in you dataset, do you know which person is 'high risk' and which person is 'low risk'? If you have that information, you can use a whole lot of machine learning model for this classification task.
However, if the labels are not present ('high risk' or 'low risk') you cannot do that. Then you have to think about some unsupervised learning methods (clustering). Hope this answers your question.
I am having a ML language identification project (Python) that requires a multi-class classification model with high dimension feature input.
Currently, all I can do to improve accuracy is through trail-and-error. Mindlessly combining available feature extraction algorithms and available ML models and see if I get lucky.
I am asking if there is a commonly accepted workflow that find a ML solution systematically.
This thought might be naive, but I am thinking if I can somehow visualize those high dimension data and the decision boundaries of my model. Hopefully this visualization can help me to do some tuning. In MATLAB, after training, I can choose any two features among all features and MATLAB will give a decision boundary accordingly. Can I do this in Python?
Also, I am looking for some types of graphs that I can use in the presentation to introduce my model and features. What are the most common graphs used in the field?
Thank you
Feature engineering is more of art than technique. That might require domain knowledge or you could try adding, subtracting, dividing and multiplying different columns to make features out of it and check if it adds value to the model. If you are using Linear Regression then the adjusted R-squared value must increase or in the Tree models, you can see the feature importance, etc.