Regression problem optimization using ML or DL

Regression problem optimization using ML or DL - python

I have some data (data from sensors and etc.) from an energy system. consider the x-axis is temperature and the y-axis is energy consumption. Suppose we just have data and we don't have access to the mathematical formulation of the problem:
energy consumption vs temperature curve
In the above figure, it is absolutely obvious that the optimum point is 20. I want to predict the optimum point using ML or DL models. Based on the courses that I have taken I know that it's a regression supervised learning problem, however, I don't know how can I do optimization on this kind of problem.
I don't want you to write a code for this problem. I just want you to give me some hints and instructions about doing this optimization problem.
Also if you recommend any references or courses, I will welcome them to learn how to predict the optimum point of a regression supervised learning problem without knowing the mathematical formulation of the problem.

There are lots of ways that you can try when it comes to optimizing your model, for example, fine tuning your model. What you can do with fine tuning is to try different options that a model consists of and find the smallest errors or higher accuracy based on the actual and predicted data.
Using DecisionTreeRegressor model, you can try to use different split criterion, limit the minimum number of split & depth to see which give you the best predicted scores/errors. For neural network model, using keras, you can try different optimizers, try different loss functions, tune your parameters etc. and try all out as a combination of model.
As for resources, you can go Google, Youtube, and other platform to use keywords such as "fine tuning DNN model" and a lot of resources will pop up for your reference. The bottom line is that you will need to try out different models and fine tune your model until when you are satisfied with your results. The results will be based on your judgement and there is no right or wrong answers (i.e., errors are always there), it just completely up to you on how would you like to achieve your solution with handful of ML and DL models that you got. My advice to you is to spend more time on getting your hands dirty. It will be worth it in the long run. HFGL!

Related

What to do when only a portion of training/testing data generates confident predictions?

I have a general question on machine learning that can be applied to any algorithm. Suppose I have a particular problem, let us say soccer team winning/losing prediction. The features I choose are the amount of sleep each player gets before the game, sentiment analysis on news coverage, etc etc.
In this scenario, there is a pattern or correlation (something only a machine learning algorithm can pick up on) that only occurs around 5% of the time. But when it occurs, it is very predictive of the upcoming match.
How do you setup a machine learning algorithm to handle such a case in which it has the ability to discard most samples as noise. For example, consider a binary SVM. If there was a way to discard most of the “noisy” samples, a lot less overfitting would occur because the hyperplane would not have to eliminate error from these samples.
Regularization would help in this case, but due to the very low percentage of predictive information, is there a way we can code the algorithm to discard these samples in training and refuse to predict certain test data samples?
I have also read into confidence intervals but they seem more of an analytic tool to me than something to use in the algorithm.
I was thinking that using another ml algorithm which uses the same features to decide which testing samples are keepers might be a good idea.
Any answers using any machine learning algorithm (e.g. svm, neural net, random forest) as an example would be much appreciated. Any suggestions on where to look would be great as well (google is usually my friend, but not this time). Please let me know if I can rephrase the question better. Thanks.

Which data to plot to know what model suits best for the problem?

I'm sorry, i know that this is a very basic question but since i'm still a beginner in machine learning, determining what model suits best for my problem is still confusing to me, lately i used linear regression model (causing the r2_score is so low) and a user mentioned i could use certain model according to the curve of the plot of my data and when i see another coder use random forest regressor (causing the r2_score 30% better than the linear regression model) and i do not know how the heck he/she knows better model since he/she doesn't mention about it. I mean in most sites that i read, they shoved the data to some models that they think would suit best for the problem (example: for regression problem, the models could be using linear regression or random forest regressor) but in some sites and some people said firstly we need to plot the data so we can predict what exact one of the models that suit the best. I really don't know which part of the data should i plot? I thought using seaborn pairplot would give me insight of the shape of the curve but i doubt that it is the right way, what should i actually plot? only the label itself or the features itself or both? and how can i get the insight of the curve to know the possible best model after that?

This question is too general, but I will try to give an overview of how to choose the model. First of all you should that there is no general rule to choose the family of models to use, it is more a choosen by experiminting different model and looking to which one gives better results. You should also now that in general you have multi-dimensional features, thus plotting the data will not give you a full insight of the dependance of your features with the target, however to check if you want to fit a linear model or not, you can start plotting the target vs each dimension of the input, and look if there is some kind of linear relation. However I would recommand that you to fit a linear model, and check if if this is relvant from a statistical point of view (student test, smirnov test, check the residuals...). Note that in real life applications, it is not likeley that linear regression will be the best model, unless you do a lot of featue engineering. So I would recommand you to use more advanced methods (RandomForests, XGboost...)

If you are using off-the-shelf packages like sklearn, then many simple models like SVM, RF, etc, are just one-liners, so in practice, we usually try several such models at the same time.

Predicting Energy Consumption of different buildings

I have the dataset which you can find the (updated) file here , containing many different characteristics of different office buildings, including their surface area and number of people working in there. In total there are about 200 records. I want to use an algorithm, that can be trained using the dataset above, in order to be able to predict the electricity consumption(given in the column 'kwh') of a the building that is not in the set.
I have tried most of the possible machine learning algorithms using the scikit library in python (linear regression, Ridge, Lasso, SVC etc) in order to predict a continuous variable. Surface_area and number of workers had a coorelation value with the target variable between 0.3-0.4 so I assumed them to be good features for the model and included them in the training of the model. However I had about 13350 mean absolute error and R-squared value of about 0.22-0.35, which is not good at all.
I would be very grateful, if someone could give me some advice, or if you could examine a little the dataset and run some algorithms on it. What type of preprocessing should I use, and what type of algorithm? Is the number of datasets too low to train a regression model for predicting continuous variables?
Any feedback would be helpful as I am new to machine learning :)

The first thing that should be done in these kinds of Machine Learning Problems is to understand the data. Yes, the number of features in your dataset is small, yes, the number of data samples are very less, but it is important to do the best we can with what we have.
The data set header is in a language other than English, it is important to convert it to a language most of the people in the community would understand (in this case English). After doing a bit of tinkering, I found out that the language being used is Dutch.
There are some key features missing in the dataset. From something as obvious as the number of floors in the building to something not obvious like the number of working hours. Surface Area and the number of workers seems to me are the most important features, but you are missing out on a feature called building_function which (after using Google Translate) tells what the purpose of the building is. Intuitively, this is supposed to have a large correlation with the power consumption. Industries tend to use more power than normal Households. After translation, I found out that the main types were Residential, Office, Accommodation and Meeting. This feature thus has to be encoded as a nominal variable to train the model.
Another feature hoofsbi also seems to have some variance. But I do not know what that feature means.
If you could translate the headers in the data and share it, I will be able to provide you some code to perform this regression task. It is very important in such tasks to understand what the data is and thus perform feature engineering.

How to study the effect of each data on a deep neural network model?

I'm working on a training a neural network model using Python and Keras library.
My model test accuracy is very low (60.0%) and I tried a lot to rise it, but I couldn't. I'm using DEAP dataset (total 32 participants) to train the model. The splitting technique that I'm using is a fixed one. It was as the followings:28 participants for training, 2 for validation and 2 for testing.
For the model I'm using is as follows.
sequential model
Optimizer = Adam
With L2_regularizer, Gaussian noise, dropout, and Batch normalization
Number of hidden layers = 3
Activation = relu
Compile loss = categorical_crossentropy
initializer = he_normal
Now, I'm using train-test technique (fixed one also) to split the data and I got better results. However, I figured out that some of the participants are affecting the training accuracy in a negative way. Thus, I want to know if there is a way to study the effect of the each data (participant) on the accuracy (performance) of a model?
Best Regards,

From my Starting deep learning hands-on: image classification on CIFAR-10 tutorial, in which I insist on keeping track of both:
global metrics (log-loss, accuracy),
examples (correctly and incorrectly classifies cases).
The later may help us telling which kinds of patterns are problematic, and on numerous occasions helped me with changing the network (or supplementing training data, if it was the case).
And example how does it work (here with Neptune, though you can do it manually in Jupyter Notebook, or using TensorBoard image channel):
And then looking at particular examples, along with the predicted probabilities:
Full disclaimer: I collaborate with deepsense.ai, the creators or Neptune - Machine Learning Lab.

This is, perhaps, more broad an answer than you may like, but I hope it'll be useful nevertheless.
Neural networks are great. I like them. But the vast majority of top-performance, hyper-tuned models are ensembles; use a combination of stats-on-crack techniques, neural networks among them. One of the main reasons for this is that some techniques handle some situations better. In your case, you've run into a situation for which I'd recommend exploring alternative techniques.
In the case of outliers, rigorous value analyses are the first line of defense. You might also consider using principle component analysis or linear discriminant analysis. You could also try to chase them out with density estimation or nearest neighbors. There are many other techniques for handling outliers, and hopefully you'll find the tools I've pointed to easily implemented (with help from their docs); sklearn tends to readily accept data prepared for Keras.

how to predict binary outcome with categorical and continuous features using scikit-learn?

I need advice choosing a model and machine learning algorithm for a classification problem.
I'm trying to predict a binary outcome for a subject. I have 500,000 records in my data set and 20 continuous and categorical features. Each subject has 10--20 records. The data is labeled with its outcome.
So far I'm thinking logistic regression model and kernel approximation, based on the cheat-sheet here.
I am unsure where to start when implementing this in either R or Python.
Thanks!

Choosing an algorithm and optimizing the parameter is a difficult task in any data mining project. Because it must customized for your data and problem. Try different algorithm like SVM,Random Forest, Logistic Regression, KNN and... and test Cross Validation for each of them and then compare them.
You can use GridSearch in sickit learn to try different parameters and optimize the parameters for each algorithm. also try this project
witch test a range of parameters with genetic algorithm

Features
If your categorical features don't have too many possible different values, you might want to have a look at sklearn.preprocessing.OneHotEncoder.
Model choice
The choice of "the best" model depends mainly on the amount of available training data and the simplicity of the decision boundary you expect to get.
You can try dimensionality reduction to 2 or 3 dimensions. Then you can visualize your data and see if there is a nice decision boundary.
With 500,000 training examples you can think about using a neural network. I can recommend Keras for beginners and TensorFlow for people who know how neural networks work.
You should also know that there are Ensemble methods.
A nice cheat sheet what to use is on in the sklearn tutorial you already found:
(source: scikit-learn.org)
Just try it, compare different results. Without more information it is not possible to give you better advice.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.