How H2O AutoMachineLearning decide the model order? - python

I have to solve a simple binary classification problem using H2O AutoML. I'd like to know if the parameters sort_metric and stopping_metric can somehow influence the order of the trained model.
I try to change these two parameters using both AUC or AUCPR, but the performances are almost identical.
My principal objective is to obtain the best algorithms in terms of AUCPR, so I would like to somehow influence the order of the trained models.
Does someone know how can I do so?

Related

Regression problem optimization using ML or DL

I have some data (data from sensors and etc.) from an energy system. consider the x-axis is temperature and the y-axis is energy consumption. Suppose we just have data and we don't have access to the mathematical formulation of the problem:
energy consumption vs temperature curve
In the above figure, it is absolutely obvious that the optimum point is 20. I want to predict the optimum point using ML or DL models. Based on the courses that I have taken I know that it's a regression supervised learning problem, however, I don't know how can I do optimization on this kind of problem.
I don't want you to write a code for this problem. I just want you to give me some hints and instructions about doing this optimization problem.
Also if you recommend any references or courses, I will welcome them to learn how to predict the optimum point of a regression supervised learning problem without knowing the mathematical formulation of the problem.
There are lots of ways that you can try when it comes to optimizing your model, for example, fine tuning your model. What you can do with fine tuning is to try different options that a model consists of and find the smallest errors or higher accuracy based on the actual and predicted data.
Using DecisionTreeRegressor model, you can try to use different split criterion, limit the minimum number of split & depth to see which give you the best predicted scores/errors. For neural network model, using keras, you can try different optimizers, try different loss functions, tune your parameters etc. and try all out as a combination of model.
As for resources, you can go Google, Youtube, and other platform to use keywords such as "fine tuning DNN model" and a lot of resources will pop up for your reference. The bottom line is that you will need to try out different models and fine tune your model until when you are satisfied with your results. The results will be based on your judgement and there is no right or wrong answers (i.e., errors are always there), it just completely up to you on how would you like to achieve your solution with handful of ML and DL models that you got. My advice to you is to spend more time on getting your hands dirty. It will be worth it in the long run. HFGL!

Which data to plot to know what model suits best for the problem?

I'm sorry, i know that this is a very basic question but since i'm still a beginner in machine learning, determining what model suits best for my problem is still confusing to me, lately i used linear regression model (causing the r2_score is so low) and a user mentioned i could use certain model according to the curve of the plot of my data and when i see another coder use random forest regressor (causing the r2_score 30% better than the linear regression model) and i do not know how the heck he/she knows better model since he/she doesn't mention about it. I mean in most sites that i read, they shoved the data to some models that they think would suit best for the problem (example: for regression problem, the models could be using linear regression or random forest regressor) but in some sites and some people said firstly we need to plot the data so we can predict what exact one of the models that suit the best. I really don't know which part of the data should i plot? I thought using seaborn pairplot would give me insight of the shape of the curve but i doubt that it is the right way, what should i actually plot? only the label itself or the features itself or both? and how can i get the insight of the curve to know the possible best model after that?
This question is too general, but I will try to give an overview of how to choose the model. First of all you should that there is no general rule to choose the family of models to use, it is more a choosen by experiminting different model and looking to which one gives better results. You should also now that in general you have multi-dimensional features, thus plotting the data will not give you a full insight of the dependance of your features with the target, however to check if you want to fit a linear model or not, you can start plotting the target vs each dimension of the input, and look if there is some kind of linear relation. However I would recommand that you to fit a linear model, and check if if this is relvant from a statistical point of view (student test, smirnov test, check the residuals...). Note that in real life applications, it is not likeley that linear regression will be the best model, unless you do a lot of featue engineering. So I would recommand you to use more advanced methods (RandomForests, XGboost...)
If you are using off-the-shelf packages like sklearn, then many simple models like SVM, RF, etc, are just one-liners, so in practice, we usually try several such models at the same time.

I want to implement a machine learning or deep learning model for text classification (100 classes)

I have a dataset that is similar to the one where we have movie plots and their genres. The number of classes is around 100. What algorithm should I choose for this 100 class classification? The classification is multi-label because 1 movie can have multiple genres
Please recommend anyone from the following. You are free to suggest any other model if you want to.
1.Naive Bayesian
2.Neural networks
3.SVM
4.Random forest
5.k nearest neighbours
It would be useful if you also give the necessary library in python
An important step in machine learning engineering consists of properly inspecting the data. Herby you get some insight that determines what algorithm to choose. Sometimes, you might try out more than one algorithm and compare the models, in order to be sure, that you tried your best on the data.
Since you did not disclose your data, I can only give you the following advice: If your data is "easy", meaning that you need only little features and a slight combination of them to solve the task, use Naive Bayes or k-nearest neighbors. If your data is "medium" hard, then use Random Forest or SVM. If solving the task requires a very complicated decision boundary combining many dimensions of the features in a non-linear fashion, choose a Neural Network architecture.
I suggest you use python and the scikit-learn package for SVM or Random forest or k-NN.
For Neural Networks, use keras.
I am sorry that I can not give you THE recipe you might expect for solving your problem. Your question is posed really broad.

how to predict binary outcome with categorical and continuous features using scikit-learn?

I need advice choosing a model and machine learning algorithm for a classification problem.
I'm trying to predict a binary outcome for a subject. I have 500,000 records in my data set and 20 continuous and categorical features. Each subject has 10--20 records. The data is labeled with its outcome.
So far I'm thinking logistic regression model and kernel approximation, based on the cheat-sheet here.
I am unsure where to start when implementing this in either R or Python.
Thanks!
Choosing an algorithm and optimizing the parameter is a difficult task in any data mining project. Because it must customized for your data and problem. Try different algorithm like SVM,Random Forest, Logistic Regression, KNN and... and test Cross Validation for each of them and then compare them.
You can use GridSearch in sickit learn to try different parameters and optimize the parameters for each algorithm. also try this project
witch test a range of parameters with genetic algorithm
Features
If your categorical features don't have too many possible different values, you might want to have a look at sklearn.preprocessing.OneHotEncoder.
Model choice
The choice of "the best" model depends mainly on the amount of available training data and the simplicity of the decision boundary you expect to get.
You can try dimensionality reduction to 2 or 3 dimensions. Then you can visualize your data and see if there is a nice decision boundary.
With 500,000 training examples you can think about using a neural network. I can recommend Keras for beginners and TensorFlow for people who know how neural networks work.
You should also know that there are Ensemble methods.
A nice cheat sheet what to use is on in the sklearn tutorial you already found:
(source: scikit-learn.org)
Just try it, compare different results. Without more information it is not possible to give you better advice.

Python multiple curve fitting models

Is there a way to have an x,y pair dataset given to a function that will return a list of curve fit models and the coeff. The program DataFit does this with about 200 different models, but we are looking for a pythonic way. From exponential to inverse polynomial etc.
I have seen many posts of manually using scipy to type each model, but this is not feasible for the number of models we want to test.
The closest I found was pyeq2, but this is not returning the list of functions, and seems to be a rabbit hole to code for.
If R has this available, we could use that but python is really the goal
Below is an example of the data, we want to find the best way to describe this curve
You can try library splines in R. I have used this for higher order curve fitting to some univariate data. You can try to change and achieve similar thing with corresponding R^2 errors.
You can either decide to do the following:
Choose a model to fit a parameters. This model should be based on a single independent variable. This can be done by python's scipy.optimize curve_fit function. You can choose something like a hyberbola.
Choose a model that is complex and likely represents an underlying mechanism of something at work. Like the system of ODE's from a disease SIR model. Fitting the parameters will be no easy task. This will be done by Markov Chain Monte Carlo (MCMC) methods. This is VERY difficult.
Realise that you have data and can use machine learning via scikit learn to predict from your data. This is a method that doesn't require parameters.
Machine learning and neural networks don't fit something and can't really tell you about the underlying mechanism but can make predicitions just as a best fit model would...dare I say even better.
In the end, we found that Eureqa software was able to achieve this. https://www.nutonian.com/products/eureqa/

Categories

Resources