I am implementing a classification task which is a 985 class classification problem.
I have trained my model and predicted the class of X_test data.
I am using logistic regression. When I am doing clf.predict(X_test[0]) then I am getting the correct class.
But when I am seeing the probabilities, clf.predict_proba(X_test[0]), then the correct class does not have the highest probability. In fact, another class has a maximum probability. I don't understand why this is happening. I have checked for another input, the same is happening for other inputs also.
This is really hard to troubleshoot without an example to replicate. However, I suspect that there may be an indexing problem. Try restarting the notebook kernel if you're using a notebook, and check for indexing problems.
Also, if you could post more details or examples of this happening, it would help.
Related
I have some data (data from sensors and etc.) from an energy system. consider the x-axis is temperature and the y-axis is energy consumption. Suppose we just have data and we don't have access to the mathematical formulation of the problem:
energy consumption vs temperature curve
In the above figure, it is absolutely obvious that the optimum point is 20. I want to predict the optimum point using ML or DL models. Based on the courses that I have taken I know that it's a regression supervised learning problem, however, I don't know how can I do optimization on this kind of problem.
I don't want you to write a code for this problem. I just want you to give me some hints and instructions about doing this optimization problem.
Also if you recommend any references or courses, I will welcome them to learn how to predict the optimum point of a regression supervised learning problem without knowing the mathematical formulation of the problem.
There are lots of ways that you can try when it comes to optimizing your model, for example, fine tuning your model. What you can do with fine tuning is to try different options that a model consists of and find the smallest errors or higher accuracy based on the actual and predicted data.
Using DecisionTreeRegressor model, you can try to use different split criterion, limit the minimum number of split & depth to see which give you the best predicted scores/errors. For neural network model, using keras, you can try different optimizers, try different loss functions, tune your parameters etc. and try all out as a combination of model.
As for resources, you can go Google, Youtube, and other platform to use keywords such as "fine tuning DNN model" and a lot of resources will pop up for your reference. The bottom line is that you will need to try out different models and fine tune your model until when you are satisfied with your results. The results will be based on your judgement and there is no right or wrong answers (i.e., errors are always there), it just completely up to you on how would you like to achieve your solution with handful of ML and DL models that you got. My advice to you is to spend more time on getting your hands dirty. It will be worth it in the long run. HFGL!
I have to solve a simple binary classification problem using H2O AutoML. I'd like to know if the parameters sort_metric and stopping_metric can somehow influence the order of the trained model.
I try to change these two parameters using both AUC or AUCPR, but the performances are almost identical.
My principal objective is to obtain the best algorithms in terms of AUCPR, so I would like to somehow influence the order of the trained models.
Does someone know how can I do so?
I'm sorry, i know that this is a very basic question but since i'm still a beginner in machine learning, determining what model suits best for my problem is still confusing to me, lately i used linear regression model (causing the r2_score is so low) and a user mentioned i could use certain model according to the curve of the plot of my data and when i see another coder use random forest regressor (causing the r2_score 30% better than the linear regression model) and i do not know how the heck he/she knows better model since he/she doesn't mention about it. I mean in most sites that i read, they shoved the data to some models that they think would suit best for the problem (example: for regression problem, the models could be using linear regression or random forest regressor) but in some sites and some people said firstly we need to plot the data so we can predict what exact one of the models that suit the best. I really don't know which part of the data should i plot? I thought using seaborn pairplot would give me insight of the shape of the curve but i doubt that it is the right way, what should i actually plot? only the label itself or the features itself or both? and how can i get the insight of the curve to know the possible best model after that?
This question is too general, but I will try to give an overview of how to choose the model. First of all you should that there is no general rule to choose the family of models to use, it is more a choosen by experiminting different model and looking to which one gives better results. You should also now that in general you have multi-dimensional features, thus plotting the data will not give you a full insight of the dependance of your features with the target, however to check if you want to fit a linear model or not, you can start plotting the target vs each dimension of the input, and look if there is some kind of linear relation. However I would recommand that you to fit a linear model, and check if if this is relvant from a statistical point of view (student test, smirnov test, check the residuals...). Note that in real life applications, it is not likeley that linear regression will be the best model, unless you do a lot of featue engineering. So I would recommand you to use more advanced methods (RandomForests, XGboost...)
If you are using off-the-shelf packages like sklearn, then many simple models like SVM, RF, etc, are just one-liners, so in practice, we usually try several such models at the same time.
I wanted to check if a Multiple Linear Regression problem produced the same output when solved using Scikit-Learn and Statsmodels.api. I did it in 3 sections (in the order of their mention): Statsmodels(without intercept), Statsmodels(with intercept) and SKL. As expected, my SKL coefficients and R(square) were as same as that of Statsmodels(with intercept) but my SKL mean square error was equivalent to that of Statsmodels(without intercept).
I am going to share my notebook code; it's a fairly basic piece of code, since I have just started with Machine Learning Applications. Please go through it and tell me why it is happening. Also, if you could share your insights on any inefficient piece of code, I would be thankful. Here's the code:
https://github.com/vgoel60/Linear-Regression-using-Sklearn-vs-Statsmodel.api/blob/master/Linear%20Regression%20Boston%20Housing%20Prices%20using%20Scikit-Learn%20and%20Statsmodels.api.ipynb
You made a mistake, which explains the strange results. When you make the predictions from the linear model with scikit-learn, you write:
predictions2 = lm.predict(xtest2)
Notice that you are using the lm model, the one resulting from the first statsmodels regression. Instead, you should have written:
predictions2 = lm2.predict(xtest2)
When you do this, the results are as expected.
I'm trying to write something similar to google's wide and deep learning after running into difficulties of doing multi-class classification(12 classes) with the sklearn api. I've tried to follow the advice in a couple of posts and used the tf.group(logistic_regression_optimizer, deep_model_optimizer). It seems to work but I was trying to figure out how to get predictions out of this model. I'm hoping that with the tf.group operator the model is learning to weight the logistic and deep models differently but I don't know how to get these weights out so I can get the right combination of the two model's predictions. Thanks in advance for any help.
https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/Cs0R75AGi8A
How to set layer-wise learning rate in Tensorflow?
tf.group() creates a node that forces a list of other nodes to run using control dependencies. It's really just a handy way to package up logic that says "run this set of nodes, and I don't care about their output". In the discussion you point to, it's just a convenient way to create a single train_op from a pair of training operators.
If you're interested in the value of a Tensor (e.g., weights), you should pass it to session.run() explicitly, either in the same call as the training step, or in a separate session.run() invocation. You can pass a list of values to session.run(), for example, your tf.group() expression, as well as a Tensor whose value you would like to compute.
Hope that helps!