questions about random forest probability calibration in h2o

questions about random forest probability calibration in h2o - python

I am reading through the example of calibrating probabilities from h2o documentation http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/calibrate_model.html
Since the example is poorly explained, my question is:
do we have to have weights (weights_column) in the training set?
if so, what do these weights do?
I also tried to leave out the weights. The code still runs but the results are very different. Any insight would be appreciated

If you are interested in how the weights column in H2O-3 works you can review the documentation here and code examples here.
I am included a copy of this document for your convenience:
weights_column
Available in: GBM, DRF, Deep Learning, GLM, AutoML, XGBoost, CoxPH
Hyperparameter: no
Description
This option specifies the column in a training frame to be used when determining weights. Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are also supported. During training, rows with higher weights matter more, due to the larger loss function pre-factor.
For scoring, all computed metrics will take the observation weights into account (for Gains/Lift, AUC, confusion matrices, logloss, etc.), so it’s important to also provide the weights column for validation or test sets if you want to up/down-weight certain observations (ideally consistently between training and testing). Note that if you omit the weights, then all observations will have equal weights (set to 1) for the computation of the metrics. For example, a weight of 2 is identical to duplicating a row.
Notes:
Weights can be specified as integers or as non-integers.
The weights column cannot be the same as the fold_column.
If a weights column is specified as both a feature (predictor) and a weight, the column will be used for weights only.
Example unit test scripts are available on GitHub:
https://github.com/h2oai/h2o-3/blob/master/h2o-py/tests/testdir_algos/gbm/pyunit_weights_gbm.py
https://github.com/h2oai/h2o-3/blob/master/h2o-py/tests/testdir_algos/gbm/pyunit_weights_gamma_gbm.py
Observation Weights in Deep Learning
The observation weights are handled differently in Deep Learning than in the other supported algorithms. For algorithms other than Deep Learning, the weight goes into the split-finding and leaf-node prediction math in a straightforward way. For Deep Learning, it’s more difficult. Using the weight as a multiplicative factor in the loss will not work in general, and that would not be the same as replicating the same row. Also, applying the same row over and over isn’t a good idea either, so sampling during training should still be active. To address these issues, Deep Learning is implemented with importance sampling using the inverse cumulative distribution function. It also includes a special case that picks a random row from the dataset for every second row it trains, just to keep outliers in the game. Note that observation weights for Deep Learning that are neither 0 nor 1 are difficult to handle properly. In this case, it might be better to explicitly oversample using balance_classes=TRUE.
Related Parameters
balance_classes
offset_column
y

Related

Model doesn't learn from data

We have a dataset with ~40000 data points each having 160 features. We know nothing about what each feature represents, but they are 0-5 integers, most probably some rankings. Our task is to take a subset of those features, lets say (40000,30) and predict the initial (40000,160) data. In other words, we need to create a model, that takes 30 features as input and outputs the full 160 set of features.
https://i.stack.imgur.com/Ko6nR.png
the example of the dataset.
What we have done so far, we trained a ANN with the following architecture:
30->200->150->163
We are calculating an accuracy score by rounding the prediction(lets say I predicted 3.6 for 4, 3.6~4, 4==4, so True)
We got ~52% accuracy and nothing makes it go higher.
So, the problem is a multi-output regression problem. The prediction is done using 30 discrete numeric features. The normalization was done both by using Min-Max Scaling and Standardization(The target is also normalized). In the model, we tried different number of layers with different capacity, tried to use batch-norm, different activations (relu is used now, for the output layer no activation is used), different losses (mse is the current one), different optimizers (adam is the current one). Both Keras and PyTorch is used in the case something is wrong with the PyTorch implementation.
So, the accuracy still remains 50-52%. There is one straightforward thing - when we increase the model capacity (the number of parameters) the model is more prone to overfitting. Even after increasing the model capacity very very much, we couldn't make the model overfit the data. We tried to use the features separately (For example, predict one feature from another) - nothing useful. Tried to predict 1 feature using 159 features, but again ~52% and even less.
What I understand and can conclude from these - there is no relationship between those ratings and most of them can't predict others. What do you think about this case?

Is there any rules of thumb for the relation of number of iterations and training size for lightgbm?

When I train a classification model using lightgbm, I usually use validation set and early stopping to determine the number of iterations.
Now I want to combine training and validation set to train a model (so I have more training examples), and use the model to predict the test data, should I change the number of iterations derived from the validation process?
Thanks!

As you said in your comment, this is not comparable to the Deep Learning number of epochs because deep learning is usually stochastic.
With LGBM, all parameters and features being equals, by adding 10% up to 15% more training points, we can expect the trees to look alike: as you have more information your split values will be better, but it is unlikely to drastically change your model (this is less true if you use parameters such as bagging_fraction or if the added points are from a different distribution).
I saw people multiplying the number of iterations by 1.1 (can't find my sources sorry). Intuitively this makes sense to add some trees as you potentially add information. Experimentally this value worked well but the optimal value will be dependent of your model and data.

In a similar problem in deep learning with Keras: I do it by using an early stopper and cross validation with train and validation data, and let the model optimize itself using validation data during trainings.
After each training, I test the model with test data and examine the mean accuracies. In the mean time after each training I save the stopped_epoch from EarlyStopper. If CV scores are satisfying, I take the mean of stopped epochs and do a full training (including all data I have) with the number of mean stopped epochs, and save the model.

I'm not aware of a well-established rule of thumb to do such estimate. As Florian has pointed out, sometimes people rescale the number of iterations obtained from early stopping by a factor. If i remember correctly, typically the factor assumes a linear dependence of the data size and the optimal number of trees. I.e. in the 10-fold cv this would be a rescaling 1.1 factor. But there is no solid justification for this. As Florian also pointed out, the dependence around the optimum is typically reasonably flat, so +- a bit of trees will not have a dramatic effect.
Two suggestions:
do k-fold validation instead of a single train-validation split. This will allow to evaluate how stable the estimate of the optimal number of trees is. If this fluctuates a lot between folds- do not rely on such estimate :)
fix the size of the validation sample and re-train your model with early stopping using gradually increasing training set. This will allow to evaluae the dependence of the number of trees on the sample size and approximate it to the full sample size.

What's the reason for the weights of my NN model don't change a lot?

I am training a neural network model, and my model fits the training data well. The training loss decreases stably. Everything works fine. However, when I output the weights of my model, I found that it didn't change too much since random initialization (I didn't use any pretrained weights. All weights are initialized by default in PyTorch). All dimension of the weights only changed about 1%, while the accuracy on training data climbed from 50% to 90%.
What could account for this phenomenon? Is the dimension of weights too high and I need to reduce the size of my model? Or is there any other possible explanations?
I understand this is a quite broad question, but I think it's impractical for me to show my model and analyze it mathematically here. So I just want to know what could be the general / common cause for this problem.

There are almost always many local optimal points in a problem so one thing you can't say specially in high dimensional feature spaces is which optimal point your model parameters will fit into. one important point here is that for every set of weights that you are computing for your model to find a optimal point, because of real value weights, there are infinite set of weights for that optimal point, the proportion of weights to each other is the only thing that matters, because you are trying to minimize the cost, not finding a unique set of weights with loss of 0 for every sample. every time you train you may get different result based on initial weights. when weights change very closely with almost same ratio to each others this means your features are highly correlated(i.e. redundant) and since you are getting very high accuracy just with a little bit of change in weights, only thing i can think of is that your data set classes are far away from each other. try to remove features one at a time, train and see results if accuracy was good continue to remove another one till you hopefully reach to a 3 or 2 dimensional space which you can plot your data and visualize it to see how data points are distributed and make some sense out of this.
EDIT: Better approach is to use PCA for dimensionality reduction instead of removing one by one

What is the key feature in MNIST Dataset that is used to classify images

I was recently learning about neural networks and came across MNIST data set. i understood that a sigmoid cost function is used to reduce the loss. Also, weights and biases gets adjusted and an optimum weights and biases are found after the training. the thing i did not understand is, on what basis the images are classified. For example, to classify whether a patient has cancer or not, data like age, location, etc., becomes features. in MNIST dataset, i did not find any of that. Am i missing something here. Please help me with this

First of all the Network pipeline consists of 3 main parts:
Input Manipulation:
Parameters that effect the finding of minimum:
Parameters like your descission function in your interpretation
layer (often fully connected layer)
In contrast to your regular machine learning pipeline where you have to extract features manually a CNN uses filters. (Filters like in edge detection or viola and jones).
If a filter runs across the images and is convolved with pixels it Produces an output.
This output is then interpreted by a neuron. If the output is above a threshold it is considered as valid (Step function counts 1 if valid or in case of Sigmoid it has a value on the sigmoid function).
The next steps are the same as before.
This is progressed until the interpretation layer (often softmax). This layer interprets your computation (if the filters are good adapted to your problem you will get a good predicted label) which means you have a low difference between (y_guess - y_true_label).
Now you can see that for the guess of y we have multiplied the input x with many weights w and also used functions on it. This can be seen like a chain rule in analysis.
To get better results the effect of a single weight on the input must be known. Therefore, you use Backpropagation which is a derivative of the Error with respect to all w. The Trick is that you can reuse derivatives which is more or less Backpropagation and it becomes easier since you can use Matrix vector notation.
If you have your gradient, you can use the normal concept of minimization where you walk along the steepest descent. (There are also many other gradient methods like adagrad or adam etc).
The steps will repeat until convergence or until you reach the maximum epochs.
So the answer is: THE COMPUTED WEIGHTS (FILTERS) ARE THE KEY TO DETECT NUMBERS AND DIGITS :)

xgboost predict method returns the same predicted value for all rows

I've created an xgboost classifier in Python:
train is a pandas dataframe with 100k rows and 50 features as columns.
target is a pandas series
xgb_classifier = xgb.XGBClassifier(nthread=-1, max_depth=3, silent=0,
objective='reg:linear', n_estimators=100)
xgb_classifier = xgb_classifier.fit(train, target)
predictions = xgb_classifier.predict(test)
However, after training, when I use this classifier to predict values the entire results array is the same number. Any idea why this would be happening?
Data clarification:
~50 numerical features with a numerical target
I've also tried RandomForestRegressor from sklearn with the same data and it does give realistic predictions. Perhaps a legitimate bug in the xgboost implementation?

This question has received several responses including on this thread as well as here and here.
I was having a similar issue with both XGBoost and LGBM. For me, the solution was to increase the size of the training dataset.
I was training on a local machine using a random sample (~0.5%) of a large sparse dataset (200,000 rows and 7000 columns) because I did not have enough local memory for the algorithm. It turned out that for me, the array of predicted values was just an array of the average values of the target variable. This suggests to me that the model may have been underfitting. One solution to an underfitting model is to train your model on more data, so I tried my analysis on a machine with more memory and the issue was resolved: my prediction array was no longer an array of average target values. On the other hand, the issue could simply have been that the slice of predicted values I was looking at were predicted from training data with very little information (e.g. 0's and nan's). For training data with very little information, it seems reasonable to predict the average value of the target feature.
None of the other suggested solutions I came across were helpful for me. To summarize some of the suggested solutions included:
1) check if gamma is too high
2) make sure your target labels are not included in your training dataset
3) max_depth may be too small.

One of the reasons for the same is that you're providing a high penalty through parameter gamma. Compare the mean value of your training response variable and check if the prediction is close to this. If yes then the model is restricting too much on the prediction to keep train-rmse and val-rmse as close as possible. Your prediction is the simplest with higher value of gamma. So you'll get the simplest model prediction like mean of training set as prediction or naive prediction.

Won't the max_depth =3 too smaller, try to get it bigger,the default value is 7 if i remember it correctly. and set silent to be 1, then you can monitor what's the error each epochs

You need to post a reproducible example for any real investigation. It's entirely likely that your response target is highly unbalanced and that your training data is not super predictive, thus you always (or almost always) get one class predicted. Have you looked at the predicted probabilities at all to see if there is any variance? Is it just an issue of not using the proper cut-off for classification labels?
Since you said that a RF gave reasonable predictions it would useful to see your training parameters for that. At a glance, it's curious why you're using a regression objective function in your xgboost call though -- that could easily be why you are seeing such poor performance. Trying changing your objective to: 'binary:logistic.

You should check there are no inf values in your target.

Try to increase (significantly) min_child_weight in XGBoost or min_data_in_leaf in LightGBM:
min_data_in_leaf oof_rmse
20000 0.052998
2000 0.053001
200 0.053002
20 0.053015
2 0.054261
Actually, it may be a case of overfitting masking as underfitting. It happens for instance for zero-inflated targets in case of insurance claims frequency models. One solution would be to increase the representation/coverage of rare target levels (e.g. non-zero insurance claims) in each tree leaf, by increasing the hyperparameter controlling minimum leaf size to some rather large values, such as those specified in the example above.

I just had this problem and managed to fix it. The problem was I was training on tree_method='gpu_hist' which gave all the same predictions. If I set tree_method='auto' it works properly but wayy longer runtimes. So then if I set tree_method='gpu_hist' along with base_score=0 it worked. I think base_score should be about the mean of your predicted variable.

I have tried all solutions on this page, but none worked.
As I was grouping time series, certain frequencies created gaps in data.
I solved this issue by filling all NaN's.

Probably the hyper-parameters you use cause errors. Try using default values. In my case, this problem was solved by removing subsample and min_child_weight hyper-parameters from params.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.