H2o Python: Combining XGB Holdout Predictions - python

When using:
"keep_cross_validation_predictions": True
"keep_cross_validation_fold_assignment": True
in H2O's XGBoost Estimator, I am not able to map these cross validated probabilities back to the original dataset. There is one documentation example for R but not for Python (combining holdout predictions).
Any leads on how to do this in Python?

The cross-validated predictions are stored in two different places -- once as a list of length k (for k-folds) in model.cross_validation_predictions(), and another as an H2O Frame with the CV preds in the same order as the original training rows in model.cross_validation_holdout_predictions(). The latter is usually what people want (we added this later, that's why there are two versions).
Yes, unfortunately the R example to get this frame in the "Cross-validation" section of the H2O User Guide does not have a Python version (ticket to fix that). In the keep_cross_validation_predictions argument documentation, it only shows one of the two locations.
Here's an updated example using XGBoost and showing both types of CV predictions:
import h2o
from h2o.estimators.xgboost import H2OXGBoostEstimator
h2o.init()
# Import a sample binary outcome training set into H2O
train = h2o.import_file("http://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
# try using the `keep_cross_validation_predictions` (boolean parameter):
# first initialize your estimator, set nfolds parameter
xgb = H2OXGBoostEstimator(keep_cross_validation_predictions = True, nfolds = 5, seed = 1)
# then train your model
xgb.train(x = x, y = y, training_frame = train)
# print the cross-validation predictions as a list
xgb.cross_validation_predictions()
# print the cross-validation predictions as an H2OFrame
xgb.cross_validation_holdout_predictions()
The CV pred frame of predictions looks like this:
Out[57]:
predict p0 p1
--------- --------- --------
1 0.396057 0.603943
1 0.149905 0.850095
1 0.0407018 0.959298
1 0.140991 0.859009
0 0.67361 0.32639
0 0.865698 0.134302
1 0.12927 0.87073
1 0.0549603 0.94504
1 0.162544 0.837456
1 0.105603 0.894397
[10000 rows x 3 columns]

For Python there is an example of this on GBM, and it should be exactly the same for XGB. According to that page, you should be able to do something like this:
model = H2OXGBoostEstimator(keep_cross_validation_predictions = True)
model.train(x = predictors, y = response, training_frame = train)
cv_predictions = model.cross_validation_predictions()

Related

Retrieving r2 value in negative

I have the following code applying lightgbm to the dataset(link shared below). I retrieve negative r2 of -2.0687981990506565. RMSE error I am retrieving is very low however r2 value is in negative. How can it perform badly while having very low MSE for train and test data.
weights_data = pd.read_csv("dataset.csv")
columns = weights_data.columns
target = columns[-1:]
features = columns[:-1]
def regressor_model():
print()
X = weights_data[features].to_numpy()
Y = weights_data[target].to_numpy() * 100
x_train,x_test,y_train,y_test=train_test_split(X,Y, train_size=0.8, random_state = 2021)
regressor = lightgbm.LGBMRegressor()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
r2_score_value=r2_score(y_test,y_pred)
print(r2_score_value)
print()
return regressor
regressor_model()
Link for dataset https://drive.google.com/file/d/1W1G67215vNZpsU1BEiz5S4XO0XwZJhwR/view?usp=sharing
If the order of the r2 parameter is changed for instance like below, a r2 value of 0.0 is retrieved.
r2_score_value=r2_score(y_pred,y_test)
If you are getting negative r-square. It means your model is making a random guess. From the above code I guess you are using the default parameters of the LGBMRegressor(). You need to tune the parameters of your model. Turning the parameters might probably solve your problem.
A you can find a similar scenario here

Why do predictions of multiple targets sometimes sum to 1 with sklearn RandomForestRegressor?

With a supervised learning method, we have features (inputs) and targets (outputs). If we have multi-dimensional targets that sum to 1 row-wise (e.g [0.3, 0.4, 0.3]) why does sklearn's RandomForestRegressor seem to normalize all outputs/predictions to sum to 1 when the training data sums to 1?
It seems like somewhere in the source code of sklearn it is normalizing outputs if the training data sums to 1, but I haven't been able to find it. I've gotten to the BaseDecisionTree class which seems to be used by random forests, but haven't been able to see any normalization going on it there. I created a gist to show how it works. When the row-wise sums of the targets don't sum to 1, the outputs of the regressor do not sum to 1. But when the row-wise sums of the targets DO sum to 1, it seems to normalize it. Here is the demonstration code from the gist:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
# simulate data
# 12 rows train, 6 rows test, 5 features, 3 columns for target
features = np.random.random((12, 5))
targets = np.random.random((12, 3))
test_features = np.random.random((6, 5))
rfr = RandomForestRegressor(random_state=42)
rfr.fit(features, targets)
preds = rfr.predict(features)
print('preds sum to 1?')
print(np.allclose(preds.sum(axis=1), np.ones(12)))
# normalize targets to sum to 1
norm_targets = targets / targets.sum(axis=1, keepdims=1)
rfr.fit(features, norm_targets)
preds = rfr.predict(features)
te_preds = rfr.predict(test_features)
print('predictions all sum to 1?')
print(np.allclose(preds.sum(axis=1), np.ones(12)))
print('test predictions all sum to 1?')
print(np.allclose(te_preds.sum(axis=1), np.ones(6)))
As one last note, I tried running a comparable fit in other random forest implementations (H2O in Python, in R: rpart, Rborist, RandomForest) but didn't find another implementation that allows multiple outputs.
My guess is that there is a bug in the sklearn code which is mixing up classification and regression somehow, and the outputs are being normalized to 1 like a classification problem.
What can be misleading here, is that you are only looking at the resulting sum of the output values. The reason why all predictions add up to 1 when the model is trained with the normalized labels, is that it will be predicting only among these multi-output arrays that it has seen. And this is happening because with such few samples, the model is overfitting, and the decision tree is de facto acting like a classifier.
In other words, looking at the example where the output is not normalised (the same applies to a DecisionTree):
from sklearn.tree import DecisionTreeRegressor
features = np.random.random((6, 5))
targets = np.random.random((6, 3))
rfr = DecisionTreeRegressor(random_state=42)
rfr.fit(features, targets)
If we now predict on a new set of random features, we will be getting predictions among the set of outputs the model has been trained on:
features2 = np.random.random((6, 5))
preds = rfr.predict(features2)
print(preds)
array([[0.0017143 , 0.05348525, 0.60877828], #0
[0.05232433, 0.37249988, 0.27844562], #1
[0.08177551, 0.39454957, 0.28182183],
[0.05232433, 0.37249988, 0.27844562],
[0.08177551, 0.39454957, 0.28182183],
[0.80068346, 0.577799 , 0.66706668]])
print(targets)
array([[0.80068346, 0.577799 , 0.66706668],
[0.0017143 , 0.05348525, 0.60877828], #0
[0.08177551, 0.39454957, 0.28182183],
[0.75093787, 0.29467892, 0.11253746],
[0.87035059, 0.32162589, 0.57288903],
[0.05232433, 0.37249988, 0.27844562]]) #1
So logically, if all training outputs add up to 1, the same will apply to the predicted values.
If we take the intersection of the sums along the first axis for both the targets and predicted values, we see that all predicted values' sum exists in targets:
preds_sum = np.unique(preds.sum(1))
targets_sum = np.unique(targets.sum(1))
len(np.intersect1d(targets_sum, preds_sum)) == len(features)
# True

How do we predict on new unseen groups in a hierarchical model in PyMC3?

If we have a hierarchical model with data from different sites as different groups in the model, how do we predict on new groups (new sites that we haven't seen before)?
e.g. using the following logistic regression model:
from pymc3 import Model, sample, Normal, HalfCauchy,Bernoulli
import theano.tensor as tt
with Model() as varying_slope:
mu_beta = Normal('mu_beta', mu=0., sd=1e5)
sigma_beta = HalfCauchy('sigma_beta', 5)
a = Normal('a', mu=0., sd=1e5)
betas = Normal('b',mu=mu_beta,sd=sigma_beta,shape=(n_features,n_site))
y_hat = a + tt.dot(X_shared,betas[:,site_shared])
y_like = Bernoulli('y_like', logit_p=y_hat, observed=train_y)
After we fit this model, we can predict on new data (i.e. sample from the posterior predictive) from a specific site using:
site_to_predict = 1
samples = 500
x = tt.matrix('X',dtype='float64')
new_site = tt.vector('new_site',dtype='int32')
n_samples = tt.iscalar('n_samples')
x.tag.test_value = np.empty(shape=(1,X.shape[1]))
new_site.tag.test_value = np.empty(shape=(1,1))
_sample_proba = approx.sample_node(varying_slope.y_like.distribution.p,
size=n_samples,
more_replacements={X_shared: x,site_shared:new_site})
sample_proba = theano.function([x,new_site,n_samples], _sample_proba)
pred_test = sample_proba(test_X.reshape(1,-1),np.array(site_to_predict).reshape(-1),samples)
but what is the correct way to sample from the posterior predictive distribution if we have a new unseen site ?
I'm just copying my answer from the pymc discourse thread if someone by chance runs into this question or another one like it here.
First of all, beware of the centered hierarchical parametrization 1 you are using, it may lead to divergences and difficulties while fitting.
That being said, your model looks more or less like a GLM with shared prior random variates mu_beta and sigma_beta across features and sites. Once you get a posterior distribution over those two, your predictions should look something like
y_hat = a + dot(X_shared, Normal(mu=mu_beta, sigma=sigma_beta))
y_like = Bernoulli('y_like', logit_p=y_hat)
So, we will aim to get that.
The way in which we always recommend out of sample posterior predictive checks is to use theano.shared's. I’ll use a different approach, inspired in the functional API that is being the core design idea for pymc4. The are many differences I wont go into between pymc3 and the skeleton of pymc4, but one thing that I started to use more were factory functions to get the Model instances. Instead of trying to define things inside the model with theano.shared's, I just create a new model with the new data and draw posterior predictive samples from it. I just recently posted about this here.
The idea is to create the model with the training data and sample from it to get a trace. Then you use have to extract from the trace the hierarchical part which is shared with the unseen site: mu_beta, sigma_beta and a. Finally, you create a new model using the new data of the test site, and sample from the posterior predictive using a list of dictionaries that hold the mu_beta, sigma_beta and a part of the training trace. Here’s a self-contained example
import numpy as np
import pymc3 as pm
from theano import tensor as tt
from matplotlib import pyplot as plt
def model_factory(X, y, site_shared, n_site, n_features=None):
if n_features is None:
n_features = X.shape[-1]
with pm.Model() as model:
mu_beta = pm.Normal('mu_beta', mu=0., sd=1)
sigma_beta = pm.HalfCauchy('sigma_beta', 5)
a = pm.Normal('a', mu=0., sd=1)
b = pm.Normal('b', mu=0, sd=1, shape=(n_features, n_site))
betas = mu_beta + sigma_beta * b
y_hat = a + tt.dot(X, betas[:, site_shared])
pm.Bernoulli('y_like', logit_p=y_hat, observed=y)
return model
# First I generate some training X data
n_features = 10
ntrain_site = 5
ntrain_obs = 100
ntest_site = 1
ntest_obs = 1
train_X = np.random.randn(ntrain_obs, n_features)
train_site_shared = np.random.randint(ntrain_site, size=ntrain_obs)
new_site_X = np.random.randn(ntest_obs, n_features)
test_site_shared = np.zeros(ntest_obs, dtype=np.int32)
# Now I generate the training and test y data with a sample from the prior
with model_factory(X=train_X,
y=np.empty(ntrain_obs, dtype=np.int32),
site_shared=train_site_shared,
n_site=ntrain_site) as train_y_generator:
train_Y = pm.sample_prior_predictive(1, vars=['y_like'])['y_like'][0]
with model_factory(X=new_site_X,
y=np.empty(ntest_obs, dtype=np.int32),
site_shared=test_site_shared,
n_site=ntest_site) as test_y_generator:
new_site_Y = pm.sample_prior_predictive(1, vars=['y_like'])['y_like'][0]
# The previous part is just to get some toy data to fit
# Now comes the important parts. First training
with model_factory(X=train_X,
y=train_Y,
site_shared=train_site_shared,
n_site=ntrain_site) as train_model:
train_trace = pm.sample()
# Second comes the hold out data posterior predictive
with model_factory(X=new_site_X,
y=new_site_Y,
site_shared=test_site_shared,
n_site=ntrain_site) as test_model:
# We first have to extract the learnt global effect from the train_trace
df = pm.trace_to_dataframe(train_trace,
varnames=['mu_beta', 'sigma_beta', 'a'],
include_transformed=True)
# We have to supply the samples kwarg because it cannot be inferred if the
# input trace is not a MultiTrace instance
ppc = pm.sample_posterior_predictive(trace=df.to_dict('records'),
samples=len(df))
plt.figure()
plt.hist(ppc['y_like'], 30)
plt.axvline(new_site_Y, linestyle='--', color='r')
The posterior predictive I get looks like this:
Of course, I don’t know what kind of data to concretely put as your X_shared, site_shared or train_y, so I just made up some nonsense toy data at the beginning of the code, you should replace that with your actual data.

Using a greedy feature selection algorithm for linear regression in Python

This is a homework problem for a machine learning course I'm taking. I'll be as descriptive as I can regarding the approaches I took, what worked, and what didn't.
We are given four types of data sets: dev_sample.npy, dev_label.npy, test_sample.npy, and test_label.npy. We first load the data set as follows:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
X_dev = np.load("./dev_sample.npy") # shape (900, 126)
y_dev = np.load("./dev_label.npy") # shape (900,)
X_test = np.load("/test_sample.npy") # shape (100, 126)
y_test = np.load("./test_label.npy") # shape (100,)
The problem we need to solve is to implement a "greedy feature selection" algorithm until the best 100 of the 126 features are selected. Basically we train models with one feature, select the best one and store it, train 125 models with each remaining feature paired with the selected, choose the next best one and store it, and continue until we have reached 100.
Here is the code:
# Define linear regression function
# You may use sklearn.linear_model.LinearRegression
# Your code here
lin_reg = LinearRegression()
# End your code
# Basic settings. DO NOT MODIFY
selected_feature = []
sel_num = 100
valid_split = 1/5
cv = ShuffleSplit(n_splits=5, test_size=valid_split, random_state=0)
selected_train_error = []
selected_valid_error = []
# For greedy selection
for sel in range(sel_num) :
min_train_error = +1000
min_valid_error = +1000
min_feature = 0
for i in range(X_dev.shape[1]) :
train_error_ith = []
valid_error_ith = []
# Select feature greedy
# Hint : There should be no duplicated feature in selected_feature
# Your code here
X_dev_fs = X_dev[:, i]
if (i in selected_feature):
continue
else:
pass
# End your code
# For cross validation
for train_index, test_index in cv.split(X_dev) : # train_index.shape = 720, test_index.shape = 180, 5 iterations
X_train, X_valid = X_dev_fs[train_index], X_dev_fs[test_index]
y_train, y_valid = y_dev[train_index], y_dev[test_index]
# Derive training error, validation error
# You may use sklearn.metrics.mean_squared_error, model.fit(), model.predict()
# Your code here
model_train = lin_reg.fit(X_train.reshape(-1, 1), y_train.reshape(-1, 1))
predictions_train = model_train.predict(X_valid.reshape(-1, 1))
train_error_ith.append(mean_squared_error(y_valid, predictions_train))
model_valid = lin_reg.fit(X_valid.reshape(-1, 1), y_valid.reshape(-1, 1))
predictions_valid = model_valid.predict(X_valid.reshape(-1, 1))
valid_error_ith.append(mean_squared_error(y_valid, predictions_valid))
# End your code
# Select best performance feature set on each features
# You should choose the feature which has minimum mean cross validation error
# Your code here
min_train_error = train_error_ith[np.argmin(train_error_ith)]
min_valid_error = valid_error_ith[np.argmin(valid_error_ith)]
min_feature = np.argmin(valid_error_ith)
# End your code
print('='*50)
print("# of selected feature(s) : {}".format(sel+1))
print("min_train_error: {}".format(min_train_error))
print("min_valid_error: {}".format(min_valid_error))
print("Selected feature of this iteration : {}".format(min_feature))
selected_feature.append(min_feature)
selected_train_error.append(min_train_error)
selected_valid_error.append(min_valid_error)
The algorithm that I had in mind when filling in the #Your code sections is that X_dev_fs would hold the feature of the current iteration along with the previously selected features. We would then use cross validation to derive training and CV errors.
The current output that I get after running this program is
==================================================
# of selected feature(s) : 1
min_train_error: 9.756743239446392
min_valid_error: 9.689856536723353
Selected feature of this iteration : 1
==================================================
# of selected feature(s) : 2
min_train_error: 9.70991346883164
min_valid_error: 9.674875050182653
Selected feature of this iteration : 1
==================================================
and so on, with the # of selected feature(s) going on until 100.
The problem is that Selected feature of this iteration : should not output the same number more than once. I'm also having trouble figuring out how to store the best feature and use it with the subsequent iterations.
The questions that I have are:
Why is my selected_feature list containing the same duplicate features, and how do I prevent that?
How do I store the best feature in selected_feature, then use that paired up with each subsequent remaining feature?
Any feedback is appreciated. Thank you.
EDIT
Here are the links to the files that I am loading into the variables, in case anybody needs them.
dev_sample.npy
dev_label.npy
test_sample.npy
test_label.npy

Sklearn svr give wrong results when the training data obvious show a pattern

I have the following training data:
x = [
[0.914728682,5.217,5,0.217,3.150362319,33.36,35,-1.64,4.220113852],
[0.885057471,7.793,8,-0.207,3.380911063,46.84,48,-1.16,4.448243115],
[0.871345029,7.152,7,0.152,3.976205037,44.98,47,-2.02,5.421236592],
[0.821428571,8.04,8,0.04,2.909880565,52.02,54.5,-2.48,2.824104235],
[0.931372549,8.01,8,0.01,4.616714697,48.04,48,0.04,9.650462033],
[0.66367713,5.424,5.5,-0.076,1.37804878,32.6,35.5,-2.9,1.189781022],
[0.78,8.66,9,-0.34,2.272965879,48.47,55,-6.53,2.564550265],
[0.227272727,19.55,21,-1.45,1.860133206,128.23,147,-18.77,1.896893491],
[0.47826087,10.09,8,2.09,1.155519927,74.43,64,10.43,1.169547454],
[0.652694611,6.775,4,2.775,1.05529595,43.1,30,13.1,1.062885327],
[0.798561151,3.986,2,1.986,0.656563993,25.38,13,12.38,0.652442159],
[0.666666667,5.419,3,2.419,1.057985162,34.37,16,18.37,0.981719509],
[0.5625,7.719,2,5.719,0.6421797,46.91,12,34.91,0.665673336]
]
and the following labels(scores):
y = [0.237113402,0.168831169,0.104166667,0.086419753,0.063147368,0.016042781,
0.014814815,0,0,-0.0794,-0.14,-0.1832,-0.2385]
It seems clear that the larger the values in column 5 and column 9 are, the higher the scores.
I write the following code that make use of SVR on the training data provided:
rb = RobustScaler()
xScaled = rb.fit_transform(x)
model = SVR(C=1.0, epsilon=0.1)
model.fit(xScaled,y)
But no matter which of the following I use for prediction, it is not giving a score that looks right.
1 score = model.predict(rb.fit_transform(testData))
2 score = model.predict(testData)
If I do something like the following during training:
xScaled = preprocessing.scale(x)
model = SVR(C=1.0, epsilon=0.1)
model.fit(xScaled,y)
then:
score = svmModel.predict(testData)
I get back something close to the origin y.
But I pick a row in x, put it in a 2d array with one row called testData, and do:
score = svmModel.predict(testData)
I get a wrong score. In fact, no matter which row in x I use for creating the 2d array with one row, I get the same score.
What have I done wrong? I would be extremely grateful if someone can help.
1) score = model.predict(rb.fit_transform(testData))
When you do the above, you are re-fitting the RobustScaler to the new data. That means that it will be scaled to new data and will not match the scales of the training data. So the results will not be good.
2) score = model.predict(testData)
In the above, you are not scaling the test data, so its different that what the SVC has learnt. Hence the results will be bad here also.
What you need to do:-
score = model.predict(rb.transform(testData))
Calling transform() will scale the supplied data based on training data scales, and hence the SVC can better predict the output.

Categories

Resources