Retrieving r2 value in negative - python

I have the following code applying lightgbm to the dataset(link shared below). I retrieve negative r2 of -2.0687981990506565. RMSE error I am retrieving is very low however r2 value is in negative. How can it perform badly while having very low MSE for train and test data.
weights_data = pd.read_csv("dataset.csv")
columns = weights_data.columns
target = columns[-1:]
features = columns[:-1]
def regressor_model():
print()
X = weights_data[features].to_numpy()
Y = weights_data[target].to_numpy() * 100
x_train,x_test,y_train,y_test=train_test_split(X,Y, train_size=0.8, random_state = 2021)
regressor = lightgbm.LGBMRegressor()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
r2_score_value=r2_score(y_test,y_pred)
print(r2_score_value)
print()
return regressor
regressor_model()
Link for dataset https://drive.google.com/file/d/1W1G67215vNZpsU1BEiz5S4XO0XwZJhwR/view?usp=sharing
If the order of the r2 parameter is changed for instance like below, a r2 value of 0.0 is retrieved.
r2_score_value=r2_score(y_pred,y_test)

If you are getting negative r-square. It means your model is making a random guess. From the above code I guess you are using the default parameters of the LGBMRegressor(). You need to tune the parameters of your model. Turning the parameters might probably solve your problem.
A you can find a similar scenario here

Related

Optimize "recall" only when "precision">0.9 using Optuna

I have a lower bound of the precision of my model, say a Logistic Regression, of 0.9, thus I want the maximum recall when precision>0.9.
Recall is here defined as the ratio of the dataset I predict on to get a precision of atleast 0.9 i.e we remove predictions with a low confidence (predict_proba).
I have used optuna before to find optimal hyper-parameters but I cannot figure out how I should use it with this condition.
Right now I have the following code
import optuna
def objective(trial):
c = trial.suggest_float(0.001,10)
model = LogisticRegression(C=c)
model.fit(X_train,y_train)
pred_proba = model.predict_proba(X_val)
pred = model.predict(X_val)
thrs = trial.suggest_float(0.5,0.9)
keep_idx = (pred_proba>=thrs).any(axis=1) #Predictions to keep
recall = keep_idx.mean() #Ratio of predictions we are making
precision = (pred[keep_idx] == y_val[keep_idx])
if precision>0.9:
return recall
else:
return 0
but I assume we are giving optuna a hard time here, since it does not get any feedback of the parameter space when precision<0.9 - it just gets the same value namely "0".
Is this the correct way of doing so or is there a better way?

Error in cbind2(1, newx) %*% nbeta : Cholmod error- Lasso Regression Error in R

I'm working on predicting a regression model on Lasso.
I have a total of 137 train data, and 100000 test data to predict the total revenue.
to build the model I split train data into train and test ( train = 96, test = 97-137).
When I ran the lasso regression to predict the testing data I received the following error:
Error in cbind2(1, newx) %*% nbeta :
Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 88
data set sample part 1 here
data set sample part 2 here
My data split is showing here:
#model 4: LASSO
y<-log(training$revenue)
X<- model.matrix(Id~Open.Date*P26*P28+sqrt(Open.Date)*P26*P28+log(Open.Date)*P26*P28 + P1+P2+P3+P4+P5+P6+P7
+P8+P9+P10+P11+P12+P13+P14+P15+P16+P17+P18+P19+P20+
P21+P22+P23+P24+P25+P27+P29+P30+P31+P32+P33+P34+P35+P36+P37, new_Data)[,-1]
X<-cbind(new_Data$ID,X) #bind a new column ID to the data set
# split X into testing, trainig/holdout and prediction as before
X.training<-X[1:96,]
X.testing<-X[97:137,]
X.prediction<-X[138:100137,]
X.training
Running lasso regression to predict Lambda on x.training
#selecting the best penalty lambda (try different values of lambda and get the value of error in Y)
crossval <- cv.glmnet(x = X.training, y = y, alpha = 1) #create cross-validation data
plot(crossval)
penalty.lasso <- crossval$lambda.min #enter code heredetermine optimal penalty parameter, lambda = -4.278
log(penalty.lasso) #see where it was on the graph, and calculates the penalty
plot(crossval,xlim=c(-6,-2),ylim=c(0,0.4)) # zoom-in
lasso.opt.fit <-glmnet(x = training, y = y, alpha = 1, lambda = penalty.lasso) #estimate the model with the optimal penalty
coef(lasso.opt.fit) #resultant model coefficients
predicting the performance ( here where I got the error)
# predicting the performance on the testing set
lasso.testing <- exp(predict(lasso.opt.fit, s = penalty.lasso, newx= X.testing))
mean(abs(lasso.testing-X.testing$revenue)/X.testing$revenue*100) #calculate and display MAPE
Do you know what could be this error for and what possible ways to fix it?
Thank you

How to balance training set in python?

I'm trying to apply baseline model to my data set. But the data set is imbalanced and only 11% of the data belongs to positive category. I split the data without sampling, the recall for positive records is very low. I want to balance the training data(0.5 negative 0.5 positive) without balancing testing data. Does anyone know how to do that?
#splitting train and test data
train,test = train_test_split(coupon,test_size = 0.3,random_state = 100)
##separating dependent and independent variables
cols = [i for i in coupon.columns if i not in target_col]
train_X = train[cols]
train_Y = train[target_col]
test_X = test[cols]
test_Y = test[target_col]
#Function attributes
#dataframe - processed dataframe
#Algorithm - Algorithm used
#training_x - predictor variables dataframe(training)
#testing_x - predictor variables dataframe(testing)
#training_y - target variable(training)
#training_y - target variable(testing)
#cf - ["coefficients","features"](cooefficients for logistic
#regression,features for tree based models)
#threshold_plot - if True returns threshold plot for model
def coupon_use_prediction(algorithm,training_x,testing_x,
training_y,testing_y,cols,cf,threshold_plot) :
#model
algorithm.fit(training_x,training_y)
predictions = algorithm.predict(testing_x)
probabilities = algorithm.predict_proba(testing_x)
#coeffs
if cf == "coefficients" :
coefficients = pd.DataFrame(algorithm.coef_.ravel())
elif cf == "features" :
coefficients = pd.DataFrame(algorithm.feature_importances_)
column_df = pd.DataFrame(cols)
coef_sumry = (pd.merge(coefficients,column_df,left_index= True,
right_index= True, how = "left"))
coef_sumry.columns = ["coefficients","features"]
coef_sumry = coef_sumry.sort_values(by = "coefficients",ascending = False)
print (algorithm)
print ("\n Classification report : \n",classification_report(testing_y,predictions))
print ("Accuracy Score : ",accuracy_score(testing_y,predictions))
You have to way of balancing data : up sampling or down sampling.
Up sampling : duplication of the under-represented data.
Down sampling : sampling of the over-represented data.
For the upsampling it is pretty much easy.
For the downsampling you can use sklearn.utils.resample and provide the number of sample you want to get.
Please note that as #paritosh-singh mentioned, changing the distribution may not be the only solution. There are machine learning algorithms that can:
- support imbalanced data
- already have built-in weighting option to takes in account the data distribution

H2o Python: Combining XGB Holdout Predictions

When using:
"keep_cross_validation_predictions": True
"keep_cross_validation_fold_assignment": True
in H2O's XGBoost Estimator, I am not able to map these cross validated probabilities back to the original dataset. There is one documentation example for R but not for Python (combining holdout predictions).
Any leads on how to do this in Python?
The cross-validated predictions are stored in two different places -- once as a list of length k (for k-folds) in model.cross_validation_predictions(), and another as an H2O Frame with the CV preds in the same order as the original training rows in model.cross_validation_holdout_predictions(). The latter is usually what people want (we added this later, that's why there are two versions).
Yes, unfortunately the R example to get this frame in the "Cross-validation" section of the H2O User Guide does not have a Python version (ticket to fix that). In the keep_cross_validation_predictions argument documentation, it only shows one of the two locations.
Here's an updated example using XGBoost and showing both types of CV predictions:
import h2o
from h2o.estimators.xgboost import H2OXGBoostEstimator
h2o.init()
# Import a sample binary outcome training set into H2O
train = h2o.import_file("http://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
# try using the `keep_cross_validation_predictions` (boolean parameter):
# first initialize your estimator, set nfolds parameter
xgb = H2OXGBoostEstimator(keep_cross_validation_predictions = True, nfolds = 5, seed = 1)
# then train your model
xgb.train(x = x, y = y, training_frame = train)
# print the cross-validation predictions as a list
xgb.cross_validation_predictions()
# print the cross-validation predictions as an H2OFrame
xgb.cross_validation_holdout_predictions()
The CV pred frame of predictions looks like this:
Out[57]:
predict p0 p1
--------- --------- --------
1 0.396057 0.603943
1 0.149905 0.850095
1 0.0407018 0.959298
1 0.140991 0.859009
0 0.67361 0.32639
0 0.865698 0.134302
1 0.12927 0.87073
1 0.0549603 0.94504
1 0.162544 0.837456
1 0.105603 0.894397
[10000 rows x 3 columns]
For Python there is an example of this on GBM, and it should be exactly the same for XGB. According to that page, you should be able to do something like this:
model = H2OXGBoostEstimator(keep_cross_validation_predictions = True)
model.train(x = predictors, y = response, training_frame = train)
cv_predictions = model.cross_validation_predictions()

Sklearn svr give wrong results when the training data obvious show a pattern

I have the following training data:
x = [
[0.914728682,5.217,5,0.217,3.150362319,33.36,35,-1.64,4.220113852],
[0.885057471,7.793,8,-0.207,3.380911063,46.84,48,-1.16,4.448243115],
[0.871345029,7.152,7,0.152,3.976205037,44.98,47,-2.02,5.421236592],
[0.821428571,8.04,8,0.04,2.909880565,52.02,54.5,-2.48,2.824104235],
[0.931372549,8.01,8,0.01,4.616714697,48.04,48,0.04,9.650462033],
[0.66367713,5.424,5.5,-0.076,1.37804878,32.6,35.5,-2.9,1.189781022],
[0.78,8.66,9,-0.34,2.272965879,48.47,55,-6.53,2.564550265],
[0.227272727,19.55,21,-1.45,1.860133206,128.23,147,-18.77,1.896893491],
[0.47826087,10.09,8,2.09,1.155519927,74.43,64,10.43,1.169547454],
[0.652694611,6.775,4,2.775,1.05529595,43.1,30,13.1,1.062885327],
[0.798561151,3.986,2,1.986,0.656563993,25.38,13,12.38,0.652442159],
[0.666666667,5.419,3,2.419,1.057985162,34.37,16,18.37,0.981719509],
[0.5625,7.719,2,5.719,0.6421797,46.91,12,34.91,0.665673336]
]
and the following labels(scores):
y = [0.237113402,0.168831169,0.104166667,0.086419753,0.063147368,0.016042781,
0.014814815,0,0,-0.0794,-0.14,-0.1832,-0.2385]
It seems clear that the larger the values in column 5 and column 9 are, the higher the scores.
I write the following code that make use of SVR on the training data provided:
rb = RobustScaler()
xScaled = rb.fit_transform(x)
model = SVR(C=1.0, epsilon=0.1)
model.fit(xScaled,y)
But no matter which of the following I use for prediction, it is not giving a score that looks right.
1 score = model.predict(rb.fit_transform(testData))
2 score = model.predict(testData)
If I do something like the following during training:
xScaled = preprocessing.scale(x)
model = SVR(C=1.0, epsilon=0.1)
model.fit(xScaled,y)
then:
score = svmModel.predict(testData)
I get back something close to the origin y.
But I pick a row in x, put it in a 2d array with one row called testData, and do:
score = svmModel.predict(testData)
I get a wrong score. In fact, no matter which row in x I use for creating the 2d array with one row, I get the same score.
What have I done wrong? I would be extremely grateful if someone can help.
1) score = model.predict(rb.fit_transform(testData))
When you do the above, you are re-fitting the RobustScaler to the new data. That means that it will be scaled to new data and will not match the scales of the training data. So the results will not be good.
2) score = model.predict(testData)
In the above, you are not scaling the test data, so its different that what the SVC has learnt. Hence the results will be bad here also.
What you need to do:-
score = model.predict(rb.transform(testData))
Calling transform() will scale the supplied data based on training data scales, and hence the SVC can better predict the output.

Categories

Resources