How to fix the number of label for XGBClassifer?

How to fix the number of label for XGBClassifer? - python

I try to build a distinct classifier per each dataset through the loop. Here's the pseudo code of my program.
for dataset in total_data:
X_train, X_test, y_train, y_test = Somefunction(dataset)
model = XGBClassifier()
model.fit(X_train, y_train)
Here, label(y_train) have one of the values [0,1,2]. However, I just find that y_train only have 1 and 2 in some datasets. And because the XGBoost force label starts with 0, it yields an error.
Is there any elegant way to tell XGBoost that there are the 3 kinds of labels even if only two exist in the train set?
+)
I try to use num_class=3 as argument of XGBClassfier, but it yields another kind of error : Invalid shape of labels

Related

Can a pipeline works for more than one class?

my dataset consist of 10 feature (10 columns) for input and the last 3 columns for 3 different output. If I use one column for output, for example y = newDf.iloc[:, 10].values , it works; but if I use all 3 columns it gives me an error at pipe_lr.fit and says: y should be a 1d array, got an array of shape (852, 3) instead.
How can I pass y ?
X = newDf.iloc[:, 0:10].values
y = newDf.iloc[:, 10:13].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
pipe_lr = make_pipeline(StandardScaler(),
PCA(n_components=2),
LogisticRegression(random_state=1, solver='lbfgs'))
pipe_lr.fit(X_train, y_train)

The pipeline itself does not care about the format for y, it just hands it over to each step. In your case, it's the LogisticRegression, which indeed is not set up for multi-label classification. You can manage it using the MultiOutputClassification wrapper:
pipe_lr = make_pipeline(
StandardScaler(),
PCA(n_components=2),
MultiOutputClassifier(LogisticRegression(random_state=1, solver='lbfgs'))
)
(There is also a MultiOutputRegressor, and more-complicated things like ClassifierChain and RegressorChain. See the User Guide. However, there is not to my knowledge a builtin way to mix and match regression and classification tasks.)

Simply put, No.
What you want is called multi-label learning, not supported by Scikit-Learn.
You should train three models, each having a label.

Format of train/test for random forest classifier with categorical variables

Updated: How do I set up my train/test df for scikit randomforestclassifier for multiple categories? How do I predict?
My training dataset has a categorical Outcome column with 4 classes and I want to predict which of those four is most likely for my test data. Looking at other questions, I tried use pandas get_dummies to encode four new columns into the original df in place of the original Outcome column but wasn't sure how to indicate to the classifier that those four columns were the categories, so I used y = df_raw['Outcomes'].values .
I then split the training set 80/20 and called fit() with these x_train, x_valid and y_train, y_valid:
def split_vals(a,n): return a[:n].copy(), a[n:].copy()
n_valid = 10000
n_trn = len(df_raw_dumtrain)-n_valid
raw_train, raw_valid = split_vals(df_raw_dumtrain, n_trn)
X_train, X_valid = split_vals(df_raw_dumtrain, n_trn)
y_train, y_valid = split_vals(df_raw_dumtrain, n_trn)
random_forest = RandomForestClassifier(n_estimators=10)
random_forest.fit(X_train, y_train)
Y_prediction = random_forest.predict(X_train)
I tried running fit() as:
test_pred = random_forest.predict(df_test)
But I get an error:
ValueError: Number of features of the model must match the input.
Model n_features is 27 and input n_features is 28
How should I be configuring my test set?

You have to remove the target variable from the test data and then give the remaining column of the dataframe as the input for the prediction function. You would able to solve the number of features mismatch.
Try this!
random_forest.predict(df_test.drop('Outcomes',axis=1))
Note : you don't have to create dummy variables of the target variables for using the random forest or any decision tree based models.

SKLearn Predicting using new Data

I've tried out Linear Regression using SKLearn. I have data something along the lines of: Calories Eaten | Weight.
150 | 150
300 | 190
350 | 200
Basically made up numbers but I've fit the dataset into the linear regression model.
What I'm confused on is, how would I go about predicting with new data, say I got 10 new numbers of Calories Eaten, and I want it to predict Weight?
regressor = LinearRegression()
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test) ??
But how would I go about making only my 10 new data numbers of Calories Eaten and make it the Test Set I want the regressor to predict?

You are correct, you simply call the predict method of your model and pass in the new unseen data for prediction. Now it also depends on what you mean by new data. Are you referencing data that you do not know the outcome of (i.e. you do not know the weight value), or is this data being used to test the performance of your model?
For new data (to predict on):
Your approach is correct. You can access all predictions by simply printing the y_pred variable.
You know the respective weight values and you want to evaluate model:
Make sure that you have two separate data sets: x_test (containing the features) and y_test (containing the labels). Generate the predictions as you are doing with the y_pred variable, then you can calculate its performance using a number of performance metrics. Most common one is the root mean square, and you simply pass the y_test and y_pred as parameters. Here is a list of all the regression performance metrics supplied by sklearn.
If you do not know the weight value of the 10 new data points:
Use train_test_split to split your initial data set into 2 parts: training and testing. You would have 4 datasets: x_train, y_train, x_test, y_test.
from sklearn.model_selection import train_test_split
# random state can be any number (to ensure same split), and test_size indicates a 25% cut
x_train, y_train, x_test, y_test = train_test_split(calories_eaten, weight, test_size = 0.25, random_state = 42)
Train model by fitting x_train and y_train. Then evaluate model's training performance by predicting on x_test and comparing these predictions with the actual results from y_test. This way you would have an idea of how the model performs. Furthermore, you can then predict the weight values for the 10 new data points accordingly.
It is also worth reading further on the topic as a beginner. This is a simple tutorial to follow.

You have to select the model using model_selection in sklearn then train and fit the dataset.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(eaten, weight)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

What I'm confused on is, how would I go about predicting with new
data, say I got 10 new numbers of Calories Eaten, and I want it to
predict Weight?
Yes, Calories Eaten represents the independent variable while Weight represent dependent variable.
After you split the data into training set and test set the next step is to fit the regressor using X_train and y_train data.
After the model is trained you can predict the results for X_test method and so we got the y_pred.
Now you can compare y_pred (predicted data) with y_test which is real data.
You can also use score method for your created linear model in order to get the performance of your model.
score is calculated using R^2(R squared) metric or Coefficient of determination.
score = regressor.score(x_test, y_test)
For splitting the data you can use train_test_split method.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(eaten, weight, test_size = 0.2, random_state = 0)

Do I have to use fit() again after training in sklearn?

I am using LinearRegression(). Below you can see what I have already done to predict new features:
lm = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=say)
lm.fit(X_train, y_train)
lm.predict(X_test)
scr = lm.score(X_test, y_test)
lm.fit(X, y)
pred = lm.predict(X_real)
Do I really need the line lm.fit(X, y) or can I just go without using it? Also, If I don't need to calculate accuracy, do you think the following approach is better instead using training and testing? (In case I don't want to test):
lm.fit(X, y)
pred = lm.predict(X_real)
Even I am getting 0.997 accuraccy, the predicted value is not close or shifted. Are there ways to make prediction more accurate?

You don't need to fit multiple times for predicting a value by given features since your algorithm already learned your train set. Check the codes below.
# Split your data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)
# Teach your data to your algorithm with train set
lr = LinearRegression()
lr.fit(X_train, y_train)
# Now it can predict
y_pred = lr.predict(X_test)
# Use test set to see how accurate it predicts
lr_score = lr.score(y_pred, y_test)

The reason you are getting almost 100% accuracy score is a data leakage, caused by the following line of code:
lm.fit(X, y)
in the line above you gave your model ALL the data and then you are testing prediction using the subset of data that your model has already seen.
This causes very high accuracy score for the already seen data, but usually it performs badly on the unseen data.
When do you want / need to fit your model multiple times?
If you are getting a new training data and want to improve your model by training it against a new portion of data, then you may want to choose one of regression algorithm, supporting incremental-learning.
In this case you will use model.partial_fit() method instead of model.fit()...

Error using a log in random forests Python

I'm trying to predict a value. I'm able to predict when I'm using my real target value, it's a number of days, but I try to predict using the log of the value, it gives me an error. I'm using sklearn and random forests.
The code:
X = final_pressure_df.drop(['y', 'log_y', 'patient_id', 'wound_id'], axis=1)
Y = final_pressure_df['log_y']
X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(X, Y, test_size=0.4, random_state=5)
forest = RandomForestClassifier(criterion='entropy', n_estimators=200, max_depth=100, random_state=5)
forest.fit(X_train, Y_train)
The error: ValueError: Unknown label type: array([[ 3.91202301]
Can someone help me please?

You need regression, not classification, so use RandomForestRegressor.
Classification will not work when the variable being predicted is real-valued (float). And even in your first case, when you are predicting number of days, it still makes more sense to use regression since you are predicting some value, and number of days in not a class/category.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.