I'm trying to use three binary explanatory variables relating a banking history: default, housing, and loan to predict the binary response variable using a Logistic Regression classifier.
I have the following dataset:
mapping function to convert text no/yes to integer 0/1
convert_to_binary = {'no' : 0, 'yes' : 1}
default = bank['default'].map(convert_to_binary)
housing = bank['housing'].map(convert_to_binary)
loan = bank['loan'].map(convert_to_binary)
response = bank['response'].map(convert_to_binary)
I added my three explanatory variables and response to an array
data = np.array([np.array(default), np.array(housing), np.array(loan),np.array(response)]).T
kfold = KFold(n_splits=3)
scores = []
for train_index, test_index in kfold.split(data):
X_train, X_test = data[train_index], data[test_index]
y_train, y_test = response[train_index], response[test_index]
model = LogisticRegression().fit(X_train, y_train)
pred = model.predict(data[test_index])
results = model.score(X_test, y_test)
scores.append(results)
print(np.mean(scores))
my accuracy is always 100%, which I know is not correct. the accuracy should be somewhere around 50-65%?
Is there something I'm doing wrong?
The split is not correct
Here is the correct split
X_train, X_labels = data[train_index], response[train_index]
y_test, y_labels = data[test_index], response[test_index]
model = LogisticRegression().fit(X_train, X_labels)
pred = model.predict(y_test)
acc = sklearn.metrics.accuracy_score(y_labels,pred,normalize=True)
Related
I am training a model to predict true or false based on some data. I drop the product number from the list of features when training and testing the model.
X = df.drop(columns = 'Product Number', axis = 1)
y = df['result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
SVC = LinearSVC(max_iter = 1200)
SVC.fit(X_train, y_train)
y_pred = SVC.predict(X_test)
Is there any way for me to recover the product number and its features for the item that has passed or failed? How do I get/relate the results of y_pred to which product number it corresponds to?
I also plan on using cross validation so the data gets shuffled, would there still be a way for me to recover the product number for each test item?
I realised I'm using cross validation only to evaluate my model's performance so I decided to just run my code without shuffling the data to see the results for each datapoint.
Edit: For evaluation without cross validation, I drop the irrelevant columns only when I pass it to the classifier as shown below:
cols = ['id', 'label']
X = train_data.copy()
y = train_data['label']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=2)
knn = make_pipeline(StandardScaler(),KNeighborsClassifier(n_neighbors=10))
y_val_pred = knn.fit(X_train.drop(columns=cols), y_train).predict(X_val.drop(columns=cols))
X_val['y_val_pred'] = y_val_pred
I join the y_val_pred after prediction to check which datapoints have been misclassified.
I've done some research about this matter. Most of people argue which one to use in gradient boosting. https://www.kaggle.com/c/home-credit-default-risk/discussion/59873 Most kagglers here explain that label encoder is used or ordinal values and one-hot for categorical.
ex: genius, smart, less-smart should be label encoded into genius=3, smart=2, and less-smart=1.
However according to sklearn documentation, label encoding is only used for target values. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
So I've trained a model by converting one of my feature using one-hot and label encoder to see which would predict better.
(according to sklearn doc, label encoder should be used for target values. In my case it is wrong since I've used it on one of my independent variable(x)?)
First, using label encoder.
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(data_set["region"])
data_set['region'] = label_encoder.transform(data_set["region"])
X, y = data_set.iloc[:, 1:6], data_set.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=0)
gbr_le = GradientBoostingRegressor(
n_estimators = 1000,
learning_rate = 0.1,
random_state = 0
)
model = gbr_le.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f'RMSE with label encoding (best_num_trees) = {np.sqrt(metrics.mean_squared_error(y_test, y_pred))}')
>>> RMSE with label encoding (best_num_trees) = 4.881378370139346
Now this time using one-hot
regions = pd.get_dummies(data_set_onehot.region)
data_set_onehot.drop(columns=['region'], inplace=True)
X, y = data_set_onehot.iloc[:, 1:5], data_set_onehot.iloc[:,-1]
X = pd.concat([X, regions], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=0)
gbr_onehot = GradientBoostingRegressor(
n_estimators = 1000,
learning_rate = 0.1,
random_state = 0
)
model = gbr_onehot.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f'RMSE with one-hot encoding (best_num_trees) = {np.sqrt(metrics.mean_squared_error(y_test, y_pred))}')
>>> RMSE with one-hot encoding (best_num_trees) = 4.86943654679908
This time model trained after using one-hot gave better rmse score. I don't think I can judge since depending on random_state value on either train_test_split, GradientBoostingRegressor changes RMSE value.
What would be best way to measure better accuracy between two model with different feature processing?
I am running a RandomTreeForest model with scikit learn library and when assessing its accuracy using mean absolute I get a ValueError:
"Found input variables with inconsistent numbers of samples".
The pandas dataframe I used comes from a real estate csv file that gets modified after I've added some missing values. Please see the code below.
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
X_train_plus = X_train.copy()
X_valid_plus = X_train.copy()
for col in cols_with_missing:
X_train_plus[col + "_was_missing"] = X_train_plus[col].isnull()
X_valid_plus[col + "_was_missing"] = X_valid_plus[col].isnull()
my_imputer = SimpleImputer()
imp_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imp_X_valid_plus = pd.DataFrame(my_imputer.fit_transform(X_valid_plus))
imp_X_train_plus.columns = X_train_plus.columns
imp_X_valid_plus.columns = X_valid_plus.columns
model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
model_2 = RandomForestRegressor(n_estimators=100, random_state=0)
model_3 = RandomForestRegressor(n_estimators=100, criterion="mae", random_state=0)
model_4 = RandomForestRegressor(n_estimators=100, min_samples_split=20, random_state=0)
model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)
models = [model_1, model_2, model_3, model_4, model_5]
def score_model (model, imp_X_train_plus, y_train, imp_X_valid_plus, y_valid):
model.fit(imp_X_train_plus, y_train)
pred = model.predict(imp_X_valid_plus)
return mean_absolute_error(y_valid, pred)
for i in range(0, len(models)):
mae = score_model (models[i], imp_X_train_plus, y_train, imp_X_valid_plus, y_valid)
print("Model %d with extended imputed has a MAE: %d" %(i+1, mae))
I expect the output to be something similar to this:
"Model 1 with extended imputed has a MAE: 345237"
But instead what I get is the following Value Error when return is calling mean_absolute_error within score_model function:
"ValueError: Found input variables with inconsistent numbers of samples: [2716, 10864]"
I think the mistake might be in the imp_X_train_plus and imp_X_valid_plus variables, however I have ran a very similar model that implements the dataframes and works fine.
It might be because of your 3d line: X_valid_plus = X_train.copy().
Did you mean to do: X_valid_plus = X_valid.copy()?
I want to test a specific row from my dataset and to see the result, but I don't know how to do it. For example I want to test row number 100 and then to see the accuracy.
feature_cols = [0,1,2,3,4,5]
X = df[feature_cols] # Features
y = df[6] # Target variable
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1,
random_state=1)
#Create Decision Tree classifer object
clf = DecisionTreeClassifier(max_depth=5)
#Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
I recommend excluding the row you want to test from the dataset.
test_row=100
train_idx=np.arange(X.shape[0])!=test_row
test_idx=np.arange(X.shape[0])==test_row
X_train=X[train_idx]
y_train=y[train_idx]
X_test=X[test_idx]
y_test=y[test_idx]
Now X_test will contain a single row. However, the accuracy will now be either 0 or 1 since you are only testing one sample.
I would like to do K-fold cross-validation. the code before K-fold cross validation is like this: and it working perfectly
df = pd.read_csv('finalupdatedothers-multilabel.csv')
X= df[['sentences']]
dfy = df[['ADR','WD','EF','INF','SSI','DI','others']]
df1 = dfy.stack().reset_index()
df1.columns = ['a','b','c']
y_train_text = df1.groupby('a')['b'].apply(list)
lb = preprocessing.MultiLabelBinarizer()
# Run classifier
stop_words = stopwords.words('english')
classifier=make_pipeline(CountVectorizer(),
TfidfTransformer(),
#SelectKBest(chi2, k=4),
OneVsRestClassifier(SGDClassifier()))
#combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
random_state = np.random.RandomState(0)
# Split into training and test
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train_text, test_size=.2,
random_state=random_state)
print y_train
# # Binarize the output classes
Y = lb.fit_transform(y_train)
Y_test=lb.transform(y_test)
classifier.fit(X_train, Y)
y_score = classifier.fit(X_train, Y).decision_function(X_test)
print ("y_score"+str(y_score))
predicted = classifier.predict(X_test)
all_labels = lb.inverse_transform(predicted)
#print accuracy_score
print ("accuracy : "+str(accuracy_score(Y_test, predicted)))
print ("micro f-measure "+str(f1_score(Y_test, predicted, average='weighted')))
print("precision"+str(precision_score(Y_test,predicted,average='weighted')))
print("recall"+str(recall_score(Y_test,predicted,average='weighted')))
for item, labels in zip(X_test, all_labels):
print ('%s => %s' % (item, ', '.join(labels)))
when I change the code to use k fold cross-validation instead of train_tes_split. I got this error:
ValueError: Found input variables with inconsistent numbers of samples: [1, 6008]
Updated with iloc
my code to use k-fold cross validation looks like this:
kf = KFold(n_splits=10)
kf.get_n_splits(X)
KFold(n_splits=2, random_state=None, shuffle=False)
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y_train_text.iloc[train_index],
y_train_text.iloc[test_index]
would you please let me know which part Im doing incorrectly?
my data looks like this:
,sentences,ADR,WD,EF,INF,SSI,DI,others
0,"extreme weight gain, short-term memory loss, hair loss.",1.0,,,,,,
1,I am detoxing from Lexapro now.,,,,,,,1.0
2,I slowly cut my dosage over several months and took vitamin supplements to help.,,,,,,,1.0