I've built a logistic regression for car loans which contains "is the loan in default yes or no" as the binary dependent variable, i am using around 20 independent variables, and the data set contains 3327 records.
I split the underlying data into a training set and test set. However after i fit the model on the training data and ask it to then predict for test data i get an output of all "0" when there should be some "1" outputs in there given the training set has roughly 12% of the time a "1" for the binary default or no default variable.
I've looked at the test and training sets which all look fine pre and post splitting (no missing values, category variables are dummies, and the training/test subsets correctly pick records at random so no breakdown there as far as I can see).
Interestingly the function "predict_proba" shows the probabilities predicted for getting a "0" is always high for each output element (0.7-0.9 probability). I'm not sure how best to correct this as i'd rather leave the default threshold at 0.5 but i'm not sure how to clear up this mess.
Is it simply a case of I need more data given the number of independent variables or am I missing something/ did something wrong?
Thanks!
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.cross_validation import train_test_split
import statsmodels.api as sm
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
#open the file
data = pd.read_csv(r"log reg test Lending club 2007-2011 and 2014 car only no dummy trap.csv")
print(data.shape)
##print(list(data.columns))
print(data['Distressed'].value_counts()) ## check number of defaulted car loans is binary
sns.countplot(x='Distressed', data=data, palette='hls')
print(plt.show()) ## confrim dependent variable is binary
##basic numerical analysis of variables to check feasibility for model
## we will need to create dummy variables for strings
#print(data.groupby('Distressed').mean()) ##numerical variable means
#print(data.groupby('grade').mean()) ## string variable means
#print(data.groupby('sub_grade').mean())
#print(data.groupby('emp_length').mean())
#print(data.groupby('home_ownership').mean())
##testing for nulls in dataset
print('Table showing cumulative number of missing data points', data.isnull().sum())
scrub_data=data.drop(['mths_since_last_delinq'],1) ## this variable is not statistically significant
print('Here is the sample showing no missing data')
print(scrub_data.isnull().sum()) ## removed records of missing info, sample still sufficiently large
#scrub_data['intercept']=0
print(list(scrub_data.columns))
print(scrub_data.head())
##convert categorical variables to dummies completed in csv file
## Agrade and Own dummies removed to avoid dummy variable trap and are treated as the base case here
X=scrub_data.ix[:,(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,22)].values
y=scrub_data.ix[:,0].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.3, random_state=0)
print('Here are the X components', X)
print('Here are the y components', y)
print('Here are the X values of the training', X_train)
print('Here are the y train values', y_train)
print('Here are the y test values', y_test)
model=LogisticRegression()
model.fit(X_train,y_train) ##Model is learning the relationship between X_train and y_train
y1_pred=model.predict(X_train)
print('y predict of train data', y1_pred)
print('Here is the Model Score', model.score(X_train,y_train)) ##check accuracy of training set
print('What percentage defaulted', y_train.mean()) ##what percentage defaulted
print('What percentage of test set defaulted', y_test.mean()) ##what percentage defaulted
print('X test values', X_test) ## check test subset values
y_pred=model.predict(X_test)
probs=model.predict_proba(X_test)
Related
I get the Used the following error when trying to do the score or mean squared error of a single sample :"Found input variables with inconsistent number of samples" for a decision tree regressor model from sklearn to find the chance
of having a heart attack based on 13 other parameters. The model seems
to work but the metrics tests always give me this kind of error
regardless of how I transform the data. Its because the sample test is 1 row whilst the training data is 303 rows but I don't know how to fit them.
import pandas as pd
from sklearn import tree
from sklearn.metrics import mean_squared_error
heart = pd.read_csv('heart.csv')
test = pd.read_csv('Heart_test.csv')
X = heart.iloc[:,0:13]
Y = heart.iloc[:,13:14]
test = test.iloc[:,0:13]
#print(X.head(), '\n', test.head(),'\n')
model = tree.DecisionTreeRegressor()
model = model.fit(X,Y)
y_prediction = model.predict(test)
print(mean_squared_error(Y,y_prediction)
I am doing a project and trying to show some BASIC elements of scikit in python. My goal is to create a 3ish simple examples and show how it learns and predicts. I am applying a simple sine wave type pattern and have been playing with a good example online from
https://mclguide.readthedocs.io/en/latest/sklearn/regression.html
My problem is that since I am new to this library and ML in general, I don't understand what I have in front of me and how to transform it into the output I am going for. The two problems I am struggling with is a linear regression on a sine wave and a guassian regression on a more complicated wave. The output I am getting per the article is the accuracy and that works like intended but what I am trying to get to is how to plot the predicted output on top of (or as an extension) of the training data to visually show how it did. I think the data is in here, I am either just using the wrong methods to return the appropriate information or I am not understanding how to extract the information from what is already being returned.
Here are some additional questions
I do not completely understand the "features = x[:, np.newaxis]" line
When plotting, what does '-*' and '-o'do? I looked through the documentation and it appears to be formatting but I couldn't find these two examples exactly.
What do I need to do to get access to the 20% predicted values so that I can plot it against the original?
Is there a simple way to apply the most amount of this code to apply to simple and gaussian examples?
Here is the skeletal code. Most of the scikit from the article is unchanged.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import random
from operator import add
N = 200 # 10 samples
randomlist = []
x = np.linspace(0, 12, N)
sine_wave = np.sin(1*x)
#plot the source data
plt.figure(figsize=(20,5))
plt.plot(x, sum_vector, 'o');
plt.show()
# convert features in 2D format i.e. list of list
# print('Before: ', x.shape)
features = x[:, np.newaxis]
# print('After: ', features.shape)
# save sine wave in variable 'targets'
targets = sine_wave
# split the training and test data
train_features, test_features, train_targets, test_targets = train_test_split(
features, targets,
train_size=0.8,
test_size=0.2,
# random but same for all run, also accuracy depends on the
# selection of data e.g. if we put 10 then accuracy will be 1.0
# in this example
random_state=23,
# keep same proportion of 'target' in test and target data
# stratify=targets # can not used for single feature
)
# training using 'training data'
regressor = LinearRegression()
regressor.fit(train_features, train_targets) # fit the model for training data
# predict the 'target' for 'training data'
prediction_training_targets = regressor.predict(train_features)
# note that 'score' uses 'feature and target (not predict_target)'
# for scoring in Regression
# whereas 'accuracy_score' uses 'features and predict_targets'
# for scoring in Classification
self_accuracy = regressor.score(train_features, train_targets)
print("Accuracy for training data (self accuracy):", self_accuracy)
# predict the 'target' for 'test data'
prediction_test_targets = regressor.predict(test_features)
test_accuracy = regressor.score(test_features, test_targets)
print("Accuracy for test data:", test_accuracy)
# plot the predicted and actual target for test data
plt.figure(figsize=(20,5))
plt.plot(test_targets, color = "red")
plt.show()
plt.plot(prediction_test_targets, '-*', color = "red")
plt.plot(test_targets, '-o' )
plt.show()
TLDR Probably this problem but how can we do it using sklearn? I'm okay if only the mean over the CVs I did for each lambda or alpha are shown in the plots.
Hi all, if I understand correctly, we need to cross-validate on the training set to select the alpha (as in sklearn) for the ridge regression. In particular, I want to perform a 5-fold CV repeated 5 times (so 25 CVs) on the training set.
What I want to do is for each alpha from the alphas:
from numpy import logspace as logs
alphas = logs(-3, 3, 71) # from 10^{-3}, 10^{-2.9}, ..., to 10^3
I get the MSEs on the 25 (different?) validation sets, and the MSE on the test set after I finish all the CVs for each training set, then take the average out of the 25 MSEs for plotting or reporting.
The issue is I'm not sure how to do so. Is this the correct code to retrieve the 25 MSEs from the validation sets which we usually couldn't observe?
from sklearn.linear_model import RidgeCV, Ridge
from sklearn.model_selection import cross_val_score as CVS
# 5 fold now repeated all 5 times
cvs = RKF(n_splits=5, n_repeats=5, random_state=42)
# each alpha input as al
# the whole data set is generated with different RNG each time
# if you like you may take any existing data sets to explain whether I did wrong
# for each whole data set, the training set is split using the same random state
CVS(Ridge(alpha=al, random_state=42), X_train, Y_train, scoring="neg_mean_squared_error", cv=cvs)
If no, should I use cross_validate or even RidgeCV to get the MSEs I want? Thanks in advance.
Most likely you need to use GridSearchCV, using an example where we have 10 values of alpha:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, RepeatedKFold
from sklearn.linear_model import RidgeCV,Ridge
from sklearn.model_selection import cross_val_score
from numpy import logspace as logs
from sklearn import datasets
alphas = logs(-3, 3, 71)
diabetes = datasets.load_diabetes()
X = diabetes.data[:300]
y = diabetes.target[:300]
X_val = diabetes.data[300:]
y_val = diabetes.target[300:]
We define the repeated cross validation, and the alphas to fit over:
cvs = RepeatedKFold(n_splits=5, n_repeats=5, random_state=42)
parameters = {'alpha':alphas}
clf = GridSearchCV(Ridge(), parameters,cv=cvs)
clf.fit(X, y)
So the means of the scores will be stored under clf.cv_results_['mean_test_score'] and you also have the individual results under the dictionary. To plot, you can simply do:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.bar(np.arange(len(alphas)), height =clf.cv_results_['mean_test_score'],
yerr=clf.cv_results_['std_test_score'], alpha=0.5,
error_kw=dict(ecolor='gray', lw=1, capsize=5, capthick=2))
ax.set_xticks(np.arange(len(alphas)))
ax.set_xticklabels(np.round(alphas,3))
This shows the mean and standard error of the score over 10 values of alpha.
You can see this post on how to get the scores for a pre-defined validation set.
I was training a model that contains 8 features that allows us to predict the probability of a room been sold.
Region: The region the room belongs to (an integer, taking value between 1 and 10)
Date:The date of stay (an integer between 1‐365, here we consider only one‐day
request)
Weekday: Day of week (an integer between 1‐7)
Apartment: Whether the room is a whole apartment (1) or just a room (0)
#beds:The number of beds in the room (an integer between 1‐4)
Review: Average review of the seller (a continuous variable between 1 and 5)
Pic Quality: Quality of the picture of the room (a continuous variable between 0 and 1)
Price: he historic posted price of the room (a continuous variable)
Accept:Whether this post gets accepted (someone took it, 1) or not (0) in the end
Column Accept is the "y". Hence, this is a binary classification.
We have plot the data and some of the data were skewed so we applied power transform.
We tried a neural network, ExtraTrees, XGBoost, Gradient boost, Random forest. They all gave about 0.77 AUC. However, when we tried them on the test set, the AUC dropped to 0.55 with a precision of 27%.
I am not sure where when wrong but my thinking was that the reason may due to the mixing of discrete and continuous data. Especially some of them are either 0 or 1.
Can anyone help?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
df_train = pd.read_csv('case2_training.csv')
X, y = df_train.iloc[:, 1:-1], df_train.iloc[:, -1]
y = y.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
transform_list = ['Pic Quality', 'Review', 'Price']
X_train[transform_list] = pt.fit_transform(X_train[transform_list])
X_test[transform_list] = pt.transform(X_test[transform_list])
for i in transform_list:
df = X_train[i]
ax = df.plot.hist()
ax.set_title(i)
plt.show()
# Normalization
sc = MinMaxScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
X_train = X_train.astype(np.float32)
X_test = X_test.astype(np.float32)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=123, n_estimators=50)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=123, n_estimators=50)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
from torch import nn
from skorch import NeuralNetBinaryClassifier
import torch
model = nn.Sequential(
nn.Linear(8,64),
nn.BatchNorm1d(64),
nn.GELU(),
nn.Linear(64,32),
nn.BatchNorm1d(32),
nn.GELU(),
nn.Linear(32,16),
nn.BatchNorm1d(16),
nn.GELU(),
nn.Linear(16,1),
# nn.Sigmoid()
)
net = NeuralNetBinaryClassifier(
model,
max_epochs=100,
lr=0.1,
# Shuffle training data on each epoch
optimizer=torch.optim.Adam,
iterator_train__shuffle=True,
)
net.fit(X_train, y_train)
from xgboost.sklearn import XGBClassifier
clf = XGBClassifier(silent=0,
learning_rate=0.01,
min_child_weight=1,
max_depth=6,
objective='binary:logistic',
n_estimators=500,
seed=1000)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
Here is an attachment of a screenshot of the data.
Sample data
This is the fundamental first step of Data Analytics. You need to do two things here:
Data understanding - do the data fields in their current format make sense (data types, value range etc.)
Data preparation - what should I do to update these data fields before passing them to our model? Also which inputs do you think will be useful for your model and which will provide little benefit? Are there outliers I need to consider/handle?
A good book if you're starting in the field of data analytics is Fundamentals of Machine Learning for Predictive Data Analytics (I have no affiliation with this book).
Looking at your dataset there's a couple of things you could try to see how it influences your prediction results:
Unless region order is actually ranked in importance/value I would change this to a one hot encoded feature, you can do this in sklearn. Otherwise you run the risk of your model thinking that regions with a higher number (say 10) are more important than regions with a lower value (say 1).
You could attempt to normalise certain categories if they are much larger than some of your other data fields Why Data Normalization is necessary for Machine Learning models
Consider looking at the Kaggle competition House Prices: Advanced Regression Techniques. It's doing a similar thing to what you're attempting to do, and it might have some pointers for how you should approach the problem in the Notebooks and Discussion tabs.
Without deeply exploring all the data you are using it is hard to say for certain what is causing the drop in accuracy (or AUC) when moving from your training set to the testing set. It is unlikely to be caused by the mixed discrete/continuous data.
The drop just suggests that your models are over-fitting to your training data (and therefore not transferring well). This could be caused by too many learned parameters (given the amount of data you have)--more often a problem with neural networks than with some of the other methods you mentioned. Or, the problem could be with the way the data was split into training/testing. If the distribution of the data has a significant difference (that's maybe not obvious) then you wouldn't expect the testing performance to be as good. If it were me, I'd look carefully at how the data was split into training/testing (assuming you have a reasonably large set of data). You may try repeating your experiments with a number of random training/testing splits (search k-fold cross validation if you're not familiar with it).
your model is overfit. try to make a simple model first and use a lower parameter value. for tree-based classification, scaling does not have any impact on the model.
My fellow Team,
Having an issue
----------------------
Avg.SessionLength TimeonApp TimeonWebsite LengthofMembership Yearly Amount Spent
0 34.497268 12.655651 39.577668 4.082621 587.951054
1 31.926272 11.109461 37.268959 2.664034 392.204933
2 33.000915 11.330278 37.110597 4.104543 487.547505
3 34.305557 13.717514 36.721283 3.120179 581.852344
4 33.330673 12.795189 37.536653 4.446308 599.406092
5 33.871038 12.026925 34.476878 5.493507 637.102448
6 32.021596 11.366348 36.683776 4.685017 521.572175
Want to apply KNN
X = df[['Avg. Session Length', 'Time on App','Time on Website', 'Length of Membership']]
y = df['Yearly Amount Spent']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=42)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
ValueError: Unknown label type: 'continuous'
The values in Yearly Amount Spent column are real numbers, so they cannot serve as labels for a classification problem (see here):
When doing classification in scikit-learn, y is a vector of integers
or strings.
Hence you get the error. If you want to build a classification model, you need to decide how you transform them into a finite set of labels.
Note that if you just want to avoid the error, you could do
import numpy as np
y = np.asarray(df['Yearly Amount Spent'], dtype="|S6")
This will transform the values in y into strings of the required format. Yet, every label will appear in only one sample, so you cannot really build a meaningful model with such set of labels.
I think you are actually trying to do a regression rather than a classification, since your code pretty much looks like you want to predict
the yearly amount spent as a number. In this case, use
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=1)
instead. If you really have a classification task, for example you want to classify into classes like ('yearly amount spent is low', 'yearly amount spent is high',...), you should discretize the labels and convert them into strings or integer numbers (as explained by #Miriam Farber), according to the thresholds you need to set manually in this case.