I have this CSV file, where I'm trying to predict the Histology based on the data in the other rows.
I have the code shown below to do that. However, I'm getting all the predictions as 1. Why is that? Although the accuracy I get after training the model is 86.81%.
import numpy as np
import pandas as pd
from keras.layers import Dense, Dropout, BatchNormalization, Activation
import keras.models as md
import keras.layers.core as core
import keras.utils.np_utils as kutils
import keras.layers.convolutional as conv
from keras.layers import MaxPool2D
from subprocess import check_output
dataset = pd.read_csv('mutation-train.csv')
dataset = dataset[['CDS_Mutation',
'Primary_Tissue',
'Genomic',
'Gene_ID',
'Official_Symbol',
'Histology']]
X = dataset.iloc[:,0:5].values
y = dataset.iloc[:,5].values
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_0 = LabelEncoder()
X[:, 0] = labelencoder_X_0.fit_transform(X[:, 0])
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2= LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
labelencoder_X_4= LabelEncoder()
X[:, 4] = labelencoder_X_4.fit_transform(X[:, 4])
X = X.astype(float)
labelencoder_y= LabelEncoder()
y = labelencoder_y.fit_transform(y)
onehotencoder0 = OneHotEncoder(categorical_features = [0])
X = onehotencoder0.fit_transform(X).toarray()
X = X[:,0:]
onehotencoder1 = OneHotEncoder(categorical_features = [1])
X = onehotencoder1.fit_transform(X).toarray()
X = X[:,0:]
onehotencoder2 = OneHotEncoder(categorical_features = [2])
X = onehotencoder2.fit_transform(X).toarray()
X = X[:,0:]
onehotencoder4 = OneHotEncoder(categorical_features = [4])
X = onehotencoder4.fit_transform(X).toarray()
X = X[:,0:]
# Splitting the dataset training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Evaluating the ANN
from sklearn.model_selection import cross_val_score
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
model=Sequential()
model.add(Dense(32, activation = 'relu', input_shape=(X.shape[1],)))
model.add(Dense(16, activation = 'relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ["accuracy"])
# Compile model
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
# Fit the model
model.fit(X,y, epochs=3, batch_size=1)
# Evaluate the model
scores = model.evaluate(X,y)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
# Calculate predictions
predictions = model.predict(X)
prediction = pd.DataFrame(predictions,columns=['predictions']).to_csv('prediction.csv')
Thanks.
As you are getting 86.81% accuracy where all the values are 1, it seems like your data is imbalanced it means in your training dataset one of the class has overpowered the other one.
So even if your predict 1 for all the test-data, you will get higher accuracy.
Refer Accuracy paradox
Eg. In your dataset, around 85% data samples are of class 1 and remaining of class 0.
How to deal with it
There are plenty of ways to deal with it.
Upsampling: Create duplicate data for class 0 so both class 1 and class 0 will be in same proportion.
Downsampling: Just remove some of the samples from class 1 to get same proprtion.
change Performance matrix: Rather than using accuracy as performance matrix use,
F1 score, precision or recall
You can assign different penalties to different classes on making a mistake. In this case you give high weightage to class which has low data.
And there more ways to deal with it.
Refer this link for more details.
Related
I have a problem fitting my dataset into my model. I do not know what this error represent and surely not how to fix it. thank you!
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from keras.models import Sequential
from keras.layers import Dense
dataset = pd.read_csv('Churn_Modelling.csv')
dataset
X=dataset.iloc[:,3:13].values
Y=dataset.iloc[:,13].values
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
lableencoder_X_2 = LabelEncoder()
X[:, 2] = lableencoder_X_2.fit_transform(X[:, 2])
ct = ColumnTransformer([('ohe', OneHotEncoder(), [1])], remainder='passthrough')
X = np.array(ct.fit_transform(X), dtype = str)
X = X[:, 1:]
classifier.add(Dense(units = 6,kernel_initializer = 'uniform',activation ='relu',input_dim = 11))
classifier.add(Dense(units = 6,kernel_initializer = 'uniform',activation ='relu'))
classifier.add(Dense(units= 1, kernel_initializer = 'uniform',activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crssentropy', metrics = ['accuracy'])
# fit our dataset to our module.
classifier.fit(X_train, y_train, batch_size = 10, epochs = 100)
Error:
Well error message is pretty clear. The loss function should be binary_crossentropy not binary_crssentropy
I watched this video on how to build your first neural network but got stuck with this error at 27:00 ValueError: cannot reshape array of size 6912 into shape (614,154)
https://www.youtube.com/watch?v=S2sZNlr-4_4
The code is below:
# This algorithm detects if a person has diabetes or not
# Load libraries
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import pandas as pd
# load data
from google.colab import files
uploaded = files.upload()
# Store the dataset
df = pd.read_csv('datasets_4511_6897_diabetes.csv')
# convert the data into an array
dataset = df.values
# get all of the rows from the first eight columns of the dataset
X = dataset[:,0:8]
y = dataset[:,8]
# process the data
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
X_scale = min_max_scaler.fit_transform(X)
# split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size = 0.2, random_state = 4)
# build the model
model = Sequential([
Dense(12, activation ='relu', input_shape = (8,)),
Dense(15, activation = 'relu'),
Dense(1, activation = 'sigmoid')
])
#Compile the model (sgd = stochastic gradient descent)
model.compile(
optimizer = 'sgd',
loss = 'binary_crossentropy',
metrics=['accuracy']
)
# train the model
hist = model.fit(X_train, y_train, batch_size = 57, epochs = 1000, validation_split=0.2)
# evaluate the model on the training data set
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = model.predict(X_train)
pred = [1 if y >= 0.5 else 0 for y in prediction]
#df = df.values.reshape(614,154)
print('confusion_matrix : /n', confusion_matrix(y_train, pred))
Do you have an idea on how to deal with this issue?
Thank you in advance
*I'm Attending a deep learning course on Udemy. I've written the code exactly in the same way the instructor said. but having a problem after the classifier.fit(X_train, y_train, batch_size = 10,epochs = 100) The error is as follows
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values
# Encoding categorical data
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
label_encoder_x_1 = LabelEncoder()
X[: , 2] = label_encoder_x_1.fit_transform(X[:,2])
transformer = ColumnTransformer(
transformers=[
("OneHot", # Just a name
OneHotEncoder(), # The transformer class
[1] # The column(s) to be applied on.
)
],
remainder='passthrough' # donot apply anything to the remaining columns
)
X = transformer.fit_transform(X.tolist())
X = X.astype('float64')
X = X[:, 1:]
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#importing keras
import keras
from keras.models import Sequential
from keras.layers import Dense
# Fitting classifier to the Training set
# Create your classifier here
classifier = Sequential()
classifier.add(Dense(output_dim = 6, init = 'uniform', activation = 'relu', input_dim = 11))
classifier.add(Dense(output_dim = 6, init = 'uniform', activation = 'relu'))
classifier.add(Dense(output_dim = 1, init = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
classifier.fit(X_train, y_train, batch_size = 10, epochs = 100)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
File "C:\Anaconda3\envs\py37\lib\site-packages\sklearn\metrics_classification.py", line 268, in confusion_matrix
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "C:\Anaconda3\envs\py37\lib\site-packages\sklearn\metrics_classification.py", line 90, in _check_targets
"and {1} targets".format(type_true, type_pred))
ValueError: Classification metrics can't handle a mix of binary and continuous targets
How to solve this*
Problem seems to be with the line, y_pred = classifier.predict(X_test). According to documentation predict_classes is used for getting class predictions, https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#predict_classes. Predict is returning continuous values which are not class labels. I made minor adjustments to your code,
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values
#print(X)
#print(y)
# Encoding categorical data
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
label_encoder_x_1 = LabelEncoder()
X[: , 2] = label_encoder_x_1.fit_transform(X[:,2])
transformer = ColumnTransformer(
transformers=[
("OneHot", # Just a name
OneHotEncoder(), # The transformer class
[1] # The column(s) to be applied on.
)
],
remainder='passthrough' # donot apply anything to the remaining columns
)
X = transformer.fit_transform(X.tolist())
X = X.astype('float64')
X = X[:, 1:]
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
#print(sum(y_train))
#print(sum(y_test))
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#importing keras
import keras
from keras.models import Sequential
from keras.layers import Dense
# Fitting classifier to the Training set
# Create your classifier here
classifier = Sequential()
classifier.add(Dense(output_dim = 6, init = 'uniform', activation = 'relu', input_dim = 11))
classifier.add(Dense(output_dim = 6, init = 'uniform', activation = 'relu'))
classifier.add(Dense(output_dim = 1, init = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
classifier.fit(X_train, y_train, batch_size = 10, epochs = 10)
# Predicting the Test set results
y_pred = classifier.predict_classes(X_test)
#print(classifier.predict(X_test))
#print(y_pred)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
#cm = confusion_matrix(y_test, y_pred)
print(confusion_matrix(y_test, y_pred, labels=[0, 1]))
print(classification_report(y_test, y_pred, target_names=['0', '1']))
I have built a regression model using ANN relating 8 input parameters and 1 output parameter.
code
X = data.iloc[:,:-1]
y = data.iloc[:,8:9]*100
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train_us, X_test_us, y_train_us, y_test_us = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_Y = StandardScaler()
X_train = sc_X.fit_transform(X_train_us)
X_test = sc_X.transform(X_test_us)
y_train = sc_Y.fit_transform(y_train_us)
y_test = sc_Y.transform(y_test_us)
# Importing the Keras libraries and packages
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
def base_model():
# Initialising the ANN
regressor = Sequential()
# Adding the input layer and the first hidden layer
regressor.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 8))
# Adding the second hidden layer
regressor.add(Dense(units = 4, kernel_initializer = 'uniform', activation = 'relu'))
# Adding the output layer
regressor.add(Dense(units = 1, kernel_initializer = 'uniform'))
# Compiling the ANN
regressor.compile(optimizer = 'adam', loss = 'mse', metrics = ['mae'])
return regressor
# Fitting the ANN to the Training set
regressor = KerasRegressor(build_fn=base_model, epochs=500, batch_size=32)
regressor.fit(X_train,y_train)
# Predicting the Test & Train set with regressor built
y_pred = regressor.predict(X_test)
y_pred = sc_Y.inverse_transform(y_pred)
y_test = sc_Y.inverse_transform(y_test)
#calculate r2_score
from sklearn.metrics import r2_score
score_test = r2_score(y_test,y_pred)
I get an r2_score of 98%.Unit of my output variable is currently metres. If I multiply it by 100 and change it to centi-meters and train the model and calculate the r2_score it is 91%.
Why is my r2_score changing with the unit of the dependent variable. Shouldn't scaling take care of this?
Thanks!!
I am doing my task to modeling kdd cup 99 dataset using Neural Network. I've tried to modeling it using this code:
#data preprocessing
#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#importing the dataset
dataset = pd.read_csv('kddcupdata.gz')
#change Multi-class to binary-class
dataset['normal.'] = dataset['normal.'].replace(['back.', 'buffer_overflow.', 'ftp_write.', 'guess_passwd.', 'imap.', 'ipsweep.', 'land.', 'loadmodule.', 'multihop.', 'neptune.', 'nmap.', 'perl.', 'phf.', 'pod.', 'portsweep.', 'rootkit.', 'satan.', 'smurf.', 'spy.', 'teardrop.', 'warezclient.', 'warezmaster.'], 'attack')
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 41].values
#encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_x_1 = LabelEncoder()
labelencoder_x_2 = LabelEncoder()
labelencoder_x_3 = LabelEncoder()
x[:, 1] = labelencoder_x_1.fit_transform(x[:, 1])
x[:, 2] = labelencoder_x_2.fit_transform(x[:, 2])
x[:, 3] = labelencoder_x_3.fit_transform(x[:, 3])
onehotencoder_1 = OneHotEncoder(categorical_features = [1])
x = onehotencoder_1.fit_transform(x).toarray()
onehotencoder_2 = OneHotEncoder(categorical_features = [4])
x = onehotencoder_2.fit_transform(x).toarray()
onehotencoder_3 = OneHotEncoder(categorical_features = [70])
x = onehotencoder_3.fit_transform(x).toarray()
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
#splitting the dataset into the training set and test set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
#feature scaling
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
# Importing the Keras libraries and packages
import keras
from keras.models import Sequential
from keras.layers import Dense
# Initialising the ANN
classifier = Sequential()
# Adding the input layer and the first hidden layer
classifier.add(Dense(output_dim = 60, init = 'uniform', activation = 'relu', input_dim = 118))
#Adding a second hidden layer
classifier.add(Dense(output_dim = 60, init = 'uniform', activation = 'relu'))
#Adding a third hidden layer
classifier.add(Dense(output_dim = 60, init = 'uniform', activation = 'relu'))
# Adding the output layer
classifier.add(Dense(output_dim = 1, init = 'uniform', activation = 'sigmoid'))
# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
# Fitting the ANN to the Training set
classifier.fit(x_train, y_train, batch_size = 10, nb_epoch = 20)
# Predicting the Test set results
y_pred = classifier.predict(x_test)
y_pred = (y_pred > 0.5)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
#the performance of the classification model
print("the Accuracy is: "+ str((cm[0,0]+cm[1,1])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1])))
recall = cm[1,1]/(cm[0,1]+cm[1,1])
print("Recall is : "+ str(recall))
print("False Positive rate: "+ str(cm[1,0]/(cm[0,0]+cm[1,0])))
precision = cm[1,1]/(cm[1,0]+cm[1,1])
print("Precision is: "+ str(precision))
print("F-measure is: "+ str(2*((precision*recall)/(precision+recall))))
from math import log
print("Entropy is: "+ str(-precision*log(precision)))
But i have got error like this picture below
I hope someone can help me to fix this code, or by give me the new code, so i can use it to modeling KDD Cup 99 Dataset easily. I really need your help to finish my task.
Thank you.