How can I raise Matthews Correlation Coefficient? - python

The accuracy of the following deep learning model is very high, but the Matthews Correlation Coefficient is expected to be very low. How can I increase the Matthews Correlation Coefficient?
(I don't have a real y value.)
the training data (X_train.csv & y_train.csv) to train a model that can predict fraudulent
transaction (isfraud=1).
Verify performance of your model using the test data (X_test.csv). That is, create “y_test.csv”
that predict the “isfraud” variable for each “id” in the “x_test.csv” file.
This is my code but still I need more improvement.
import numpy as np
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score, matthews_corrcoef
# mount google drive
from google.colab import drive
drive.mount('./mount')
# Store x_train in DataFrame
df_xtrain = pd.read_csv("mount/My Drive/Colab Notebooks/x_train.csv")
df_xtrain.shape
# Store y_train in DataFrame
df_ytrain = pd.read_csv("mount/My Drive/Colab Notebooks/y_train.csv")
df_ytrain.shape
# Store x_test in DataFrame - NOTE: it has one more column than x_train, which is "id"
df_xtest = pd.read_csv("mount/My Drive/Colab Notebooks/x_test.csv")
df_xtest.shape
# Check for missing values _ df_xtrain
print("Number of missing values: ", df_xtrain.isnull().sum().sum())
# Check for missing values _ df_ytrain
print("Number of missing values: ", df_ytrain.isnull().sum().sum())
# Check for missing values _df_xtest
print("Number of missing values: ", df_xtest.isnull().sum().sum())
# Impute missing values with the mean of each column
df_xtrain.fillna(df_xtrain.mean(), inplace=True)
scaler = MinMaxScaler()
xtrain = scaler.fit_transform(df_xtrain)
xtest = scaler.transform(df_xtest.iloc[:,1:])
# convert y_train to numpy array
ytrain = df_ytrain.to_numpy()
# number of input rows
nrows = xtrain.shape[0]
# number of columns
ncols = xtrain.shape[1]
# Define model with two hidden layers
model = Sequential()
model.add(Dense(200, input_dim=ncols, activation='relu'))
model.add(Dense(20, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(xtrain, ytrain, epochs=50, batch_size=10000)
# make class predictions with the model
ytest_prediction = model.predict(xtest)
# round the predictions to make it either 0 or 1
ytest = np.round(ytest_prediction)
# Extract the first 10000 samples from ytrain to use as the ground truth for ytest
ytrain_for_testing = ytrain[:10000, :]
# calculate the Matthews Correlation Coefficient
mcc = matthews_corrcoef(ytrain_for_testing.flatten(), ytest.flatten())
print("Matthews Correlation Coefficient:", mcc)

Related

tokenizer output unable to be converted to numpy array

I've been following the following tutorial to try and understand LSTMs and tensorflow a bit more. From running, it the training of the model goes smoothly, but when I try to use the trained tokenizer on the test data and then convert it to a numpy array, it doesn't work and I'm not really sure what the problem is. The relevant portion that goes wrong is below:
# test model
x_test = np.array(tokenizer.texts_to_sequences([str(txt) for txt in df_test['text'].values]))
The error it presents is as below:
Traceback (most recent call last):
File "/Users/pranavnair/Documents/Code/wpd/wpd.py", line 85, in <module>
x_test = np.array(x_test_data)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (10824,) + inhomogeneous part.
I've tried using np.hstack instead of np.array, and that doesn't fix it. Would appreciate any help at all, thanks in advance.
Full code below for reference
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras import utils
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout
from keras.layers import Embedding
from keras.optimizers import Adam
# set random seed for reproducibility
RANDOM_SEED = 4
np.random.seed(RANDOM_SEED)
# import datasets
df_neut = pd.read_csv("./input/good.csv")
df_prom = pd.read_csv("./input/promotional.csv")
# clean up data to only include text
df_prom = df_prom.drop(df_prom.columns[1:], axis=1)
df_neut = df_neut.drop(df_neut.columns[1:], axis=1)
# combine datasets
df_neut.insert(1, 'label', 0) # neutral labels
df_prom.insert(1, 'label', 1) # promotional labels
# merge dataframes
df = pd.concat((df_neut, df_prom), ignore_index=True, axis=0)
# randomize order of dataframes
df = df.reindex(np.random.permutation(df.index))
# split into training and testing datasets
df_train, df_test = train_test_split(df, test_size=0.2, random_state=RANDOM_SEED)
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 50000
# perform data preprocessing using keras tokenizer
text_data = [str(txt) for txt in df_train['text'].values] # convert text data to strings
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?#[\]^_`{|}~', lower=True) # create tokenizer
tokenizer.fit_on_texts(text_data) # make dictionary
# vectorize dataset
x_train = tokenizer.texts_to_sequences(text_data)
# Max number of words in each sequence
MAX_SEQUENCE_LENGTH = 400
# pad sequence lengths
x_train = utils.pad_sequences(x_train, maxlen=MAX_SEQUENCE_LENGTH)
# get test labels
y_train = df_train['label'].values
# create sequential model
model = Sequential()
# create embedding layer
EMBEDDING_DIM = 100
model.add(Embedding(MAX_NB_WORDS+1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
# add LSTM layer to model
model.add(LSTM(80))
# setup model layers
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
# setup binary classification via binary cross entropy loss
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
# train for two epochs
EPOCHS = 4
BATCH_SIZE = 64
history = model.fit(x_train, y_train, epochs=EPOCHS, batch_size=BATCH_SIZE, validation_split=0.15)
# test model
x_test = np.array(tokenizer.texts_to_sequences([str(txt) for txt in df_test['text'].values]))
x_test = utils.pad_sequences(x_test, maxlen=MAX_SEQUENCE_LENGTH)
y_test = np.array(df_test['label'].values)
# evaluate model
scores = model.evaluate(x_test, y_test, batch_size=128)
print("The model has a test loss of %.2f and a test accuracy of %.1f%%" % (scores[0], scores[1]*100))

Resnet for Text data

Hi I want to use ResNet for Text data. I tried to look some code example lot of other data at the end I wrote the following code. But I'm not sure it's the correct way for ResNet or not.
NOTE::: this part is optional if i recieve an opinion on it. it will be great but I'm going to try it once the above one is corrected. if it is correct way then I want it to implement it in this way ----> ResNet should contain 18 layers in total whereas these layers should be divided into four stages and each stage should consist of two convolutional blocks. Each convolutional block should contain two convolutional layers with batch normalization and ReLU non_linearity in-between. Then, ResNet should pass the output from the convolutional layers to two fully-connected layers that will use the reduced data to classify the initial data to a given website class. Last but not least, you should use Adam optimizer and categorical cross-entropy (typically used for multi-class classification problems). Make sure that you identify and use the optimal hyper-parameters for your ResNet.
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore
from sklearn.model_selection import KFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
class ResNet_class():
def __init__(self):
# Cross-Validate
self.no_of_folds = int(input('enter no of K_fold: '))
self.kf = KFold(self.no_of_folds, shuffle=True, random_state=42) # Use for KFold classification
self.EPOCHS = int(input('enter no of epochs: '))
def check_test(self):
df = pd.read_csv(
"https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv",
na_values=['NA','?'])
df = pd.concat([df,pd.get_dummies(df['job'],prefix="job")],axis=1)
df.drop('job', axis=1, inplace=True)
df = pd.concat([df,pd.get_dummies(df['area'],prefix="area")],axis=1)
df.drop('area', axis=1, inplace=True)
df = pd.concat([df,pd.get_dummies(df['product'],prefix="product")],axis=1)
df.drop('product', axis=1, inplace=True)
med = df['income'].median()
df['income'] = df['income'].fillna(med)
df['income'] = zscore(df['income'])
df['aspect'] = zscore(df['aspect'])
df['save_rate'] = zscore(df['save_rate'])
df['subscriptions'] = zscore(df['subscriptions'])
x_columns = df.columns.drop('age').drop('id')
x = df[x_columns].values
y = df['age'].values
oos_y = []
oos_pred = []
fold = 0
for train, test in self.kf.split(x):
fold += 1
print(f"Fold #{fold}")
x_train = x[train]
y_train = y[train]
x_test = x[test]
y_test = y[test]
model = Sequential()
model.add(Dense(20, input_dim=x.shape[1], activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(x_train, y_train, validation_data=(x_test, y_test), verbose=0,
epochs=self.EPOCHS)
pred = model.predict(x_test)
oos_y.append(y_test)
oos_pred.append(pred)
score = np.sqrt(metrics.mean_squared_error(pred, y_test))
print(f"Fold score (RMSE): {score}")
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
score = np.sqrt(metrics.mean_squared_error(oos_pred, oos_y))
print(f"Final, out of sample score (RMSE): {score}")
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
oosDF = pd.concat([df, oos_y, oos_pred], axis=1)
resnet = ResNet_class()
resnet.check_test()

ValueError: Error when checking target: expected dense_4 to have shape (1,) but got array with shape (6,)

I am doing a prediction model using a chronic kidney disease dataset.
However the shape of my X_train value doesn't seem to be valid.
I have tried to change it but got a tuple error
# import libraries
import glob
from keras.models import Sequential, load_model
import numpy as np
import pandas as pd
from keras.layers import Dense
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
import matplotlib.pyplot as plt
import keras as k
from sklearn.model_selection import train_test_split
# load the data
from google.colab import files
uploaded = files.upload()
df = pd.read_csv('kidney_disease.csv')
#print the first 5 rows of data
df.head(5)
# create a list of column names to keep
columns_to_retain = ['sg', 'al', 'sc', 'hemo', 'pcv', 'wbcc', 'htn', 'classification']
# drop the unneccessary columns
df = df.drop( [col for col in df.columns if not col in columns_to_retain], axis=1)
#drop the rows with na or missing values
df = df.dropna(axis=0)
# transform the non-numeric data in the columns
for column in df.columns:
if df[column].dtype == np.number:
continue
df[column] = LabelEncoder().fit_transform(df[column])
# split the data into independent (X) dataset and dependent (y) dataset
X = df.drop(['classification'], axis=1)
y = df['classification']
# feature scaling
#min-max scaler method scales the dataset in order that all features lies between 0 and 1
X_scaler = MinMaxScaler()
X_scaler.fit(X)
column_names = X.columns
X[column_names] = X_scaler.transform(X)
# split the data into 80% training & 20% testing
X_train, y_train, X_test, y_test = train_test_split(X,y, test_size = 0.2, shuffle=True)# build the model
model = Sequential()
model.add( Dense(256, input_dim= len(X.columns), kernel_initializer=k.initializers.random_normal(seed=13), activation ='relu') )
model.add( Dense(1, activation = 'hard_sigmoid') )
# compiling the model (loss function mesures how well the model does in training
# & tries to improve on it using the optimizer )
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# train the model
history = model.fit(X_train, y_train, epochs = 2000, batch_size= X_train.shape[0])
#print(X_train[0:1].shape)
Do you guys have any idea and explain me the root of this problem.
Thank you in advance!

I just trained my first ML model based on the titanic dataset from kaggle.I am getting an RMSE value of ~0.4 is it good?

Please Note : I trained my model only on the basis of numerical columns and not the string columns
And please suggest some resources to go further into machine learning as I really like this subject.
Thank you
Here is the code and gives the following output :-
train rmse: 0.42
test rmse: 0.43
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import pandas as pd
import matplotlib.pyplot as plt
dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')
dftest = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv')
dftrain.loc[dftrain['fare'] == 0, 'fare'] = 34.85
plt.plot(list(dftrain.age), list(dftrain.fare), '.',markersize = 1)
dftrain = dftrain.drop(['sex', 'class', 'deck','embark_town', 'alone'], axis =1 )
X = dftrain.loc[:, dftrain.columns != 'survived']
y = dftrain.loc[:, 'survived']
model = Sequential()
model.add(Dense(128, activation = 'relu', input_dim = 4))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(32, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))
model.compile(optimizer = 'adam' , loss = 'binary_crossentropy', metrics = ['accuracy'])
model.fit(X, y , epochs = 200)
dftest = dftest.drop(['sex', 'class', 'deck','embark_town', 'alone'], axis =1 )
A = dftest.loc[:, dftest.columns != 'survived']
b = dftest.loc[:, 'survived']
from sklearn.metrics import mean_squared_error
import numpy as np
train_pred = model.predict(X)
train_rmse = np.sqrt(mean_squared_error(y, train_pred))
test_pred = model.predict(A)
test_rmse = np.sqrt(mean_squared_error(b, test_pred))
print("train rmse: {:0.2f}".format(train_rmse))
print("test rmse: {:0.2f}".format(test_rmse))```
First, root mean square error might not be a good score to look at in classification problems in the first place. For reasons why, refer to either this post or this stats stack exchange post.
Second, you're training a somewhat large neural network (with many parameters) compared to the available amount of training data (there were only 2224 passengers and crew members). When you have a comparable number of parameters in your model to the amount of training data, you run a risk of overfitting. Refer to this tutorial to learn what you can find about your model from looking at training/validation loss curves and how you can combat over/under fitting. You can experiment with different learning rates, number of epochs, batch sizes, normalization methods etc.
You might also want to take a look at other metrics like accuracy score and precision and recall

Keras: Re-use trained weights in a new experiment

I am quite new to Keras so apologies in advance for any stupid mistakes. I am currently attempting to try out some good old cross-domain transfer learning between two datasets. I have a model here that is trained and executed on a voice recognition dataset that I have generated (code is at the bottom of this question because it's quite long)
If I were to train a new model, say model_2 on a different dataset, then I'd get a baseline from the initial random distribution of weights.
I wonder, is it possible to train model_1 and model_2, then, and this is the bit I don't know how to do; can I take the two 256 and 128 dense layers from model_1 (with trained weights) and use them as starting points for a model_3 - which is dataset 2 with the initial weight distribution from model_1?
So, in the end, I have the following:
Model_1 which starts from a random distribution and trains on dataset 1
Model_2 which starts from a random distribution and trains on dataset 2
Model_3 which starts from the distribution trained in Model_1 and trains on dataset 2.
My question is, how would I go about doing step 3 in the above? I don't want to freeze the weights, I just want an initial distribution for training from a past experiment
Any help would be greatly appreciated. Thank you! Apologies if I didn't make it quite clear enough what I'm going for
My code to train Model_1 is as follows:
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from keras.utils import np_utils
from keras.layers.normalization import BatchNormalization
import time
start = time.clock()
# fix random seed for reproducibility
seed = 1
numpy.random.seed(seed)
# load dataset
dataframe = pandas.read_csv("voice.csv", header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
numVars = len(dataframe.columns) - 1
numClasses = dataframe[numVars].nunique()
X = dataset[:,0:numVars].astype(float)
Y = dataset[:,numVars]
print("THERE ARE " + str(numVars) + " ATTRIBUTES")
print("THERE ARE " + str(numClasses) + " UNIQUE CLASSES")
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)
calls = [EarlyStopping(monitor='acc', min_delta=0.0001, patience=100, verbose=2, mode='max', restore_best_weights=True)]
# define baseline model
def baseline_model():
# create model
model = Sequential()
model.add(BatchNormalization())
model.add(Dense(256, input_dim=numVars, activation='sigmoid'))
model.add(Dense(128, activation='sigmoid'))
model.add(Dense(numClasses, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
estimator = KerasClassifier(build_fn=baseline_model, epochs=2000, batch_size=1000, verbose=1)
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, dummy_y, cv=kfold, fit_params={'callbacks':calls})
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
#your code here
print (time.clock() - start)
PS: Input attributes and outputs will all be the same between the two datasets, all that will change are attribute values. I am curious, can this be done if the two datasets have different numbers of output classes?
In short, to fine-tune Model_3 from Model_1, just call model.load_weights('/path/to/model_1.h5', by_name=True) after model.compile(...). Of course, you must have saved the trained Model_1 first.
If I understood correct, you have the same number of features and classes among the two datasets, so you do not even need to re-design your model. If you had different set of classes, then you had to give different names to the last layers of Model_1 and Model_3:
model.add(Dense(numClasses, activation='softmax', name='some_unique_name'))

Categories

Resources