ValueError Data cardinality is ambiguous: - python

I'm trying to learn tensorflow basic and make codes to check students performance score with this csvfrom kaggle, .
But I have this error
The error is
ValueError
Data cardinality is ambiguous:
x sizes: 1000
y sizes: 3
Make sure all arrays contain the same number of samples.
File
"C:\Users\w1234\algorithm.py\tensor\tensorflow\students_performance.py",
line 30, in model.fit(np.array(x_data), np.array(y_data),
epochs = 100)
Could you help me? How can I change the samples size?
The codes
from sklearn import metrics
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
import os
import numpy as np
import pandas as pd
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
data = pd.read_csv("C:/Users/w1234/algorithm.py/tensor/tensorflow/students_performance.csv")
data = data.dropna()
x_data = []
y_data = [data['math score'].values,
data['reading score'].values,
data['writing score']]
for i, row in data.iterrows() :
x_data.append([row['gender'],
row['parental level of education'],
row['lunch'],
row['test preparation course']])
model = Sequential([Dense(64, activation='relu'),
Dense(32, activation='relu'),
Dense(1, activation='sigmoid', name = 'output')])
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = 'accuracy')
model.fit(np.array(x_data), np.array(y_data), epochs = 100)

Typically machine learning algorithms work with numeric matrices or tensors and hence most feature engineering techniques deal with converting raw data into some numeric representations which can be easily understood by these algorithms.
From your code it seems like you are trying to predict the output for race/ethnicity which is the output variable.
gender, parental level of education, lunch, test preparation course are all categorical columns with dtype as object, we must convert these columns to numerical columns, hence I have used one-hot encoding.
Please find the working code below:
from sklearn import metrics
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
import os
import numpy as np
import pandas as pd
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
data = pd.read_csv("/content/StudentsPerformance.csv")
data = data.dropna()
#y_data is the output variable
y_data=data.pop("race/ethnicity")
#x_data are the input variables or the features on which y_data is depended
x_data=data
x_data.astype('object')
categorical_cols = ['gender', 'parental level of education', 'lunch', 'test preparation course']
#One-hot encoding
x_data = pd.get_dummies(x_data, columns = categorical_cols)
x_data.astype('float')
y_data =pd.get_dummies(y_data)
model = Sequential([Dense(64, activation='relu', ),
Dense(32, activation='relu'),
Dense(5, activation='sigmoid', name = 'output')])
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = 'accuracy')
model.fit((x_data), (y_data), epochs = 100)
Let us know if the issue still persists. Thanks!

Related

How can I raise Matthews Correlation Coefficient?

The accuracy of the following deep learning model is very high, but the Matthews Correlation Coefficient is expected to be very low. How can I increase the Matthews Correlation Coefficient?
(I don't have a real y value.)
the training data (X_train.csv & y_train.csv) to train a model that can predict fraudulent
transaction (isfraud=1).
Verify performance of your model using the test data (X_test.csv). That is, create “y_test.csv”
that predict the “isfraud” variable for each “id” in the “x_test.csv” file.
This is my code but still I need more improvement.
import numpy as np
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score, matthews_corrcoef
# mount google drive
from google.colab import drive
drive.mount('./mount')
# Store x_train in DataFrame
df_xtrain = pd.read_csv("mount/My Drive/Colab Notebooks/x_train.csv")
df_xtrain.shape
# Store y_train in DataFrame
df_ytrain = pd.read_csv("mount/My Drive/Colab Notebooks/y_train.csv")
df_ytrain.shape
# Store x_test in DataFrame - NOTE: it has one more column than x_train, which is "id"
df_xtest = pd.read_csv("mount/My Drive/Colab Notebooks/x_test.csv")
df_xtest.shape
# Check for missing values _ df_xtrain
print("Number of missing values: ", df_xtrain.isnull().sum().sum())
# Check for missing values _ df_ytrain
print("Number of missing values: ", df_ytrain.isnull().sum().sum())
# Check for missing values _df_xtest
print("Number of missing values: ", df_xtest.isnull().sum().sum())
# Impute missing values with the mean of each column
df_xtrain.fillna(df_xtrain.mean(), inplace=True)
scaler = MinMaxScaler()
xtrain = scaler.fit_transform(df_xtrain)
xtest = scaler.transform(df_xtest.iloc[:,1:])
# convert y_train to numpy array
ytrain = df_ytrain.to_numpy()
# number of input rows
nrows = xtrain.shape[0]
# number of columns
ncols = xtrain.shape[1]
# Define model with two hidden layers
model = Sequential()
model.add(Dense(200, input_dim=ncols, activation='relu'))
model.add(Dense(20, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(xtrain, ytrain, epochs=50, batch_size=10000)
# make class predictions with the model
ytest_prediction = model.predict(xtest)
# round the predictions to make it either 0 or 1
ytest = np.round(ytest_prediction)
# Extract the first 10000 samples from ytrain to use as the ground truth for ytest
ytrain_for_testing = ytrain[:10000, :]
# calculate the Matthews Correlation Coefficient
mcc = matthews_corrcoef(ytrain_for_testing.flatten(), ytest.flatten())
print("Matthews Correlation Coefficient:", mcc)

tokenizer output unable to be converted to numpy array

I've been following the following tutorial to try and understand LSTMs and tensorflow a bit more. From running, it the training of the model goes smoothly, but when I try to use the trained tokenizer on the test data and then convert it to a numpy array, it doesn't work and I'm not really sure what the problem is. The relevant portion that goes wrong is below:
# test model
x_test = np.array(tokenizer.texts_to_sequences([str(txt) for txt in df_test['text'].values]))
The error it presents is as below:
Traceback (most recent call last):
File "/Users/pranavnair/Documents/Code/wpd/wpd.py", line 85, in <module>
x_test = np.array(x_test_data)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (10824,) + inhomogeneous part.
I've tried using np.hstack instead of np.array, and that doesn't fix it. Would appreciate any help at all, thanks in advance.
Full code below for reference
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras import utils
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout
from keras.layers import Embedding
from keras.optimizers import Adam
# set random seed for reproducibility
RANDOM_SEED = 4
np.random.seed(RANDOM_SEED)
# import datasets
df_neut = pd.read_csv("./input/good.csv")
df_prom = pd.read_csv("./input/promotional.csv")
# clean up data to only include text
df_prom = df_prom.drop(df_prom.columns[1:], axis=1)
df_neut = df_neut.drop(df_neut.columns[1:], axis=1)
# combine datasets
df_neut.insert(1, 'label', 0) # neutral labels
df_prom.insert(1, 'label', 1) # promotional labels
# merge dataframes
df = pd.concat((df_neut, df_prom), ignore_index=True, axis=0)
# randomize order of dataframes
df = df.reindex(np.random.permutation(df.index))
# split into training and testing datasets
df_train, df_test = train_test_split(df, test_size=0.2, random_state=RANDOM_SEED)
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 50000
# perform data preprocessing using keras tokenizer
text_data = [str(txt) for txt in df_train['text'].values] # convert text data to strings
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?#[\]^_`{|}~', lower=True) # create tokenizer
tokenizer.fit_on_texts(text_data) # make dictionary
# vectorize dataset
x_train = tokenizer.texts_to_sequences(text_data)
# Max number of words in each sequence
MAX_SEQUENCE_LENGTH = 400
# pad sequence lengths
x_train = utils.pad_sequences(x_train, maxlen=MAX_SEQUENCE_LENGTH)
# get test labels
y_train = df_train['label'].values
# create sequential model
model = Sequential()
# create embedding layer
EMBEDDING_DIM = 100
model.add(Embedding(MAX_NB_WORDS+1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
# add LSTM layer to model
model.add(LSTM(80))
# setup model layers
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
# setup binary classification via binary cross entropy loss
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
# train for two epochs
EPOCHS = 4
BATCH_SIZE = 64
history = model.fit(x_train, y_train, epochs=EPOCHS, batch_size=BATCH_SIZE, validation_split=0.15)
# test model
x_test = np.array(tokenizer.texts_to_sequences([str(txt) for txt in df_test['text'].values]))
x_test = utils.pad_sequences(x_test, maxlen=MAX_SEQUENCE_LENGTH)
y_test = np.array(df_test['label'].values)
# evaluate model
scores = model.evaluate(x_test, y_test, batch_size=128)
print("The model has a test loss of %.2f and a test accuracy of %.1f%%" % (scores[0], scores[1]*100))

How to optimize accuracy of ANN

I have 50 target classes of 300 datasets.
This is my sample dataset, with 98 features:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
dataset = pd.read_csv(root_path + 'pima-indians-diabetes.data.csv', header=None)
X= dataset.iloc[:,0:8]
y= dataset.iloc[:,8]
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3)
from keras import Sequential
from keras.layers import Dense
classifier = Sequential()
#First Hidden Layer
classifier.add(Dense(units = 10, activation='relu',kernel_initializer='random_normal', input_dim=8))
#Second Hidden Layer
classifier.add(Dense(units = 10, activation='relu',kernel_initializer='random_normal'))
#Output Layer
classifier.add(Dense(units = 1, activation='sigmoid',kernel_initializer='random_normal'))
#Compiling the neural network
classifier.compile(optimizer ='adam',loss='binary_crossentropy', metrics =['accuracy'])
#Fitting the data to the training dataset
classifier.fit(X_train,y_train, batch_size=2, epochs=10)
I get 19% accuracy here, and I don't know how to optimize my prediction result.
I am considering that you have performed the Dimentionality Reduction technique on your original data having 98 features and therefore you are using an 8-dimensional input feature in your model.
I have a few observations on your implementation:
[As a Classification Problem]
As you have mentioned that your samples belong to 50 diffecent classes, the problem is certainly a multiclass classification problem. So, you need to encode your label first like:
from keras.utils import to_categorical
y = to_categorical(y, num_classes=50, dtype='float32')
In this case, you need to change the number of output node (representing class) and activation function in the final layer as follows:
classifier.add(Dense(units = 50, activation='softmax'))
Furthermore, you have to ue categorical_crossentropy as a loss function while compiling your model.
classifier.compile(optimizer ='adam',loss='categorical_crossentropy', metrics =['accuracy'])
[As a Regression Problem]
You can also consider this problem as a multiple regression problem as the output is within the range of 0 to 50 (continuous) and can keep a single output node in the final layer as you did. But in that case, you should use a linear activation function instead of sigmoid.
So, the final layer should be like:
classifier.add(Dense(units = 1)) # default activation is linear
Additionally, In case of regression problem, mean_squared_error is the most relevant cost function to use (assuming not many outliers in your dataset) and accuracy as a performance metric is irrelevant (rather you may use mean_absolute_error which is analogous to loss). Hence, the second modification is:
classifier.compile(optimizer ='adam',loss='mean_squared_error')

How can adjust train-set and test-set and validation-set for Keras in RNN?

I'm playing with the dataset which are time-series and by help of reshape I fetch them to RNN.
The dataset contains 40 time-steps dataset I hope each time 1 row which contains 1440 columns fetch to RNN and get trained andin the end 40 times (1, 1440) feed and learn in RNN.
X_train size: (40, 1, 1440)
X_test size: (40, 1, 1440)
Problem is I want to split dataset into :
data-set , validation-set and test-set so that I can predict after timestep=t and plot the output.
I already read #YSelf answer and the reason why it's helpful but unfortunately I haven't set up successfully yet due to structure of data.
is I want to provide the condition that I can adjust in implementation scripts so that let's say I'm interested in predicting timesteps after 35th row or time-step till end. How I can fulfill it and how I can demonstrate its plot?
my code is following:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from keras.layers import Dense , Activation , BatchNormalization
from keras.layers import LSTM,SimpleRNN
from keras.models import Sequential
from keras.optimizers import Adam, RMSprop
data_train = pd.read_csv("D:\Train.csv", header=None)
#select interested columns to predict 980 out of 1440 for prediction
j=0
index=[]
for i in range(1439):
if j==2:
j=0
continue
else:
index.append(i)
j+=1
Y_train = data_train[index]
Y_test = data_test[index]
data_train = data_train.values
data_test = data_test.values
X_train = data_train .reshape((data_train.shape[0], 1,data_train.shape[1]))
X_test = data_test .reshape((data_test.shape[0] , 1 ,data_test.shape[1]))
# create and fit the SimpleRNN model
model_RNN = Sequential()
model_RNN.add(SimpleRNN(units=1440, input_shape=(X_train.shape[1], X_train.shape[2]))) #in real data units=1440
model_RNN.add(Dense(960)) # in real data Dense(960)
model_RNN.add(BatchNormalization())
model_RNN.add(Activation('tanh'))
model_RNN.compile(loss='mean_squared_error', optimizer='adam')
hist_RNN=model_RNN.fit(X_train[:30, :], Y_train[30:, :], epochs =50, batch_size =20,validation_data=(X_train[:30, :], Y_train[30:, :]),verbose=1)
I would really appreciate it if someone could explain answer of questions. Thanks in advance.

Keras: Re-use trained weights in a new experiment

I am quite new to Keras so apologies in advance for any stupid mistakes. I am currently attempting to try out some good old cross-domain transfer learning between two datasets. I have a model here that is trained and executed on a voice recognition dataset that I have generated (code is at the bottom of this question because it's quite long)
If I were to train a new model, say model_2 on a different dataset, then I'd get a baseline from the initial random distribution of weights.
I wonder, is it possible to train model_1 and model_2, then, and this is the bit I don't know how to do; can I take the two 256 and 128 dense layers from model_1 (with trained weights) and use them as starting points for a model_3 - which is dataset 2 with the initial weight distribution from model_1?
So, in the end, I have the following:
Model_1 which starts from a random distribution and trains on dataset 1
Model_2 which starts from a random distribution and trains on dataset 2
Model_3 which starts from the distribution trained in Model_1 and trains on dataset 2.
My question is, how would I go about doing step 3 in the above? I don't want to freeze the weights, I just want an initial distribution for training from a past experiment
Any help would be greatly appreciated. Thank you! Apologies if I didn't make it quite clear enough what I'm going for
My code to train Model_1 is as follows:
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from keras.utils import np_utils
from keras.layers.normalization import BatchNormalization
import time
start = time.clock()
# fix random seed for reproducibility
seed = 1
numpy.random.seed(seed)
# load dataset
dataframe = pandas.read_csv("voice.csv", header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
numVars = len(dataframe.columns) - 1
numClasses = dataframe[numVars].nunique()
X = dataset[:,0:numVars].astype(float)
Y = dataset[:,numVars]
print("THERE ARE " + str(numVars) + " ATTRIBUTES")
print("THERE ARE " + str(numClasses) + " UNIQUE CLASSES")
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)
calls = [EarlyStopping(monitor='acc', min_delta=0.0001, patience=100, verbose=2, mode='max', restore_best_weights=True)]
# define baseline model
def baseline_model():
# create model
model = Sequential()
model.add(BatchNormalization())
model.add(Dense(256, input_dim=numVars, activation='sigmoid'))
model.add(Dense(128, activation='sigmoid'))
model.add(Dense(numClasses, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
estimator = KerasClassifier(build_fn=baseline_model, epochs=2000, batch_size=1000, verbose=1)
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, dummy_y, cv=kfold, fit_params={'callbacks':calls})
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
#your code here
print (time.clock() - start)
PS: Input attributes and outputs will all be the same between the two datasets, all that will change are attribute values. I am curious, can this be done if the two datasets have different numbers of output classes?
In short, to fine-tune Model_3 from Model_1, just call model.load_weights('/path/to/model_1.h5', by_name=True) after model.compile(...). Of course, you must have saved the trained Model_1 first.
If I understood correct, you have the same number of features and classes among the two datasets, so you do not even need to re-design your model. If you had different set of classes, then you had to give different names to the last layers of Model_1 and Model_3:
model.add(Dense(numClasses, activation='softmax', name='some_unique_name'))

Categories

Resources