Keras custom loss: a * MAE + (1-a) * constant

Keras custom loss: a * MAE + (1-a) * constant - python

I am trying to implement a fairly simple custom loss function in Keras.
I am trying to make the network predict a bad input case (i.e. on which it has no chance of predicting correct output), along with correct output. To try to do this, I used a loss function which allows the network to 'choose' a constant loss (8) instead of it's current loss (determined by MAE).
loss = quality * output + (1-quality) * 8
Where quality is output from sigmoid, so in [0,1]
How would I design such a loss function properly in Keras?
Specifically, in the basic case, the network gets several predictions of the output, along with metrics known or thought to correlate with prediction quality. The role of the (small) network is to use these metrics to determine the weights to give when averaging these different prediction. This works well enough.
However, in some fraction of cases (say 5-10%) the input data is so bad that all predictors will be wrong. In that case, I want to output '?' to the user instead of a wrong answer.
My code complained about 1 array vs 2 arrays (presumably, identical number of y_true and y_pred are expected, but I don't have these).
model = Model(inputs=[ain, in1, in2, in3, in4, in5, x], outputs=[pred,qual])
model.compile(loss=quality_loss, optimizer='adam', metrics=['mae'])
model.fit([acc, hrmet0, hrmet1, hrmet2, hrmet3, hrmet4, hrs], ref, epochs=50, batch_size=5000, verbose=2, shuffle=True)
It seems having two outputs is causing the loss function to be called independently for each output.
ValueError: Error when checking model target: the list of Numpy arrays that you
are passing to your model is not the size the model expected. Expected to see 2
array(s), but instead got the following list of 1 arrays:
This was solved by passing a concatenated array instead.
def quality_loss(y_true, y_pred):
qual = y_pred[:,0]
hr = y_pred[:,1]
const = 8
return qual * mean_absolute_error(y_true,hr) + (1 - qual) * const
def my_mae(y_true,y_pred):
return mean_absolute_error(y_true,y_pred[:,1])
model = Model(inputs=[xin, in1, in2, in3, in4, in5, hr], outputs=concatenate([qual, pred_hr]))
model.compile(loss=quality_loss, optimizer='adam', metrics=[my_mae])
Network code:
xin = Input(shape=(1,))
in1 = Input(shape=(4,))
net1 = Dense(3,activation='tanh')( Dense(6,activation='tanh')(in1) )
in2 = Input(shape=(4,))
net2 = Dense(3,activation='tanh')( Dense(6,activation='tanh')(in2) )
in3 = Input(shape=(4,))
net3 = Dense(3,activation='tanh')( Dense(6,activation='tanh')(in3) )
in4 = Input(shape=(4,))
net4 = Dense(3,activation='tanh')( Dense(6,activation='tanh')(in4) )
in5 = Input(shape=(4,))
net5 = Dense(3,activation='tanh')( Dense(6,activation='tanh')(in5) )
smweights = Dense(5, activation='softmax')( concatenate([xin, net1, net2, net3, net4, net5]) )
qual = Dense(1, activation='sigmoid')( Dense(3, activation='tanh')( concatenate([xin, net1, net2, net3, net4, net5]) ) )
x = Input(shape=(5,))
pred = dot([x, smweights], axes=1)
This runs, but converges to loss = const and mae > 25 (whereas a simple mae loss here achieves 3-4 quite easily). Something is still not quite right with the loss function. Since shape on y_true/y_pred in the loss function gives (?) it's hard to track what is being passed exactly.

This issue is actually not caused by your custom loss function, but by something else: The problem arises because of how you call the fit function.
When you define the model, you give it 7 inputs and 2 outputs:
model = Model(inputs=[ain, in1, in2, in3, in4, in5, x], outputs=[pred,qual])
When you eventually call the fit function, you give a list 7 arrays as the input of the network but only 1 target output value called ref:
model.fit([acc, hrmet0, hrmet1, hrmet2, hrmet3, hrmet4, hrs], ref, ...)
This will not work. You have to supply the fit function with the same number of inputs and outputs as declared in the model's definition.
Edit: I think there is some conceptual problem with your approach: how are you actually planning to define the quality of your prediction? Why are you thinking, that adding a branch of your network which is supposed to judge the quality of your network's prediction will actually help to train it? The network will converge to a local minimum of the loss function. The fancier your loss function is, the more likely it is, that it will not actually converge to the state you actually want it to be in, but to some other local and not global minimum. You could try to experiment with different optimizers and learning rates - maybe this helps your training.

Related

Model not improving with GradientTape but with model.fit()

I am currently trying to train a model using tf.GradientTape, as model.fit(...) from keras will not be able to handle my data input in the future. However, while a test run with model.fit(...) and my model works perfectly, tf.GradientTape does not.
During training, the loss using the tf.GradientTape custom workflow will first slightly decrease, but then become stuck and not improve any further, no matter how many epochs I run. The chosen metric will also not change after the first few batches. Additionally, the loss per batch is unstable and jumps between nearly zero to something very large. The running loss is more stable but shows the model not improving.
This is all in contrast to using model.fit(...), where loss and metrics are improving immediately.
My code:
def build_model(kernel_regularizer=l2(0.0001), dropout=0.001, recurrent_dropout=0.):
x1 = Input(62)
x2 = Input((62, 3))
x = Embedding(30, 100, mask_zero=True)(x1)
x = Concatenate()([x, x2])
x = Bidirectional(LSTM(500,
return_sequences=True,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Bidirectional(LSTM(500,
return_sequences=False,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Activation('softmax')(x)
x = Dense(1000)(x)
x = Dense(500)(x)
x = Dense(250)(x)
x = Dense(1, bias_initializer='ones')(x)
x = tf.math.abs(x)
return Model(inputs=[x1, x2], outputs=x)
optimizer = Adam(learning_rate=0.0001)
model = build_model()
model.compile(optimizer=optimizer, loss='mse', metrics='mse')
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA
dat_train = tf.data.Dataset.from_generator(
generator= lambda: <load_function()>
output_types=((tf.int32, tf.float32), tf.float32)
)
dat_train = dat_train.with_options(options)
# keras training
model.fit(dat_train, epochs=50)
# custom training
for epoch in range(50):
for (x1, x2), y in dat_train:
with tf.GradientTape() as tape:
y_pred = model((x1, x2), training=True)
loss = model.loss(y, y_pred)
grads = tape.gradient(loss, model.trainable_variables)
model.optimizer.apply_gradients(zip(grads, model.trainable_variables))
I could use relu at the output layer, however, I found the abs to be more robust. Changing it does not change the outcome. The input x1 of the model is a sequence, x2 are some additional features, that are later concatenated to the embedded x1 sequence. For my approach, I'm not using the MSE, but it works either way.
I could provide some data, however, my dataset is quite large, so I would need to extract a bit out of it.
All in all, my problem seems to be similar to:
Keras model doesn't train when using GradientTape
Edit 1
The softmax activation is currently not necessary, but is relevant for my future goal of splitting the model.
Additionally, some things I noticed:
The custom training takes roughly 2x the amount of time compared to model.fit(...).
The gradients in the custom training seem very small and range from ±1e-3 to ±1e-9 inside the model. I don't know if that's normal and don't know how to compare it to the gradients provided by model.fit(...).
Edit 2
I've added a Google Colab notebook to reproduce the issue:
https://colab.research.google.com/drive/1pk66rbiux5vHZcav9VNSBhdWWIhQM-nF?usp=sharing
The loss and MSE for 20 epochs is shown here:
custom training
keras training
While I only used a portion of my data in the notebook, it will still run for a very long time. For the custom training run, the loss for each batch is simply stored in losses. It matches the behavior in the custom training run image.
So far, I've noticed two ways of improving the performance of the custom training:
The usage of custom layer initialization
Using MSE as a loss function
Using the MSE, compared to my own loss function actually improves the custom training performance. Still, using MSE and/or different initialization won't come close to the performance of keras fit.

I have found the solution, it was a simple shape mismatch, which was somehow not picked up by any error check and worked both with my custom loss function and MSE. Using x = Reshape(())(x) as final layer did the trick.

How to make predictions on new dataset with tensorflow's gradient tape

While I'm able to understand how to use model.fit(x_train, y_train), I can't figure out how to make predictions on new data using tensorflow's gradient tape. My github repository with runnable code (up to an error) can be found here. What is currently working is that I get the trained model "network_output", however it appears that with gradient tape, argmax is being used on the model itself, where I'm used to model.fit() taking the test data as an input:
network_output = trained_network(input_images,input_number)
preds = np.argmax(network_output, axis=1)
Where "input_images" is an ndarray: (20,3,3,1) and "input_number" is an ndarray: (20,5).
Now I'm taking network_output as the trained model and would like to use it to predict similarly typed data of test_images, and test_number respectively.
The error 'tensorflow.python.framework.ops.EagerTensor' object has no attribute 'predict' here:
predicted_number = network_output.predict(test_images)
Which is because I don't know how to use the tape to make predictions. However once the prediction works I would guess I can compare the resulting "predicted_number" against the "test_number" as would usually be done using the model.fit method.
acc = 0
for i in range(len(test_images)):
if (predicted_number[i] == test_number[i]):
acc += 1
print("Accuracy: ", acc / len(input_images) * 100, "%")

In order to obtain prediction I usually iterate through batches manually like this:
predictions = []
for batch in range(num_batch):
logits = trained_network(x_test[batch * batch_size: (batch + 1) * batch_size], training=False)
# first obtain probabilities
# (if the last layer of the network has no activation, otherwise skip the softmax here)
prob = tf.nn.softmax(logits)
# putting back together predictions for all batches
predictions.extend(tf.argmax(input=prob, axis=1))
If you don't have a lot of data you can skip the loop, this is faster than using predict because you directly invoke the __call__ method of the model:
logits = trained_network(x_test, training=False)
prob = tf.nn.softmax(logits)
predictions = tf.argmax(input=prob, axis=1)
Finally you could also use predict. In this case the batches are handled automatically. It is easier to use when you have lots of data since you don't have to create a loop to interate through batches. The result is a numpy array of predictions. In can be used like this:
predictions = trained_network.predict(x_test) # you can set a batch_size if you want
What you're doing wrong is this part:
network_output = trained_network(input_images,input_number)
predicted_number = network_output.predict(test_images)
You have to call predict directly on your model trained_network.

Siamese Network for binary classification with pre-encoded inputs

I want to train a Siamese Network to compare vectors for similarity.
My dataset consist of pairs of vectors and a target column with "1" if they are the same and "0" otherwise (binary classification):
import pandas as pd
# Define train and test sets.
X_train_val = pd.read_csv("train.csv")
print(X_train_val.head())
y_train_val = X_train_val.pop("class")
print(y_train_val.value_counts())
# Keep 50% of X_train_val in validation set.
X_train, X_val = X_train_val[:991], X_train_val[991:]
y_train, y_val = y_train_val[:991], y_train_val[991:]
del X_train_val, y_train_val
# Split our data to 'left' and 'right' inputs (one for each side Siamese).
X_left_train, X_right_train = X_train.iloc[:, :200], X_train.iloc[:, 200:]
X_left_val, X_right_val = X_val.iloc[:, :200], X_val.iloc[:, 200:]
assert X_left_train.shape == X_right_train.shape
# Repeat for test set.
X_test = pd.read_csv("test.csv")
y_test = X_test.pop("class")
print(y_test.value_counts())
X_left_test, X_right_test = X_test.iloc[:, :200], X_test.iloc[:, 200:]
returns
v0 v1 v2 ... v397 v398 v399 class
0 0.003615 0.013794 0.030388 ... -0.093931 0.106202 0.034870 0.0
1 0.018988 0.056302 0.002915 ... -0.007905 0.100859 -0.043529 0.0
2 0.072516 0.125697 0.111230 ... -0.010007 0.064125 -0.085632 0.0
3 0.051016 0.066028 0.082519 ... 0.012677 0.043831 -0.073935 1.0
4 0.020367 0.026446 0.015681 ... 0.062367 -0.022781 -0.032091 0.0
1.0 1060
0.0 923
Name: class, dtype: int64
1.0 354
0.0 308
Name: class, dtype: int64
The rest of my script is as follows:
import keras
import keras.backend as K
from keras.layers import Dense, Dropout, Input, Lambda
from keras.models import Model
def euclidean_distance(vectors):
"""
Find the Euclidean distance between two vectors.
"""
x, y = vectors
sum_square = K.sum(K.square(x - y), axis=1, keepdims=True)
# Epsilon is small value that makes very little difference to the value of the denominator, but ensures that it isn't equal to exactly zero.
return K.sqrt(K.maximum(sum_square, K.epsilon()))
def contrastive_loss(y_true, y_pred):
"""
Distance-based loss function that tries to ensure that data samples that are semantically similar are embedded closer together.
See:
* https://gombru.github.io/2019/04/03/ranking_loss/
"""
margin = 1
return K.mean(y_true * K.square(y_pred) + (1 - y_true) * K.square(K.maximum(margin - y_pred, 0)))
def accuracy(y_true, y_pred):
"""
Compute classification accuracy with a fixed threshold on distances.
"""
return K.mean(K.equal(y_true, K.cast(y_pred < 0.5, y_true.dtype)))
def create_base_network(input_dim: int, dense_units: int, dropout_rate: float):
input1 = Input(input_dim, name="encoder")
x = input1
x = Dense(dense_units, activation="relu")(x)
x = Dropout(dropout_rate)(x)
x = Dense(dense_units, activation="relu")(x)
x = Dropout(dropout_rate)(x)
x = Dense(dense_units, activation="relu", name="Embeddings")(x)
return Model(input1, x)
def build_siamese_model(input_dim: int):
shared_network = create_base_network(input_dim, dense_units=128, dropout_rate=0.1)
left_input = Input(input_dim)
right_input = Input(input_dim)
# Since this is a siamese nn, both sides share the same network.
encoded_l = shared_network(left_input)
encoded_r = shared_network(right_input)
# The euclidean distance layer outputs close to 0 value when two inputs are similar and 1 otherwise.
distance = Lambda(euclidean_distance, name="Euclidean-Distance")([encoded_l, encoded_r])
siamese_net = Model(inputs=[left_input, right_input], outputs=distance)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])
return siamese_net
model = build_siamese_model(X_left_train.shape[1])
es_callback = keras.callbacks.EarlyStopping(monitor="val_loss", patience=3, verbose=0)
history = model.fit(
[X_left_train, X_right_train],
y_train,
validation_data=([X_left_val, X_right_val], y_val),
epochs=100,
callbacks=[es_callback],
verbose=1,
)
I have plotted the contrastive loss vs epoch and model accuracy vs epoch:
The validation line is almost flat, which seems odd to me (overfitted?).
After changing the dropout of the shared network from 0.1 to 0.5, I get the following results:
Somehow it looks better, but yields bad predictions as well.
My questions are:
Most examples of Siamese Networks I've seen so far involves embedding layers (text pairs) and/or Convolution layers (image pairs). My input pairs are the actual vector representation of some text, which is why I used Dense layers for the shared network. Is this the proper approach?
The output layer of my Siamese Network is as follows:
distance = Lambda(euclidean_distance, name="Euclidean-Distance")([encoded_l, encoded_r])
siamese_net = Model(inputs=[left_input, right_input], outputs=distance)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])
but someone over the internet suggested this instead:
distance = Lambda(lambda tensors: K.abs(tensors[0] - tensors[1]), name="L1-Distance")([encoded_l, encoded_r])
output = Dense(1, activation="sigmoid")(distance) # returns the class probability
siamese_net = Model(inputs=[left_input, right_input], outputs=output)
siamese_net.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
I'm not sure which one to trust nor the difference between them (except that the former returns the distance and the latter returns the class probability). In my experiments, I get poor results with binary_crossentropy.
EDIT:
After following #PlzBePython suggestions, I come up with the following base network:
distance = Lambda(lambda tensors: K.abs(tensors[0] - tensors[1]), name="L1-Distance")([encoded_l, encoded_r])
output = Dense(1, activation="linear")(distance)
siamese_net = Model(inputs=[left_input, right_input], outputs=output)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])
Thank you for your help!

This is less of an answer and more writing my thoughts down and hoping they can help find an answer.
In general, everything you do seems pretty reasonable to me.
Regarding your Questions:
1:
Embedding or feature extraction layers are never a must, but almost always make it easier to learn the intended. You can think of them like providing your distance model with the comprehensive summary of a sentence instead of its raw words. This also makes your model not depend on the location of a word. In your case, creating the summary/important features of a sentence and embedding similar sentences close to each other is done by the same network. Of course, this can also work, and I don't even think it's a bad approach. However, I would maybe increase the network size.
2:
In my opinion, those two loss functions are not too different. Binary Crossentropy is defined as:
While Contrastive Loss (margin = 1) is:
So you basically swap a log function for a hinge function.
The only real difference comes from the distance calculation. You probably got suggested using some kind of L1 distance, since L2 distance is supposed to perform worse with higher dimensions (see for example here) and your dimensionality is 128. Personally, I would rather go with L1 in your case, but I don't think it's a dealbreaker.
What I would try is:
increase the margin parameter. "1" always results in a pretty low loss in the false positive case. This could slow down training in general
try out embedding into the [-inf, inf] space (change last layer embedding activation to "linear")
change "binary_crossentropy" loss into "keras.losses.BinaryCrossentropy(from_logits=True)" and last activation from "sigmoid" to "linear". This should actually not make a difference, but I've made some weird experiences with the keras binary crossentropy function and from_logits seems to help sometimes
increase parameters
Lastly, a validation accuracy of 90% actually looks pretty good to me. Keep in mind, that when the validation accuracy is calculated in the first epoch, the model already has done about 60 weight updates (batch_size = 32). That means, especially in the first episode, a validation accuracy that is higher than the training accuracy (which is calculated during training) is kind of to be expected. Also, this can sometimes cause the misbelief that training loss is increasing faster than validation loss.
EDIT
I recommended "linear" in the last layer, because tensorflow recommends it ("from_logits"=True which requires value in [-inf, inf]) for Binary Crossentropy. In my experience, it converges better.

How to iterate through tensors in custom loss function?

I'm using keras with tensorflow backend. My goal is to query the batchsize of the current batch in a custom loss function. This is needed to compute values of the custom loss functions which depend on the index of particular observations. I like to make this clearer given the minimum reproducible examples below.
(BTW: Of course I could use the batch size defined for the training procedure and plugin it's value when defining the custom loss function, but there are some reasons why this can vary, especially if epochsize % batchsize (epochsize modulo batchsize) is unequal zero, then the last batch of an epoch has different size. I didn't found a suitable approach in stackoverflow, especially e. g.
Tensor indexing in custom loss function and Tensorflow custom loss function in Keras - loop over tensor and Looping over a tensor because obviously the shape of any tensor can't be inferred when building the graph which is the case for a loss function - shape inference is only possible when evaluating given the data, which is only possible given the graph. Hence I need to tell the custom loss function to do something with particular elements along a certain dimension without knowing the length of the dimension.
(this is the same in all examples)
from keras.models import Sequential
from keras.layers import Dense, Activation
# Generate dummy data
import numpy as np
data = np.random.random((1000, 100))
labels = np.random.randint(2, size=(1000, 1))
model = Sequential()
model.add(Dense(32, activation='relu', input_dim=100))
model.add(Dense(1, activation='sigmoid'))
example 1: nothing special without issue, no custom loss
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
# Train the model, iterating on the data in batches of 32 samples
model.fit(data, labels, epochs=10, batch_size=32)
(Output omitted, this runs perfectily fine)
example 2: nothing special, with a fairly simple custom loss
def custom_loss(yTrue, yPred):
loss = np.abs(yTrue-yPred)
return loss
model.compile(optimizer='rmsprop',
loss=custom_loss,
metrics=['accuracy'])
# Train the model, iterating on the data in batches of 32 samples
model.fit(data, labels, epochs=10, batch_size=32)
(Output omitted, this runs perfectily fine)
example 3: the issue
def custom_loss(yTrue, yPred):
print(yPred) # Output: Tensor("dense_2/Sigmoid:0", shape=(?, 1), dtype=float32)
n = yPred.shape[0]
for i in range(n): # TypeError: __index__ returned non-int (type NoneType)
loss = np.abs(yTrue[i]-yPred[int(i/2)])
return loss
model.compile(optimizer='rmsprop',
loss=custom_loss,
metrics=['accuracy'])
# Train the model, iterating on the data in batches of 32 samples
model.fit(data, labels, epochs=10, batch_size=32)
Of course the tensor has not shape info yet which can't be inferred when building the graph, only at training time. Hence for i in range(n) rises an error. Is there any way to perform this?
The traceback of the output:
-------
BTW here's my true custom loss function in case of any questions. I skipped it above for clarity and simplicity.
def neg_log_likelihood(yTrue,yPred):
yStatus = yTrue[:,0]
yTime = yTrue[:,1]
n = yTrue.shape[0]
for i in range(n):
s1 = K.greater_equal(yTime, yTime[i])
s2 = K.exp(yPred[s1])
s3 = K.sum(s2)
logsum = K.log(y3)
loss = K.sum(yStatus[i] * yPred[i] - logsum)
return loss
Here's an image of the partial negative log-likelihood of the cox proportional harzards model.
This is to clarify a question in the comments to avoid confusion. I don't think it is necessary to understand this in detail to answer the question.

As usual, don't loop. There are severe performance drawbacks and also bugs. Use only backend functions unless totally unavoidable (usually it's not unavoidable)
Solution for example 3:
So, there is a very weird thing there...
Do you really want to simply ignore half of your model's predictions? (Example 3)
Assuming this is true, just duplicate your tensor in the last dimension, flatten and discard half of it. You have the exact effect you want.
def custom_loss(true, pred):
n = K.shape(pred)[0:1]
pred = K.concatenate([pred]*2, axis=-1) #duplicate in the last axis
pred = K.flatten(pred) #flatten
pred = K.slice(pred, #take only half (= n samples)
K.constant([0], dtype="int32"),
n)
return K.abs(true - pred)
Solution for your loss function:
If you have sorted times from greater to lower, just do a cumulative sum.
Warning: If you have one time per sample, you cannot train with mini-batches!!!
batch_size = len(labels)
It makes sense to have time in an additional dimension (many times per sample), as is done in recurrent and 1D conv netoworks. Anyway, considering your example as expressed, that is shape (samples_equal_times,) for yTime:
def neg_log_likelihood(yTrue,yPred):
yStatus = yTrue[:,0]
yTime = yTrue[:,1]
n = K.shape(yTrue)[0]
#sort the times and everything else from greater to lower:
#obs, you can have the data sorted already and avoid doing it here for performance
#important, yTime will be sorted in the last dimension, make sure its (None,) in this case
# or that it's (None, time_length) in the case of many times per sample
sortedTime, sortedIndices = tf.math.top_k(yTime, n, True)
sortedStatus = K.gather(yStatus, sortedIndices)
sortedPreds = K.gather(yPred, sortedIndices)
#do the calculations
exp = K.exp(sortedPreds)
sums = K.cumsum(exp) #this will have the sum for j >= i in the loop
logsums = K.log(sums)
return K.sum(sortedStatus * sortedPreds - logsums)

low training accuracy of a neural network with adult income dataset

I built a neural network with tensorflow. It is a simple 3 layer neural network with the last layer being softmax.
I tried it on standard adult income dataset (e.g. https://archive.ics.uci.edu/ml/datasets/adult) since it is publicly available, has a good amount of data (roughly 50k examples) and also provides separate test data.
As there are some categorical attributes, I converted them into one hot encodings. For neural network I used Xavier initialization and Adam Optimizer. As there are only two output classes (>50k and <=50k) the last softmax layer had only two neurons. After one hot encoding expansion, the 14 attributes / columns expanded into 108 columns.
I experimented with different number of neurons in the first two hidden layers (from 5 to 25). I also experimented with number of iterations (from 1000 to 20000).
The training accuracy wasn't affected much by the number of neurons. It went up a little with more number of iterations. However I could not do any better than 82% :(
Am I missing something basic in my approach? Has anyone tried this (neural network with this dataset)? If so what are the expected results? Could the low accuracy be due to missing values? (I am planning to try filtering out all the missing values if there aren't much in the dataset).
Any other ideas? Here is my tensorflow neural network code in case there are any bugs in it etc.
def create_placeholders(n_x, n_y):
X = tf.placeholder(tf.float32, [n_x, None], name = "X")
Y = tf.placeholder(tf.float32, [n_y, None], name = "Y")
return X, Y
def initialize_parameters(num_features):
tf.set_random_seed(1) # so that your "random" numbers match ours
layer_one_neurons = 5
layer_two_neurons = 5
layer_three_neurons = 2
W1 = tf.get_variable("W1", [layer_one_neurons,num_features], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
b1 = tf.get_variable("b1", [layer_one_neurons,1], initializer = tf.zeros_initializer())
W2 = tf.get_variable("W2", [layer_two_neurons,layer_one_neurons], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
b2 = tf.get_variable("b2", [layer_two_neurons,1], initializer = tf.zeros_initializer())
W3 = tf.get_variable("W3", [layer_three_neurons,layer_two_neurons], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
b3 = tf.get_variable("b3", [layer_three_neurons,1], initializer = tf.zeros_initializer())
parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2,
"W3": W3,
"b3": b3}
return parameters
def forward_propagation(X, parameters):
"""
Implements the forward propagation for the model: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX
Arguments:
X -- input dataset placeholder, of shape (input size, number of examples)
parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3"
the shapes are given in initialize_parameters
Returns:
Z3 -- the output of the last LINEAR unit
"""
# Retrieve the parameters from the dictionary "parameters"
W1 = parameters['W1']
b1 = parameters['b1']
W2 = parameters['W2']
b2 = parameters['b2']
W3 = parameters['W3']
b3 = parameters['b3']
Z1 = tf.add(tf.matmul(W1, X), b1)
A1 = tf.nn.relu(Z1)
Z2 = tf.add(tf.matmul(W2, A1), b2)
A2 = tf.nn.relu(Z2)
Z3 = tf.add(tf.matmul(W3, A2), b3)
return Z3
def compute_cost(Z3, Y):
"""
Computes the cost
Arguments:
Z3 -- output of forward propagation (output of the last LINEAR unit), of shape (6, number of examples)
Y -- "true" labels vector placeholder, same shape as Z3
Returns:
cost - Tensor of the cost function
"""
# to fit the tensorflow requirement for tf.nn.softmax_cross_entropy_with_logits(...,...)
logits = tf.transpose(Z3)
labels = tf.transpose(Y)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels = labels))
return cost
def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.0001, num_epochs = 1000, print_cost = True):
"""
Implements a three-layer tensorflow neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SOFTMAX.
Arguments:
X_train -- training set, of shape (input size = 12288, number of training examples = 1080)
Y_train -- test set, of shape (output size = 6, number of training examples = 1080)
X_test -- training set, of shape (input size = 12288, number of training examples = 120)
Y_test -- test set, of shape (output size = 6, number of test examples = 120)
learning_rate -- learning rate of the optimization
num_epochs -- number of epochs of the optimization loop
print_cost -- True to print the cost every 100 epochs
Returns:
parameters -- parameters learnt by the model. They can then be used to predict.
"""
ops.reset_default_graph() # to be able to rerun the model without overwriting tf variables
tf.set_random_seed(1) # to keep consistent results
seed = 3 # to keep consistent results
(n_x, m) = X_train.shape # (n_x: input size, m : number of examples in the train set)
n_y = Y_train.shape[0] # n_y : output size
costs = [] # To keep track of the cost
# Create Placeholders of shape (n_x, n_y)
X, Y = create_placeholders(n_x, n_y)
# Initialize parameters
parameters = initialize_parameters(X_train.shape[0])
# Forward propagation: Build the forward propagation in the tensorflow graph
Z3 = forward_propagation(X, parameters)
# Cost function: Add cost function to tensorflow graph
cost = compute_cost(Z3, Y)
# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer.
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)
# Initialize all the variables
init = tf.global_variables_initializer()
# Start the session to compute the tensorflow graph
with tf.Session() as sess:
# Run the initialization
sess.run(init)
# Do the training loop
for epoch in range(num_epochs):
_ , epoch_cost = sess.run([optimizer, cost], feed_dict={X: X_train, Y: Y_train})
# Print the cost every epoch
if print_cost == True and epoch % 100 == 0:
print ("Cost after epoch %i: %f" % (epoch, epoch_cost))
if print_cost == True and epoch % 5 == 0:
costs.append(epoch_cost)
# plot the cost
plt.plot(np.squeeze(costs))
plt.ylabel('cost')
plt.xlabel('iterations (per tens)')
plt.title("Learning rate =" + str(learning_rate))
plt.show()
# lets save the parameters in a variable
parameters = sess.run(parameters)
print ("Parameters have been trained!")
# Calculate the correct predictions
correct_prediction = tf.equal(tf.argmax(Z3), tf.argmax(Y))
# Calculate accuracy on the test set
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
print ("Train Accuracy:", accuracy.eval({X: X_train, Y: Y_train}))
#print ("Test Accuracy:", accuracy.eval({X: X_test, Y: Y_test}))
return parameters
import math
import numpy as np
import h5py
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.python.framework import ops
import pandas as pd
%matplotlib inline
np.random.seed(1)
df = pd.read_csv('adult.data', header = None)
X_train_orig = df.drop(df.columns[[14]], axis=1, inplace=False)
Y_train_orig = df[[14]]
X_train = pd.get_dummies(X_train_orig) # get one hot encoding
Y_train = pd.get_dummies(Y_train_orig) # get one hot encoding
parameters = model(X_train.T, Y_train.T, None, None, num_epochs = 10000)
Any suggestions for other publicly available dataset for trying this out?
I tried standard algorithms on this dataset from scikit learn with default parameters and I got following accuracies:
Random Forest: 86
SVM: 96
kNN: 83
MLP: 79
I have uploaded my iPython notebook for this at: https://github.com/sameermahajan/ClassifiersWithIncomeData/blob/master/Scikit%2BLearn%2BClassifiers.ipynb
The best accuracy is with SVM which can be expected from some explanation that can be seen from: http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html Interestingly SVM also took a lot of time to run, way more than any other method.
This may not be a good problem to be solved by neural network looking at MLPClassifier accuracy above. My neural network wasn't that bad after all! Thanks for all the responses and your interest in this.

I didn't experiment on this dataset but after looking at some papers and doing some researches, it looks like your network is doing ok.
First is your accuracy calculed from the training set or the test set ? Having both will give you a good hint of how your network is performing.
I'm still a bit new to machine learning but I can maybe give some help :
By looking at the data documentation link here : https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names
And this paper : https://cseweb.ucsd.edu/classes/wi17/cse258-a/reports/a120.pdf
From those links 85% accuracy on training and test set looks like a good score, you are not too far.
Do you have some kind of cross-validation to look for overfitting of your network ?
I don't have your code so can't help you if this is a bug or a programming related issue, maybe sharing your code might be a good idea.
I think you would gain more accuracy by pre-processing your data a bit :
There are a lot of unknowns inside your data and neural networks are very sensitive to mislabeling and bad data.
You should try to find and replace or remove the unknowns.
You could also try to identify the most useful features and drop the ones that are near useless.
Feature scaling / data normalization can also be quite important for neural networks, i didn't look much into the data but maybe you can try to figure out how to scale your data between [0, 1] if its not done already.
The document I linked you seems to see an upgrade in performance by adding layers up to 5 layers, did you try adding more layers ?
You can also add dropout if you network overfits, if you didn't already.
I would maybe try other networks that are generally good for those tasks like SVM (Support Vector Machine) or Logistic Regression or even Random Forest but not sure by looking at the result that those will perform better than the artificial neural network.
I would also take a look at those links : https://www.kaggle.com/wenruliu/adult-income-dataset/feed
https://www.kaggle.com/wenruliu/income-prediction
In this link there are some people trying algorithms and giving tips to process the data and tackle this subject.
Hope it helped
Good luck,
Marc.

I think you are focusing too much in your network structure and you are forgetting that your results also depend largely on the data quality. I have tried a quick out-of-the-shelf random forest and it gave me similar results as you got (acc = 0.8275238).
I suggest you do some feature engineering (the kaggle link provided by #Marc has some nice examples). Decide an strategy for your NA's (look here), group values when you have many factor levels in categorical variables (e.g. countries grouped into continents) or discretise continuous variables (age variable into levels as in old, mid_aged, young).
Play with your data, study your dataset and try to apply expertise to remove redundant or too narrow info. Once this is done, then start tweaking your model. Additionally, you can consider doing as I did: use ensemble models (which are usually fast and pretty accurate with the default values) like RF or XGB to check if the results are consistent between all your models. Once you are sure you are in the right track, you can start tweaking structure, layers, etc. and see if you can push your results ever further.
Hope this helps.
Good luck!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.