How do you train a linear model in tensorflow? - python

I have generated some input data in a CSV where 'awesomeness' is 'age * 10'. It looks like this:
age, awesomeness
67, 670
38, 380
32, 320
69, 690
40, 400
It should be trivial to write a tensorflow model that can predict 'awesomeness' from 'age', but I can't make it work.
When I run training, the output I get is:
accuracy: 0.0 <----------------------------------- What!??
accuracy/baseline_label_mean: 443.8
accuracy/threshold_0.500000_mean: 0.0
auc: 0.0
global_step: 6000
labels/actual_label_mean: 443.8
labels/prediction_mean: 1.0
loss: -2.88475e+09
precision/positive_threshold_0.500000_mean: 1.0
recall/positive_threshold_0.500000_mean: 1.0
Please note that this is obviously a completely contrived example, but that is because I was getting the same result with a more complex meaningful model with a much larger data set; 0% accuracy.
This is my attempt at the most minimal possible reproducible test case that I can make which exhibits the same behaviour.
Here's what I'm doing, based on the census example for the DNNClassifier from tflearn:
COLUMNS = ["age", "awesomeness"]
CONTINUOUS_COLUMNS = ["age"]
OUTPUT_COLUMN = "awesomeness"
def build_estimator(model_dir):
"""Build an estimator."""
age = tf.contrib.layers.real_valued_column("age")
deep_columns = [age]
m = tf.contrib.learn.DNNClassifier(model_dir=model_dir,
feature_columns=deep_columns,
hidden_units=[50, 10])
return m
def input_fn(df):
"""Input builder function."""
feature_cols = {k: tf.constant(df[k].values, shape=[df[k].size, 1]) for k in CONTINUOUS_COLUMNS}
output = tf.constant(df[OUTPUT_COLUMN].values, shape=[df[OUTPUT_COLUMN].size, 1])
return feature_cols, output
def train_and_eval(model_dir, train_steps):
"""Train and evaluate the model."""
train_file_name, test_file_name = training_data()
df_train = pd.read_csv(...) # ommitted for clarity
df_test = pd.read_csv(...)
m = build_estimator(model_dir)
m.fit(input_fn=lambda: input_fn(df_train), steps=train_steps)
results = m.evaluate(input_fn=lambda: input_fn(df_test), steps=1)
for key in sorted(results):
print("%s: %s" % (key, results[key]))
def training_data():
"""Return path to the training and test data"""
training_datafile = path.join(path.dirname(__file__), 'data', 'data.training')
test_datafile = path.join(path.dirname(__file__), 'data', 'data.test')
return training_datafile, test_datafile
model_folder = 'scripts/model' # Where to store the model
train_steps = 2000 # How many iterations to run while training
train_and_eval(model_folder, train_steps)
A couple of notes:
The original example tutorial this is based on is here https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/wide_n_deep_tutorial.py
Notice I am using the DNNClassifier, not the LinearClassifier as I want specifically to deal with continuous input variables.
A lot of examples just use 'premade' data sets which are known to work with examples; my data set has been manually generated and is absolutely not random.
I have verified the csv loader is loading the data correctly as int64 values.
Training and test data are generated identically, but have different values in them; however, using data.training as the test data still returns a 0% accuracy, so there's no question that something isn't working, this isn't just over-fitting.

First of all, you are describing a regression task, not a classification task. Therefore, both, DNNClassifier and LinearClassifier would be the wrong thing to use. That also makes accuracy the wrong quantity to use to tell if your model works or not. I suggest you read up on these two different context e.g. in the book "The Elements of Statistical Learning"
But here is a short answer to your problem. Say you have a linear model
awesomeness_predicted = slope * age
where slope is the parameter you want to learn from data. Lets say you have data age[0], ..., age[N] and the corresponding awesomeness values a_data[0],...,a_data[N]. In order to specify if your model works well, we are going to use mean squared error, that is
error = sum((a_data[i] - a_predicted[i])**2 for i in range(N))
What you want to do now is start with a random guess for slope and gradually improving using gradient descent. Here is a full working example in pure tensorflow
import tensorflow as tf
import numpy as np
DTYPE = tf.float32
## Generate Data
age = np.array([67, 38, 32, 69, 40])
awesomeness = 10 * age
## Generate model
# define the parameter of the model
slope = tf.Variable(initial_value=tf.random_normal(shape=(1,), dtype=DTYPE))
# define the data inputs to the model as variable size tensors
x = tf.placeholder(DTYPE, shape=(None,))
y_data = tf.placeholder(DTYPE, shape=(None,))
# specify the model
y_pred = slope * x
# use mean squared error as loss function
loss = tf.reduce_mean(tf.square(y_data - y_pred))
target = tf.train.AdamOptimizer().minimize(loss)
## Train Model
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for epoch in range(100000):
_, training_loss = sess.run([target, loss],
feed_dict={x: age, y_data: awesomeness})
print("Training loss: ", training_loss)
print("Found slope=", sess.run(slope))

There are a few things I would like to say.
Assuming you load the data correctly:
-This looks like a regression task and you are using a classifier. I'm not saying it doesn't work at all, but like this your are giving a label to each entry of age and training on the whole batch each epoch is very unstable.
-You are getting a huge value for the loss, your gradients are exploding. Having this toy dataset you probably need to tune hyperparameters like hidden neurons, learning rate and number of epochs. Try to log the loss value for each epoch and see if that may be the problem.
-Last suggestion, make your data work with a simpler model, possibly suited for your task, like a regression model and then scale up

See also https://github.com/tflearn/tflearn/blob/master/examples/basics/multiple_regression.py for using tflearn to solve this.
""" Multiple Regression/Multi target Regression Example
The input features have 10 dimensions, and target features are 2 dimension.
"""
from __future__ import absolute_import, division, print_function
import tflearn
import numpy as np
# Regression data- 10 training instances
#10 input features per instance.
X=np.random.rand(10,10).tolist()
#2 output features per instance
Y=np.random.rand(10,2).tolist()
# Multiple Regression graph, 10-d input layer
input_ = tflearn.input_data(shape=[None,10])
#10-d fully connected layer
r1 = tflearn.fully_connected(input_,10)
#2-d fully connected layer for output
r1 = tflearn.fully_connected(r1,2)
r1 = tflearn.regression(r1, optimizer='sgd', loss='mean_square',
metric='R2', learning_rate=0.01)
m = tflearn.DNN(r1)
m.fit(X,Y, n_epoch=100, show_metric=True, snapshot_epoch=False)
#Predict for 1 instance
testinstance=np.random.rand(1,10).tolist()
print("\nInput features: ",testinstance)
print("\n Predicted output: ")
print(m.predict(testinstance))

Related

How to make predictions on new dataset with tensorflow's gradient tape

While I'm able to understand how to use model.fit(x_train, y_train), I can't figure out how to make predictions on new data using tensorflow's gradient tape. My github repository with runnable code (up to an error) can be found here. What is currently working is that I get the trained model "network_output", however it appears that with gradient tape, argmax is being used on the model itself, where I'm used to model.fit() taking the test data as an input:
network_output = trained_network(input_images,input_number)
preds = np.argmax(network_output, axis=1)
Where "input_images" is an ndarray: (20,3,3,1) and "input_number" is an ndarray: (20,5).
Now I'm taking network_output as the trained model and would like to use it to predict similarly typed data of test_images, and test_number respectively.
The error 'tensorflow.python.framework.ops.EagerTensor' object has no attribute 'predict' here:
predicted_number = network_output.predict(test_images)
Which is because I don't know how to use the tape to make predictions. However once the prediction works I would guess I can compare the resulting "predicted_number" against the "test_number" as would usually be done using the model.fit method.
acc = 0
for i in range(len(test_images)):
if (predicted_number[i] == test_number[i]):
acc += 1
print("Accuracy: ", acc / len(input_images) * 100, "%")
In order to obtain prediction I usually iterate through batches manually like this:
predictions = []
for batch in range(num_batch):
logits = trained_network(x_test[batch * batch_size: (batch + 1) * batch_size], training=False)
# first obtain probabilities
# (if the last layer of the network has no activation, otherwise skip the softmax here)
prob = tf.nn.softmax(logits)
# putting back together predictions for all batches
predictions.extend(tf.argmax(input=prob, axis=1))
If you don't have a lot of data you can skip the loop, this is faster than using predict because you directly invoke the __call__ method of the model:
logits = trained_network(x_test, training=False)
prob = tf.nn.softmax(logits)
predictions = tf.argmax(input=prob, axis=1)
Finally you could also use predict. In this case the batches are handled automatically. It is easier to use when you have lots of data since you don't have to create a loop to interate through batches. The result is a numpy array of predictions. In can be used like this:
predictions = trained_network.predict(x_test) # you can set a batch_size if you want
What you're doing wrong is this part:
network_output = trained_network(input_images,input_number)
predicted_number = network_output.predict(test_images)
You have to call predict directly on your model trained_network.

Neural network versus random forest performance discrepancy

I want to run some experiments with neural networks using PyTorch, so I tried a simple one as a warm-up exercise, and I cannot quite make sense of the results.
The exercise attempts to predict the rating of 1000 TPTP problems from various statistics about the problems such as number of variables, maximum clause length etc. Data file https://github.com/russellw/ml/blob/master/test.csv is quite straightforward, 1000 rows, the final column is the rating, started off with some tens of input columns, with all the numbers scaled to the range 0-1, I progressively deleted features to see if the result still held, and it does, all the way down to one input column; the others are in previous versions in Git history.
I started off using separate training and test sets, but have set aside the test set for the moment, because the question about whether training performance generalizes to testing, doesn't arise until training performance has been obtained in the first place.
Simple linear regression on this data set has a mean squared error of about 0.14.
I implemented a simple feedforward neural network, code in https://github.com/russellw/ml/blob/master/test_nn.py and copied below, that after a couple hundred training epochs, also has an mean squared error of 0.14.
So I tried changing the number of hidden layers from 1 to 2 to 3, using a few different optimizers, tweaking the learning rate, switching the activation functions from relu to tanh to a mixture of both, increasing the number of epochs to 5000, increasing the number of hidden units to 1000. At this point, it should easily have had the ability to just memorize the entire data set. (At this point I'm not concerned about overfitting. I'm just trying to get the mean squared error on training data to be something other than 0.14.) Nothing made any difference. Still 0.14. I would say it must be stuck in a local optimum, but that's not supposed to happen when you've got a couple million weights; it's supposed to be practically impossible to be in a local optimum for all parameters simultaneously. And I do get slightly different sequences of numbers on each run. But it always converges to 0.14.
Now the obvious conclusion would be that 0.14 is as good as it gets for this problem, except that it stays the same even when the network has enough memory to just memorize all the data. But the clincher is that I also tried a random forest, https://github.com/russellw/ml/blob/master/test_rf.py
... and the random forest has a mean squared error of 0.01 on the original data set, degrading gracefully as features are deleted, still 0.05 on the data with just one feature.
Nowhere in the lore of machine learning is it said 'random forests vastly outperform neural nets', so I'm presumably doing something wrong, but I can't see what it is. Maybe it's something as simple as just missing a flag or something you need to set in PyTorch. I would appreciate it if someone could take a look.
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
# data
df = pd.read_csv("test.csv")
print(df)
print()
# separate the output column
y_name = df.columns[-1]
y_df = df[y_name]
X_df = df.drop(y_name, axis=1)
# numpy arrays
X_ar = np.array(X_df, dtype=np.float32)
y_ar = np.array(y_df, dtype=np.float32)
# torch tensors
X_tensor = torch.from_numpy(X_ar)
y_tensor = torch.from_numpy(y_ar)
# hyperparameters
in_features = X_ar.shape[1]
hidden_size = 100
out_features = 1
epochs = 500
# model
class Net(nn.Module):
def __init__(self, hidden_size):
super(Net, self).__init__()
self.L0 = nn.Linear(in_features, hidden_size)
self.N0 = nn.ReLU()
self.L1 = nn.Linear(hidden_size, hidden_size)
self.N1 = nn.Tanh()
self.L2 = nn.Linear(hidden_size, hidden_size)
self.N2 = nn.ReLU()
self.L3 = nn.Linear(hidden_size, 1)
def forward(self, x):
x = self.L0(x)
x = self.N0(x)
x = self.L1(x)
x = self.N1(x)
x = self.L2(x)
x = self.N2(x)
x = self.L3(x)
return x
model = Net(hidden_size)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
# train
print("training")
for epoch in range(1, epochs + 1):
# forward
output = model(X_tensor)
cost = criterion(output, y_tensor)
# backward
optimizer.zero_grad()
cost.backward()
optimizer.step()
# print progress
if epoch % (epochs // 10) == 0:
print(f"{epoch:6d} {cost.item():10f}")
print()
output = model(X_tensor)
cost = criterion(output, y_tensor)
print("mean squared error:", cost.item())
can you please print the shape of your input ?
I would say check those things first:
that your target y have the shape (-1, 1) I don't know if pytorch throws an Error in this case. you can use y.reshape(-1, 1) if it isn't 2 dim
your learning rate is high. usually when using Adam the default value is good enough or try simply to lower your learning rate. 0.1 is a high value for a learning rate to start with
place the optimizer.zero_grad at the first line inside the for loop
normalize/standardize your data ( this is usually good for NNs )
remove outliers in your data (my opinion: I think this can't affect Random forest so much but it can affect NNs badly)
use cross validation (maybe skorch can help you here. It's a scikit learn wrapper for pytorch and easy to use if you know keras)
Notice that Random forest regressor or any other regressor can outperform neural nets in some cases. There is some fields where neural nets are the heros like Image Classification or NLP but you need to be aware that a simple regression algorithm can outperform them. Usually when your data is not big enough.

For a classification model in tensorflow, is there a way to impose an asymmetric cost function during the training?

I am trying to build a Neural Network in tensorflow where the cost of a Type I error (false-positive) is more costly than a Type II error (false-negative). Is there a way to impose this during the training process (i.e. inputting a cost matrix)? This is possible with simple models like Logistic Regression in scikit learn by specifying the class_weight parameter.
cw = {0: 3,1:1}
clf = LogisticRegression(class_weight = cw )
In this case, incorrectly predicting a 0 is 3x more costly than incorrectly predicting a 1. However, this cannot be performed with a Neural Network, so I want to see if it is possible in tensorflow.
Thanks
You could use tf.nn.weighted_cross_entropy_with_logits and it's pos_weight argument.
This argument weights positive class, as described by documentation (in TF2.0 at least):
A value pos_weights > 1 decreases the false negative count, hence increasing the recall.
Conversely setting pos_weights < 1 decreases the false positive count and increases the precision.
In your case, you could create custom loss function like this:
import tensorflow as tf
# Output logits from your network, not the values after sigmoid activation
class WeightedBinaryCrossEntropy:
def __init__(self, positive_weight: float):
self.positive_weight = positive_weight
def __call__(self, targets, logits, sample_weight=None):
return tf.nn.weighted_cross_entropy_with_logits(
targets, logits, pos_weight=self.positive_weight
)
And create a custom neural network with it, for example using tf.keras (samples are weighted as they were in your question:
import numpy as np
model = tf.keras.models.Sequential(
[
tf.keras.layers.Dense(32, input_shape=(10,)),
tf.keras.layers.Activation("relu"),
tf.keras.layers.Dense(10),
tf.keras.layers.Activation("relu"),
# Output one logit for binary classification
tf.keras.layers.Dense(1),
]
)
# Example random data
data = np.random.random((32, 10))
targets = np.random.randint(2, size=32)
# 3 times as costly to make type I error
model.compile(optimizer="rmsprop", loss=WeightedBinaryCrossEntropy(positive_weight=3))
model.fit(data, targets, batch_size=32)
You can use a logarithmic scale. For a 0 incorrectly predicted as 1, y - ŷ = -1, log goes to 1.71. For a 1 predicted as 0, y - ŷ = 1 log equals 0.63. For y == ŷ log equals 0. Almost the three times more costly, for a 0 incorrectly predicted as 1.
import numpy as np
from math import exp
loss=abs(1-exp(-np.log(exp(y-ŷ))))
#abs(1-exp(-np.log(exp(0))))
#Out[53]: 0.0
#abs(1-exp(-np.log(exp(-1))))
#Out[54]: 1.718281828459045
#abs(1-exp(-np.log(exp(1))))
#Out[55]: 0.6321205588285577
Then you will have a convex optimization. Implementing:
import keras.backend as K
def custom_loss(y_true,y_pred):
return K.mean(abs(1-exp(-np.log(exp(y_true-y_pred)))))
Then:
model.compile(loss=custom_loss, optimizer=sgd,metrics = ['accuracy'])

Keras variable length input for regression

Everyone!
I am trying to develop a neural network using Keras and TensorFlow, which should be able to take variable length arrays as input and give either some single value (see the toy example below) or classify them (that a problem for later and will not be touched in this question).
The idea is fairly simple.
We have variable length arrays. I am currently using very simple toy data, which is generated by the following code:
import numpy as np
import pandas as pd
from keras import models as kem
from keras import activations as kea
from keras import layers as kel
from keras import regularizers as ker
from keras import optimizers as keo
from keras import losses as kelo
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize
n = 100
x = pd.DataFrame(columns=['data','res'])
mms = MinMaxScaler(feature_range=(-1,1))
for i in range(n):
k = np.random.randint(20,100)
ss = np.random.randint(0,100,size=k)
idres = np.sum(ss[np.arange(0,k,2)])-np.sum(ss[np.arange(1,k,2)])
x.loc[i,'data'] = ss
x.loc[i,'res'] = idres
x.res = mms.fit_transform(x.res)
x_train,x_test,y_train, y_test = train_test_split(x.data,x.res,test_size=0.2)
x_train = sliding_window(x_train.as_matrix(),2,2)
x_test = sliding_window(x_test.as_matrix(),2,2)
To put it simple, I generate arrays with random length and the result (output) for each array is sum of even elements - sum of odd elements. Obviously, it can be negative and positive. The output then scaled to the range [-1,1] to fit with tanh activation function.
The Sequential model is generated as following:
model = kem.Sequential()
model.add(kel.LSTM(20,return_sequences=False,input_shape=(None,2),recurrent_activation='tanh'))
model.add(kel.Dense(20,activation='tanh'))
model.add(kel.Dense(10,activation='tanh'))
model.add(kel.Dense(5,activation='tanh'))
model.add(kel.Dense(1,activation='tanh'))
sgd = keo.SGD(lr=0.1)
mseloss = kelo.mean_squared_error
model.compile(optimizer=sgd,loss=mseloss,metrics=['accuracy'])
And the training of the model is doing in the following way:
def calcMSE(model,x_test,y_test):
nTest = len(x_test)
sum = 0
for i in range(nTest):
restest = model.predict(np.reshape(x_test[i],(1,-1,2)))
sum+=(restest-y_test[0,i])**2
return sum/nTest
i = 1
mse = calcMSE(model,x_test,np.reshape(y_test.values,(1,-1)))
lrPar = 0
lrSteps = 30
while mse>0.04:
print("Epoch %i" % (i))
print(mse)
for j in range(len(x_train)):
ntrain=j
model.train_on_batch(np.reshape(x_train[ntrain],(1,-1,2)),np.reshape(y_train.values[ntrain],(-1,1)))
i+=1
mse = calcMSE(model,x_test,np.reshape(y_test.values,(1,-1)))
The problem is that optimiser gets stuck usually around MSE=0.05 (on test set). Last time I tested, it actually stuck around MSE=0.12 (on test data).
Moreover, if you will look at what the model gives on test data (left column) in comparison with the correct output (right column):
[[-0.11888303]] 0.574923547401
[[-0.17038491]] -0.452599388379
[[-0.20098214]] 0.065749235474
[[-0.22307695]] -0.437308868502
[[-0.2218809]] 0.371559633028
[[-0.2218741]] 0.039755351682
[[-0.22247596]] -0.434250764526
[[-0.17094387]] -0.151376146789
[[-0.17089397]] -0.175840978593
[[-0.16988073]] 0.025993883792
[[-0.16984619]] -0.117737003058
[[-0.17087571]] -0.515290519878
[[-0.21933308]] -0.366972477064
[[-0.09379648]] -0.178899082569
[[-0.17016701]] -0.333333333333
[[-0.17022927]] -0.195718654434
[[-0.11681376]] 0.452599388379
[[-0.21438009]] 0.224770642202
[[-0.12475857]] 0.151376146789
[[-0.2225963]] -0.380733944954
And on training set the same is:
[[-0.22209576]] -0.00764525993884
[[-0.17096499]] -0.247706422018
[[-0.22228305]] 0.276758409786
[[-0.16986915]] 0.340978593272
[[-0.16994311]] -0.233944954128
[[-0.22131597]] -0.345565749235
[[-0.17088912]] -0.145259938838
[[-0.22250554]] -0.792048929664
[[-0.17097935]] 0.119266055046
[[-0.17087702]] -0.2874617737
[[-0.1167363]] -0.0045871559633
[[-0.08695849]] 0.159021406728
[[-0.17082921]] 0.374617737003
[[-0.15422876]] -0.110091743119
[[-0.22185338]] -0.7125382263
[[-0.17069265]] -0.678899082569
[[-0.16963181]] -0.00611620795107
[[-0.17089556]] -0.249235474006
[[-0.17073657]] -0.414373088685
[[-0.17089497]] -0.351681957187
[[-0.17138508]] -0.0917431192661
[[-0.22351067]] 0.11620795107
[[-0.17079701]] -0.0795107033639
[[-0.22246087]] 0.22629969419
[[-0.17044055]] 1.0
[[-0.17090379]] -0.0902140672783
[[-0.23420531]] -0.0366972477064
[[-0.2155242]] 0.0366972477064
[[-0.22192241]] -0.675840978593
[[-0.22220723]] -0.354740061162
[[-0.1671907]] -0.10244648318
[[-0.22705412]] 0.0443425076453
[[-0.22943887]] -0.249235474006
[[-0.21681401]] 0.065749235474
[[-0.12495813]] 0.466360856269
[[-0.17085686]] 0.316513761468
[[-0.17092516]] 0.0275229357798
[[-0.17277785]] -0.325688073394
[[-0.22193027]] 0.139143730887
[[-0.17088208]] 0.422018348624
[[-0.17093034]] -0.0886850152905
[[-0.17091317]] -0.464831804281
[[-0.22241674]] -0.707951070336
[[-0.1735626]] -0.337920489297
[[-0.16984227]] 0.00764525993884
[[-0.16756304]] 0.515290519878
[[-0.22193302]] -0.414373088685
[[-0.22419722]] -0.351681957187
[[-0.11561158]] 0.17125382263
[[-0.16640976]] -0.321100917431
[[-0.21557514]] -0.313455657492
[[-0.22241823]] -0.117737003058
[[-0.22165506]] -0.646788990826
[[-0.22238114]] -0.261467889908
[[-0.1709189]] 0.0902140672783
[[-0.17698884]] -0.626911314985
[[-0.16984172]] 0.587155963303
[[-0.22226149]] -0.590214067278
[[-0.16950315]] -0.469418960245
[[-0.22180589]] -0.133027522936
[[-0.2224243]] -1.0
[[-0.22236891]] 0.152905198777
[[-0.17089345]] 0.435779816514
[[-0.17422611]] -0.233944954128
[[-0.17177556]] -0.324159021407
[[-0.21572633]] -0.347094801223
[[-0.21509495]] -0.646788990826
[[-0.17086846]] -0.34250764526
[[-0.17595944]] -0.496941896024
[[-0.16803505]] -0.382262996942
[[-0.16983894]] -0.348623853211
[[-0.17078683]] 0.363914373089
[[-0.21560851]] -0.186544342508
[[-0.22416025]] -0.374617737003
[[-0.1723443]] -0.186544342508
[[-0.16319042]] -0.0122324159021
[[-0.18837349]] -0.181957186544
[[-0.17371364]] -0.539755351682
[[-0.22232121]] -0.529051987768
[[-0.22187822]] -0.149847094801
As you can see, model output is actually all quite close to each other unlike the training set, where variability is much bigger (although, I should admit, that negative values are dominants in both training and test set.
What am I doing wrong here? Why training gets stuck or is it normal process and I should leave it for much longer (I was doing several hudreds epochs couple of times and still stay stuck). I also tried to use variable learning rate (used, for example, cosine annealing with restarts (as in I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with restarts.
arXiv preprint arXiv:1608.03983, 2016.)
I would appreciate any suggestions both from network structure and training approach and from coding/detailed sides.
Thank you very much in advance for help.

low training accuracy of a neural network with adult income dataset

I built a neural network with tensorflow. It is a simple 3 layer neural network with the last layer being softmax.
I tried it on standard adult income dataset (e.g. https://archive.ics.uci.edu/ml/datasets/adult) since it is publicly available, has a good amount of data (roughly 50k examples) and also provides separate test data.
As there are some categorical attributes, I converted them into one hot encodings. For neural network I used Xavier initialization and Adam Optimizer. As there are only two output classes (>50k and <=50k) the last softmax layer had only two neurons. After one hot encoding expansion, the 14 attributes / columns expanded into 108 columns.
I experimented with different number of neurons in the first two hidden layers (from 5 to 25). I also experimented with number of iterations (from 1000 to 20000).
The training accuracy wasn't affected much by the number of neurons. It went up a little with more number of iterations. However I could not do any better than 82% :(
Am I missing something basic in my approach? Has anyone tried this (neural network with this dataset)? If so what are the expected results? Could the low accuracy be due to missing values? (I am planning to try filtering out all the missing values if there aren't much in the dataset).
Any other ideas? Here is my tensorflow neural network code in case there are any bugs in it etc.
def create_placeholders(n_x, n_y):
X = tf.placeholder(tf.float32, [n_x, None], name = "X")
Y = tf.placeholder(tf.float32, [n_y, None], name = "Y")
return X, Y
def initialize_parameters(num_features):
tf.set_random_seed(1) # so that your "random" numbers match ours
layer_one_neurons = 5
layer_two_neurons = 5
layer_three_neurons = 2
W1 = tf.get_variable("W1", [layer_one_neurons,num_features], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
b1 = tf.get_variable("b1", [layer_one_neurons,1], initializer = tf.zeros_initializer())
W2 = tf.get_variable("W2", [layer_two_neurons,layer_one_neurons], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
b2 = tf.get_variable("b2", [layer_two_neurons,1], initializer = tf.zeros_initializer())
W3 = tf.get_variable("W3", [layer_three_neurons,layer_two_neurons], initializer = tf.contrib.layers.xavier_initializer(seed = 1))
b3 = tf.get_variable("b3", [layer_three_neurons,1], initializer = tf.zeros_initializer())
parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2,
"W3": W3,
"b3": b3}
return parameters
def forward_propagation(X, parameters):
"""
Implements the forward propagation for the model: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX
Arguments:
X -- input dataset placeholder, of shape (input size, number of examples)
parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3"
the shapes are given in initialize_parameters
Returns:
Z3 -- the output of the last LINEAR unit
"""
# Retrieve the parameters from the dictionary "parameters"
W1 = parameters['W1']
b1 = parameters['b1']
W2 = parameters['W2']
b2 = parameters['b2']
W3 = parameters['W3']
b3 = parameters['b3']
Z1 = tf.add(tf.matmul(W1, X), b1)
A1 = tf.nn.relu(Z1)
Z2 = tf.add(tf.matmul(W2, A1), b2)
A2 = tf.nn.relu(Z2)
Z3 = tf.add(tf.matmul(W3, A2), b3)
return Z3
def compute_cost(Z3, Y):
"""
Computes the cost
Arguments:
Z3 -- output of forward propagation (output of the last LINEAR unit), of shape (6, number of examples)
Y -- "true" labels vector placeholder, same shape as Z3
Returns:
cost - Tensor of the cost function
"""
# to fit the tensorflow requirement for tf.nn.softmax_cross_entropy_with_logits(...,...)
logits = tf.transpose(Z3)
labels = tf.transpose(Y)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels = labels))
return cost
def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.0001, num_epochs = 1000, print_cost = True):
"""
Implements a three-layer tensorflow neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SOFTMAX.
Arguments:
X_train -- training set, of shape (input size = 12288, number of training examples = 1080)
Y_train -- test set, of shape (output size = 6, number of training examples = 1080)
X_test -- training set, of shape (input size = 12288, number of training examples = 120)
Y_test -- test set, of shape (output size = 6, number of test examples = 120)
learning_rate -- learning rate of the optimization
num_epochs -- number of epochs of the optimization loop
print_cost -- True to print the cost every 100 epochs
Returns:
parameters -- parameters learnt by the model. They can then be used to predict.
"""
ops.reset_default_graph() # to be able to rerun the model without overwriting tf variables
tf.set_random_seed(1) # to keep consistent results
seed = 3 # to keep consistent results
(n_x, m) = X_train.shape # (n_x: input size, m : number of examples in the train set)
n_y = Y_train.shape[0] # n_y : output size
costs = [] # To keep track of the cost
# Create Placeholders of shape (n_x, n_y)
X, Y = create_placeholders(n_x, n_y)
# Initialize parameters
parameters = initialize_parameters(X_train.shape[0])
# Forward propagation: Build the forward propagation in the tensorflow graph
Z3 = forward_propagation(X, parameters)
# Cost function: Add cost function to tensorflow graph
cost = compute_cost(Z3, Y)
# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer.
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)
# Initialize all the variables
init = tf.global_variables_initializer()
# Start the session to compute the tensorflow graph
with tf.Session() as sess:
# Run the initialization
sess.run(init)
# Do the training loop
for epoch in range(num_epochs):
_ , epoch_cost = sess.run([optimizer, cost], feed_dict={X: X_train, Y: Y_train})
# Print the cost every epoch
if print_cost == True and epoch % 100 == 0:
print ("Cost after epoch %i: %f" % (epoch, epoch_cost))
if print_cost == True and epoch % 5 == 0:
costs.append(epoch_cost)
# plot the cost
plt.plot(np.squeeze(costs))
plt.ylabel('cost')
plt.xlabel('iterations (per tens)')
plt.title("Learning rate =" + str(learning_rate))
plt.show()
# lets save the parameters in a variable
parameters = sess.run(parameters)
print ("Parameters have been trained!")
# Calculate the correct predictions
correct_prediction = tf.equal(tf.argmax(Z3), tf.argmax(Y))
# Calculate accuracy on the test set
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
print ("Train Accuracy:", accuracy.eval({X: X_train, Y: Y_train}))
#print ("Test Accuracy:", accuracy.eval({X: X_test, Y: Y_test}))
return parameters
import math
import numpy as np
import h5py
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.python.framework import ops
import pandas as pd
%matplotlib inline
np.random.seed(1)
df = pd.read_csv('adult.data', header = None)
X_train_orig = df.drop(df.columns[[14]], axis=1, inplace=False)
Y_train_orig = df[[14]]
X_train = pd.get_dummies(X_train_orig) # get one hot encoding
Y_train = pd.get_dummies(Y_train_orig) # get one hot encoding
parameters = model(X_train.T, Y_train.T, None, None, num_epochs = 10000)
Any suggestions for other publicly available dataset for trying this out?
I tried standard algorithms on this dataset from scikit learn with default parameters and I got following accuracies:
Random Forest: 86
SVM: 96
kNN: 83
MLP: 79
I have uploaded my iPython notebook for this at: https://github.com/sameermahajan/ClassifiersWithIncomeData/blob/master/Scikit%2BLearn%2BClassifiers.ipynb
The best accuracy is with SVM which can be expected from some explanation that can be seen from: http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html Interestingly SVM also took a lot of time to run, way more than any other method.
This may not be a good problem to be solved by neural network looking at MLPClassifier accuracy above. My neural network wasn't that bad after all! Thanks for all the responses and your interest in this.
I didn't experiment on this dataset but after looking at some papers and doing some researches, it looks like your network is doing ok.
First is your accuracy calculed from the training set or the test set ? Having both will give you a good hint of how your network is performing.
I'm still a bit new to machine learning but I can maybe give some help :
By looking at the data documentation link here : https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names
And this paper : https://cseweb.ucsd.edu/classes/wi17/cse258-a/reports/a120.pdf
From those links 85% accuracy on training and test set looks like a good score, you are not too far.
Do you have some kind of cross-validation to look for overfitting of your network ?
I don't have your code so can't help you if this is a bug or a programming related issue, maybe sharing your code might be a good idea.
I think you would gain more accuracy by pre-processing your data a bit :
There are a lot of unknowns inside your data and neural networks are very sensitive to mislabeling and bad data.
You should try to find and replace or remove the unknowns.
You could also try to identify the most useful features and drop the ones that are near useless.
Feature scaling / data normalization can also be quite important for neural networks, i didn't look much into the data but maybe you can try to figure out how to scale your data between [0, 1] if its not done already.
The document I linked you seems to see an upgrade in performance by adding layers up to 5 layers, did you try adding more layers ?
You can also add dropout if you network overfits, if you didn't already.
I would maybe try other networks that are generally good for those tasks like SVM (Support Vector Machine) or Logistic Regression or even Random Forest but not sure by looking at the result that those will perform better than the artificial neural network.
I would also take a look at those links : https://www.kaggle.com/wenruliu/adult-income-dataset/feed
https://www.kaggle.com/wenruliu/income-prediction
In this link there are some people trying algorithms and giving tips to process the data and tackle this subject.
Hope it helped
Good luck,
Marc.
I think you are focusing too much in your network structure and you are forgetting that your results also depend largely on the data quality. I have tried a quick out-of-the-shelf random forest and it gave me similar results as you got (acc = 0.8275238).
I suggest you do some feature engineering (the kaggle link provided by #Marc has some nice examples). Decide an strategy for your NA's (look here), group values when you have many factor levels in categorical variables (e.g. countries grouped into continents) or discretise continuous variables (age variable into levels as in old, mid_aged, young).
Play with your data, study your dataset and try to apply expertise to remove redundant or too narrow info. Once this is done, then start tweaking your model. Additionally, you can consider doing as I did: use ensemble models (which are usually fast and pretty accurate with the default values) like RF or XGB to check if the results are consistent between all your models. Once you are sure you are in the right track, you can start tweaking structure, layers, etc. and see if you can push your results ever further.
Hope this helps.
Good luck!

Categories

Resources