I have a dataset with 40 feature values for each item.
When I try to build a neural network using tensorflow( i am new to tensorflow), this line of the code, is raising an error.
for _ in range(n_batches):
batches = tf.train.batch(input_list, batch_size=batch_size, enqueue_many=True, capacity=3)
Error:
ValueError: Dimensions 1 and 40 are not compatible
Edit:
Input list is calculated by reading in the csv file which contains per item 40 values of feature data
with open('0.csv') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
for row in spamreader:
input_list.append(row)
label.append([0,1])
As a sample, the csv file looks like this:
-303.49402956575165,122.79163957162679,-11.468865795473208,3.9811327671171512,15.337415109639783,-14.108867251441396,-6.2515380667284548,-2.5776250066394879,-11.151238822575237,0.80064220417869125,0.27982062264574636,-7.4540404067320498,1.1294160621043878,-0.19718258671031155,-4.8998361975682672,1.4096255573297396,-1.7156108756597495,-4.8841327368201828,1.1763404024624535,-1.4828051938843485,-4.4185680752011773,-0.23408318097388262,-1.4313336679963722,-2.350002729608391,-0.012519210688780043,-1.1211450861767087,-0.28546877503470003,1.0960108052925126,0.017672759764874924,2.2723680886768522,2.9337076178657373,0.80627947017775015,2.7373411634266391,2.6117883927402459,-0.17306332146016015,1.3495579191555498,1.2670127235105628,-1.2019882636572772,-0.19807357792275704,-0.11667725657652298
-324.95895982872742,129.16902362793437,-8.5782578426411202,4.8909691557978645,9.679460715739177,-21.516263281123845,-8.0416220828154454,-3.8078900557142812,-13.927945903788769,0.43952636061160394,-0.69085228125901854,-9.051802560349115,3.2384649283450346,0.51938767915475448,-7.6485369361057103,2.1827631457774346,-0.2975737631792475,-8.3214260824025619,-0.52396570369004059,1.1065859071072002,-2.3500738274356539,4.2447931265345034,7.879566058882304,6.0660506076869103,6.012983546020755,4.4649038449901903,2.1070954443797651,-0.26862183717859767,-1.495853591930427,0.52413729310912371,-0.85169785225746941,-3.675692955742599,-1.2819459279342635,-1.3243977571633176,-3.4214154132554886,-1.025319003875736,-1.5668864629221912,-4.3026170282653107,-1.9139061426068631,-0.64944140674848683
You should convert the input list to a single array first. You can supply a list of tensors/arrays to tf.batch but then every tensor will be split in batches of size 40. Currently you are supplying a list of tensors that have batch size 1 and you are asking to create batches of size 40 for each tensor these tensors. As you cannot create 40 examples from 1 example, you get a dimension mismatch. So instead do something like:
import numpy as np
input_list = np.array(input_list)
labels = np.array(labels)
batch = tf.train.batch([input_list, labels], batch_size=batch_size, enqueue_many=True, capacity=3)
inputs = batch[0]
labels = batch[1]
You can then use inputs and labels to define your network and loss function. For instance (just an illustration, code not tested):
hidden = tf.layers.dense(inputs, 2)
loss = tf.softmax_cross_entropy_with_logits(labels=labels, logits=hidden)
optimizer = tf.train.AdamOptimizer()
train_op = optimizer.minimize(loss)
Also note that you need to call tf.batch only once, not in a loop. Everytime an operation is run that requires inputs or labels, inputs and labels will evaluate to a different batch.
with tf.Session() as sess:
with tf.contrib.slim.queues.QueueRunners(sess):
for i in range(n_batches):
_, loss_value = sess.run([train_op, loss])
print("Loss for batch {}: {}".format(i, loss_value))
You can pass the input_list as a list of tensors. tf.train.batch
for _ in range(n_batches):
batches = tf.train.batch([input_list], batch_size=batch_size,
enqueue_many=True, capacity=3)
Related
I am building a multi input Network using the Keras functionnal API, but I struggle to find and understand the right format for my input data throw the network.
I have two main input :
One is an image, that goes throw a fine-tuned ResNet50 CNN
The second is a simple numpy array (X_train) containing metadata about the image (position and size of the image). This one goes throw a simple dense network.
I load the images from a dataframe, containing the metadata, and the filepath to the corresponding image.
I use ImageDataGenerator and the flow_from_dataframe method to load my images :
datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
train_flow = datagen.flow_from_dataframe(
dataframe=df_train,
x_col="cropped_img_filepath",
y_col="category",
batch_size=batch_size,
shuffle=False,
class_mode="categorical",
target_size=(224,224)
)
I can train the two networks separately using their own data, no problems until here.
The two output of the two distinct networks are then combined to a dense network to output a 10 digits probability vector :
# Create the input for the final dense network using the output of both the dense MLP and CNN
combinedInput = concatenate([cnn.output, mlp.output])
x = Dense(512, activation="relu")(combinedInput)
x = Dense(256, activation="relu")(x)
x = Dense(128, activation="relu")(x)
x = Dense(32, activation="relu")(x)
x = Dense(10, activation="softmax")(x)
model = Model(inputs=[cnn.input, mlp.input], outputs=x)
# Compile the model
opt = Adam(lr=1e-3, decay=1e-3 / 200)
model.compile(loss="categorical_crossentropy",
metrics=['accuracy'],
optimizer=opt)
# Train the model
model_history = model.fit(x=(train_flow, X_train),
y=y_train,
epochs=1,
batch_size=batch_size)
However, when I cannot train the overall network, I get the following error :
ValueError: Failed to find data adapter that can handle input: (<class 'tuple'> containing values of types {"<class 'keras_preprocessing.image.dataframe_iterator.DataFrameIterator'>", "<class 'numpy.ndarray'>"}), <class 'pandas.core.series.Series'>
I understand I am not using the correct input format for my input data.
I can train my CNN with the train_flow, and my dense network with X_train, so I was hoping this would work.
Do you have any idea of how to combine image data and nump array into a multi input array ?
Thank you for all the information you can give me!
I finally found how to do it, inspiring me from the post # Nima Aghli proposed.
Here is how I did that :
First instanciate the preprocessing function (for me the one used for ResNest50) :
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input
def preprocess_function(x):
if x.ndim == 3:
x = x[np.newaxis, :, :, :]
return preprocess_input(x)
# Initializing the datagen, using the above function :
datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
And then Define the Custom Data Generator that will yield randomly sampled array coupling image & metadata, whiule making sure not to be ever out of data (so that you can run on which ever number of epochs) :
def createGenerator(dff, verif=False, batch_size=BATCH_SIZE):
# Shuffles the dataframe, and so the batches as well
dff = dff.sample(frac=1)
# Shuffle=False is EXTREMELY important to keep order of image and coord
flow = datagen.flow_from_dataframe(
dataframe=dff,
directory=None,
x_col="cropped_img_filepath",
y_col="category",
batch_size=batch_size,
shuffle=False,
class_mode="categorical",
target_size=(224,224),
seed=42
)
idx = 0
n = len(dff) - batch_size
batch = 0
while True :
# Get next batch of images
X1 = flow.next()
# idx to reach
end = idx + X1[0].shape[0]
# get next batch of lines from df
X2 = dff[["x", "y", "w", "h"]][idx:end].to_numpy()
dff_verif = dff[idx:end]
# Updates the idx for the next batch
idx = end
# print("batch nb : ", batch, ", batch_size : ", X1[0].shape[0])
batch+=1
# Checks if we are at the end of the dataframe
if idx==len(dff):
# print("END OF THE DATAFRAME\n")
idx = 0
# Yields the image, metadata & target batches
if verif==True :
yield [X1[0], X2], X1[1], dff_verif
else :
yield [X1[0], X2], X1[1] #Yield both images, metadata and their mutual label
I voluntarily kept the commentaries as it helps grasps all the operations that are computed.
The main point/problem is to get images from all the dataframe, without ever getting short on images, and having batches of the same size.
Also, we have to be careful to the order of the images/metadata, so tht the right info is connected to the right image in the returned array.
I am trying to feed my model with batches of data incrementally as the dataset if very large. Building on the tutorial here I have written a data generator as below:
split_index = round(len(train) * 0.9)
shuffled_train = train.sample(frac=1)
df_train = shuffled_train.iloc[:split_index]
df_val = shuffled_train.iloc[split_index:]
# Convert validation set to fixed array
x_val = df_to_data(df_val)
y_val = df_val[classes].values
def data_generator(df, batch_size, gensim_embedding_model):
"""
Given a raw dataframe, generates infinite batches of FastText vectors.
"""
batch_i = 0 # Counter inside the current batch vector
batch_x = None # The current batch's x data
batch_y = None # The current batch's y data
while True: # Loop forever
df = df.sample(frac=1) # Shuffle df each epoch
for i, row in df.iterrows():
comment = row['comment_text']
if batch_x is None:
batch_x = np.zeros((batch_size, window_length, n_features), dtype='float32')
batch_y = np.zeros((batch_size, len(classes)), dtype='float32')
batch_x[batch_i] = text_to_vector(comment, gensim_embedding_model) # LINE A
batch_y[batch_i] = row[classes].values
batch_i += 1
if batch_i == batch_size:
# Ready to yield the batch
yield batch_x, batch_y
batch_x = None
batch_y = None
batch_i = 0
Where 'LINE A' calls a function that looks up words in 'comment' in a pre-trained embedding model (gensim_embedding_model) and populates the vector.
I understand that this creates an embedding representation of the batch, which is fed into the model to train incrementally, and replaces the Keras EmbeddingLayer which tries to fit the embedding representation of the entire dataset into memory.
However, the Keras EmbeddingLayer has a parameter 'trainable', which can control if the parameters (i.e., word embedding weights in this case) should be treated 'as-is' or only as initial to be tuned. But using a data generator, I do not see how it is possible to set this parameter (or equivalent).
Am I right that if data generator is used in such a way, then the embedding weights are only used as-is and will not be possible to be further tuned? I.e., this is equivalent to 'trainable=False'. But with this implementation it is not possible to have 'trainable=True'?
Thanks
Summary: according to the documentation, Keras model.fit() should accept tf.dataset as input (I am using TF version 1.12.0). I can train my model if I manually do the training steps but using model.fit() on the same model, I get an error I cannot resolve.
Here is a sketch of what I did: my dataset, which is too big to fit in the memory, consists of many files each with different number of rows of (100 features, label). I'd like to use tf.data to build my data pipeline:
def data_loader(filename):
'''load a single data file with many rows'''
features, labels = load_hdf5(filename)
...
return features, labels
def make_dataset(filenames, batch_size):
'''read files one by one, pick individual rows, batch them and repeat'''
dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.map( # Problem here! See edit for solution
lambda filename: tuple(tf.py_func(data_loader, [filename], [float32, tf.float32])))
dataset = dataset.flat_map(
lambda features, labels: tf.data.Dataset.from_tensor_slices((features, labels)))
dataset = dataset.batch(batch_size)
dataset = dataset.repeat()
dataset = dataset.prefetch(1000)
return dataset
_BATCH_SIZE = 128
training_set = make_dataset(training_files, batch_size=_BATCH_SIZE)
I'd like to try a very basic logistic regression model:
inputs = tf.keras.layers.Input(shape=(100,))
outputs = tf.keras.layers.Dense(1, activation='softmax')(inputs)
model = tf.keras.Model(inputs, outputs)
If I train it manually everything works fine, e.g.:
labels = tf.placeholder(tf.float32)
loss = tf.reduce_mean(tf.keras.backend.categorical_crossentropy(labels, outputs))
train_step = tf.train.GradientDescentOptimizer(.05).minimize(loss)
iterator = training_set.make_one_shot_iterator()
next_element = iterator.get_next()
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
for i in range(training_size // _BATCH_SIZE):
x, y = sess.run(next_element)
train_step.run(feed_dict={inputs: x, labels: y})
However, if I instead try to use model.fit like this:
model.compile('adam', 'categorical_crossentropy', metrics=['acc'])
model.fit(training_set.make_one_shot_iterator(),
steps_per_epoch=training_size // _BATCH_SIZE,
epochs=1,
verbose=1)
I get an error message ValueError: Cannot take the length of Shape with unknown rank. inside the keras'es _standardize_user_data function.
I have tried quite a few things but could not resolve the issue. Any ideas?
Edit: based on #kvish's answer, the solution was to change the map from a lambda to a function that would specify the correct tensor dimensions, e.g.:
def data_loader(filename):
def loader_impl(filename):
features, labels, _ = load_hdf5(filename)
...
return features, labels
features, labels = tf.py_func(loader_impl, [filename], [tf.float32, tf.float32])
features.set_shape((None, 100))
labels.set_shape((None, 1))
return features, labels
and now, all needed to do is to call this function from map:
dataset = dataset.map(data_loader)
Probably tf.py_func produces an unknown shape which Keras cannot infer. We can set the shape of the tensor returned by it using set_shape(your_shape) method and that would help Keras infer the shape of the result.
I'm trying to build a fully connected layer using the CIFAR 100 dataset with softmax and print the accuracy,the learning curve and some of end results – the pictures and their true label and predicted label.
I have this following code for the mnist dataset,the problem I'm facing is to how apply the same thing for my data set,I'll try to explain my problem down below:
#initialization
X=tf.placeholder(tf.float32, [None, 28, 28, 1])
w=tf.Variable(tf.zeros([784, 10]))
b=tf.Variable(tf.zeros([10]))
init=tf.global_variables_initializer()
#model
Y=tf.nn.softmax(tf.matmul(tf.reshape(X,[-1, 784]), w)+b)
#place holder for correct answer
Y_=tf.placeholder(tf.float32, [None, 10])
#loss function
cross_entropy= -tf.reduce_sum(Y_ * tf.log(Y))
# % of correct answers found in batch
is_correct=tf.equal(tf.argmax(Y,1), tf.argmax(Y_,1))
accurancy= tf.reduce_mean(tf.cast(is_correct,tf.float32))
#training step
optimizer=tf.train.GradientDescentOptimizer(0.003)
train_step=optimizer.minimize(cross_entropy)
sess=tf.Session()
sess.run(init)
for i in range(10000):
#load batch of images and correct answer
batch_x, batch_Y=mnist.train.next_batch(100)
train_data={X: batch_x, Y_:batch_y}
#train
sess.run(train_step, feed_dict=train_data)
a,c=sess.run([accurancy, cross_entropy], feed=train_data)
test_data={X:mnist.test.images, Y_:mnist.test.lables}
a,c=sess.run([accurancy, cross_entropy], feed=test_data)
I have downloaded CIFAR-100 dataset. The CIFAR-100 dataset consists of 60000 32x32 color images. It has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).
I only used 2 super classes “aquatic mammals” and “flowers” each with 5 subcategories
here is some of the code:
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
# loading train data
data = unpickle('train')
train_data, label_train_data = filter_train(data, 5000)
label_train_data = relabel(label_train_data)
# loading test data
data2 = unpickle('test')
test_data, label_test_data = filter_train(data2, 1000)
label_test_data = relabel(label_test_data)
filter_train is just a function I used to fillter the 2 super classes “aquatic mammals” and “flowers”
I know that mnist.train.next_batch(batch_size=100) means it randomly pick 100 data from MNIST dataset
So my Question is how can I exchange the
batch_x, batch_Y=mnist.train.next_batch(100)
and:
test_data={X:mnist.test.images, Y_:mnist.test.lables}
So that I can access the train data and test data of my CIFAR dataset,
I been tring to replace those lines with the
train_data, label_train_data and test_data, label_test_data but it won't seem to work and I can't find any other way to get to those sets.
Any hely would be appreciated
I am trying to properly read in my own binary data to Tensorflow based on Fixed length records section of this tutorial, and by looking at the read_cifar10 function here. Mind you I am new to tensorflow, so my understanding may be off.
My Data
My files are binary with float32 type. The first 32 bit sample is the label, and the remaining 256 samples are the data. I want to reshape the data at the end to a [2, 128] matrix.
My Code So far:
import tensorflow as tf
import os
def read_data(filename_queue):
item_type = tf.float32
label_items = 1
data_items = 256
label_bytes = label_items * item_type.size
data_bytes = data_items * item_type.size
record_bytes = label_bytes + data_bytes
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
key, value = reader.read(filename_queue)
record_data = tf.decode_raw(value, item_type)
# labels = tf.cast(tf.strided_slice(record_data, [0], [label_items]), tf.int32)
label = tf.strided_slice(record_data, [0], [label_items])
data0 = tf.strided_slice(record_data, [label_items], [label_items + data_items])
data = tf.reshape(data0, [2, data_items/2])
return data, label
if __name__ == '__main__':
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Set GPU device
datafiles = ['train_0000.dat', 'train_0001.dat']
num_epochs = 2
filename_queue = tf.train.string_input_producer(datafiles, num_epochs=num_epochs, shuffle=True)
data, label = read_data(filename_queue)
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
(x, y) = read_data(filename_queue)
print(y.eval())
This code hands at the print(y.eval()), but I fear I have much bigger issues than that.
Question:
When I execute this, I get a data and label tensor returned. The problem is I don't quite understand how to actually read the data from the tensor. For example, I understand the autoencoder example here, however this has a mnist.train.next_batch(batch_size) function that is called to read the next batch. Do I need to write that for my function, or is it handled by something internal to my read_data() function. If I need to write that function, what does it look like?
Are their any other obvious things I'm missing? My goal in using this method is to reduce I/O overhead, and not store all of the data in memory, since my file are quite large.
Thanks in advance.
Yes. You are pretty much done. At this point you need to:
1) Write your neural network model model which is supposed to take your data and return a label.
2) Write your cost function C which takes the network prediction and the true label and gives you a cost.
3) Choose and optimizer.
4) Put everything together:
opt = tf.AdamOptimizer(learning_rate=0.001)
datafiles = ['train_0000.dat', 'train_0001.dat']
num_epochs = 2
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
filename_queue = tf.train.string_input_producer(datafiles, num_epochs=num_epochs, shuffle=True)
data, label = read_data(filename_queue)
example_batch, label_batch = tf.train.shuffle_batch(
[data, label], batch_size=128)
y_pred = model(data)
loss = C(label, y_pred)
After which you iterate and minimize the loss with:
opt.minimize(loss)
See also tf.train.string_input_producer behavior in a loop for related information.