Tensorflow Data API: repeat() - python

The following code is a snippet from "Hands on machine learning with scimitar-learn, Keras and tensorflow".
I understand everything in the following code, except the chaining the .repeat(repeat) function in second line.
I know that repeat is repeating the dataset elements (i.e., in this case the file paths) and if the argument is set to None or left empty the repetition will continues forever until the function which using it decides when to stop.
As you can see in the code bellow, the author is setting the repeat() argument to None;
1 - basically I want to know why the author decided to do such?
2 - or is it because the code is trying to simulate a situation which the dataset is not fitting in the memory, if this is the case then in real situation we should avoid repeat(), am I correct?
def csv_reader_dataset(filepaths, repeat=1, n_readers=5,
n_read_threads=None, shuffle_buffer_size=10000,
n_parse_threads=5, batch_size=32):
dataset = tf.data.Dataset.list_files(filepaths, seed = 42).repeat(repeat)
dataset = dataset.interleave(
lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
cycle_length = n_readers, num_parallel_calls = n_read_threads)
dataset = dataset.shuffle(shuffle_buffer_size)
dataset = dataset.map(preprocess, num_parallel_calls = n_parse_threads)
dataset = dataset.batch(batch_size)
return dataset.prefetch(1)
train_set = csv_reader_dataset(train_filepaths, repeat = None)
valid_set = csv_reader_dataset(valid_filepaths)
test_set = csv_reader_dataset(test_filepaths)
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)
model = keras.models.Sequential([
keras.layers.InputLayer(input_shape = X_train.shape[-1: ]),
keras.layers.Dense(30, activation = 'relu'),
keras.layers.Dense(1)
])
m_loss = keras.losses.mean_squared_error
m_optimizer = keras.optimizers.SGD(lr = 1e-3)
batch_size = 32
model.compile(loss = m_loss, optimizer = m_optimizer, metrics = ['accuracy'])
model.fit(train_set, steps_per_epoch = len(X_train) // batch_size, epochs = 10, validation_data = valid_set)

For your questions, I think:
tf.data API won't easily lead to out-of-memory easily as it loads data given the file paths or tfrecrods (compressed mode). Hence, repeat() does not thing with memory here; instead, it is used for data-transforming.
I have to use repeat(#) when setting steps_per_epoch to #. Say your batch_num = 32, and steps_per_epoch = 100//32 = 3 -> require 3 * 32 = 96 samples per epoch but your data has 80 samples only. Then, I have to use data.repeat(2) to have totally 160 samples that 80 samples in repeat_1 and the first 16 samples in repeat_2 would be used within 1 epoch. This is to prevent the error Input run out of data.

I had another copy of the same question on book's author git repo.
The issue is clarified; it was due to a bug in Keras 2.0.
Read more on: https://github.com/ageron/handson-ml2/issues/407

Related

Custom layer with different activation function for each output

I have a simple neural network with two outputs and for each of them I need to use different activation function. I do basically what is written in this article - here, but it looks like my layer with different activation functions is not working:
See my code below:
X = filled_df.loc[:, "SOUTEZ_MEAN_HOME":"TOTAL_POINTS_AWAY"].values
y = filled_df.loc[:, "HOME_YELLOW_CARDS"].values
X= X.astype("float32")
y= y.astype("float32")
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train= scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
def negative_binomial_layer(x):
# Get the number of dimensions of the input
num_dims = len(x.get_shape())
# Separate the parameters
n, p = tf.unstack(x, num=2, axis=-1)
# Add one dimension to make the right shape
n = tf.expand_dims(n, -1)
p = tf.expand_dims(p, -1)
# Apply a softplus to make positive
n = tf.cast(n, tf.float32)
p = tf.cast(p, tf.float32)
n = tf.keras.activations.softplus(n)
# Apply a sigmoid activation to bound between 0 and 1
p = tf.keras.activations.sigmoid(p)
# Join back together again
out_tensor = tf.concat((n, p), axis=num_dims-1)
return out_tensor
input_shape = (212, )
# Define inputs with predefined shape
inputs = Input(shape=input_shape)
# Build network with some predefined architecture
Layer1 = Dense(16)
Layer2 = Dense(8)
output1 = Layer1(inputs)
output2 = Layer2(output1)
# Predict the parameters of a negative binomial distribution
outputs = Dense(2)(output2)
#outputs = tf.cast(outputs, tf.float32)
distribution_outputs = Lambda(negative_binomial_layer)(outputs)
# Construct model
model = Model(inputs=inputs, outputs=outputs)
num_epochs = 10
opt = Adam()
model.compile(loss = negative_binomial_loss, optimizer = opt)
history = model.fit(X_train, y_train, epochs = num_epochs,
validation_data = (X_test, y_test))
These are my predicted values if I print y_pred in custom loss function:
Epoch 1/10
y_pred = [[2.19472528 3.14479065]
[-1.16056371 1.69369149]
[-1.12327099 2.06830978]
...
[-1.23587477 4.82307]
[0.235431105 3.86740351]
[-2.75554061 1.10352468]] [[[2.19472528 3.14479065]
[-1.16056371 1.69369149]
[-1.12327099 2.06830978]
...
[-1.23587477 4.82307]
[0.235431105 3.86740351]
[-2.75554061 1.10352468]]]
Second predicted value p should be between 0 and 1 and since it is out of this range I am getting nan during counting loss.
Any suggestions? Thanks
I can't give an exact programming explanation, but I can give a theoretical answer to this question which you should be able to use to build it.
From what I am assuming, you are asking how to use a different activation function for each output node in the outputs layer. I do not know much about any of the libraries or extensions you are using, but usually these kinds of libraries include some kind of method to create a customised network. From the code you have posted I can see that you are using a pre-defined structure for a network, this means that you may not be able to customise the output layer yourself, and you will have to create a custom network instead. I am assuming you are using Tensorflow due to some of the methods in the code you posted.
There is also something else to consider. Usually you have activated functions on the neurons (hidden layer) too, that is something that you might have to take in to consideration as well.
I am sorry I was not able to give a practical answer, but I hope this helps you see what you can do to get it to work - have a nice day!

What is steps_per_epoch in model.fit_generator actually doing?

After reading the Keras documentation on the steps_per_epoch required argument in the model.fit_generator method my understanding of it is:
If a dataset contains 'N' samples and the generator function (passed to Keras) returns 'B = batch_size' number of samples for every call (Here, I think of a call as a single yield from the generator function) and since steps_per_epoch = ceil(N/B) the generator is being called steps_per_epoch times so that the full dataset is passed through the model after one epoch and this same process is repeated for every epoch until the training is completed.
So as to test whether my understanding was correct, I implemented the following
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
index = 0
def get_values(inputs, targets):
i = 0
while True:
yield inputs[i], targets[i]
i += 1
if i >= len(inputs):
i = 0
def get_batch(inputs, targets, batch_size=2):
global index
batch_X = []
batch_Y = []
for inp, targ in get_values(inputs, targets):
batch_X.append(inp)
batch_Y.append(targ)
if len(batch_X) >= batch_size:
yield np.array(batch_X), np.array(batch_Y)
index += 1
batch_X = []
batch_Y = []
data = list(range(10))
labels = [2*val for val in range(10)]
model = Sequential([
Dense(16, activation='relu', input_shape=(1, )),
Dense(1)
])
model.compile(optimizer='rmsprop', loss='mean_squared_error')
model.fit_generator(get_batch(data, labels, batch_size=2), steps_per_epoch=5, epochs=1, verbose=False)
print(index) # Should Print 5 but it prints 15
The program isn't tough to understand...
But as per my interpretation, it should print 5 but it prints 15. Am I wrong in the interpretation of steps_per_epoch?
If so please give me the correct interpretation of steps_per_epoch
PS. I'm new to Tensorflow and Keras, thanks in advance.
Did not go through your code but your original interpertation is correct. Actually per the documentation located here you can omit steps per epoch and the model.fit will divide the length of your data set (N) by the batch size to determine the steps. I did copy and run your code. Guess what it printed the index as 5. Only thing I can think of that might be different are the imports.

Keras Multi Input Network, using Images and structured data : How do I build the correct input data?

I am building a multi input Network using the Keras functionnal API, but I struggle to find and understand the right format for my input data throw the network.
I have two main input :
One is an image, that goes throw a fine-tuned ResNet50 CNN
The second is a simple numpy array (X_train) containing metadata about the image (position and size of the image). This one goes throw a simple dense network.
I load the images from a dataframe, containing the metadata, and the filepath to the corresponding image.
I use ImageDataGenerator and the flow_from_dataframe method to load my images :
datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
train_flow = datagen.flow_from_dataframe(
dataframe=df_train,
x_col="cropped_img_filepath",
y_col="category",
batch_size=batch_size,
shuffle=False,
class_mode="categorical",
target_size=(224,224)
)
I can train the two networks separately using their own data, no problems until here.
The two output of the two distinct networks are then combined to a dense network to output a 10 digits probability vector :
# Create the input for the final dense network using the output of both the dense MLP and CNN
combinedInput = concatenate([cnn.output, mlp.output])
x = Dense(512, activation="relu")(combinedInput)
x = Dense(256, activation="relu")(x)
x = Dense(128, activation="relu")(x)
x = Dense(32, activation="relu")(x)
x = Dense(10, activation="softmax")(x)
model = Model(inputs=[cnn.input, mlp.input], outputs=x)
# Compile the model
opt = Adam(lr=1e-3, decay=1e-3 / 200)
model.compile(loss="categorical_crossentropy",
metrics=['accuracy'],
optimizer=opt)
# Train the model
model_history = model.fit(x=(train_flow, X_train),
y=y_train,
epochs=1,
batch_size=batch_size)
However, when I cannot train the overall network, I get the following error :
ValueError: Failed to find data adapter that can handle input: (<class 'tuple'> containing values of types {"<class 'keras_preprocessing.image.dataframe_iterator.DataFrameIterator'>", "<class 'numpy.ndarray'>"}), <class 'pandas.core.series.Series'>
I understand I am not using the correct input format for my input data.
I can train my CNN with the train_flow, and my dense network with X_train, so I was hoping this would work.
Do you have any idea of how to combine image data and nump array into a multi input array ?
Thank you for all the information you can give me!
I finally found how to do it, inspiring me from the post # Nima Aghli proposed.
Here is how I did that :
First instanciate the preprocessing function (for me the one used for ResNest50) :
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input
def preprocess_function(x):
if x.ndim == 3:
x = x[np.newaxis, :, :, :]
return preprocess_input(x)
# Initializing the datagen, using the above function :
datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
And then Define the Custom Data Generator that will yield randomly sampled array coupling image & metadata, whiule making sure not to be ever out of data (so that you can run on which ever number of epochs) :
def createGenerator(dff, verif=False, batch_size=BATCH_SIZE):
# Shuffles the dataframe, and so the batches as well
dff = dff.sample(frac=1)
# Shuffle=False is EXTREMELY important to keep order of image and coord
flow = datagen.flow_from_dataframe(
dataframe=dff,
directory=None,
x_col="cropped_img_filepath",
y_col="category",
batch_size=batch_size,
shuffle=False,
class_mode="categorical",
target_size=(224,224),
seed=42
)
idx = 0
n = len(dff) - batch_size
batch = 0
while True :
# Get next batch of images
X1 = flow.next()
# idx to reach
end = idx + X1[0].shape[0]
# get next batch of lines from df
X2 = dff[["x", "y", "w", "h"]][idx:end].to_numpy()
dff_verif = dff[idx:end]
# Updates the idx for the next batch
idx = end
# print("batch nb : ", batch, ", batch_size : ", X1[0].shape[0])
batch+=1
# Checks if we are at the end of the dataframe
if idx==len(dff):
# print("END OF THE DATAFRAME\n")
idx = 0
# Yields the image, metadata & target batches
if verif==True :
yield [X1[0], X2], X1[1], dff_verif
else :
yield [X1[0], X2], X1[1] #Yield both images, metadata and their mutual label
I voluntarily kept the commentaries as it helps grasps all the operations that are computed.
The main point/problem is to get images from all the dataframe, without ever getting short on images, and having batches of the same size.
Also, we have to be careful to the order of the images/metadata, so tht the right info is connected to the right image in the returned array.

Preprocess huge data with a custom data generator function for keras

Actually I'm building a keras model and I have a dataset in the msg format with over 10 million instances with 40 features which are all categorical. For the moment i'm using just a sample of it since reading all the dataset and encoding it is impossbile to fit into the memory. Here a part of the code i'm using:
import pandas as pd
from category_encoders import BinaryEncoder as be
from sklearn.preprocessing import StandardScaler
def model():
model = Sequential()
model.add(Dense(120, input_dim=233, kernel_initializer='uniform', activation='selu'))
model.add(Dense(12, kernel_initializer='uniform', activation='sigmoid'))
model.compile(SGD(lr=0.008),loss='mean_squared_error', metrics=['accuracy'])
return model
def addrDataLoading():
data=pd.read_msgpack('datum.msg')
data=data.dropna(subset=['s_address','d_address'])
data=data.sample(300000) # taking a sample of all the dataset to make the encoding possible
y=data[['s_address','d_address']]
x=data.drop(['s_address','d_address'],1)
encX = be().fit(x, y)
numeric_X= encX.transform(x)
encY=be().fit(y,y)
numeric_Y=encY.transform(y)
scaler=StandardScaler()
X_all=scaler.fit_transform(numeric_X)
x_train=X_all[0:250000,:]
y_train=numeric_Y.iloc[0:250000,:]
x_val=X_all[250000:,:]
y_val=numeric_Y.iloc[250000:,:]
return x_train,y_train,x_val,y_val
x_train,y_train,x_val,y_val=addrDataLoading()
model.fit(x_train, y_train,validation_data=(x_val,y_val),nb_epoch=20, batch_size=200)
So my question is how to use a custom data generator function to read and process all the data I have and not just a sample, and then use fit_generator() function to train my model?
EDIT
This is a sample of the data:
netData
I think that taking different samples from the data results in different encoding dimensions.
For this sample there's 16 different categories: 4 addresses (3 bit), 4 hostnames (3 bit ), 1 subnetmask (1 bit), 5 infrastructures (3 bit ), 1 accesszone(1 bit ), so the binary encoding will give us 11 bit and the new dimension of the data is 11 previously 5. So let's say for another sample in the address column we have 8 different categories this will give 4 bit in binary and we let the same number of categories in the other columns so the overall encoding will result in 12 dimensions. I believe that what's causing the problem.
Slightly slow solution (repeating the same actions)
Edit - fit BinatyEncoder before create generators
Drop NA first and work with clean data further to avoid reassignments of data frame.
data = pd.read_msgpack('datum.msg')
data.dropna(subset=['s_address','d_address']).to_msgpack('datum_clean.msg')
In this solution data_generator can process same data multiple times. If it's not critical, you can use this solution.
Define function which reads the data snd splits index to train and test. It won't consume a lot of memory.
import pandas as pd
from category_encoders import BinaryEncoder as be
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
def model():
#some code defining the model
def train_test_index_split():
# if there's enough memory to add one more column
data = pd.read_msgpack('datum_cleaned.msg')
train_idx, test_idx = train_test_split(data.index)
return data, train_idx, test_idx
data, train_idx, test_idx = train_test_index_split()
Define and initialize data generator, both for train and validation
def data_generator(data, encX, encY, bathc_size, n_steps, index):
# EDIT: As the data was cleaned, you don't need dropna
# data = data.dropna(subset=['s_address','d_address'])
for i in range(n_steps):
batch_idx = np.random.choice(index, batch_size)
sample = data.loc[batch_idx]
y = sample[['s_address', 'd_address']]
x = sample.drop(['s_address', 'd_address'], 1)
numeric_X = encX.transform(x)
numeric_Y = encY.transform(y)
scaler = StandardScaler()
X_all = scaler.fit_transform(numeric_X)
yield X_all, numeric_Y
Edited part now train binary encoders. You should sub-sample your data to create representative training set for encoders. I guess error with the shape of the data was caused by incorrecly trained BinaryEncoder (Error when checking input: expected dense_9_input to have shape (233,) but got array with shape (234,)):
def get_minimal_unique_frame(df):
return (pd.Series([df[column].unique() for column in df], index=df.columns)
.apply(pd.Series) # tranform list on unique values to pd.Series
.T # transope frame: columns is columns again
.fillna(method='ffill')) # fill NaNs with last value
x = get_minimal_unique_frame(data.drop(['s_address', 'd_address'], 1))
y = get_minimal_unique_frame(data[['s_address', 'd_address']])
NB: I never used category_encoders and have incompatible system configuration, so can't install and check it. So, former code can evoke problems. In that case, I guess, you should compare length of x and y data frames and make it the same, and probaly change an index of data frames.
encX = be().fit(x, y)
encY = be().fit(y, y)
batch_size = 200
train_steps = 100000
val_steps = 5000
train_gen = data_generator(data, encX, encY, batch_size, train_steps, train_idx)
test_gen = data_generator(data, encX, encY, batch_size, test_steps, test_idx)
Edit Please provide an exapmple of x_sample, run train_gen and save output, and post x_samples, y_smaples:
x_samples = []
y_samples = []
for i in range(10):
x_sample, y_sample = next(train_gen)
x_samples.append(x_sample)
y_samples.append(y_sample)
Note: data generator won't stop itself. But itt will be stopped after train_steps by fit_generator method.
Fit model with generators:
model.fit_generator(generator=train_gen, steps_per_epoch=train_steps, epochs=1,
validation_data=test_gen, validation_steps=val_steps)
As far as I know, python does not copy pandas dataframes if you won't do it explicitply with copy() or so. Because of it, both generators use the same object. But if you use Jupyter Notebook, data leaks/uncollected carbage may occur, and a memory troubles comes with them.
More efficient solution - scketch
Clean your data
data = pd.read_msgpack('datum.msg')
data.dropna(subset=['s_address','d_address']).to_msgpack('datum_clean.msg')
Create train/test split, preprocess it and store as numpy array, if you have enough disk space.
data, train_idx, test_idx = train_test_index_split()
def data_preprocessor(data, path, index):
# data = data.dropna(subset=['s_address','d_address'])
sample = data.loc[batch_idx]
y = sample[['s_address', 'd_address']]
x = sample.drop(['s_address', 'd_address'], 1)
encX = be().fit(x, y)
numeric_X = encX.transform(x)
encY = be().fit(y, y)
numeric_Y = encY.transform(y)
scaler = StandardScaler()
X_all = scaler.fit_transform(numeric_X)
np.save(path + '_X', X_all)
np.save(path + '_y', numeric_Y)
data_preprocessor(data, 'train', train_idx)
data_preprocessor(data, 'test', test_idx)
Delete unnecessary data:
del data
Load your files and use following generator:
train_X = np.load('train_X.npy')
train_y = np.load('train_y.npy')
test_X = np.load('test_X.npy')
test_y = np.load('test_y.npy')
def data_generator(X, y, batch_size, n_steps):
idxs = np.arange(len(X))
np.random.shuffle(idxs)
ptr = 0
for _ in range(n_steps):
batch_idx = idxs[ptr:ptr+batch_size]
x_sample = X[batch_idx]
y_sample = y[batch_idx]
ptr += batch_size
if ptr > len(X):
ptr = 0
yield x_sapmple, y_sample
Prepare generators:
train_gen = data_generator(train_X, train_y, batch_size, train_steps)
test_gen = data_generator(test_X, test_y, batch_size, test_steps)
And fit the model finaly. Hope one of this solutions will help. At least if python does pass arrays and data frames buy reference, not by value. Stackoverflow answer about it.

Tensorflow and reading binary data properly

I am trying to properly read in my own binary data to Tensorflow based on Fixed length records section of this tutorial, and by looking at the read_cifar10 function here. Mind you I am new to tensorflow, so my understanding may be off.
My Data
My files are binary with float32 type. The first 32 bit sample is the label, and the remaining 256 samples are the data. I want to reshape the data at the end to a [2, 128] matrix.
My Code So far:
import tensorflow as tf
import os
def read_data(filename_queue):
item_type = tf.float32
label_items = 1
data_items = 256
label_bytes = label_items * item_type.size
data_bytes = data_items * item_type.size
record_bytes = label_bytes + data_bytes
reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
key, value = reader.read(filename_queue)
record_data = tf.decode_raw(value, item_type)
# labels = tf.cast(tf.strided_slice(record_data, [0], [label_items]), tf.int32)
label = tf.strided_slice(record_data, [0], [label_items])
data0 = tf.strided_slice(record_data, [label_items], [label_items + data_items])
data = tf.reshape(data0, [2, data_items/2])
return data, label
if __name__ == '__main__':
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Set GPU device
datafiles = ['train_0000.dat', 'train_0001.dat']
num_epochs = 2
filename_queue = tf.train.string_input_producer(datafiles, num_epochs=num_epochs, shuffle=True)
data, label = read_data(filename_queue)
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
(x, y) = read_data(filename_queue)
print(y.eval())
This code hands at the print(y.eval()), but I fear I have much bigger issues than that.
Question:
When I execute this, I get a data and label tensor returned. The problem is I don't quite understand how to actually read the data from the tensor. For example, I understand the autoencoder example here, however this has a mnist.train.next_batch(batch_size) function that is called to read the next batch. Do I need to write that for my function, or is it handled by something internal to my read_data() function. If I need to write that function, what does it look like?
Are their any other obvious things I'm missing? My goal in using this method is to reduce I/O overhead, and not store all of the data in memory, since my file are quite large.
Thanks in advance.
Yes. You are pretty much done. At this point you need to:
1) Write your neural network model model which is supposed to take your data and return a label.
2) Write your cost function C which takes the network prediction and the true label and gives you a cost.
3) Choose and optimizer.
4) Put everything together:
opt = tf.AdamOptimizer(learning_rate=0.001)
datafiles = ['train_0000.dat', 'train_0001.dat']
num_epochs = 2
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
filename_queue = tf.train.string_input_producer(datafiles, num_epochs=num_epochs, shuffle=True)
data, label = read_data(filename_queue)
example_batch, label_batch = tf.train.shuffle_batch(
[data, label], batch_size=128)
y_pred = model(data)
loss = C(label, y_pred)
After which you iterate and minimize the loss with:
opt.minimize(loss)
See also tf.train.string_input_producer behavior in a loop for related information.

Categories

Resources