fit_generator in keras, loading everything into memory

fit_generator in keras, loading everything into memory - python

I have original data (X_train, y_train), and I am modifying this data into something else. Original data are just images with labels. Modified data should be pairs of images for Siamese network which are high in number and it would be around 30 GB in memory. So can't run this function to create pairs on whole original data. So, I used keras fit_generator thinking it would load only that particular batch.
I ran both model.fit and also model.fit_generator on sample pairs but i observed both are using the same amount memory. So, I guess think some problem with my code in using fit_generator. Below is the relevant code. Can you guys please help me with this?
Code Below:
def create_pairs(X_train, y_train):
tr_pairs = []
tr_y = []
y_train = np.array(y_train)
digit_indices = [np.where(y_train == i)[0] for i in list(set(y_train))]
for i in range(len(digit_indices)):
n = len(digit_indices[i])
for j in range(n):
random_index = digit_indices[i][j]
anchor_image = X_train[random_index]
anchor_label = y_train[random_index]
anchor_indices = [i for i, x in enumerate(y_train) if x == anchor_label]
negate_indices = list(set(list(range(0,len(X_train)))) - set(anchor_indices))
for k in range(j+1,n):
support_index = digit_indices[i][k]
support_image = X_train[support_index]
tr_pairs += [[anchor_image,support_image]]
negate_index = random.choice(negate_indices)
negate_image = X_train[negate_index]
tr_pairs += [[anchor_image,negate_image]]
tr_y += [1,0]
return np.array(tr_pairs),np.array(tr_y)
def myGenerator():
tr_pairs, tr_y = create_pairs(X_train, y_train)
while 1:
for i in range(110): # 1875 * 32 = 60000 -> # of training samples
if i%125==0:
print("i = " + str(i))
yield [tr_pairs[i*32:(i+1)*32][:, 0], tr_pairs[i*32:(i+1)*32][:, 1]], tr_y[i*32:(i+1)*32]
model.fit_generator(myGenerator(), steps_per_epoch=110, epochs=2,
verbose=1, callbacks=None, validation_data=([te_pairs[:, 0], te_pairs[:, 1]], te_y), validation_steps=None, class_weight=None,
max_queue_size=10, workers=1, use_multiprocessing=False, shuffle=True, initial_epoch=0)

myGenerator returns a generator.
However you should notice that create_pairs is loading the full dataset into memory. When you call tr_pairs, tr_y = create_pairs(X_train, y_train) the dataset is loaded, so the memory resources are being used.
myGenerator simply traverses a structure that is already in memory.
The solution would be to make create_pairs a generator itself.
If the data is a numpy array I can suggest using h5 files to read chuncks of data from disk.
http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage

Related

How to splitting training data into smaller batches to solve memory error

I have a training data with two multidimensional arrays [prev_sentences, current_sentences], when I used simple model.fit method, It gives me memory error. I want to use fit_generator now but I don't know how to split the training data into batches to feed into model.fit_generator. Shapes of training data are (111356,126,1024) and (111356,126,1024) and y_train shape is (111356,19). Here is line of code for simple fit method.
history=model.fit([previous_sentences, current_sentences], y_train,
epochs=15,batch_size=256,
shuffle = False, verbose = 1,
validation_split=0.2,
class_weight=custom_weight_dict,
callbacks=[early_stopping_cb])
I have never used fit_generator and data generator so I have no idea exactly how to split these training data to be used fit_generator. Can anyone help me in creating batches using fit_generator?

You just need to call:
model.fit_generator(generator, steps_per_epoch)
where steps_per_epoch is typically ceil(num_samples / batch_size) (as per the doc) and generator is a python generator which iterates over the data and yields the data batch-wise. Each call to the generator should then yield batch_size many elements. An example for a generator (source):
def generate_data(directory, batch_size):
"""Replaces Keras' native ImageDataGenerator."""
i = 0
file_list = os.listdir(directory)
while True:
image_batch = []
for b in range(batch_size):
if i == len(file_list):
i = 0
random.shuffle(file_list)
sample = file_list[i]
i += 1
image = cv2.resize(cv2.imread(sample[0]), INPUT_SHAPE)
image_batch.append((image.astype(float) - 128) / 128)
yield np.array(image_batch)
Since this is absolutely problem-specific, you'll have to write your own generator, though it should be simple to do from this template.

This is the data generator to split the training data into mini batches:
def generate_data(X1,X2,Y,batch_size):
p_input=[]
c_input=[]
target=[]
batch_count=0
for i in range(len(X1)):
p_input.append(X1[i])
c_input.append(X2[i])
target.append(Y[i])
batch_count+=1
if batch_count>batch_size:
prev_X=np.array(p_input,dtype=np.int64)
cur_X=np.array(c_input,dtype=np.int64)
cur_y=np.array(target,dtype=np.int32)
print(len(prev_X),len(cur_X))
yield ([prev_X,cur_X],cur_y )
p_input=[]
c_input=[]
target=[]
batch_count=0
return
And here is fit_generator function call instead of model.fit method:
batch_size=256
epoch_steps=math.ceil(len(previous_sentences)/ batch_size)
hist = model.fit_generator(generate_data(previous_sentences,current_sentences, y_train, batch_size),
steps_per_epoch=epoch_steps,
callbacks = [early_stopping_cb],
validation_data=generate_data(val_prev, val_curr,y_val,batch_size),
validation_steps=val_steps, class_weight=custom_weight_dict,
verbose=1)

Custom Keras generator much slower compared to Keras' bult in generator

I have a multi label classification problem. I wrote this custom generator. It reads images and output labels from the disk, and returns them in batches of size 32.
def get_input(img_name):
path = os.path.join("images", img_name)
img = image.load_img(path, target_size=(224, 224))
return img
def get_output(img_name, file_path):
data = pd.read_csv(file_path, delim_whitespace=True, header=None)
img_id = img_name.split(".")[0]
img_id = img_id.lstrip("0")
img_id = int(img_id)
labels = data.loc[img_id - 1].values
labels = labels[1:]
labels = list(labels)
label_arrays = []
for i in range(20):
val = np.zeros((1))
val[0] = labels[i]
label_arrays.append(val)
return label_arrays
def preprocess_input(img_name):
img = get_input(img_name)
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
return x
def train_generator(batch_size):
file_path = "train.txt"
data = pd.read_csv(file_path, delim_whitespace=True, header=None)
while True:
for i in range(math.floor(8000/batch_size)):
x_batch = np.zeros(shape=(32, 224, 224, 3))
y_batch = np.zeros(shape=(32, 20))
for j in range(batch_size):
img_name = data.loc[i * batch_size + j].values
img_name = img_name[0]
x = preprocess_input(img_name)
y = get_output(img_name, file_path)
x_batch[j, :, :, :] = x
y_batch[j] = y
ys = []
for i in range(20):
ys.append(y_batch[:,i])
yield(x_batch, ys)
Had a little problem with labels returned to the model, and got it solved in this question:
training a multi-output keras model
I tested this generator on a single output problem. This custom generator is very slow. The ETA for a single epoch by using this custom generator is around 27 hours, while the builtin generator(using flow_from_directory) takes 25 minutes for a single epoch. What am I doing wrong?
The training process for both tests is identical, except for the generator used. Validation generator is similar to training generator. I know I will not reach the efficiency of Keras' built in generator, but this difference in speed is too much.
EDIT
Some guides I read for creating custom generators.
Writing Custom Keras Generators
custom generator for fit_generator() that yields multiple inputs with different shapes

Maybe the built in generator processes the data on your gpu while your custom generator runs on the cpu, making is significantly slower.
Another guess is because Keras is using Dataset in the background. Your implementation probably uses feed-dict which is the slowest possible way to pass information to TensorFlow. The best way to feed data into the models is to use an input pipeline to ensure that the GPU never has to wait for new stuff to come in.

How can I get the indices of the data used in every batch?

I need to save the indices of the data that are used in every mini-batch.
For example if my data is:
x = np.array([[1.1], [2.2], [3.3], [4.4]])
and the first mini-batch is [1.1] and [3.3], then I want to store 0 and 2 (since [1.1] is the 0th observations and [3.3] is the 2nd observation).
I am using tensorflow in eager execution with the keras.sequential APIs.
As far as I can tell from reading the source code, this information is not stored anywhere so I was unable to do this with a callback.
I am currently solving my problem by creating an object that stores the indices.
class IndexIterator(object):
def __init__(self, n, n_epochs, batch_size, shuffle=True):
data_ix = np.arange(n)
if shuffle:
np.random.shuffle(data_ix)
self.ix_batches = np.array_split(data_ix, np.ceil(n / batch_size))
self.batch_indices = []
def generate_arrays(self, x, y):
batch_ixs = np.arange(len(self.ix_batches))
while 1:
np.random.shuffle(batch_ixs)
for batch in batch_ixs:
self.batch_indices.append(self.ix_batches[batch])
yield (x[self.ix_batches[batch], :], y[self.ix_batches[batch], :])
data_gen = IndexIterator(n=32, n_epochs=100, batch_size=16)
dnn.fit_generator(data_gen.generate_arrays(x, y),
steps_per_epoch=2,
epochs=100)
# This is what I am looking for
print(data_gen.batch_indices)
Is there no way to do this using a tensorflow callback?

Not sure if this will be more efficient than your solution, but is certainly more general.
If you have training data with n indices you can create a secondary Dataset that contains only these indices and zip it with the "real" dataset.
I.E.
real_data = tf.data.Dataset ...
indices = tf.data.Dataset.from_tensor_slices(tf.range(data_set_length)))
total_dataset = tf.data.Dataset.zip((real_data, indices))
# Perform optional pre-processing ops.
iterator = total_dataset.make_one_shot_iterator()
# Next line yields `(original_data_element, index)`
item_and_index_tuple = iterator.get_next()
`

Preprocess huge data with a custom data generator function for keras

Actually I'm building a keras model and I have a dataset in the msg format with over 10 million instances with 40 features which are all categorical. For the moment i'm using just a sample of it since reading all the dataset and encoding it is impossbile to fit into the memory. Here a part of the code i'm using:
import pandas as pd
from category_encoders import BinaryEncoder as be
from sklearn.preprocessing import StandardScaler
def model():
model = Sequential()
model.add(Dense(120, input_dim=233, kernel_initializer='uniform', activation='selu'))
model.add(Dense(12, kernel_initializer='uniform', activation='sigmoid'))
model.compile(SGD(lr=0.008),loss='mean_squared_error', metrics=['accuracy'])
return model
def addrDataLoading():
data=pd.read_msgpack('datum.msg')
data=data.dropna(subset=['s_address','d_address'])
data=data.sample(300000) # taking a sample of all the dataset to make the encoding possible
y=data[['s_address','d_address']]
x=data.drop(['s_address','d_address'],1)
encX = be().fit(x, y)
numeric_X= encX.transform(x)
encY=be().fit(y,y)
numeric_Y=encY.transform(y)
scaler=StandardScaler()
X_all=scaler.fit_transform(numeric_X)
x_train=X_all[0:250000,:]
y_train=numeric_Y.iloc[0:250000,:]
x_val=X_all[250000:,:]
y_val=numeric_Y.iloc[250000:,:]
return x_train,y_train,x_val,y_val
x_train,y_train,x_val,y_val=addrDataLoading()
model.fit(x_train, y_train,validation_data=(x_val,y_val),nb_epoch=20, batch_size=200)
So my question is how to use a custom data generator function to read and process all the data I have and not just a sample, and then use fit_generator() function to train my model?
EDIT
This is a sample of the data:
netData
I think that taking different samples from the data results in different encoding dimensions.
For this sample there's 16 different categories: 4 addresses (3 bit), 4 hostnames (3 bit ), 1 subnetmask (1 bit), 5 infrastructures (3 bit ), 1 accesszone(1 bit ), so the binary encoding will give us 11 bit and the new dimension of the data is 11 previously 5. So let's say for another sample in the address column we have 8 different categories this will give 4 bit in binary and we let the same number of categories in the other columns so the overall encoding will result in 12 dimensions. I believe that what's causing the problem.

Slightly slow solution (repeating the same actions)
Edit - fit BinatyEncoder before create generators
Drop NA first and work with clean data further to avoid reassignments of data frame.
data = pd.read_msgpack('datum.msg')
data.dropna(subset=['s_address','d_address']).to_msgpack('datum_clean.msg')
In this solution data_generator can process same data multiple times. If it's not critical, you can use this solution.
Define function which reads the data snd splits index to train and test. It won't consume a lot of memory.
import pandas as pd
from category_encoders import BinaryEncoder as be
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
def model():
#some code defining the model
def train_test_index_split():
# if there's enough memory to add one more column
data = pd.read_msgpack('datum_cleaned.msg')
train_idx, test_idx = train_test_split(data.index)
return data, train_idx, test_idx
data, train_idx, test_idx = train_test_index_split()
Define and initialize data generator, both for train and validation
def data_generator(data, encX, encY, bathc_size, n_steps, index):
# EDIT: As the data was cleaned, you don't need dropna
# data = data.dropna(subset=['s_address','d_address'])
for i in range(n_steps):
batch_idx = np.random.choice(index, batch_size)
sample = data.loc[batch_idx]
y = sample[['s_address', 'd_address']]
x = sample.drop(['s_address', 'd_address'], 1)
numeric_X = encX.transform(x)
numeric_Y = encY.transform(y)
scaler = StandardScaler()
X_all = scaler.fit_transform(numeric_X)
yield X_all, numeric_Y
Edited part now train binary encoders. You should sub-sample your data to create representative training set for encoders. I guess error with the shape of the data was caused by incorrecly trained BinaryEncoder (Error when checking input: expected dense_9_input to have shape (233,) but got array with shape (234,)):
def get_minimal_unique_frame(df):
return (pd.Series([df[column].unique() for column in df], index=df.columns)
.apply(pd.Series) # tranform list on unique values to pd.Series
.T # transope frame: columns is columns again
.fillna(method='ffill')) # fill NaNs with last value
x = get_minimal_unique_frame(data.drop(['s_address', 'd_address'], 1))
y = get_minimal_unique_frame(data[['s_address', 'd_address']])
NB: I never used category_encoders and have incompatible system configuration, so can't install and check it. So, former code can evoke problems. In that case, I guess, you should compare length of x and y data frames and make it the same, and probaly change an index of data frames.
encX = be().fit(x, y)
encY = be().fit(y, y)
batch_size = 200
train_steps = 100000
val_steps = 5000
train_gen = data_generator(data, encX, encY, batch_size, train_steps, train_idx)
test_gen = data_generator(data, encX, encY, batch_size, test_steps, test_idx)
Edit Please provide an exapmple of x_sample, run train_gen and save output, and post x_samples, y_smaples:
x_samples = []
y_samples = []
for i in range(10):
x_sample, y_sample = next(train_gen)
x_samples.append(x_sample)
y_samples.append(y_sample)
Note: data generator won't stop itself. But itt will be stopped after train_steps by fit_generator method.
Fit model with generators:
model.fit_generator(generator=train_gen, steps_per_epoch=train_steps, epochs=1,
validation_data=test_gen, validation_steps=val_steps)
As far as I know, python does not copy pandas dataframes if you won't do it explicitply with copy() or so. Because of it, both generators use the same object. But if you use Jupyter Notebook, data leaks/uncollected carbage may occur, and a memory troubles comes with them.
More efficient solution - scketch
Clean your data
data = pd.read_msgpack('datum.msg')
data.dropna(subset=['s_address','d_address']).to_msgpack('datum_clean.msg')
Create train/test split, preprocess it and store as numpy array, if you have enough disk space.
data, train_idx, test_idx = train_test_index_split()
def data_preprocessor(data, path, index):
# data = data.dropna(subset=['s_address','d_address'])
sample = data.loc[batch_idx]
y = sample[['s_address', 'd_address']]
x = sample.drop(['s_address', 'd_address'], 1)
encX = be().fit(x, y)
numeric_X = encX.transform(x)
encY = be().fit(y, y)
numeric_Y = encY.transform(y)
scaler = StandardScaler()
X_all = scaler.fit_transform(numeric_X)
np.save(path + '_X', X_all)
np.save(path + '_y', numeric_Y)
data_preprocessor(data, 'train', train_idx)
data_preprocessor(data, 'test', test_idx)
Delete unnecessary data:
del data
Load your files and use following generator:
train_X = np.load('train_X.npy')
train_y = np.load('train_y.npy')
test_X = np.load('test_X.npy')
test_y = np.load('test_y.npy')
def data_generator(X, y, batch_size, n_steps):
idxs = np.arange(len(X))
np.random.shuffle(idxs)
ptr = 0
for _ in range(n_steps):
batch_idx = idxs[ptr:ptr+batch_size]
x_sample = X[batch_idx]
y_sample = y[batch_idx]
ptr += batch_size
if ptr > len(X):
ptr = 0
yield x_sapmple, y_sample
Prepare generators:
train_gen = data_generator(train_X, train_y, batch_size, train_steps)
test_gen = data_generator(test_X, test_y, batch_size, test_steps)
And fit the model finaly. Hope one of this solutions will help. At least if python does pass arrays and data frames buy reference, not by value. Stackoverflow answer about it.

Creating custom data_generator in Keras for fit_generate()

I am trying to train a CNN-LSTM to read a sequence of 6 frames at a time to the CNN (VGG16 without top layer) and give the extracted features to an LSTM in Keras.
The issue is that, since I need to send 6 frames at a time, I need to reshape every 6 frames and add a dimension. Also, since the labels are for every frame, I need to create another variable to get the label of the first frame for every sequence and put it in a new array and then feed both to feed the model (code below).
The issue is that the data gets way too big to use model.fit() and even when trying with it on a small part of the data I get weird horrible results, so I am trying to use model.fit_generator to iterate the input to the model. But since I cannot just directly feed the data I load from the dataset (because I need to reshape and do what I explained in the first paragraph), I am trying to make my own generator. However, things are not going well and I keep getting errors saying 'tuple' is not an iterator. Does anyone know how I can fix the code to make it work?
train_batches = ImageDataGenerator().flow_from_directory(train_path, target_size=(224, 224),
classes=['Bark', 'Bitting', 'Engage', 'Hidden', 'Jump',
'Stand', 'Walk'], batch_size=18156, shuffle=False)
valid_batches = ImageDataGenerator().flow_from_directory(valid_path, target_size=(224, 224),
classes=['Bark', 'Bitting', 'Engage', 'Hidden', 'Jump',
'Stand', 'Walk'], batch_size=6, shuffle=False)
test_batches = ImageDataGenerator().flow_from_directory(test_path, target_size=(224, 224),
classes=['Bark', 'Bitting', 'Engage', 'Hidden', 'Jump',
'Stand','Walk'], batch_size=6, shuffle=False)
def train_gen():
n_frames=6
n_samples=6 #to decide
H=W=224
C = 3
imgs, labels = next(train_batches)
y = np.empty((n_samples, 7))
j = 0
for i in range(n_samples):
y[i] = labels[j]
j +=6
frame_sequence = imgs.reshape(n_samples,n_frames, H,W,C)
return frame_sequence,y
def valid_gen():
v_frames=6
v_samples=1
H=W=224
C = 3
vimgs,vlabels = next(valid_batches)
y2 = np.empty((v_samples, 7))
k = 0
for l in range(v_samples):
y2[l] = vlabels[k]
k +=6
valid_sequence = vimgs.reshape(v_samples,v_frames, H,W,C)
return valid_sequence,y2
def main():
cnn = VGG16(weights='imagenet',
include_top='False', pooling='avg')
cnn.layers.pop()
print(cnn.summary())
cnn.trainable = False
video_input= Input(shape=(None,224,224,3), name='video_input')
print(video_input.shape)
encoded_frame_sequence = TimeDistributed(cnn)(video_input) # the output will be a sequence of vectors
encoded_video = LSTM(256)(encoded_frame_sequence) # the output will be a vector
output = Dense(7, activation='relu')(encoded_video)
video_model = Model(inputs=[video_input], outputs=output)
tr_data = train_gen()
vd_data= valid_gen()
print(video_model.summary())
imgs, labels = next(train_batches)
vimgs,vlabels = next(valid_batches)
print("Training ...")
video_model.compile(Adam(lr=.001), loss='categorical_crossentropy', metrics=['accuracy'])
video_model.fit_generator(tr_data,
steps_per_epoch=1513,
validation_data=vd_data,
validation_steps=431,
epochs=1,
verbose=2)
Is there a mistake in the way I define the generator?

It seems like the way I defined the generators was not correct. As a Keras admin explained to me, the definition has two issues.
Instead of return we need to use the yield
We need a while True loop to make sure it keeps reading
Note that there are few errors on the rest of the code that I dealt with, but since this question is about the generator, I am only posting an answer about that part (There are two generators but they are similar except for the input):
def train_gen():
n_frames=6
n_samples=5 #to decide
H=W=224
C = 3
while True:
imgs, labels = next(train_batches)
#frame_sequence = imgs.reshape(n_samples,n_frames, H,W,C)
y = np.empty((n_samples, 7))
j = 0
#print("labels")
#print(labels)
#print("y")
#print(y.shape)
if len(labels) == n_frames*n_samples:
frame_sequence = imgs.reshape(n_samples,n_frames, H,W,C)
for i in range(n_samples):
y[i] = labels[j]
# print("inside: ")
#print(y[i])
# print(labels[j])
j +=6
yield frame_sequence,y

I think that you should implement a class for the data-generator, I found this link, it might help you. A detailed example of how to use data generators with Keras

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

fit_generator in keras, loading everything into memory - python

Related

How to splitting training data into smaller batches to solve memory error

Custom Keras generator much slower compared to Keras' bult in generator

How can I get the indices of the data used in every batch?

Preprocess huge data with a custom data generator function for keras

Creating custom data_generator in Keras for fit_generate()

Categories

Resources