I m working with large dataset having low memory and I got introduced to Dask dataframe. What I understood from the docs that Dask does not load whole dataset into memory . instead it created multiple threads which will fetch the records from disk on demand basis. So I suppose keras model with having batch size = 500, it should only have 500 records in the memory at the training time. but when I start training. it takes forever.May be I am doing something wrong.please suggest.
shape of training data: 1000000 * 1290
import glob
import dask.dataframe
paths_train = glob.glob(r'x_train_d_final*.csv')
X_train_d = dd.read_csv('.../x_train_d_final0.csv')
Y_train1 = keras.utils.to_categorical(Y_train.iloc[,1], num_classes)
batch_size = 500
num_classes = 2
epochs = 5
model = Sequential()
model.add(Dense(645, activation='sigmoid', input_shape=(1290,),kernel_initializer='glorot_normal'))
#model.add(Dense(20, activation='sigmoid',kernel_initializer='glorot_normal'))
model.add(Dense(num_classes, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer=Adam(decay=0),
metrics=['accuracy'])
history = model.fit(X_train_d.to_records(), Y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
class_weight = {0:1,1:6.5},
shuffle=False)
You should use fit_generator() from Sequential model with generator or with a Sequence instance. Both provide a proper way to load only a portion of data.
Keras docs provide an excellent example:
def generate_arrays_from_file(path):
while 1:
f = open(path)
for line in f:
# create Numpy arrays of input data
# and labels, from each line in the file
x, y = process_line(line)
yield (x, y)
f.close()
model.fit_generator(generate_arrays_from_file('/my_file.txt'),
steps_per_epoch=1000, epochs=10)
Today Keras does not know about Dask dataframes or arrays. I suspect that it is just converting the dask object into the equivalent Pandas or Numpy object instead.
If your Keras model can be trained incrementally then you could solve this problem using dask.delayed and some for loops.
Eventually it would be nice to see the Keras and Dask projects learn more about each other to facilitate these workloads without excess work.
Related
I have a relatively large dataset of a size of ~1.8TB. Aiming to optimize the data feeding process while training the model, I first preprocessed all the data and stored them into 1200 separate tensorflow datasets with 1024 samples in each of them. The sample shape is (512, 768). Data type is tf.float32 .
To load the data I use Python loop, Dataset.load() and Dataset().concatenate functions:
# I put datasets 0000..0959 into training dataset
train_data = Dataset.load(f'tf/0000')
for i in range(1,960):
train_data = train_data.concatenate(Dataset.load(f'tf/{str(i).zfill(4)}'))
# And the datasets 0960..1199 into validaion dataset
valid_data = Dataset.load(f'tf/0960')
for i in trange(961, 1200):
valid_data = valid_data.concatenate(Dataset.load(f'tf/{str(i).zfill(4)}'))
The labels are stored separately as a Numpy file. There is only one label for each sample, so to merge the features and labels, I put labels into the memory and zip them with the features:
labels = np.load('tf/y.npy').astype(np.float32)
train_labels = labels[0:960*1024]
valid_labels = labels[960*1024:1200*1024]
train_data = Dataset.zip((train_data, Dataset.from_tensor_slices(train_labels)))
valid_data = Dataset.zip((valid_data, Dataset.from_tensor_slices(valid_labels)))
I tried using different batch sizes and prefetch parameters, but it didn't impact the performance. Here is the current version:
train_data = train_data.batch(256)
train_data = train_data.prefetch(4)
valid_data = valid_data.batch(64)
valid_data = valid_data.prefetch(4)
Everything about the model in this example is as simple as possible:
def create_model(input_shape):
inp = Input(input_shape)
f1 = Flatten()(inp)
do1 = Dropout(0.2)(f1)
d1 = Dense(1024)(do1)
do2 = Dropout(0.2)(d1)
out = Dense(1, activation='linear')(do2)
model = Model(inputs=inp, outputs=out)
return model
model = create_model((512,768))
model.compile(optimizer = Adam(), loss = 'mse')
model.fit(train_data, validation_data = valid_data, epochs = 10)
The progress bar indicates that 1 epoch is gonna take 2-3 hours while neither CPU nor GPU is fully loaded.
I use the virtual machine: 8 vCPUs, 30 GB RAM, T4 GPU. I use Python 3.7.12 and tensorflow 2.10.0
Utilization metrics are:
CPU: stable ~6% (1 vCPU, I guess)
RAM: stable ~38% used; ~60% cached; ~2% free
GPU: hops from 0 to 15-20%
So I want to find the bottleneck and fix the code to make it use all resources.
I've tried to change the batch size and prefetch parameter but it had zero effect.
I've also tried to set tf.data.ThreadingOptions() after prefetch():
options = tf.data.Options()
options.threading.private_threadpool_size = 10
train_data = train_data.with_options(options)
But it didn't help too.
I've looked into this - Looping over tf.data.Dataset very slow and tried to wrap model fitting into #tf.function:
#tf.function
def train(model, train_data, valid_data):
model.fit(train_data, validation_data = valid_data, epochs = 10)
train(model, train_data, valid_data)
But it raised an error:
RuntimeError: Detected a call to `Model.fit` inside a `tf.function`. `Model.fit is a high-level endpoint that manages its own `tf.function`. Please move the call to `Model.fit` outside of all enclosing `tf.function`s. Note that you can call a `Model` directly on `Tensor`s inside a `tf.function` like: `model(x)`.
Finally, I've experimented with workers and use_multiprocessing parameters of the model.fit() method, but it changed nothing.
A have also read these questions:
Why is TensorFlow's `tf.data` package slowing down my code?
Tensorflow Dataset extremely slow compared to queues
But I didn't find them relevant
I have a 27G dataset to analyse, and because of the size of my RAM I can't feed all my data into my Neural Network at once, and I have to import bits of it, learn on them, and then another part, so the process would look some thing like this:
import 10% of data
learn
save model
delete the data on RAM
import the next 10% and so on
To see how this would affect a known dataset, I tested it on MNIST. the following is the process/procedure:
for 35 times:
import 1/5 of the data
learn
delete
import the next 1/5
learn
delete
...
This is the code to import the dataset from tensorflow:
from tensorflow.keras.datasets import mnist
(sep, label), (sep_t, label_t) = mnist.load_data()
Then, the network:
Dense = tf.keras.layers.Dense
fc_model = tf.keras.Sequential(
[
tf.keras.Input(shape=(28,28)),
tf.keras.layers.Flatten(),
Dense(128, activation='relu'),
Dense(32, activation='relu'),
Dense(10, activation='softmax')])
fc_model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
Below is the code for partially importing and learning the MNIST data set:
for k in range(35):
for j in range(5):
if i == 0:
history = fc_model.fit(sep[i*12000:(i+1)*12000-1], label[i*12000:(i+1)*12000-1], batch_size=128, validation_data=(sep_t, label_t) ,epochs=1)
fc_model.save('Mytf.h5')
i = i + 1
else:
fc_model = load_model('Mytf.h5')
history = fc_model.fit(sep[i*12000:(i+1)*12000-1], label[i*12000:(i+1)*12000-1], batch_size=128, validation_data=(sep_t, label_t) ,epochs=1)
fc_model.save('Mytf.h5')
valacc.append(history.history['val_accuracy'])
valacc_epc.append(history.history['val_accuracy'])
The following is the code to learn the data in one whole dataset:
history_new = fc_model.fit(sep, label, batch_size=128, validation_data=(sep_t, label_t) ,epochs=35)
and the graph below is the comparison between the two methods in terms of accuracy of the validation data:
even though the difference is like 1% (96(avg)-95(avg)=1%), would this mean that when testing on a different dataset using the same methodology of saving and learning, this would result in reduced accuracy? is it better to do some investment and do it on a cloud computation platform?
For both approaches, the batches are organized differently, so there would have to be some deviations (similar to shuffling the data vs feeding it in a specific order). But we can assume that these differences would not be consistent unless we observe this on a large amount of trials.
In any case, this is the common approach to loading large datasets bit by bit: tf.keras allows you to pass a Python generator for model.fit(x) (in case you want to research tutorials or work with an older API: until recently this was a separate method called model.fit_generator, see API).
All the data generator needs to do is yield a batch of training data and labels (x,y) each time it is called. The API will take care of calling it for you, as long as you pass it with fit. The result is that everything is read batch-by-batch into the RAM. A very basic template for a generator is something like this (source):
def generator(features, labels, batch_size):
# Create empty arrays to contain batch of features and labels#
batch_features = np.zeros((batch_size, 64, 64, 3))
batch_labels = np.zeros((batch_size,1))
while True:
for i in range(batch_size):
# choose random index in features
index= random.choice(len(features),1)
batch_features[i] = some_processing(features[index])
batch_labels[i] = labels[index]
yield batch_features, batch_labels
I have very big training (30Gb) files.
Since all the data does not fit in my available RAM, I want to read the data by batch.
I saw that there is Tensorflow-io package which implemented a way to read HDF5 into Tensorflow this way thanks to the function tfio.IODataset.from_hdf5()
Then, since tf.keras.model.fit() takes a tf.data.Dataset as input containing both samples and targets, I need to zip my X and Y together and then use .batch and .prefetch to load in memory just the necessary data. For testing I tried to apply this method to smaller samples: training (9Gb), validation (2.5Gb) and testing (1.2Gb) which I know work well because they can fit into memory and I have good results (70% accuracy and <1 loss).
The training files are stored in HDF5 files split into samples (X) and labels (Y) files like so:
X_learn.hdf5
X_val.hdf5
X_test.hdf5
Y_test.hdf5
Y_learn.hdf5
Y_val.hdf5
Here is my code:
BATCH_SIZE = 2048
EPOCHS = 100
# Create an IODataset from a hdf5 file's dataset object
x_val = tfio.IODataset.from_hdf5(path_hdf5_x_val, dataset='/X_val')
y_val = tfio.IODataset.from_hdf5(path_hdf5_y_val, dataset='/Y_val')
x_test = tfio.IODataset.from_hdf5(path_hdf5_x_test, dataset='/X_test')
y_test = tfio.IODataset.from_hdf5(path_hdf5_y_test, dataset='/Y_test')
x_train = tfio.IODataset.from_hdf5(path_hdf5_x_train, dataset='/X_learn')
y_train = tfio.IODataset.from_hdf5(path_hdf5_y_train, dataset='/Y_learn')
# Zip together samples and corresponding labels
train = tf.data.Dataset.zip((x_train,y_train)).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)
test = tf.data.Dataset.zip((x_test,y_test)).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)
val = tf.data.Dataset.zip((x_val,y_val)).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)
# Build the model
model = build_model()
# Compile the model with custom learing rate function for Adam optimizer
model.compile(loss='categorical_crossentropy',
optimizer=Adam(lr=lr_schedule(0)),
metrics=['accuracy'])
# Fit model with class_weights calculated before
model.fit(train,
epochs=EPOCHS,
class_weight=class_weights_train,
validation_data=val,
shuffle=True,
callbacks=callbacks)
This code runs but the loss goes very high (300+) and accuracy drops to 0 (0.30 -> 4*e^-5) right from the beginning... I don't understand what I am doing wrong, am I missing something ?
Providing the solution here (Answer Section), even though it is present in the Comment Section for the benefit of the community.
There was no issue with the code, its actually with the data (not preprocessed properly), hence model not able to learning well, which leads to strange loss and accuracy.
I am absolutely new to TensorFlow and Keras, and I am trying to make my way around trying out some code that I am finding online.
In particular I am using the fashion-MNIST - consisting of 60000 examples and test set of 10000 examples. Each of them is a 28x28 grayscale image.
I am following this tutorial "https://towardsdatascience.com/building-your-first-neural-network-in-tensorflow-2-tensorflow-for-hackers-part-i-e1e2f1dfe7a0", and I have no problem until the definition of
history = model.fit(
train_dataset.repeat(),
epochs=10,
steps_per_epoch=500,
validation_data=val_dataset.repeat(),
validation_steps=2)
As long as I understood, I need to use train_dataset.repeat() as input dataset because otherwise I won't have enough training example using those values for the hyperparameters (epochs, steps_per_epochs).
My question is: how can I avoid to have to use .repeat()?
How do I need to change the hyperparameters?
I am coping the code here, for simplicity:
def preprocess(x,y):
x = tf.cast(x,tf.float32) / 255.0
y = tf.cast(y, tf.float32)
return x,y
def create_dataset(xs, ys, n_classes=10):
ys = tf.one_hot(ys, depth=n_classes)
return tf.data.Dataset.from_tensor_slices((xs, ys)).map(preprocess).shuffle(len(ys)).batch(128)
model.compile(optimizer = 'adam', loss =tf.losses.CategoricalCrossentropy(from_logits= True), metrics =['accuracy'])
history1 = model.fit(train_dataset.repeat(),
epochs=10,
steps_per_epoch=500,
validation_data=val_dataset.repeat(),
validation_steps=2)
Thanks!
If you don't want to use .repeat() you need to have your model passing thought your entire data only one time per epoch.
In order to do that you need to calculate how many steps it will take for your model to pass throught the entire dataset, the calcul is easy :
steps_per_epoch = len(train_dataset) // batch_size
So with a train_dataset of 60 000 sample and a batch_size of 128, you need to have 468 steps per epoch.
By setting this parameter like that you make sure that you do not exceed the size of your dataset.
I encountered the same problem and here is what I found.
Documentation of tf.keras.Model.fit: "If x is a tf.data dataset, and 'steps_per_epoch' is None, the epoch will run until the input dataset is exhausted."
In other words, we don't need to specify 'steps_per_epoch' if we use the tf.data.dataset as the training data, and tf will figure out how many steps are there. Meanwhile, tf will automatically repeat the dataset when the next epoch begins, so you can specify any 'epoch'.
When passing an infinitely repeating dataset (e.g. dataset.repeat()), you must specify the steps_per_epoch argument.
I'm trying to perform a sentiment analysis in Python using Keras. To do so, I need to do a word embedding of my texts. The problem appears when I try to fit the data to my model:
model_1 = Sequential()
model_1.add(Embedding(1000,32, input_length = X_train.shape[0]))
model_1.add(Flatten())
model_1.add(Dense(250, activation='relu'))
model_1.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
The shape of my train data is
(4834,)
And is a Pandas series object. When I try to fit my model and validate it with some other data I get this error:
model_1.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=64, verbose=2)
ValueError: Error when checking model input: expected
embedding_1_input to have shape (None, 4834) but got array with shape
(4834, 1)
How can I reshape my data to make it suited for Keras? I've been trying with np.reshape but I cannot place None elements with that function.
Thanks in advance
None is the number of expected rows that goes into training therefore you can't define it. Also Keras needs a numpy array as input and not a pandas dataframe. First convert the df to a numpy array with df.values and then do np.reshape((-1, 4834)). Note that you should use np.float32. This is important if you train it on GPU.
https://pypi.org/project/keras-pandas/
Easiest way is having the keras_pandas package to fit a pandas dataframe to keras.The code shown below is an general example from the package docs.
from keras import Model
from keras.layers import Dense
from keras_pandas.Automater import Automater
from keras_pandas.lib import load_titanic
observations = load_titanic()
# Transform the data set, using keras_pandas
categorical_vars = ['pclass', 'sex', 'survived']
numerical_vars = ['age', 'siblings_spouses_aboard', 'parents_children_aboard', 'fare']
text_vars = ['name']
auto = Automater(categorical_vars=categorical_vars, numerical_vars=numerical_vars, text_vars=text_vars,
response_var='survived')
X, y = auto.fit_transform(observations)
# Start model with provided input nub
x = auto.input_nub
# Fill in your own hidden layers
x = Dense(32)(x)
x = Dense(32, activation='relu')(x)
x = Dense(32)(x)
# End model with provided output nub
x = auto.output_nub(x)
model = Model(inputs=auto.input_layers, outputs=x)
model.compile(optimizer='Adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train model
model.fit(X, y, epochs=4, validation_split=.2)
You need a specific version of Pandas for this to work. If you use the current version (as of 20th Aug 2018) this will fail.
Rollback your Pandas and Keras (pip uninstall ....) and then install a specific version like this
python -m pip install pandas==0.19.2
Use tf.data.Dataset.from_tensor_slices to read the values from a pandas dataframe.
See https://www.tensorflow.org/tutorials/load_data/pandas_dataframe for reference how to do this properly in TF2.x