I have please two questions concerning the ImageDataGenerator:
1) Are the same augmentations used on the whole batch or each image gets its own random transformation?
e.g. for rotation, does the module rotates all the images in the batch with same angle or each image get a random rotation angle ?
2) The data in ImageDataGenerator.flow is looped over (in batches) indefinitely. Is there a way to stop this infinite loop, i.e. doing the augmentation only for n number of time. Because I need to modify the batch_size in each step (not each epoch).
Thanks
Answer from Francois Chollet:
1) Are the same augmentations used on the whole batch or each image gets its own random transformation? e.g. for rotation, does the module rotates all the images in the batch with same angle or each image get a random rotation angle ?
Every single sample has a different unique transformation (e.g. a random rotation within a certain range).
2) The data in ImageDataGenerator.flow is looped over (in batches) indefinitely. Is there a way to stop this infinite loop, i.e. doing the augmentation only for n number of time. Because I need to modify the batch_size in each step (not each epoch). Thanks
Unclear what is meant here. But if you are using model.fit_generator(ImageDataGenerator.flow()) then you can specify samples_per_epoch=... to only yield a specific number of samples from the generator. If you want batch-level granularity, you could do:
for x, y in model.fit_generator(ImageDataGenerator.flow()):
model.train_on_batch(x, y)
In that case you can just break (it's a loop) after any number of batches that you want.
#Neal: Thank you for the prompt answer! You were right, I probably need to better explain my task. My work is somehow similar to classifying video sequences, but my data is saved in a database. I want my code to follow this steps for one epoch:
For i in (number_of_sequences):
Get N, the number of frames in the sequence i (I think that’s
equivalent to batch_size, the number of N of each sequence is
already saved in a list)
Fetch N successive frames from my database and their labels:
X_train, y_train
For j in range(number_of_rotation): -
Perform (the same) data Augmentation on all frames of the sequence (probably using datagen = ImageDataGenerator() and datagen.flow())
Train the network on X, y
My first thought was using model.fit_generator(generator = ImageDataGenerator().flow()) but this way, I can not modify my batch_size, and honestly I did not understand your solution.
Sorry for the long post, but I’m still a novice in both python and NN, but I’m really a big fan of Keras ;)
Thnx!
Related
I have a number of multivariate time series that are produced by the same kind of process but:
are of significantly different lengths;
each time series is an independent instance, and the measurements are taken at different, quite random timestamps;
each time series is related at every timestamp to two targets.
In other words:
each time series has a shape of (n_timestamps, n_features)
each target series has a shape of (n_timestamps, 2).
To give an example, this could be treated as stocks of different companies, that are described by few various features and the target at a given timestamp are probabilities that the final price at the end of the year will be higher than x, except we learn them directly from magically given ground-truth probabilities (instead of observed 0/1 responses).
I want to be able to predict the target at each time point and I wanted to give RNNs a try. However, I'm having issues with figuring out how I should arrange the data before passing it to Keras LSTM layers. The main things I'm wondering about are:
I want my RNN to use data starting from the beginning of the series to make prediction at time t, not only last k timestamps. I can't really use the whole history directly without exploding the gradient (it's too long), therefore I need a way to "remember" previously learned weights even though in reality my RNN will loop over last k timestamps.
Each time series has different length, so I'm unsure how to make things compatible with each other. I'm aware of padding as an option, but since the difference in length of examples can be as significant as 1000 vs 3000 this will results in many training examples that constitutes only of padding value.
Since measurements are taken at different timestamps, I believe it may affect my network in a sense that it can't really learn that e.g. last 10 timestamps are the most important. Or even if it can, these last 10 timestamps will have different lengths in reality for each input time-series... How big problem is this? Should I start with resampling all examples to the same time points (e.g. by interpolating)?
My current thinking is that:
I can pad each of my example sequences to the same length (max(n_timestamps))
Create batches of short sequences of length k, where k represents the length of the loop of RNN layer. In consequence, assuming I have 200 example sequences with the longest one has 3000 timestamps and my selected k is 50, it would result in 3000/50=60 batches of (200, 50) shape. Or should I make 3000-1 batches where one batch differs from the next one only by one timestamp (i.e. while the fist batch has timestamps from 1 to 50, the next batch has timestamps from 2 to 51 etc.)?
Since padding was used, I would need to use Masking layer. Some (quite many) of the rows in prepared batches would constitute of inputs that should be ignored completely (as they would only have padding value for all 50 elements).
Is this the correct way to prepare the data for my problem? Can it be done better to not introduce bottlenecks such as learning using examples of only padding value (that should be ignored with masking layer). Or how can I prepare that data to address points 1., 2. and 3. described above?
each time series has a shape of (n_timestamps, n_features)
each target series has a shape of (n_timestamps, 2).
Okay, this is pretty standard so far.
I want my RNN to use data starting from the beginning of the series to make prediction at time t, not only last k timestamps. I can't really use the whole history directly without exploding the gradient (it's too long), therefore I need a way to "remember" previously learned weights even though in reality my RNN will loop over last k timestamps.
Check and make sure you actually need this. An RNN (or a Transformer) could use any of/all of the history that you give it. But that's assuming that the history is useful for the predictions you're making.
I'd try training on standard-sized random-clips of the data (like in this tutorial). I'd retrain it a few times with longer and longer clips and see if the model performance plateaus before I run out of memory.
But in Keras it is relatively simple to do exactly the thing you're asking.
Keras RNNs (LSTM, GRU) have these this argument return_states. It allows you to allows you to run the model over part of a sequence, pause, execute a training step, and then continue running exactly where you left off.
(and stateful argument is another mechanism to provide that effect)
The code ends up looking something like this:
class MyModel(keras.Model):
...
def train_step(self, args):
inputs, labels = args
state = self.get_initial_state()
while tf.shape(inputs)[1] != 0:
in_slice, inputs = inputs[:,:100], inputs[:,100:]
label_slice, labels = labels[:, :100], labels[:,100:]
with tf.GradientTape() as tape:
result, state = self(in_slice, state)
loss = self.loss(label_slice, result)
vars = self.trainable_variables
grads = tape.gradient(loss, vars)
self.optimizer.apply_gradients(zip(grads, vars))
It may also be possible to use ForwardAccumulator to collect the gradients. In that case you don't need to cut the sequences into chunks because the memory used by forward accumulator doesn't grow with sequence length. I've never tried before so I don't have example code.
Each time series has different length, so I'm unsure how to make things compatible with each other. I'm aware of padding as an option, but since the difference in length of examples can be as significant as 1000 vs 3000 this will results in many training examples that constitutes only of padding value.
That might be okay, just inefficient. You can make batches of similar sequence lengths using: Dataset.bucket_by_sequence_length
Since measurements are taken at different timestamps, I believe it may affect my network in a sense that it can't really learn that e.g. last 10 timestamps are the most important. Or even if it can, these last 10 timestamps will have different lengths in reality for each input time-series... How big problem is this? Should I start with resampling all examples to the same time points (e.g. by interpolating)?
Interpolating to a fixed rate might be a resonable thing to try if it doesn't make your data too much longer. Just think carefully about making predictions on interpolated values: There's some data leaking back in time from a future measurement.
Another approach would be to make the size of the time-step a feature. If each input is tagged with how long it's been since the last input the model can learn how to handle small or large steps.
I can pad each of my example sequences to the same length (max(n_timestamps))
Yes. Pad, or make clips of a fixed size.
Create batches of short sequences of length k, where k represents the length of the loop of RNN layer. In consequence, assuming I have 200 example sequences with the longest one has 3000 timestamps and my selected k is 50, it would result in 3000/50=60 batches of (200, 50) shape.
That would line up with the code example I gave.
Or should I make 3000-1 batches where one batch differs from the next one only by one timestamp
Either way is fine. But if you want to carry the state over from batch to batch (I'm skeptical that you actually need the carry over) then you need to do them chunk by chunk, not by single-stepping your window.
Since padding was used, I would need to use Masking layer. Some (quite many) of the rows in prepared batches would constitute of inputs that should be ignored completely (as they would only have padding value for all 50 elements).
Yeah, that'll be wasted computation, but it won't hurt anything.
I am training a deep learning model using Mask RCNN from the following git repository: matterport/Mask_RCNN. I rely on a heavy augmentation of my dataset (original dataset: 59 images of 1988x1355x3 with each > 80 annotations), which I store locally (necessary to evaluate type/degree of augmentation vs validation metrics). The augmented dataset counts 6000 images. This dataset varies in x and y dimensions of the image, because of reducing resolution and affine transformations - I assume the different x,y-dimensions will not affect the final tests.
However, my Python kernel crashes whenever I load more than 'X' images to train the model.
Hence, I came up with the idea of splitting the dataset in sub-datasets and iterate through the sub-dataset, using the 'last' trained weights as starting point for the new round. But I am not sure if the results will be the same (read: same, taken the stochastic nature of 'stochastic gradient descent' into account)?
I wonder, if the results would be the same, if I don't iterate through the sub-datasets per epoch, but train Y epochs (eg. 20 for 'heads' only, 10 for 'all layers')?
Yet, I am sure this is not the most efficient way of solving this issues. Ideas for improvement are welcome.
Note, I am not using keras.preprocessing.image.ImageDataGenerator(), as I have understood it, it randomly generates data and feeds it to the model by replacing the input for the epoch, whereas I would like to feed the whole dataset to the model.
I came up with the idea of splitting the dataset in sub-datasets and iterate through the sub-dataset, using the 'last' trained weights as starting point for the new round. But I am not sure if the results will be the same?
You are doing the same thing as ImageDataGenerator is doing (creating your own mini-batches) but less optimally. The result will be same with respect to what?
If you mean with respect to a model that was trained with all the data in a single batch? - Most probably not. As a lower batch means slower convergence. But this can be solved by training for more epochs.
Another issue is reproducibility. If you want to reproduce your model with same results each time just use seeds.
import random
random.seed(1)
import numpy as np
np.random.seed(1)
import tensorflow
tensorflow.random.set_seed(1)
Another concept is gradient accumulation. It will help you train you with high batch size without keeping too many images in memory at a time.
https://github.com/CyberZHG/keras-gradient-accumulation
Finally, keras.preprocessing.image.ImageDataGenerator() in facts trains on the whole dataset, it just chooses a random sample at each step (you're doing the same thing with your so-called sub-datasets).
You can seed the ImageDataGenerator so it is reproducible and not entirely random.
I have multiple datasets, each with a different number of images (and different image dimensions) in it. In the training loop I want to load a batch of images randomly from among all the datasets but so that each batch only contains images from a single dataset. For example, I have datasets A, B, C, D and each has images 01.jpg, 02.jpg, … n.jpg (where n depends on the dataset), and let’s say the batch size is 3. In the first loaded batch, for example, I may get images [B/02.jpg, B/06.jpg, B/12.jpg], in the next batch [D/01.jpg, D/05.jpg, D/12.jpg], etc.
So far I have considered the following:
Use a different DataLoader for each dataset, e.g. dataloaderA, dataloaderB, etc., and then in each training loop randomly select one of the dataloaders and get a batch from it. However, this will require a for loop and for large number of datasets it would be very slow since it can’t be split among workers to do it in parallel.
Use a single DataLoader with all of the images from all datasets together but with a custom collate_fn which will create a batch using only images from the same dataset. (I’m not sure how exactly to go about this.)
I have looked at the ConcatDataset class but from its source code it looks like if I use it and try getting a new batch the images in it will be mixed up from among different datasets which I don’t want.
What would be the best way to do this? Thanks!
You can use ConcatDataset, and provide a batch_sampler to DataLoader.
concat_dataset = ConcatDataset((dataset1, dataset2))
ConcatDataset.comulative_sizes will give you the boundaries between each dataset you have:
ds_indices = concat_dataset.cumulative_sizes
Now, you can use ds_indices to create a batch sampler. See the source for BatchSampler for reference. Your batch sampler just has to return a list with N random indices that will respect the ds_indices boundaries. This will guarantee that your batches will have elements from the same dataset.
I don't understand how to run the result of tf.train.batch for multiple epochs. It runs out once of course and I don't know how to restart it.
Maybe I can repeat it using tile, which is complicated but described in full here.
If I can redraw a batch each time that would be fine -- I would need batch_size random integers between 0 and num_examples. (My examples all sit in local RAM). I haven't found an easy way to get these random draws at once.
Ideally there is a reshuffle too when the batch is repeated, but it makes more sense to me to run an epoch then reshuffle, etc., instead of join the training space to itself num_epochs size, then shuffle.
I think this is confusing because I'm not really building an input pipeline since my input fits in memory, but yet I still need to be building out batching, shuffling and multiple epochs which possibly requires more knowledge of input pipeline.
tf.train.batch simply groups upstream samples into batches, and nothing more. It is meant to be used at the end of an input pipeline. Data and epochs are dealt with upstream.
For example, if your training data fits into a tensor, you could use tf.train.slice_input_producer to produce samples. This function has arguments for shuffling and epochs.
I'm working on a machine learning problem and one of the first steps in my pipeline is to convert the raw data into features. Since I'm working with very large datasets I constantly run into memory issues. These are the steps I follow - I'd like to know if there are some things that are fundamentally wrong with the approach. For context, I'm working with 10,000s of images on a Google Cloud machine with 64GB ram.
1 - Create array to store features
Create numpy array to store the features. Example below is for a feature array that will hold 14,000 image features, each of which has height/width of 288/512 and 3 color channels).
x = np.zeros((14000, 288, 512, 3)) # 29316
2 - Read in raw images sequentially, process them, and put them into x
for idx, name in enumerate(raw_data_paths):
image = functions.read_png(name)
features = get_feature(image)
x[idx] = features
3 - train/test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_fraction, random_state=42)
Questions
Am I approaching this completely incorrectly by using numpy arrays when there are more efficient storage mechanisms? I need to later use the data on a Keras neural net so working with numpy arrays has been convenient.
I tend to get issues with step (1) and step (3) above. For step 1, I sometimes cannot execute that line because I run out of memory. Interestingly, I have no issues on my slow local computer (which I'm guessing is using virtual memory), but I do get issues on my Linux Google Compute instance which has 64GB memory. How can I fix this issue?
For step (3) I sometimes run out of memory, and I imagine it's because when that line is executed I double memory needs (x_train, y_train, x_test, y_test together I would imagine require as much memory as x and y). Is there a way to do this step without doubling memory requirements?
1 - In keras, you can either use a python generator or a keras sequence for training. You define then the size of the batches.
You will train your model using fit_generator, passing the generator or the sequence. Adjust the parameters max_queue_size to at most 1 (the queue will be loaded in parallel while the model works on a batch)
2 - Do you really need to work with 14000 at once? Can't you make smaller batches?
You may use np.empty instead of np.zeros.
3 - Splitting train and test data is just as easy as:
trainData = originalData[:someSize]
testData = originalData[somesize:]
Using generators or sequences
These are options for you to load your data in parts, and you can define these parts any way you want.
You can indeed save your data in smaller files to load each file per step.
Or you can also do the entire image preprocessing inside the generator, in small batches.
See this answer for a simple example of a generator: Training a Keras model on multiple feature files that are read in sequentially to save memory
You can create a generator from a list of image files, divide the list in batches of files, and at each step, do the preprocessing:
def loadInBatches(batchSize,dataPaths):
while True:
for step in range(0,len(dataPaths),batchSize):
x = np.empty((batchSize, 288, 512, 3))
y = np.empty(???)
for idx,name in enumerate(dataPaths[step:step+batchSize])
image = functions.read_png(name)
features = get_feature(image)
x[idx] = features
y[idx] = ???
yield (x,y)
I think a good solution, which can solve all the 3 question (more or less), is to use Tensorflow. The latter gives the possibility to create a queue of input. You can find more information in Threading and Queues. This is easy to use way to scale your training.
Since you wanna later on use Neural Net, I suggest you to spend sometimes to learn TF, and the queues, since they are a very powerfull tool.