Best way to save extracted features for future training deep learning

Best way to save extracted features for future training deep learning - python

I am using the VGG19 architecture to extract features from my images. Here is my code to do so:
model = VGG19(include_top=False)
image_paths = glob.glob('train/*/*')
def extract_features(model, path):
img_path = path
img = image.load_img(img_path, target_size=(224,224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
for path in image_paths:
extract_features(model, path)
I want to save each feature in format where I can later use it for deep learning, in torch or tf. Normally, I would just append each features into a list and save that list as a csv, but I've run into issues feeding lists into deep learning models. How do I save this data in a proper format?

I have two suggestions:
Save the features per file, e.g. for cat.png save it as cat.npy; and as you go over your list of files (cat.png, dog.png, snake.png), first check if the feature was already created and directly load the .npy file.
The second approach is using a dictionary data structure, where you use the index of the sample as the key, and the extracted feature as the value. Example:
index = 123
feature = extract_feature(...)
dictionary[index] = feature
You can save this dictionary to pickle. Then load it the next time, and directly pull the features for an index from the dictionary.

Related

Read image labels from a csv file

I have a dataset of medical images (.dcm) which I can read into TensorFlow as a batch. However, the problem that I am facing is that the labels of these images are in a .csv. The .csv file contains two columns - image_path (location of the image) and image_labels (0 for no; 1 for yes). I wanted to know how I can read the labels into a TensorFlow dataset batch wise. I am using the following code to load the images batch wise:-
import tensorflow as tf
import tensorflow_io as tfio
def process_image(filename):
image_bytes = tf.io.read_file(filename)
image = tf.squeeze(
tfio.image.decode_dicom_image(image_bytes, on_error='strict', dtype=tf.uint16),
axis = 0
)
x = tfio.image.decode_dicom_data(image_bytes, tfio.image.dicom_tags.PhotometricInterpretation)
image = (image - tf.reduce_min(image))/(tf.reduce_max(image) - tf.reduce_min(image))
if(x == "MONOCHROME1"):
image = 1 - image
image = image*255
image = tf.cast(tf.image.resize(image, (512, 512)),tf.uint8)
return image
# train_images is a list containing the locations of .dcm images
dataset = tf.data.Dataset.from_tensor_slices(train_images)
dataset = dataset.map(process_image, num_parallel_calls=4).batch(50)
Hence, I can load the images into the TensorFlow dataset. But I would like to know how I can load the image labels batch wise.

Something like this instead of the last two lines should work:
#train_labels is a list of labels for each image in the same order as in train_images
dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels))
dataset = dataset.map(lambda x,y : (process_image(x), y), num_parallel_calls=4).batch(50)
now the dataset can be passed to your network's .fit(), .predict() and other methods:
model.fit(dataset, epochs=epochs, callbacks=callbacks)
Alternatively, you can create a second dataset containing the labels and then combine two datasets with tf.data.Dataset.zip(). It works similarly to the python's native zip.
I prefer the first method since It feels a bit cleaner to me + I can, for example, shuffle the filenames/labels and only then parse the files instead of doing the opposite.

How to retrieve file paths from a tf.data.Dataset created with from_tensor_slices() and shuffled after every epoch

First of all, I would like to say that this is my first question in stackOverflow, so I hope that the question as a whole respects the rules. I realize that the question is a bit long, but I would like to provide as much background and detail as possible .
I am currently developing a real-time image binary classification system based on Tensorflow 2.8.0 and I am quite new at it. Here are some of the peculiarities of the data that I have for the mentioned project:
Too big to fit in memory: I have more than 200 GB of data. Keep in mind that I have labeled only a small portion of it, but I want to write code that could manage the whole dataset in the future.
Some files are not directly compatible with Tensorflow: I have .FITS and .FIT files that cannot be opened directly with Tensorflow. Due to this issue, I use a library called Astropy to open these files.
The classes are very unbalanced.
After reading the official documentation and tutorials, I thought that, in order to load, preprocess and feed data to my CNN, the best option was to build an input pipeline using the tf.data.Dataset class due to the ease of opening FITS files. My general procedure follows this idea:
Get a list of file paths and split it into train, val and test partitions if desired.
Create a tf.data.Dataset with the from_tensor_slices() method
Shuffle the data (before the heavier reading and image processing operations)
Read and process every path with map()
Batch and prefetch
Here are some code fragments in case they help to understand my goal:
(...)
import config as cfg # Custom .py file
import tensorflow as tf
# x_train, x_val and x_test are previously split file paths lists
train_ds = tf.data.Dataset.from_tensor_slices([str(p) for p in x_train])
val_ds = tf.data.Dataset.from_tensor_slices([str(p) for p in x_val])
test_ds = tf.data.Dataset.from_tensor_slices([str(p) for p in x_test])
train_ds = configure_tf_ds(train_ds)
val_ds = configure_tf_ds(val_ds)
test_ds = configure_tf_ds(test_ds)
def configure_tf_ds(self, tf_ds, buf_size):
# reshuffle_each_iteration=True ensures that data is shuffled each time it is iterated
tf_ds = tf_ds.shuffle(buffer_size=cfg.SHUFFLE_BUF_SIZE, seed=cfg.seed, reshuffle_each_iteration=True)
tf_ds = tf_ds.map(lambda x: tf.py_function(self.process_path, [x], [self.img_dtpye, self.label_dtype]))
tf_ds = tf_ds.batch(self.batch_size)
tf_ds = tf_ds.prefetch(buffer_size=tf.data.AUTOTUNE)
return tf_ds
def process_path(self, file_path):
# Labels are extracted from the file path, not relevant for my problem
label = get_label(file_path)
path = bytes.decode(file_path.numpy()).lower()
img = None
# Open and process images depending on their file paths' extension: FITS, FIT, JPG
if "fit" in path:
img = tf.py_function(func=self.decode_fits, inp=[file_path], Tout=self.img_dtpye)
else:
img = tf.py_function(func=self.decode_img, inp=[file_path], Tout=self.img_dtpye)
return img, label
model.fit(train_ds, epochs=50, validation_data=val_ds)
# Then, I would like to obtain predictions, plot results, and so on but knowing which file paths I am working with
(...)
Following the previous idea, I have successfully created and tested different types of pipelines for different types of partitions of my dataset: unlabeled (remember that only a portion of the data is labeled), labeled and weighted labeled (I wanted to see if my models improve by specifying class weights when training).
However, in order to monitor results and make proper adjustments to my model, I would like to retrieve the usual predictions, real labels and images next to the file paths preserving the ability to shuffle the data after every epoch.
I have managed to solve my question if I do not shuffle data with .shuffle(reshuffle_each_iteration=True), but models' performance is supposed to increase if data is shuffled after each epoch, according to several sources.
I have read different posts in stackOverflow related to my question. I will list those posts next to the problems that I have found for my particular use case:
Solution 1: My dataset cannot be fed to the model as X, y because it is a tf.data.Dataset
Solution 2: I want to obtain the image and the label too.
Solution 3: This works, but it would not respect the expected tf.data.Dataset format in the future .fit() call as stated here:
A tf.data dataset. Should return a tuple of either (inputs, targets)
or (inputs, targets, sample_weights)
I have also tried to keep a separate tf.data.Dataset with only the file paths but if I call the shuffle method with the reshuffle_each_iteration=True option in both tf.data.Dataset instances, the order of their elements does not match even if I set the same seed.
In short, is it possible to achieve what I want? If so, how should I proceed?
Thank you very much in advance.

Preprocess your data into three TFRecord files, one each for training, testing, and validation. Then you can shuffle and never cross records between the sets. This also speeds up data loading and can be done once and reused many times while playing with hyperparameters.
Here is an example of how you can preprocess and split your data. Your actual dataset data will have a different structure, this example has "encdata", a 2048-wide vector of vggface2 face encoding data. This assumes you have a single directory of data, with subdirectories named for a class and containing all the files for that class.
import tensorflow as tf
import numpy as np
import pickle
import sys
import os
# 80% to training, 10% to testing, 10% to validation
validation_portion = 1
testing_portion = 1
training_portion = 8
file_cycle_total = validation_portion + testing_portion + training_portion
# Where to store the TFRecord files
training_tfrecord_path = '/var/tmp/xtraining_tfrecords.tfr'
testing_tfrecord_path = '/var/tmp/xtesting_tfrecords.tfr'
validation_tfrecord_path = '/var/tmp/xvalidation_tfrecords.tfr'
# Where we keep the encodings
FACELIB_DIR='/aimiassd/Datasets/LabeledAstroFaces'
# Get list of all classes from all facelib dirs
classNames = sorted([x for x in os.listdir(FACELIB_DIR) if os.path.isdir(os.path.join(FACELIB_DIR,x)) and not x.startswith('.')])
classStrToInt = dict([(x,i) for i,x in enumerate(classNames)])
print('Found %d different classNames for labels\n' % len(classNames))
# Create our record writers
train_file_writer = tf.io.TFRecordWriter(training_tfrecord_path)
test_file_writer = tf.io.TFRecordWriter(testing_tfrecord_path)
val_file_writer = tf.io.TFRecordWriter(validation_tfrecord_path)
# Create a dataset of filenames of every enc2048 file in the facelibraries
cnt_records_written = [0,0,0]
for CN in classNames:
class_int = classStrToInt[CN]
# Get a list of all the encoding files
encfiles = sorted(filter((lambda x: x.endswith('.enc2048')), os.listdir(os.path.join(FACELIB_DIR, CN))))
# For each encoding file, read the encoding data and write it to the various tfrecords
for i, F in enumerate(encfiles):
file_path = os.path.join(FACELIB_DIR,CN,F)
with open(file_path,'rb') as fin:
encdata,_ = pickle.loads(fin.read()) # encodings, source_image_name
# Turn encdata into a tf.train.Example and serialize it for writing
record_bytes = tf.train.Example(features=tf.train.Features(feature={
"x": tf.train.Feature(float_list=tf.train.FloatList(value=encdata)),
"y": tf.train.Feature(int64_list=tf.train.Int64List(value=[class_int])),
})).SerializeToString()
# Write it out with the appropriate record writer
remainder = i % file_cycle_total
if remainder < validation_portion:
val_file_writer.write(record_bytes)
cnt_records_written[2] += 1
elif remainder < validation_portion + testing_portion:
test_file_writer.write(record_bytes)
cnt_records_written[1] += 1
else:
train_file_writer.write(record_bytes)
cnt_records_written[0] += 1
print('Writing records done.')
print('Wrote %d training, %d testing, %d validation records' %
(cnt_records_written[0], cnt_records_written[1], cnt_records_written[2]) )
train_file_writer.close()
test_file_writer.close()
val_file_writer.close()
print('Reading data back out...')
# Function to turn a serialized TFRecord back into a tf.train.Example
def decode_fn(record_bytes):
return tf.io.parse_single_example(
# Data
record_bytes,
# Schema
{"x": tf.io.FixedLenFeature([2048], dtype=tf.float32),
"y": tf.io.FixedLenFeature([], dtype=tf.int64)}
)
# Read and deserialize the datasets
train_ds = tf.data.TFRecordDataset([training_tfrecord_path]).map(decode_fn)
test_ds = tf.data.TFRecordDataset([ testing_tfrecord_path]).map(decode_fn)
validation_ds = tf.data.TFRecordDataset([validation_tfrecord_path]).map(decode_fn)
# Use a dataset
count = 0
for batch in tf.data.TFRecordDataset([training_tfrecord_path]).map(decode_fn):
print(batch)
count +=1
if count > 4:
sys.exit(0)
print('Done.')
Note how as the data is being process into TFRecords, it is alternately being written into the three datasets. Verify and Testing entries are written first, to ensure classes with very small amounts of samples still get something into the verify and testing datasets. This is controlled by the variables at the top, validation_portion, testing_portion, and training_portion, adjust per your preferences.
Finally, at the end, the TFRecords are re-read and used to build three new tf.data.Dataset, which can be fed to model.fit() and friends. The example code just prints four records to show the data is of the correct, original shape.

Merge or append multiple Keras TimeseriesGenerator objects into one

I'm trying to make a LSTM model. The data is coming from a csv file that contains values for multiple stocks.
I can't use all the rows as they appear in the file to make sequences because each sequence is only relevant in the context of its own stock, so I need to select the rows for each stock and make the sequences based on that.
I have something like this:
for stock in stocks:
stock_df = df.loc[(df['symbol'] == stock)].copy()
target = stock_df.pop('price')
x = np.array(stock_df.values)
y = np.array(target.values)
sequence = TimeseriesGenerator(x, y, length = 4, sampling_rate = 1, batch_size = 1)
That works fine, but then I want to merge each of those sequences into a bigger one that I will use for training and that contains the data for all the stocks.
It is not possible to use append or merge because the function return a generator object, not a numpy array.

EDIT: New answer:
So what I've ended up doing is to do all the preprocessing manually and save an .npy file for each stock containing the preprocessed sequences, then using a manually created generator I make batches like this:
class seq_generator():
def __init__(self, list_of_filepaths):
self.usedDict = dict()
for path in list_of_filepaths:
self.usedDict[path] = []
def generate(self):
while True:
path = np.random.choice(list(self.usedDict.keys()))
stock_array = np.load(path)
random_sequence = np.random.randint(stock_array.shape[0])
if random_sequence not in self.usedDict[path]:
self.usedDict[path].append(random_sequence)
yield stock_array[random_sequence, :, :]
train_generator = seq_generator(list_of_filepaths)
train_dataset = tf.data.Dataset.from_generator(seq_generator.generate(),
output_types=(tf.float32, tf.float32),
output_shapes=(n_timesteps, n_features))
train_dataset = train_dataset.batch(batch_size)
Where list_of_filepaths is simply a list of paths to preprocessed .npy data.
This will:
Load a random stock's preprocessed .npy data
Pick a sequence at random
Check if the index of the sequence has already been used in usedDict
If not:
Append the index of that sequence to usedDict to keep track as to not feed the same data twice to the model
Yield the sequence
This means that the generator will feed a single unique sequence from a random stock at each "call", enabling me to use the .from_generator() and .batch() methods from Tensorflows Dataset type.
Original answer:
I think the answer from #TF_Support is slightly missing the point. If I understand your question It's not as if you want to train one model pr. stock, you want one model trained on the entire dataset.
If you have enough memory you could manually create the sequences and hold the entire dataset in memory. The issue I'm facing is similar, I simply can't hold everything in memory: Creating a TimeseriesGenerator with multiple inputs.
Instead I'm exploring the possibility of preprocessing all data for each stock seperately, saving as .npy files and then using a generator to load a random sample of those .npy files to batch data to the model, I'm not entirely sure how to approach this yet though.

For the scenario, you want to merge each of those sequences into a bigger one that contains the data for all the stocks and will be used for training.
You can append the created TimeSeriesGenerators into a Python List.
stock_timegenerators = []
for stock in stocks:
stock_df = stock.copy()
features = stock_df.pop('symbol')
target = stock_df.pop('price')
x = np.array(stock_df.values)
y = np.array(target.values)
# sequence = TimeseriesGenerator(x, y, length = 4, sampling_rate = 1, batch_size = 1)
stock_timegenerators.append(TimeseriesGenerator(x, y, length = 4, sampling_rate = 1, batch_size = 1))
The output of this will be an appended TimeSeriesGenerator that you can use by iterating the list or reference by index.
[<tensorflow.python.keras.preprocessing.sequence.TimeseriesGenerator at 0x7eff62c699b0>,
<tensorflow.python.keras.preprocessing.sequence.TimeseriesGenerator at 0x7eff62c6eba8>,
<tensorflow.python.keras.preprocessing.sequence.TimeseriesGenerator at 0x7eff62c782e8>]
Also having Multiple Keras Timeseries means that you're training Multiple LSTM Models for each stock.
You can also use this approach in dealing with multiple models efficiently.
lstm_models = []
for time_series_gen in stock_timegenerators:
# lstm_models.append(create_model()) : You could create everything using functions
# Or in the loop like this.
model = Sequential()
model.add(LSTM(32, input_shape = (n_input, n_features)))
model.add(Dense(1))
model.compile(loss ='mse', optimizer = 'adam')
model.fit(time_series_gen, steps_per_epoch= 1, epochs = 5)
lstm_models.append(model)
This would output a list of models appended and easily referenced using the index.
[<tensorflow.python.keras.engine.sequential.Sequential at 0x7eff62c7b748>,
<tensorflow.python.keras.engine.sequential.Sequential at 0x7eff6100e160>,
<tensorflow.python.keras.engine.sequential.Sequential at 0x7eff63dc94a8>]
This way you can create Multiple LSTM Models that have different Time Series Generators for different stocks.
Hope this helps you.

Flatten Dataset of multiple files tensorflow

I'm trying to read the CIFAR-10 dataset from 6 .bin files, and then create a initializable_iterator. This is the site I downloaded the data from, and it also contains a description of the structure of the binary files. Each file contains 2500 images. The resulting iterator, however, only generates one tensor for each file, a tensor of size (2500,3703). Here is my code
import tensorflow as tf
filename_dataset = tf.data.Dataset.list_files("cifar-10-batches-bin/*.bin")
image_dataset = filename_dataset.map(lambda x: tf.decode_raw(tf.read_file(x), tf.float32))
iter_ = image_dataset.make_initializable_iterator()
next_file_data = iter_.get_next()I
next_file_data = tf.reshape(next_file_data, [-1,3073])
next_file_img_data, next_file_labels = next_file_data[:,:-1], next_file_data[:,-1]
next_file_img_data = tf.reshape(next_file_img_data, [-1,32,32,3])
init_op = iter_.initializer
with tf.Session() as sess:
sess.run(init_op)
print(next_file_img_data.eval().shape)
_______________________________________________________________________
>> (2500,32,32,3)
The first two lines are based on this answer. I would like to be able to specify the number of images generated by get_next(), using batch() rather than it being the number of images in each .bin file, which here is 2500.
There has already been a question about flattening a dataset here, but the answer is not clear to me. In particular, the question seems to contain a code snippet from a class function which is defined elsewhere, and I am not sure how to implement it.
I have also tried creating the dataset with tf.data.Dataset.from_tensor_slices(), replacing the first line above with
import os
filenames = [os.path.join('cifar-10-batches-bin',f) for f in os.listdir("cifar-10-batches-bin") if f.endswith('.bin')]
filename_dataset = tf.data.Dataset.from_tensor_slices(filenames)
but this didn't solve the problem.
Any help would be very much appreciated. Thanks.

I am not sure how your bin file is structured. I am assuming 32*32*3 = 3072 points per image is present in each file. So the data present in each file is a multiple of 3072. However for any other structure, the kind of operations would be similar, so this can still serve as a guide for that.
You could do a series of mapping operations:
import tensorflow as tf
filename_dataset = tf.data.Dataset.list_files("cifar-10-batches-bin/*.bin")
image_dataset = filename_dataset.map(lambda x: tf.decode_raw(tf.read_file(x), tf.float32))
image_dataset = image_dataset.map(lambda x: tf.reshape(x, [-1, 32, 32, 3]) # Reshape your data to get 2500, 32, 32, 3
image_dataset = image_dataset.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x)) # This operation would give you tensors of shape 32,32,3 and put them all together.
image_dataset = image_dataset.batch(batch_size) # Now you can define your batchsize

how to train model with batches

I trying yolo model in python.
To process the data and annotation I'm taking the data in batches.
batchsize = 50
#boxList= []
#boxArr = np.empty(shape = (0,26,5))
for i in range(0, len(box_list), batchsize):
boxList = box_list[i:i+batchsize]
imagesList = image_list[i:i+batchsize]
#to convert the annotation from VOC format
convertedBox = np.array([np.array(get_boxes_for_id(box_l)) for box_l in boxList])
#pre-process on image and annotaion
image_data, boxes = process_input_data(imagesList,max_boxes,convertedBox)
boxes = np.array(list(itertools.chain.from_iterable(boxes)))
detectors_mask, matching_true_boxes = get_detector_mask(boxes, anchors)
after this, I want to pass my data to the model to train.
when I append the list it gives memory error because of array size.
and when i append array gives dimensionality error because of shape.
how can i train the data and what shoud i use model.fit() or model.train_on_batch()

If you are using Keras to Train your model with a bunch of Images you can use Train generator and validation generator, all you have to do is put your images in there respective class folders. look at a sample code . also take a look at this link maybe it may help you https://keras.io/preprocessing/image/ . i hope i have answered your question unless i did not understand it

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best way to save extracted features for future training deep learning - python

Related

Read image labels from a csv file

How to retrieve file paths from a tf.data.Dataset created with from_tensor_slices() and shuffled after every epoch

Merge or append multiple Keras TimeseriesGenerator objects into one

Flatten Dataset of multiple files tensorflow

how to train model with batches

Categories

Resources