Reading Dataset from files where some might be missing

Reading Dataset from files where some might be missing - python

I'm trying to load files to TensorFlow Dataset where some files might be missing (in which case I want to replace these with zeroes).
The structure of directories that I'm trying to read data from is as follows:
|-data
|---sensor_A
|-----1.dat
|-----2.dat
|-----3.dat
|---sensor_B
|-----1.dat
|-----2.dat
|-----3.dat
.dat files are .csv files with spacebar as a separator. The content of every file is a single, multi-row observation where the number of columns is constant (say 4) and the number of rows is unknown (timeseries data).
I've successfully managed to read every sensor data to a separate TensorFlow Dataset with the following code:
import os
import tensorflow as tf
tf.enable_eager_execution()
data_root_dir = "data"
modalities_to_use = ["sensor_A", "sensor_B"]
timestamps = [1, 2, 3]
for mod_idx, modality in enumerate(modalities_to_use):
# Will produce: ['data/sensor_A/1.dat', 'data/sensor_A/2.dat', 'data/sensor_A/3.dat']
filenames = [os.path.join(data_root_dir, modality, str(timestamp) + ".dat") for timestamp in timestamps]
dataset = tf.data.Dataset.from_tensor_slices((filenames,))
def _parse_function_internal(filename):
number_of_columns = 4
single_observation = tf.read_file(filename)
# Tokenise every value so we can cast these to floats later.
single_observation = tf.string_split([single_observation], sep='\r\n ').values
single_observation = tf.reshape(single_observation, (-1, number_of_columns))
single_observation = tf.strings.to_number(single_observation, tf.float32)
return filename, single_observation
dataset = dataset.map(_parse_function_internal)
print('Result:')
for el in dataset:
try:
# Filename
print(el[0])
# Parsed file content
print(el[1])
except tf.errors.OutOfRangeError:
break
which successfully prints out content of all three files for every sensor.
My problem is that some timestamps in the dataset might be missing. For instance if file 1.dat in sensor_A directory will be missing I'm getting this error:
tensorflow.python.framework.errors_impl.NotFoundError: NewRandomAccessFile failed to Create/Open: mock_data\sensor_A\1.dat : The system cannot find the file specified.
; No such file or directory
[[{{node ReadFile}}]] [Op:IteratorGetNextSync]
which is thrown in this line:
for el in dataset:
What I've tried to do is to surround the call to tf.read_file() function with try block but obviously it doesn't work as the error is not thrown when tf.read_file() is called, but when the value is fetched from the dataset. Later I want to pass this dataset to a Keras model so I can't just surround it with try block. Is there any workaround? Is that even supported?
Thanks!

I managed to solve the problem, sharing the solution just in case someone else will be struggling with it as well. I had to use additional list of booleans specifying whether the file actually exist and pass it into the mapper. Then using tf.cond() function we decide whether to read the file or mock the data with zeroes (or any other logic).
import os
import tensorflow as tf
tf.enable_eager_execution()
data_root_dir = "data"
modalities_to_use = ["sensor_A", "sensor_B"]
timestamps = [1, 2, 3]
for mod_idx, modality in enumerate(modalities_to_use):
# Will produce: ['data/sensor_A/1.dat', 'data/sensor_A/2.dat', 'data/sensor_A/3.dat']
filenames = [os.path.join(data_root_dir, modality, str(timestamp) + ".dat") for timestamp in timestamps]
files_exist = [os.path.isfile(filename) for filename in filenames]
dataset = tf.data.Dataset.from_tensor_slices((filenames, files_exist))
def _parse_function_internal(filename, file_exist):
number_of_columns = 4
single_observation = tf.cond(file_exist, lambda: tf.read_file(filename), lambda: ' '.join(['0.0'] * number_of_columns))
# Tokenise every value so we can cast these to floats later.
single_observation = tf.string_split([single_observation], sep='\r\n ').values
single_observation = tf.reshape(single_observation, (-1, number_of_columns))
single_observation = tf.strings.to_number(single_observation, tf.float32)
return filename, single_observation
dataset = dataset.map(_parse_function_internal)
print('Result:')
for el in dataset:
try:
# Filename
print(el[0])
# Parsed file content
print(el[1])
except tf.errors.OutOfRangeError:
break

Related

Strange behaviour when passing a function into a Tensorflow dataset map method

This was working perfectly fine for me earlier today, but it suddenly started behaving very strangely when I restarted my notebook.
I have a tf dataset that takes in numpy files and their corresponding labels as input, like so tf.data.Dataset.from_tensor_slices((specgram_files, labels)).
When I take 1 item using for item in ds.take(1): print(item) I get the expected output, which is a tuple of tensors, where the first tensor contains the name of the numpy file as a bytes string and the second tensor contains the encoded label.
I then have a function that reads the file using np.load() and produces a numpy array, which is then returned. This function is passed into the map() method, and it looks like this:
ds = ds.map(
lambda file, label: tuple([tf.numpy_function(read_npy_file, [file], [tf.float32]), label]),
num_parallel_calls=tf.data.AUTOTUNE)
, where read_npy_file looks like this:
def read_npy_file(data):
# 'data' stores the file name of the numpy binary file storing the features of a particular sound file
# as a bytes string.
# decode() is called on the bytes string to decode it from a bytes string to a regular string
# so that it can passed as a parameter into np.load()
data = np.load(data.decode())
return data.astype(np.float32)
As you can see, the mapping should create another tuple of tensors, where the first tensor is the numpy array and the second tensor is the label, untouched. This worked perfectly earlier, but now it gives the most bizarre behaviour. I placed print statements in the read_npy_file() function to see if the correct data was being passed in. I expected it to pass a single bytes string, but it instead produces this output when I call print(data) in the read_npy_file() function and take 1 item from the dataset to trigger one mapping using ds.take(1):
b'./challengeA_data/log_spectrogram/2603ebb3-3cd3-43cc-98ef-0c128c515863.npy'b'./challengeA_data/log_spectrogram/fab6a266-e97a-4935-a0c3-444fc4426fc5.npy'b'./challengeA_data/log_spectrogram/93014682-60a2-45bd-9c9e-7f3c97b83be9.npy'b'./challengeA_data/log_spectrogram/710f2430-5da3-4822-a252-6ad3601b92d9.npy'b'./challengeA_data/log_spectrogram/e757058c-91de-4381-8184-65f001c95647.npy'
b'./challengeA_data/log_spectrogram/38b12689-04ba-422b-a972-5856b05ca868.npy'
b'./challengeA_data/log_spectrogram/7c9ccc04-a2d2-4eec-bafd-0c97b3658c26.npy'b'./challengeA_data/log_spectrogram/c7cc3520-7218-4d07-9f0a-6bd7bb90a551.npy'
b'./challengeA_data/log_spectrogram/21f6060a-9766-4810-bd7c-0437f47ccb98.npy'
I didn't modify any formatting of the output.
I'd greatly appreciate any help. TFDS has been an absolute nightmare to work with haha.
Here's the full code
def read_npy_file(data):
# 'data' stores the file name of the numpy binary file storing the features of a particular sound file
# as a bytes string.
# decode() is called on the bytes string to decode it from a bytes string to a regular string
# so that it can passed as a parameter into np.load()
print(data)
data = np.load(data.decode())
return data.astype(np.float32)
specgram_ds = tf.data.Dataset.from_tensor_slices((specgram_files, labels))
specgram_ds = specgram_ds.map(
lambda file, label: tuple([tf.numpy_function(read_npy_file, [file], [tf.float32]), label]),
num_parallel_calls=tf.data.AUTOTUNE)
num_files = len(train_df)
num_train = int(0.8 * num_files)
num_val = int(0.1 * num_files)
num_test = int(0.1 * num_files)
specgram_ds = specgram_ds.shuffle(buffer_size=1000)
specgram_train_ds = specgram_ds.take(num_train)
specgram_test_ds = specgram_ds.skip(num_train)
specgram_val_ds = specgram_test_ds.take(num_val)
specgram_test_ds = specgram_test_ds.skip(num_val)
# iterating over one item to trigger the mapping function
for item in specgram_ds.take(1):
pass
Thanks!

Your logic seems to be fine. You are actually just observing the behavior of tf.data.AUTOTUNE in combination with print(*). According to the docs :
If the value tf.data.AUTOTUNE is used, then the number of parallel calls is set dynamically based on available CPU.
You can run the following code a few times to observe the changes:
import tensorflow as tf
import numpy as np
def read_npy_file(data):
# 'data' stores the file name of the numpy binary file storing the features of a particular sound file
# as a bytes string.
# decode() is called on the bytes string to decode it from a bytes string to a regular string
# so that it can passed as a parameter into np.load()
print(data)
data = np.load(data.decode())
return data.astype(np.float32)
# Create dummy data
for i in range(4):
np.save('{}-array'.format(i), np.random.random((5,5)))
specgram_files = ['/content/0-array.npy', '/content/1-array.npy', '/content/2-array.npy', '/content/3-array.npy']
labels = [1, 0, 0, 1]
specgram_ds = tf.data.Dataset.from_tensor_slices((specgram_files, labels))
specgram_ds = specgram_ds.map(
lambda file, label: tuple([tf.numpy_function(read_npy_file, [file], [tf.float32]), label]),
num_parallel_calls=tf.data.AUTOTUNE)
num_files = len(specgram_files)
num_train = int(0.8 * num_files)
num_val = int(0.1 * num_files)
num_test = int(0.1 * num_files)
specgram_ds = specgram_ds.shuffle(buffer_size=1000)
specgram_train_ds = specgram_ds.take(num_train)
specgram_test_ds = specgram_ds.skip(num_train)
specgram_val_ds = specgram_test_ds.take(num_val)
specgram_test_ds = specgram_test_ds.skip(num_val)
for item in specgram_ds.take(1):
pass
Also see this. Finally, note that using tf.print instead of print should get ride of any side effects.

loading csv files - SyntaxError: invalid syntax (python 3.8)

I was working on a project that requires me to add csv file in two places of the code. I have seen kinda similar problem here at stackoverflow. But their problem was due to old python version 2.5. But my python version is 3.8.
import csv
from tensorflow.keras.datasets import mnist
import numpy as np
def load_az_dataset("C:\A_Z_Handwritten_Data\A_Z_Handwritten_Data.csv"):
# initialize the list of data and labels
data = []
labels = []
# loop over the rows of the A-Z handwritten digit dataset
for row in open("C:\A_Z_Handwritten_Data\A_Z_Handwritten_Data.csv"):
# parse the label and image from the row
row = row.split(",")
label = int(row[0])
image = np.array([int(x) for x in row[1:]], dtype="uint8")
# images are represented as single channel (grayscale) images
# that are 28x28=784 pixels -- we need to take this flattened
# 784-d list of numbers and repshape them into a 28x28 matrix
image = image.reshape((28, 28))
# update the list of data and labels
data.append(image)
labels.append(label)
# convert the data and labels to NumPy arrays
data = np.array(data, dtype="float32")
labels = np.array(labels, dtype="int")
# return a 2-tuple of the A-Z data and labels
return (data, labels)
It's showing this syntax error

The syntax error is caused by the fact that the file path is in the parameter list in the function definition. This is the culprit:
def load_az_dataset("C:\A_Z_Handwritten_Data\A_Z_Handwritten_Data.csv"):
You have no parameters listed in the function definition. You just have a literal string.
Furthermore, you should also either be using raw strings: r"..." or escaping your backslashes, as others have mentioned.
Finally, you should be using the with open(file_path) as f: pattern to open your file.

The syntax error is caused since you are passing the literal string in the method declaration of load_az_dataset.
You need to define the parameter to the function as:
def load_az_dataset(fileName):
Further, if you want to add that file as the default value for the parameter then use:
def load_az_dataset(fileName="C:\\A_Z_Handwritten_Data\\A_Z_Handwritten_Data.csv"):
Also, unrelated to the problem, you need to escape the \ with another \.
Try:
open("C:\\A_Z_Handwritten_Data\\A_Z_Handwritten_Data.csv")

CNN: What to do when labels are given by a map

When fitting a convolutional neural network for an image classification problem, in order to use functions like
flow_from_directory()
image_dataset_from_directory()
Keras expects the train data to be stored in this way:
\data:
\training
\class_1
"imag1.jpg"
"imag2.jpg"
...
\class_2
"imag1.jpg"
"imag2.jpg"
...
....
Instead, I have a dataset with all the images stored in a single folder and a .json file which contains a map from the file names to the labels. Something like
{"18985.jpg": 0, "43358.jpg": 0, ... "13163.jpg": 1 ....}
Is there an efficient way to use this dataset anyway?

The solution I advise would be to write a script to build the folders for you
step 1 : open the json, and get a list of unique catégories
step 2 : iterate over the list of unique categories et create a folder under training
step 3 : iterate over the json, and copy the file to the right folder (that you created already)
step 4 : load everything using image_dataset_from_directory
Another one would be to use from_generator
import json
# Opening JSON file
f = open('data.json',)
# returns JSON object as
# a dictionary
data = json.load(f)
def gen():
for (image_path, label) in data.items():
image = tf.keras.preprocessing.image.load_img(image_path)
input_arr = keras.preprocessing.image.img_to_array(image)
yield (input_arr, label)
dataset = tf.data.Dataset.from_generator(
gen,
(tf.float32, tf.float32),
output_shapes=([32,256,256,3], [32,5]) # 5 is your number of categories
Personnally I'll go with the first one ^^

Inputting an obscure file type into tensorflow

I am currently trying to train a neural network model on MRI scan images. The images are in a NIfTI (.nii) file format which I don't believe tensorflow or keras has the inherent ability to read. I have a python package that allows me to read these files in python, however I am having trouble figuring out how to interface this package with tensorflow. I first create a tf.data.Dataset object containing the paths to each of my MRI scans, and then I try to use the Dataset.map() function to read each of the files and create a dataset of image, label pairs. My problem is that the tf.data.Dataset object seems to store each filename in a Tensor rather than a string, but the function that can read the .nii filetype cannot read a Tensor. Is there a way to convert the filepath string tensors into readable strings to allow me to open the files? If not, is there a better way of creating the dataset?

Specifying the code below, which was present in the Link specified by "agrits" in the comments section, for the benefit of the community.
# Creates a .tfrecord file from a directory of nifti images.
# This assumes your niftis are soreted into subdirs by directory, and a regex
# can be written to match a volume-filenames and label-filenames
#
# USAGE
# python ./genTFrecord.py <data-dir> <input-vol-regex> <label-vol-regex>
# EXAMPLE:
# python ./genTFrecord.py ./buckner40 'norm' 'aseg' buckner40.tfrecords
#
# Based off of this:
# http://warmspringwinds.github.io/tensorflow/tf-slim/2016/12/21/tfrecords-guide/
# imports
import numpy as np
import tensorflow as tf
import nibabel as nib
import os, sys, re
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def select_hipp(x):
x[x != 17] = 0
x[x == 17] = 1
return x
def crop_brain(x):
x = x[90:130,90:130,90:130] #should take volume zoomed in on hippocampus area
return x
def preproc_brain(x):
x = select_hipp(x)
x = crop_brain(x)
return x
def listfiles(folder):
for root, folders, files in os.walk(folder):
for filename in folders + files:
yield os.path.join(root, filename)
def gen_filename_pairs(data_dir, v_re, l_re):
unfiltered_filelist=list(listfiles(data_dir))
input_list = [item for item in unfiltered_filelist if re.search(v_re,item)]
label_list = [item for item in unfiltered_filelist if re.search(l_re,item)]
print("input_list size: ", len(input_list))
print("label_list size: ", len(label_list))
if len(input_list) != len(label_list):
print("input_list size and label_list size don't match")
raise Exception
return zip(input_list, label_list)
# parse args
data_dir = sys.argv[1]
v_regex = sys.argv[2]
l_regex = sys.argv[3]
outfile = sys.argv[4]
print("data_dir: ", data_dir)
print("v_regex: ", v_regex )
print("l_regex: ", l_regex )
print("outfile: ", outfile )
# Generate a list of (volume_filename, label_filename) tuples
filename_pairs = gen_filename_pairs(data_dir, v_regex, l_regex)
# To compare original to reconstructed images
original_images = []
writer = tf.python_io.TFRecordWriter(outfile)
for v_filename, l_filename in filename_pairs:
print("Processing:")
print(" volume: ", v_filename)
print(" label: ", l_filename)
# The volume, in nifti format
v_nii = nib.load(v_filename)
# The volume, in numpy format
v_np = v_nii.get_data().astype('int16')
# The volume, in raw string format
v_np = crop_brain(v_np)
# The volume, in raw string format
v_raw = v_np.tostring()
# The label, in nifti format
l_nii = nib.load(l_filename)
# The label, in numpy format
l_np = l_nii.get_data().astype('int16')
# Preprocess the volume
l_np = preproc_brain(l_np)
# The label, in raw string format
l_raw = l_np.tostring()
# Dimensions
x_dim = v_np.shape[0]
y_dim = v_np.shape[1]
z_dim = v_np.shape[2]
print("DIMS: " + str(x_dim) + str(y_dim) + str(z_dim))
# Put in the original images into array for future check for correctness
# Uncomment to test (this is a memory hog)
########################################
# original_images.append((v_np, l_np))
data_point = tf.train.Example(features=tf.train.Features(feature={
'image_raw': _bytes_feature(v_raw),
'label_raw': _bytes_feature(l_raw)}))
writer.write(data_point.SerializeToString())
writer.close()

tensorflow TFRecordWriter killing kernel

I am staring playing with tensorflow. I am facing the following problem. I am trying to run an example to do image recognition based on the Stanford Dog Dataset.I am stuck in the step of converting the image and label in TRFRECORDS files.
In the image dataset folder there are 120 sub-folders, one for each breed (label).
If I run the code below with just on sub-folder in run fine (Actually I didn't tried to read the trfrecord file). But If I include a second sub-folder the process kills the python kernel process.
Here is the code I am running
import glob
import tensorflow as tf
from itertools import groupby
from collections import defaultdict
image_filenames = glob.glob(r'C:\Users\Administrator\Documents\Tensorflow\images\n02*\*.jpg')
training_dataset = defaultdict(list)
testing_dataset = defaultdict(list)
# Split up the filename into its breed and corresponding filename. The breed is found by taking the directo
image_filename_with_breed =map(lambda filename: (filename.split("\\")[6], filename), image_filenames)
# Group each image by the breed which is the 0th element in the tuple returned above
for dog_breed, breed_images in groupby(image_filename_with_breed, lambda x: x[0]):
# Enumerate each breed's image and send ~20% of the images to a testing set
for i, breed_image in enumerate(breed_images):
if i % 5 == 0:
testing_dataset[dog_breed].append(breed_image[1])
else:
training_dataset[dog_breed].append(breed_image[1])
# Check that each breed includes at least 18% of the images for testing
breed_training_count = len(training_dataset[dog_breed])
breed_testing_count = len(testing_dataset[dog_breed])
assert round(breed_testing_count / (breed_training_count + breed_testing_count), 2) > 0.18,'Not enough testing data'
sess = tf.Session()
def write_records_file(dataset, record_location):
"""
Fill a TFRecords file with the images found in `dataset` and include their category.
Parameters
----------
dataset : dict(list)
Dictionary with each key being a label for the list of image filenames of its value.
record_location : str
Location to store the TFRecord output.
"""
writer = None
# Enumerating the dataset because the current index is used to breakup the files if they get over 100
# images to avoid a slowdown in writing.
current_index = 0
for breed, images_filenames in dataset.items():
for image_filename in images_filenames:
print(image_filename)
if current_index % 100 == 0:
if writer:
writer.close()
record_filename = "{record_location}-{current_index}.tfrecords".format(
record_location=record_location,
current_index=current_index)
print(record_filename)
writer = tf.python_io.TFRecordWriter(record_filename)
current_index += 1
image_file = tf.read_file(image_filename)
# In ImageNet dogs, there are a few images which TensorFlow doesn't recognize as JPEGs. This
# try/catch will ignore those images.
try:
image = tf.image.decode_jpeg(image_file)
except:
print(image_filename)
continue
# Converting to grayscale saves processing and memory but isn't required.
grayscale_image = tf.image.rgb_to_grayscale(image)
resized_image = tf.image.resize_images(grayscale_image, (250, 151))
# tf.cast is used here because the resized images are floats but haven't been converted into
# image floats where an RGB value is between [0,1).
image_bytes = sess.run(tf.cast(resized_image, tf.uint8)).tobytes()
# Instead of using the label as a string, it'd be more efficient to turn it into either an
# integer index or a one-hot encoded rank one tensor.
# https://en.wikipedia.org/wiki/One-hot
image_label = breed.encode("utf-8")
example = tf.train.Example(features=tf.train.Features(feature={
'label': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_label])),
'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_bytes]))
}))
writer.write(example.SerializeToString())
writer.close()
write_records_file(testing_dataset, r'C:\Users\Administrator\Documents\Tensorflow\TRF\testing_images')
write_records_file(training_dataset, r'C:\Users\Administrator\Documents\Tensorflow\TRF\training_images')
I monitored the memory usage and running the script does not seems to consume to much memory. I tried this in two Virtual Machines. One with Ubuntu and the other on with Windows 2000.
Does anyone have a idea?
Thanks!

I found the problem. It was the writer.close() statement that was incorrectly idented. I should be idented in the first for loop but I idented it in the second loop.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading Dataset from files where some might be missing - python

Related

Strange behaviour when passing a function into a Tensorflow dataset map method

loading csv files - SyntaxError: invalid syntax (python 3.8)

CNN: What to do when labels are given by a map

Inputting an obscure file type into tensorflow

tensorflow TFRecordWriter killing kernel

Categories

Resources