Tensorflow Dataset open several files and keep them separated

Tensorflow Dataset open several files and keep them separated - python

I am trying to have a 3d dataset, each data point is in a separate csv file composed of lines and my features are in the columns
I have tried several ways of adding a list of files to a dataset
files = os.listdir("path")
dataset = tf.data.Dataset.from_tensor_slices(files)
or
dataset = tf.data.Dataset.list_files("path/*.csv")
and both seem to work, but then to open the files I cannot rely on Dataset.TextLineDataset because it would make all of my data into one big 2d dataset
I have tried using
dataset = dataset.map(parse_file)
and
def parse_file(filename):
data = np.genfromtxt(str(filename), delimiter=',')
return data
to get it as an array but I get the error
OSError: Tensor("args_0:0", shape=(), dtype=string) not found.
What am I doing wrong ?
EDIT : the data looks like this, it's several files that are all in this form (no headers) :
1498561981000,51.89105,12.41285,0
1498562341000,51.891052,12.412848,0
1498562566000,51.891045,12.412857,0
...
In the end I want a 3d representation where the first dimension is the file, the second is the line and the third is the column
like
[
[[1498561981000,51.89105,12.41285,0],[1498562341000,51.891052,12.412848,0],[1498562566000,51.891045,12.412857,0]],
[[1498561981000,51.89105,12.41285,0],[1498562341000,51.891052,12.412848,0],[1498562566000,51.891045,12.412857,0]]
...
]

The error you run into is because you are trying to use a python/numpy function in the map call. For performance reasons, tf.data runs its op in graph mode, which means that every function passed to map should either be native to tensorflow, or wrapped in a tf.python_func/tf.numpy_func. It's really tricky when it comes to I/O operation like reading a file, where it's almost mandatory to use native tensorflow functions.
Here is a way to read your csv and put them in a dataset. Each element of the dataset in one csv file.
import tensorflow as tf
def read_csv(filepath):
content = tf.io.read_file(filepath)
# taking care of trailing whitespace
content_no_trailing = tf.strings.strip(content)
lines = tf.strings.split(content_no_trailing, sep="\n")
values = tf.map_fn(lambda x: tf.strings.split(x, sep=","), lines)
# we have to nest two calls to map_fn, one for each line, then for each columns
float_values = tf.map_fn(
lambda x: tf.map_fn(tf.strings.to_number, x, fn_output_signature=tf.float32),
values,
fn_output_signature=tf.float32,
)
return float_values
files = ["test1.csv", "test2.csv"] # or any way to get a list of file names
list_ds = tf.data.Dataset.from_tensor_slices(files)
ds = list_ds.map(read_csv)
Writing to files `test1.csv" and test2.csv" with identical content and then looping over that dataset we see:
>>> for elem in ds:print(elem)
tf.Tensor(
[[1.4985620e+12 5.1891048e+01 1.2412850e+01 0.0000000e+00]
[1.4985623e+12 5.1891052e+01 1.2412848e+01 0.0000000e+00]
[1.4985626e+12 5.1891045e+01 1.2412857e+01 0.0000000e+00]], shape=(3, 4), dtype=float32)
tf.Tensor(
[[1.4985620e+12 5.1891048e+01 1.2412850e+01 0.0000000e+00]
[1.4985623e+12 5.1891052e+01 1.2412848e+01 0.0000000e+00]
[1.4985626e+12 5.1891045e+01 1.2412857e+01 0.0000000e+00]], shape=(3, 4), dtype=float32)

Related

CNN: What to do when labels are given by a map

When fitting a convolutional neural network for an image classification problem, in order to use functions like
flow_from_directory()
image_dataset_from_directory()
Keras expects the train data to be stored in this way:
\data:
\training
\class_1
"imag1.jpg"
"imag2.jpg"
...
\class_2
"imag1.jpg"
"imag2.jpg"
...
....
Instead, I have a dataset with all the images stored in a single folder and a .json file which contains a map from the file names to the labels. Something like
{"18985.jpg": 0, "43358.jpg": 0, ... "13163.jpg": 1 ....}
Is there an efficient way to use this dataset anyway?

The solution I advise would be to write a script to build the folders for you
step 1 : open the json, and get a list of unique catégories
step 2 : iterate over the list of unique categories et create a folder under training
step 3 : iterate over the json, and copy the file to the right folder (that you created already)
step 4 : load everything using image_dataset_from_directory
Another one would be to use from_generator
import json
# Opening JSON file
f = open('data.json',)
# returns JSON object as
# a dictionary
data = json.load(f)
def gen():
for (image_path, label) in data.items():
image = tf.keras.preprocessing.image.load_img(image_path)
input_arr = keras.preprocessing.image.img_to_array(image)
yield (input_arr, label)
dataset = tf.data.Dataset.from_generator(
gen,
(tf.float32, tf.float32),
output_shapes=([32,256,256,3], [32,5]) # 5 is your number of categories
Personnally I'll go with the first one ^^

"... has insufficient rank for batching." What is the problem with this 3 line code?

this is my first question here.
I've been wanting to create a dataset with the popular IMDb dataset for learning purpose. The directories are as follows: .../train/pos/ and .../train/neg/ . I created a function which will merge text files with its labels and getting a error. I need your help to debug!
def datasetcreate(filepath, label):
filepaths = tf.data.Dataset.list_files(filepath)
return tf.stack([tf.data.Dataset.from_tensor_slices((_, tf.constant(label, dtype='int32'))) for _ in tf.data.TextLineDataset(filepaths)])
datasetcreate(['aclImdb/train/pos/*.txt'],1)
And this is the error I'm getting:
ValueError: Value tf.Tensor(b'An American in Paris was, in many ways, the ultimate.....dancers of all time.', shape=(), dtype=string) has insufficient rank for batching.
Why does this happen and what can I do to get rid of this? Thanks.

Your code has two problems:
First, the way you load your TextLineDatasets, your loaded tensors contain string objects, which have an empty shape associated, i.e. a rank of zero. The rank of a tensor is the length of the shape property.
Secondly, you are trying to stack two tensors with different rank, which is would throw another error because, a sentence (a sequence of tokens) has a rank of 1 and the label as scalar has a rank of 0.
If you just need the dataset, I recommend to use the Tensorflow Dataset package, which has many ready-to-use datasets available.
If want to solve your particular problem, one way to fix your data pipeline is by using Datasest.interleave and the Dataset.zip functions.
# load positive sentences
filepaths = list(tf.data.Dataset.list_files('aclImdb/train/pos/*.txt'))
sentences_ds = tf.data.Dataset.from_tensor_slices(filepaths)
sentences_ds = sentences_ds.interleave(lambda text_file: tf.data.TextLineDataset(text_file) )
sentences_ds = sentences_ds.map( lambda text: tf.strings.split(text) )
# dataset for labels, create 1 label per file
labels = tf.constant(1, dtype="int32", shape=(len(filepaths)))
label_ds = tf.data.Dataset.from_tensor_slices(labels)
# combine text with label datasets
dataset = tf.data.Dataset.zip( (sentences_ds, label_ds) )
print( list(dataset.as_numpy_iterator() ))
First, you use the interleave function to combine multiple text datasets to one dataset. Next, you use tf.strings.split to split each text to its tokens. Then, you create a dataset for your positive labels. Finally, you combine the two datasets using zip.
IMPORTANT: To train/run any DL models on your dataset, you will likely need further pre-processing for your sentences, e.g. build a vocabulary and train word-embeddings.

Reading Dataset from files where some might be missing

I'm trying to load files to TensorFlow Dataset where some files might be missing (in which case I want to replace these with zeroes).
The structure of directories that I'm trying to read data from is as follows:
|-data
|---sensor_A
|-----1.dat
|-----2.dat
|-----3.dat
|---sensor_B
|-----1.dat
|-----2.dat
|-----3.dat
.dat files are .csv files with spacebar as a separator. The content of every file is a single, multi-row observation where the number of columns is constant (say 4) and the number of rows is unknown (timeseries data).
I've successfully managed to read every sensor data to a separate TensorFlow Dataset with the following code:
import os
import tensorflow as tf
tf.enable_eager_execution()
data_root_dir = "data"
modalities_to_use = ["sensor_A", "sensor_B"]
timestamps = [1, 2, 3]
for mod_idx, modality in enumerate(modalities_to_use):
# Will produce: ['data/sensor_A/1.dat', 'data/sensor_A/2.dat', 'data/sensor_A/3.dat']
filenames = [os.path.join(data_root_dir, modality, str(timestamp) + ".dat") for timestamp in timestamps]
dataset = tf.data.Dataset.from_tensor_slices((filenames,))
def _parse_function_internal(filename):
number_of_columns = 4
single_observation = tf.read_file(filename)
# Tokenise every value so we can cast these to floats later.
single_observation = tf.string_split([single_observation], sep='\r\n ').values
single_observation = tf.reshape(single_observation, (-1, number_of_columns))
single_observation = tf.strings.to_number(single_observation, tf.float32)
return filename, single_observation
dataset = dataset.map(_parse_function_internal)
print('Result:')
for el in dataset:
try:
# Filename
print(el[0])
# Parsed file content
print(el[1])
except tf.errors.OutOfRangeError:
break
which successfully prints out content of all three files for every sensor.
My problem is that some timestamps in the dataset might be missing. For instance if file 1.dat in sensor_A directory will be missing I'm getting this error:
tensorflow.python.framework.errors_impl.NotFoundError: NewRandomAccessFile failed to Create/Open: mock_data\sensor_A\1.dat : The system cannot find the file specified.
; No such file or directory
[[{{node ReadFile}}]] [Op:IteratorGetNextSync]
which is thrown in this line:
for el in dataset:
What I've tried to do is to surround the call to tf.read_file() function with try block but obviously it doesn't work as the error is not thrown when tf.read_file() is called, but when the value is fetched from the dataset. Later I want to pass this dataset to a Keras model so I can't just surround it with try block. Is there any workaround? Is that even supported?
Thanks!

I managed to solve the problem, sharing the solution just in case someone else will be struggling with it as well. I had to use additional list of booleans specifying whether the file actually exist and pass it into the mapper. Then using tf.cond() function we decide whether to read the file or mock the data with zeroes (or any other logic).
import os
import tensorflow as tf
tf.enable_eager_execution()
data_root_dir = "data"
modalities_to_use = ["sensor_A", "sensor_B"]
timestamps = [1, 2, 3]
for mod_idx, modality in enumerate(modalities_to_use):
# Will produce: ['data/sensor_A/1.dat', 'data/sensor_A/2.dat', 'data/sensor_A/3.dat']
filenames = [os.path.join(data_root_dir, modality, str(timestamp) + ".dat") for timestamp in timestamps]
files_exist = [os.path.isfile(filename) for filename in filenames]
dataset = tf.data.Dataset.from_tensor_slices((filenames, files_exist))
def _parse_function_internal(filename, file_exist):
number_of_columns = 4
single_observation = tf.cond(file_exist, lambda: tf.read_file(filename), lambda: ' '.join(['0.0'] * number_of_columns))
# Tokenise every value so we can cast these to floats later.
single_observation = tf.string_split([single_observation], sep='\r\n ').values
single_observation = tf.reshape(single_observation, (-1, number_of_columns))
single_observation = tf.strings.to_number(single_observation, tf.float32)
return filename, single_observation
dataset = dataset.map(_parse_function_internal)
print('Result:')
for el in dataset:
try:
# Filename
print(el[0])
# Parsed file content
print(el[1])
except tf.errors.OutOfRangeError:
break

How to create my own dataset for keras model.fit() using Tensorflow(python)?

I want to train a simple classification neural network which can classify the data into 2 types, i.e. true or false.
I have 29 data along with respective labels available with me. I want to parse this data to form a dataset which can be fed into model.fit() to train the neural network.
Please suggest me how can I arrange the data with their respective labels. What to use, whether lists, dictionary, array?
There are values of 2 fingerprints separated by '$' sign and whether they match or not (i.e. true or false) is separated by another '$' sign.
A Fingerprint has 63 features separated by ','(comma) sign.
So, Each line has the data of 2 fingerprints and true/false data.
I have below data with me in following format:
File Name : thumb_and_index.txt
239,1,255,255,255,255,2,0,130,3,1,105,24,152,0,192,126,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,128,0,192,0,192,0,0,0,0,0,0,0,147,18,19,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,101,22,154,0,240,30,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,128,0,0,0,0,0,0,0,0,0,0,0,0,71,150,212,$true
239,1,255,255,255,255,2,0,130,3,1,82,23,146,0,128,126,0,14,0,6,0,6,0,2,0,0,0,0,0,2,0,2,0,2,0,2,0,2,0,6,128,6,192,14,224,30,255,254,0,0,0,0,0,0,207,91,180,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,81,28,138,0,241,254,128,6,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,128,0,128,2,128,2,192,6,224,6,224,62,0,0,0,0,0,0,0,0,0,0,0,0,13,62,$true
239,1,255,255,255,255,2,0,130,3,1,92,29,147,0,224,0,192,0,192,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,192,0,192,0,224,0,224,2,240,2,248,6,255,14,76,16,0,0,0,0,19,235,73,181,0,0,0,0,$239,192,255,255,255,255,2,0,130,3,1,0,0,0,0,248,30,240,14,224,0,224,0,128,0,128,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,6,128,14,192,14,252,30,0,0,0,0,0,0,0,0,0,0,0,0,158,46,$false
239,1,255,255,255,255,2,0,130,3,1,0,0,0,0,128,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,128,0,0,0,0,0,0,0,217,85,88,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,90,27,135,0,252,254,224,126,128,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,190,148,$false
239,1,255,255,255,255,2,0,130,3,1,89,22,129,0,129,254,128,254,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,6,0,6,128,14,192,14,224,14,0,0,0,0,0,0,20,20,43,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,91,17,134,0,0,126,0,30,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,6,0,6,0,30,192,62,224,126,224,254,0,0,0,0,0,0,0,0,0,0,0,0,138,217,$true
239,1,255,255,255,255,2,0,130,3,1,71,36,143,0,128,254,0,14,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,6,80,18,0,0,0,0,153,213,11,95,83,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,94,30,140,0,129,254,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,192,6,0,0,0,0,0,0,0,0,0,0,0,0,54,13,$true
239,1,255,255,255,255,2,0,130,3,1,66,42,135,0,255,254,1,254,0,14,0,6,0,6,0,6,0,6,0,6,0,2,0,2,0,2,0,2,0,2,0,2,0,6,0,6,0,6,0,0,0,0,0,0,225,165,64,152,172,88,0,0,$239,1,255,255,255,255,2,0,130,3,1,62,29,137,0,255,254,249,254,240,6,224,2,224,0,224,0,224,0,224,0,224,0,224,0,224,0,240,0,240,0,240,0,240,0,240,0,240,2,0,0,0,0,0,0,0,0,0,0,0,0,0,98,$true
239,1,255,255,255,255,2,0,130,3,1,83,31,142,0,255,254,128,254,0,30,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,128,2,192,2,192,2,192,2,192,6,0,0,0,0,0,0,146,89,117,12,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,84,14,154,0,0,2,0,2,0,2,0,2,0,2,0,6,0,14,128,30,192,62,255,254,255,254,255,254,255,254,255,254,255,254,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,31,$false
239,1,255,255,255,255,2,0,130,3,1,66,41,135,0,255,254,248,62,128,30,0,14,0,14,0,14,0,14,0,14,0,14,0,6,0,6,0,6,0,14,0,14,0,14,192,14,224,14,0,0,0,0,0,0,105,213,155,107,95,23,0,0,$239,1,255,255,255,255,2,0,130,3,1,61,33,133,0,255,254,255,254,224,62,192,6,192,6,192,6,192,6,192,6,192,6,224,6,224,6,224,6,224,6,224,6,224,6,224,6,224,6,0,0,0,0,0,0,0,0,0,0,0,0,0,62,$false
239,1,255,255,255,255,2,0,130,3,1,88,31,119,0,0,14,0,14,0,6,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,100,133,59,150,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,97,21,137,0,128,14,0,6,0,2,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,6,0,0,0,0,0,0,0,0,0,0,0,80,147,210,$true
239,1,255,255,255,255,2,0,130,3,1,85,21,137,0,224,14,192,6,192,6,128,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,128,14,192,30,224,126,224,254,0,0,0,0,0,0,79,158,178,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,89,25,134,0,240,6,128,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,128,2,128,2,192,2,192,6,224,6,240,14,240,30,0,0,0,0,0,0,0,0,0,0,0,0,72,31,$true
239,1,255,255,255,255,2,0,130,3,1,90,25,128,0,241,254,0,30,0,6,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,6,0,6,192,14,0,0,0,0,0,0,225,153,189,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,96,12,153,0,192,14,128,6,128,6,128,6,0,6,128,2,128,2,128,2,128,6,128,6,192,14,240,30,255,254,255,254,255,254,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,18,$false
239,1,255,255,255,255,2,0,130,3,1,96,22,142,0,255,254,254,14,128,2,128,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,192,2,0,0,0,0,0,0,18,25,100,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,76,24,145,0,224,2,192,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,224,2,240,126,255,254,255,254,255,254,255,254,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,145,$false
239,1,255,255,255,255,2,0,130,3,1,71,33,117,0,129,254,0,30,0,14,0,14,0,6,0,6,0,2,0,2,0,6,0,6,0,6,0,6,0,6,128,14,192,14,240,30,240,254,0,0,0,0,0,0,235,85,221,57,17,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,76,31,112,0,255,254,0,62,0,62,0,62,0,14,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,0,6,128,14,224,62,0,0,0,0,0,0,0,0,0,0,0,0,30,170,$true
239,1,255,255,255,255,2,0,130,3,1,64,29,117,0,128,30,0,30,0,30,0,14,0,6,0,6,0,6,0,6,0,6,0,14,0,14,0,14,128,30,192,30,224,62,240,254,255,254,0,0,0,0,0,0,99,80,119,149,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,72,18,132,0,128,2,0,0,0,0,128,0,128,0,128,0,128,0,192,2,224,2,240,14,252,14,255,254,255,254,255,254,255,254,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,14,$false
239,1,255,255,255,255,2,0,130,3,1,82,16,132,0,255,254,255,254,255,254,240,30,224,14,224,14,192,6,192,6,192,2,192,2,192,2,192,2,192,2,192,2,192,1,224,2,240,6,0,0,0,0,0,0,215,21,0,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,85,23,130,0,240,30,192,14,128,14,128,6,128,2,128,2,128,2,128,2,128,2,128,0,192,0,192,2,192,2,224,2,224,6,240,6,248,30,0,0,0,0,0,0,0,0,0,0,0,0,0,62,$true
239,1,255,255,255,255,2,0,130,3,1,100,28,141,0,255,254,255,254,224,14,192,14,192,6,192,2,128,2,128,2,128,2,0,2,0,2,0,2,0,2,0,6,0,6,0,6,192,14,0,0,0,0,0,0,42,88,87,169,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,95,31,134,0,255,254,240,254,224,0,192,0,192,0,192,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,128,0,192,2,192,6,0,0,0,0,0,0,0,0,0,0,0,0,0,182,$true
239,1,255,255,255,255,2,0,130,3,1,88,35,121,0,255,14,240,6,224,7,192,2,192,2,192,2,192,2,192,2,192,2,192,2,192,2,224,2,224,2,224,2,224,2,224,2,224,6,0,0,0,0,0,0,36,81,48,225,153,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,81,43,112,0,252,62,248,14,224,2,192,2,192,2,192,0,192,0,192,0,192,0,192,0,192,0,192,0,224,0,224,2,224,2,224,2,224,6,0,0,0,0,0,0,0,0,0,0,0,0,0,76,$true
239,1,255,255,255,255,2,0,130,3,1,103,24,144,0,255,254,192,14,192,6,128,2,128,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,6,128,6,128,6,192,30,224,254,0,0,0,0,0,0,19,82,111,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,98,11,149,0,255,2,255,0,252,0,240,0,240,0,240,0,248,0,248,0,248,0,252,0,254,0,254,2,254,30,254,30,254,30,254,30,254,30,0,0,0,0,0,0,0,0,0,0,0,0,0,114,$false
239,1,255,255,255,255,2,0,130,3,1,92,23,123,0,255,254,255,30,252,6,240,2,224,0,192,0,192,0,192,0,224,0,224,0,224,0,224,2,224,2,224,2,224,2,224,6,224,6,0,0,0,0,0,0,35,161,251,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,52,37,125,0,255,254,255,254,224,254,192,30,192,14,128,14,128,14,128,14,128,14,128,14,128,14,128,14,128,6,0,2,0,2,0,2,192,2,0,0,0,0,0,0,0,0,0,0,0,0,0,110,$false
239,1,255,255,255,255,2,0,130,3,1,103,19,143,0,255,254,254,254,0,126,0,126,0,126,0,62,0,62,0,126,0,126,0,126,0,126,0,126,0,126,0,126,0,254,0,254,0,254,0,0,0,0,0,0,38,168,0,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,90,30,141,0,255,254,193,254,128,62,0,6,0,2,0,2,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,6,0,254,0,0,0,0,0,0,0,0,0,0,0,0,53,211,$true
239,1,255,255,255,255,2,0,130,3,1,93,34,137,0,255,254,225,254,192,14,192,2,192,2,192,2,192,2,192,0,192,0,192,0,192,0,192,0,192,0,224,2,224,2,240,6,240,14,0,0,0,0,0,0,101,4,252,164,28,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,88,31,140,0,255,254,192,62,192,14,192,14,0,6,0,6,0,6,0,6,0,2,0,2,0,2,0,2,128,2,128,6,192,6,224,14,240,30,0,0,0,0,0,0,0,0,0,0,0,0,10,97,$true
239,1,255,255,255,255,2,0,130,3,1,57,50,107,0,248,2,248,0,248,0,224,0,224,0,192,0,192,0,192,0,128,0,128,0,128,0,128,0,192,0,192,0,192,0,192,2,224,2,0,0,0,0,0,0,34,10,146,27,176,73,73,82,$239,1,255,255,255,255,2,0,130,3,1,54,42,111,0,255,254,255,254,254,126,252,6,240,2,224,2,224,2,224,0,224,0,224,0,224,0,224,0,224,0,224,0,224,0,192,0,192,0,0,0,0,0,0,0,0,0,0,0,0,0,0,225,$true
239,1,255,255,255,255,2,0,130,3,1,103,18,142,0,241,254,224,254,128,126,128,126,0,62,0,30,0,30,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,0,0,0,0,0,209,21,0,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,103,10,139,0,255,254,255,254,255,254,225,254,192,254,192,254,192,126,128,62,0,30,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,0,0,0,0,0,0,0,0,0,0,0,0,163,$true
239,1,255,255,255,255,2,0,130,3,1,85,21,132,0,248,2,248,2,248,0,240,0,240,0,240,0,240,0,240,0,240,0,240,0,248,0,248,0,252,0,252,0,252,0,254,2,255,6,0,0,0,0,0,0,94,23,110,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,76,26,133,0,129,254,128,62,0,62,0,62,0,62,0,62,0,30,0,30,0,30,0,30,0,30,0,30,0,30,0,30,128,30,192,14,224,14,0,0,0,0,0,0,0,0,0,0,0,0,222,36,$true
239,1,255,255,255,255,2,0,130,3,1,87,28,141,0,255,254,255,254,224,254,224,126,224,126,0,14,0,2,0,2,0,2,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,2,0,0,0,0,0,0,143,231,78,148,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,89,30,139,0,255,254,248,254,240,30,224,14,224,14,192,6,192,2,128,0,0,0,0,0,0,0,0,0,0,0,0,2,0,2,0,2,0,2,0,0,0,0,0,0,0,0,0,0,0,0,26,213,$true
239,1,255,255,255,255,2,0,130,3,1,93,25,136,0,255,254,193,254,0,254,0,62,0,30,0,30,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,0,0,0,0,0,148,210,91,0,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,95,23,145,0,254,254,252,30,240,2,224,0,224,0,224,0,192,0,192,0,192,0,192,6,192,6,192,6,192,6,192,6,192,6,224,6,224,14,0,0,0,0,0,0,0,0,0,0,0,0,0,30,$false
239,1,255,255,255,255,2,0,130,3,1,85,27,138,0,255,254,240,126,224,30,192,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,14,0,30,0,30,0,30,192,62,224,62,0,0,0,0,0,0,85,17,74,101,0,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,105,19,144,0,192,254,128,126,0,62,0,30,128,30,128,30,128,14,192,14,192,14,192,14,224,14,224,14,240,14,240,14,248,14,254,30,255,30,0,0,0,0,0,0,0,0,0,0,0,0,0,254,$false
239,1,255,255,255,255,2,0,130,3,1,86,37,116,0,255,254,254,14,252,6,248,2,240,0,240,0,224,0,192,0,192,0,128,0,0,0,0,2,0,2,0,2,0,2,0,6,0,6,0,0,0,0,0,0,94,157,90,28,219,0,0,0,$239,1,255,255,255,255,2,0,130,3,1,99,26,130,0,255,254,248,14,240,2,224,0,192,0,192,0,192,0,128,0,192,0,192,0,192,0,192,0,224,0,240,2,248,6,255,254,255,254,0,0,0,0,0,0,0,0,0,0,0,0,0,213,$true
I have used this code trying to parse the data:
import tensorflow as tf
import os
import array as arr
import numpy as np
import json
os.environ["TF_CPP_MIN_LOG_LEVEL"]="2"
f= open("thumb_and_index.txt","r")
dataset = []
if f.mode == 'r':
contents =f.read()
#list of lines
lines = contents.splitlines()
print("No. of lines : "+str(len(lines)))
for line in lines:
words = line.split(',')
mainlist = []
list = []
flag = 0
for word in words:
print("word : " + word)
if '$' in word:
if flag == 1:
mainlist.append(list)
mainlist.append(word[1:])
dataset.append(mainlist)
else:
mainlist.append(list)
del list[0:len(list)]
list.append(int(word[1:]))
flag = flag + 1
else:
list.append(int(word))
print(json.dumps(dataset, indent = 4))
I want to feed the parsed data into model.fit() using keras in tensorflow(python).
Also I want to ask about the neural network. How many layers and nodes should I keep in my neural network? Suggest a starting point.

there's a plenty ways to do that (formating the data), you can create 2D matrix for the data that has 62 columns for the data and another array that handles the results for this data (X_data,Y_data).
also you can use pandas to create dataframes for the data (same as arrays, bu it's better to show and visualize the data).
example to read the textfile into pandas dataframe
import pandas
df = pandas.read_table('./input/dists.txt', delim_whitespace=True, names=('A', 'B', 'C'))
split the data into x&y then fit it in your model
for the size of the hidden layers in your neural, it's well known that the more layers you add the more accurate results you get (without considering overfitting) , so that depends on your data.
I suggest you to start with a sequential layers as follows (62->2048->1024->512->128->64->sigmoid)

The best approach, especially assuming that dataset is large, is to use the tf.data dataset. There's a CSV reader built right in. The dataset api provides all the functionality you need to preprocess the dataset, it provides built-in multi-core processing, and quite a bit more.
Once you have the dataset built Keras will accept it as an input directly, so fit(my_dataset, inputs=... outputs=...).
The structure of the dataset api takes a little learning, but it's well worth it. Here's the primary guide with lots of examples:
https://www.tensorflow.org/guide/datasets
Scroll down to the section on 'Import CSV data' for poignant examples.
Here's a nice example of using the dataset API with keras: How to Properly Combine TensorFlow's Dataset API and Keras?

Can't access returned h5py object instance

I have a very weird issue here. I have 2 functions: one which reads an HDF5 file created using h5py and one which creates a new HDF5 file which concatenates the content returned by the former function.
def read_file(filename):
with h5py.File(filename+".hdf5",'r') as hf:
group1 = hf.get('group1')
group1 = hf.get('group2')
dataset1 = hf.get('dataset1')
dataset2 = hf.get('dataset2')
print group1.attrs['w'] # Works here
return dataset1, dataset2, group1, group1
And the create file function
def create_chunk(start_index, end_index):
for i in range(start_index, end_index):
if i == start_index:
mergedhf = h5py.File("output.hdf5",'w')
mergedhf.create_dataset("dataset1",dtype='float64')
mergedhf.create_dataset("dataset2",dtype='float64')
g1 = mergedhf.create_group('group1')
g2 = mergedhf.create_group('group2')
rd1,rd2,rg1,rg2 = read_file(filename)
print rg1.attrs['w'] #gives me <Closed HDF5 group> message
g1.attrs['w'] = "content"
g1.attrs['x'] = "content"
g2.attrs['y'] = "content"
g2.attrs['z'] = "content"
print g1.attrs['w'] # Works Here
return mergedhf.get('dataset1'), mergedhf.get('dataset2'), g1, g2
def calling_function():
wd1, wd2, wg1, wg2 = create_chunk(start_index, end_index)
print wg1.attrs['w'] #Works here as well
Now the problem is, the dataset and the properties from the new file created and represented by wd1, wd2, wg1 and wg2 can be accessed by me and I can access the attribute data but i cant do the same for which I have read and returned the value for.
Can anyone help me fetch the values of the dataset and group when I have returned the reference to the calling function?

The problem is in read_file, this line:
with h5py.File(filename+".hdf5",'r') as hf:
This closes hf at the end of the with block, i.e. when read_file returns. When this happens, the datasets and groups also get closed and you can no longer access them.
There are (at least) two ways to fix this. Firstly, you can open the file like you do in create_chunk:
hf = h5py.File(filename+".hdf5", 'r')
and keep the reference to hf around as long as you need it, before closing it:
hf.close()
The other way is to copy the data from the datasets in read_file and return those instead:
dataset1 = hf.get('dataset1')[:]
dataset2 = hf.get('dataset2')[:]
Note that you can't do this with the groups. The file needs to be open for as long as you need to do things with the groups.

Adding to #Yossarian's answer
The problem is in read_file, this line:
with h5py.File(filename+".hdf5",'r') as hf:
This closes hf at the end of the with block, i.e. when read_file returns. When this happens, the datasets and groups also get closed and you can no longer access them.
For those who come across this and are reading a scalar dataset make sure to index using [()]:
scalar_dataset1 = hf['scalar_dataset1'][()]
Preface
I had a similar issue as OP resulting in a return value of <closed hdf5 dataset>. However, I would get a ValueError when attempting to slice my scalar dataset with [:].
"ValueError: Illegal slicing argument for scalar dataspace"
Indexing with [()] along with #Yossarian's answer helped solve my problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow Dataset open several files and keep them separated - python

Related

CNN: What to do when labels are given by a map

"... has insufficient rank for batching." What is the problem with this 3 line code?

Reading Dataset from files where some might be missing

How to create my own dataset for keras model.fit() using Tensorflow(python)?

Can't access returned h5py object instance

Categories

Resources