tensorflow pipeline for pickled pandas data input - python

I would like to input compressed pd.read_pickle(filename, compression='xz') pandas dataframes as a pipeline to tensorflow. I want to use the high level API tf.estimatorclassifier which requires an input function.
My data files are large matrices ~(1400X16) of floats, and each matrix corresponds to a particular type (label). Each type (label) is contained in a different directory, so I know the matrix label from its directory. At the low level, I know I can populate data using a feed_dict={X:batch_X:Y_:batch_Y}to feed the data pipeline, but tf.estimatorrequires an input function. For example, assuming I have two labels, my function would probably be something like this
def my_input_fn(directory,file_name):
data=pd.read_pickle(directory+file_name,compression='xz')
#sometimes I need to operate columns
data=data['col1']*data['col2']
if directory=='./dir1':
label=[1,0]
elif directory=='./dir2':
label=[0,1]
return data,label
But I am having a lot of trouble understanding how to map my input to a tensorflow graph dict and how the tf.estimator accepts my function returns. What is the correct way to return my data and labels, so that they enter a pipeline?

Just in case someone else has a similar problem, I will write a small version of a solution to this question.
I defined a function called "extract" and mapped this extracted numpy array as a dataset using tf.py_func. Then another dataset is initiated from "labels" (a one-hot numpy array), and is initiated in correspondance to the filenames of the extracted pandas dataframes. The extracted dataframes and one-hot arrays are zipped with tf.data.Dataset.zip into a final dataset. The final dataset can then be prefetched, iterated, etc.
def extract(file_name):
df=pd.read_pickle(file_name,compression='xz')
df=df.astype(dtype=float)
... #extra manipulations
return df.values.astype('float32', copy=False)
dataset1 = tf.data.Dataset.list_files(file_names)
dataset1 = dataset1.map(lambda filename: tf.py_func(extract,filename],tf.float32),num_parallel_calls=10)
dataset2 = tf.data.Dataset.from_tensor_slices(labels)
dataset = tf.data.Dataset.zip((dataset1,dataset2))
iter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()
X,Y_= get_batch
with tf.Session() as sess:
sess.run(init)
xx,yy=sess.run([X,Y_])

Related

How do I correctly apply data augmentation to a TFRecord Dataset?

I am attempting to apply data augmentation to a TFRecord dataset after it has been parsed. However, when I check the size of the dataset before and after mapping the augmentation function, the sizes are the same. I know the parse function is working and the datasets are correct as I have already used them to train a model. So I have only included code to map the function and count the examples afterward.
Here is the code I am using:
num_ex = 0
def flip_example(image, label):
flipped_image = flip(image)
return flipped_image, label
dataset = tf.data.TFRecordDataset('train.TFRecord').map(parse_function)
for x in dataset:
num_ex += 1
num_ex = 0
dataset = dataset.map(flip_example)
#Size of dataset
for x in dataset:
num_ex += 1
In both cases, num_ex = 324 instead of the expected 324 for non-augmented and 648 for augmented. I have also successfully tested the flip function so it seems the issue is with how the function interacts with the dataset. How do I correctly implement this augmentation?
When you apply data augmentation with the tf.data API, it is done on-the-fly, meaning that every example is transformed as implemented in your method. Augmenting data this way does not mean that the number of examples in your pipeline changes.
If you want to use every example n times, simply add dataset = dataset.repeat(count=n). You might want to update your code to use tf.image.random_flip_left_right, otherwise the flip is done the same way each time.
In your example the second time you check num_ex, dataset only contains the flipped images so 324.
Furthermore if you have a large dataset, larger than 324, you might want to look into online data augmentation. In this case, during training the dataset is augmented differently every epoch, and you only train on the augmented data not on the original dataset. This helps the trained model generalise better. (https://www.tensorflow.org/tutorials/images/data_augmentation)

Creating a TimeseriesGenerator with multiple inputs

I'm trying to train an LSTM model on daily fundamental and price data from ~4000 stocks, due to memory limits I cannot hold everything in memory after converting to sequences for the model.
This leads me to using a generator instead like the TimeseriesGenerator from Keras / Tensorflow. Problem is that if I try using the generator on all of my data stacked it would create sequences of mixed stocks, see the example below with a sequence of 5, here Sequence 3 would include the last 4 observations of "stock 1" and the first observation of "stock 2"
Instead what I would want is similar to this:
Slightly similar question: Merge or append multiple Keras TimeseriesGenerator objects into one
I explored the option of combining the generators like this SO suggests: How do I combine two keras generator functions, however this is not idea in the case of ~4000 generators.
I hope my question makes sense.
So what I've ended up doing is to do all the preprocessing manually and save an .npy file for each stock containing the preprocessed sequences, then using a manually created generator I make batches like this:
class seq_generator():
def __init__(self, list_of_filepaths):
self.usedDict = dict()
for path in list_of_filepaths:
self.usedDict[path] = []
def generate(self):
while True:
path = np.random.choice(list(self.usedDict.keys()))
stock_array = np.load(path)
random_sequence = np.random.randint(stock_array.shape[0])
if random_sequence not in self.usedDict[path]:
self.usedDict[path].append(random_sequence)
yield stock_array[random_sequence, :, :]
train_generator = seq_generator(list_of_filepaths)
train_dataset = tf.data.Dataset.from_generator(seq_generator.generate(),
output_types=(tf.float32, tf.float32),
output_shapes=(n_timesteps, n_features))
train_dataset = train_dataset.batch(batch_size)
Where list_of_filepaths is simply a list of paths to preprocessed .npy data.
This will:
Load a random stock's preprocessed .npy data
Pick a sequence at random
Check if the index of the sequence has already been used in usedDict
If not:
Append the index of that sequence to usedDict to keep track as to not feed the same data twice to the model
Yield the sequence
This means that the generator will feed a single unique sequence from a random stock at each "call", enabling me to use the .from_generator() and .batch() methods from Tensorflows Dataset type.

Proper way to iterate tf.data.Dataset in session for 2.0

I have downloaded some *.tfrecord data from the youtube-8m project. You can download a 'small' portion of the data with this command:
curl data.yt8m.org/download.py | shard=1,100 partition=2/video/train mirror=us python
I am trying to get an idea of how to use the new tf.data API. I would like to become familiar with the typical ways people iterate through datasets. I have been using the guide on TF website and this slide: Derek Murray's Slides
Here is how I define the dataset:
# Use interleave() and prefetch() to read many files concurrently.
files = tf.data.Dataset.list_files("./youtube_vids/*.tfrecord")
dataset = files.interleave(lambda x: tf.data.TFRecordDataset(x).prefetch(100),
cycle_length=8)
# Use num_parallel_calls to parallelize map().
dataset = dataset.map(lambda record: tf.parse_single_example(record, feature_map),
num_parallel_calls=2) #
# put in x,y output form
dataset = dataset.map(lambda x: (x['mean_rgb'], x['id']))
# shuffle
dataset = dataset.shuffle(10000)
#one epoch
dataset = dataset.repeat(1)
dataset = dataset.batch(200)
#Use prefetch() to overlap the producer and consumer.
dataset = dataset.prefetch(10)
Now, I know in eager execution mode I can just
for x,y in dataset:
x,y
However, when I attempt to create an iterator as follows:
# A one-shot iterator automatically initializes itself on first use.
iterator = dset.make_one_shot_iterator()
# The return value of get_next() matches the dataset element type.
images, labels = iterator.get_next()
And run with session
with tf.Session() as sess:
# Loop until all elements have been consumed.
try:
while True:
r = sess.run(images)
except tf.errors.OutOfRangeError:
pass
I get the warning
Use `for ... in dataset:` to iterate over a dataset. If using `tf.estimator`, return the `Dataset` object directly from your input function. As a last resort, you can use `tf.compat.v1.data.make_one_shot_iterator(dataset)`.
So, here is my question:
What is the proper way to iterate through a dataset within a session? Is it just a matter of v1 and v2 differences?
Also, the advice to pass the dataset directly to an estimator implies that the input function also has an iterator defined as in Derek Murray's slides above, correct?
As for Estimator API, no you don't have to specify iterator, just pass dataset object as input function.
def input_fn(filename):
dataset = tf.data.TFRecordDataset(filename)
dataset = dataset.shuffle().repeat()
dataset = dataset.map(parse_func)
dataset = dataset.batch()
return dataset
estimator.train(input_fn=lambda: input_fn())
In TF 2.0 dataset became iterable, so, just as warning message says, you can use
for x,y in dataset:
x,y

writing numpy to tfrecord is slow when looping over features using tfrecordwriter

I have a numpy array that i want to write to a tfrecord file. The dimensions of the array for both input X and label y are [200,46,72,72]for the training of my model i want to read the tfrecord file to get slices of [72,72] both for input and label data.
I tried to apply the following stackoverflow answer
The problem is that this method is really slow probably due to the amount of elements looped over 200*46.
When I write the entire numpy array as feature of bytes instead of floatlist I don't have this problem, but than I don't understand how to get [72,72] slices for each batch.
def npy_to_tfrecords(X,y):
# write records to a tfrecords file
output_file = 'E:\\Documents\\Datasets\\tfrecordtest\\test.tfrecord'
writer = tf.python_io.TFRecordWriter(output_file)
# Loop through all the features you want to write
for i in range(X.shape[0]) :
for j in range(X.shape[1]) :
#let say X is of np.array([[...][...]])
#let say y is of np.array[[0/1]]
print(f"{i},{j}")
# Feature contains a map of string to feature proto objects
feature = {}
feature['X'] = tf.train.Feature(float_list=tf.train.FloatList(value=X[i,j:,:].flatten()))
feature['y'] = tf.train.Feature(float_list=tf.train.FloatList(value=y[i,j:,:].flatten()))
# Construct the Example proto object
example = tf.train.Example(features=tf.train.Features(feature= feature) )
# Serialize the example to a string
serialized = example.SerializeToString()
# write the serialized objec to the disk
writer.write(serialized)
writer.close()
for reading i use the following code roughly
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(_parse_function, num_parallel_calls=6)
dataset.apply(tf.contrib.data.shuffle_and_repeat(SHUFFLE_BUFFER))
dataset = dataset.batch(BATCH_SIZE)
iterator = dataset.make_one_shot_iterator()
input_data, label_data = iterator.get_next()
when i save the numpy arrays as bytes the parse_function returns
the whole array and I can not figure out how to write a
parse_function that returns slices.
Summary:
save 2 numpy arrays to tfrecord
read tfrecord file and obtain slices of the saved numpy arrays in the batches used for the model

Streaming large training and test files into Tensorflow's DNNClassifier

I have a huge training CSV file (709M) and a large testing CSV file (125M) that I want to send into a DNNClassifier in the context of using the high-level Tensorflow API.
It appears that the input_fn param accepted by fit and evaluate must hold all feature and label data in memory, but I currently would like to run this on my local machine, and thus expect it to run out of memory rather quickly if I read these files into memory and then process them.
I skimmed the doc on streamed-reading of data, but the sample code for reading CSVs appears to be for the low-level Tensorflow API.
And - if you'll forgive a bit of whining - it seems overly-complex for the trivial use case of sending well-prepared files of training and test data into an Estimator ... although, perhaps that level of complexity is actually required for training and testing large volumes of data in Tensorflow?
In any case, I'd really appreciate an example of using that approach with the high-level API, if it's even possible, which I'm beginning to doubt.
After poking around, I did manage to find DNNClassifier#partial_fit, and will attempt to use it for training.
Examples of how to use this method would save me some time, though hopefully I'll stumble into the correct usage in the next few hours.
However, there doesn't seem to be a corresponding DNNClassifier#partial_evaluate ... though I suspect that I could break-up the testing data into smaller pieces and run DNNClassifier#evaluate successively on each batch, which might actually be a great way to do it since I could segment the testing data into cohorts, and thereby obtain per-cohort accuracy.
==== Update ====
Short version:
DomJack's recommendation should be the accepted answer.
However, my Mac's 16GB of RAM enough for it to hold the entire 709Mb training data set in memory without crashing. So, while I will use the DataSets feature when I eventually deploy the app, I'm not using it yet for local dev work.
Longer version:
I started by using the partial_fit API as described above, but upon every use it emitted a warning.
So, I went to look at the source for the method here, and discovered that its complete implementation looks like this:
logging.warning('The current implementation of partial_fit is not optimized'
' for use in a loop. Consider using fit() instead.')
return self.fit(x=x, y=y, input_fn=input_fn, steps=steps,
batch_size=batch_size, monitors=monitors)
... which reminds me of this scene from Hitchhiker's Guide:
Arthur Dent: What happens if I press this button?
Ford Prefect: I wouldn't-
Arthur Dent: Oh.
Ford Prefect: What happened?
Arthur Dent: A sign lit up, saying 'Please do not press this button again'.
Which is to say: partial_fit seems to exist for the sole purpose of telling you not to use it.
Furthermore, the model generated by using partial_fit iteratively on training file chunks was much smaller than the one generated by using fit on the whole training file, which strongly suggests that only the last partial_fit training chunk actually "took".
Check out the tf.data.Dataset API. There are a number of ways to create a dataset. I'll outline four - but you'll only have to implement one.
I assume each row of your csv files is n_features float values followed by a single int value.
Creating a tf.data.Dataset
Wrap a python generator with Dataset.from_generator
The easiest way to get started is to wrap a native python generator. This can have performance issues, but may be fine for your purposes.
def read_csv(filename):
with open(filename, 'r') as f:
for line in f.readlines():
record = line.rstrip().split(',')
features = [float(n) for n in record[:-1]]
label = int(record[-1])
yield features, label
def get_dataset():
filename = 'my_train_dataset.csv'
generator = lambda: read_csv(filename)
return tf.data.Dataset.from_generator(
generator, (tf.float32, tf.int32), ((n_features,), ()))
This approach is highly versatile and allows you to test your generator function (read_csv) independently of TensorFlow.
Use Tensorflow Datasets API
Supporting tensorflow versions 1.12+, tensorflow datasets is my new favourite way of creating datasets. It automatically serializes your data, collects statistics and makes other meta-data available to you via info and builder objects. It can also handle automatic downloading and extracting making collaboration simple.
import tensorflow_datasets as tfds
class MyCsvDatasetBuilder(tfds.core.GeneratorBasedBuilder):
VERSION = tfds.core.Version("0.0.1")
def _info(self):
return tfds.core.DatasetInfo(
builder=self,
description=(
"My dataset"),
features=tfds.features.FeaturesDict({
"features": tfds.features.Tensor(
shape=(FEATURE_SIZE,), dtype=tf.float32),
"label": tfds.features.ClassLabel(
names=CLASS_NAMES),
"index": tfds.features.Tensor(shape=(), dtype=tf.float32)
}),
supervised_keys=("features", "label"),
)
def _split_generators(self, dl_manager):
paths = dict(
train='/path/to/train.csv',
test='/path/to/test.csv',
)
# better yet, if the csv files were originally downloaded, use
# urls = dict(train=train_url, test=test_url)
# paths = dl_manager.download(urls)
return [
tfds.core.SplitGenerator(
name=tfds.Split.TRAIN,
num_shards=10,
gen_kwargs=dict(path=paths['train'])),
tfds.core.SplitGenerator(
name=tfds.Split.TEST,
num_shards=2,
gen_kwargs=dict(cvs_path=paths['test']))
]
def _generate_examples(self, csv_path):
with open(csv_path, 'r') as f:
for i, line in enumerate(f.readlines()):
record = line.rstrip().split(',')
features = [float(n) for n in record[:-1]]
label = int(record[-1])
yield dict(features=features, label=label, index=i)
Usage:
builder = MyCsvDatasetBuilder()
builder.download_and_prepare() # will only take time to run first time
# as_supervised makes output (features, label) - good for model.fit
datasets = builder.as_dataset(as_supervised=True)
train_ds = datasets['train']
test_ds = datasets['test']
Wrap an index-based python function
One of the downsides of the above is shuffling the resulting dataset with a shuffle buffer of size n requires n examples to be loaded. This will either create periodic pauses in your pipeline (large n) or result in potentially poor shuffling (small n).
def get_record(i):
# load the ith record using standard python, return numpy arrays
return features, labels
def get_inputs(batch_size, is_training):
def tf_map_fn(index):
features, labels = tf.py_func(
get_record, (index,), (tf.float32, tf.int32), stateful=False)
features.set_shape((n_features,))
labels.set_shape(())
# do data augmentation here
return features, labels
epoch_size = get_epoch_size()
dataset = tf.data.Dataset.from_tensor_slices((tf.range(epoch_size,))
if is_training:
dataset = dataset.repeat().shuffle(epoch_size)
dataset = dataset.map(tf_map_fn, (tf.float32, tf.int32), num_parallel_calls=8)
dataset = dataset.batch(batch_size)
# prefetch data to CPU while GPU processes previous batch
dataset = dataset.prefetch(1)
# Also possible
# dataset = dataset.apply(
# tf.contrib.data.prefetch_to_device('/gpu:0'))
features, labels = dataset.make_one_shot_iterator().get_next()
return features, labels
In short, we create a dataset just of the record indices (or any small record ID which we can load entirely into memory). We then do shuffling/repeating operations on this minimal dataset, then map the index to the actual data via tf.data.Dataset.map and tf.py_func. See the Using with Estimators and Testing in isolation sections below for usage. Note this requires your data to be accessible by row, so you may need to convert from csv to some other format.
TextLineDataset
You can also read the csv file directly using a tf.data.TextLineDataset.
def get_record_defaults():
zf = tf.zeros(shape=(1,), dtype=tf.float32)
zi = tf.ones(shape=(1,), dtype=tf.int32)
return [zf]*n_features + [zi]
def parse_row(tf_string):
data = tf.decode_csv(
tf.expand_dims(tf_string, axis=0), get_record_defaults())
features = data[:-1]
features = tf.stack(features, axis=-1)
label = data[-1]
features = tf.squeeze(features, axis=0)
label = tf.squeeze(label, axis=0)
return features, label
def get_dataset():
dataset = tf.data.TextLineDataset(['data.csv'])
return dataset.map(parse_row, num_parallel_calls=8)
The parse_row function is a little convoluted since tf.decode_csv expects a batch. You can make it slightly simpler if you batch the dataset before parsing.
def parse_batch(tf_string):
data = tf.decode_csv(tf_string, get_record_defaults())
features = data[:-1]
labels = data[-1]
features = tf.stack(features, axis=-1)
return features, labels
def get_batched_dataset(batch_size):
dataset = tf.data.TextLineDataset(['data.csv'])
dataset = dataset.batch(batch_size)
dataset = dataset.map(parse_batch)
return dataset
TFRecordDataset
Alternatively you can convert the csv files to TFRecord files and use a TFRecordDataset. There's a thorough tutorial here.
Step 1: Convert the csv data to TFRecords data. Example code below (see read_csv from from_generator example above).
with tf.python_io.TFRecordWriter("my_train_dataset.tfrecords") as writer:
for features, labels in read_csv('my_train_dataset.csv'):
example = tf.train.Example()
example.features.feature[
"features"].float_list.value.extend(features)
example.features.feature[
"label"].int64_list.value.append(label)
writer.write(example.SerializeToString())
This only needs to be run once.
Step 2: Write a dataset that decodes these record files.
def parse_function(example_proto):
features = {
'features': tf.FixedLenFeature((n_features,), tf.float32),
'label': tf.FixedLenFeature((), tf.int64)
}
parsed_features = tf.parse_single_example(example_proto, features)
return parsed_features['features'], parsed_features['label']
def get_dataset():
dataset = tf.data.TFRecordDataset(['data.tfrecords'])
dataset = dataset.map(parse_function)
return dataset
Using the dataset with estimators
def get_inputs(batch_size, shuffle_size):
dataset = get_dataset() # one of the above implementations
dataset = dataset.shuffle(shuffle_size)
dataset = dataset.repeat() # repeat indefinitely
dataset = dataset.batch(batch_size)
# prefetch data to CPU while GPU processes previous batch
dataset = dataset.prefetch(1)
# Also possible
# dataset = dataset.apply(
# tf.contrib.data.prefetch_to_device('/gpu:0'))
features, label = dataset.make_one_shot_iterator().get_next()
estimator.train(lambda: get_inputs(32, 1000), max_steps=1e7)
Testing the dataset in isolation
I'd strongly encourage you to test your dataset independently of your estimator. Using the above get_inputs, it should be as simple as
batch_size = 4
shuffle_size = 100
features, labels = get_inputs(batch_size, shuffle_size)
with tf.Session() as sess:
f_data, l_data = sess.run([features, labels])
print(f_data, l_data) # or some better visualization function
Performance
Assuming your using a GPU to run your network, unless each row of your csv file is enormous and your network is tiny you probably won't notice a difference in performance. This is because the Estimator implementation forces data loading/preprocessing to be performed on the CPU, and prefetch means the next batch can be prepared on the CPU as the current batch is training on the GPU. The only exception to this is if you have a massive shuffle size on a dataset with a large amount of data per record, which will take some time to load in a number of examples initially before running anything through the GPU.
I agree with DomJack about using the Dataset API, except the need to read the whole csv file and then convert to TfRecord. I am hereby proposing to emply TextLineDataset - a sub-class of the Dataset API to directly load data into a TensorFlow program. An intuitive tutorial can be found here.
The code below is used for the MNIST classification problem for illustration and hopefully, answer the question of the OP. The csv file has 784 columns, and the number of classes is 10. The classifier I used in this example is a 1-hidden-layer neural network with 16 relu units.
Firstly, load libraries and define some constants:
# load libraries
import tensorflow as tf
import os
# some constants
n_x = 784
n_h = 16
n_y = 10
# path to the folder containing the train and test csv files
# You only need to change PATH, rest is platform independent
PATH = os.getcwd() + '/'
# create a list of feature names
feature_names = ['pixel' + str(i) for i in range(n_x)]
Secondly, we create an input function reading a file using the Dataset API, then provide the results to the Estimator API. The return value must be a two-element tuple organized as follows: the first element must be a dict in which each input feature is a key, and then a list of values for the training batch, and the second element is a list of labels for the training batch.
def my_input_fn(file_path, batch_size=32, buffer_size=256,\
perform_shuffle=False, repeat_count=1):
'''
Args:
- file_path: the path of the input file
- perform_shuffle: whether the data is shuffled or not
- repeat_count: The number of times to iterate over the records in the dataset.
For example, if we specify 1, then each record is read once.
If we specify None, iteration will continue forever.
Output is two-element tuple organized as follows:
- The first element must be a dict in which each input feature is a key,
and then a list of values for the training batch.
- The second element is a list of labels for the training batch.
'''
def decode_csv(line):
record_defaults = [[0.]]*n_x # n_x features
record_defaults.insert(0, [0]) # the first element is the label (int)
parsed_line = tf.decode_csv(records=line,\
record_defaults=record_defaults)
label = parsed_line[0] # First element is the label
del parsed_line[0] # Delete first element
features = parsed_line # Everything but first elements are the features
d = dict(zip(feature_names, features)), label
return d
dataset = (tf.data.TextLineDataset(file_path) # Read text file
.skip(1) # Skip header row
.map(decode_csv)) # Transform each elem by applying decode_csv fn
if perform_shuffle:
# Randomizes input using a window of 256 elements (read into memory)
dataset = dataset.shuffle(buffer_size=buffer_size)
dataset = dataset.repeat(repeat_count) # Repeats dataset this # times
dataset = dataset.batch(batch_size) # Batch size to use
iterator = dataset.make_one_shot_iterator()
batch_features, batch_labels = iterator.get_next()
return batch_features, batch_labels
Then, the mini-batch can be computed as
next_batch = my_input_fn(file_path=PATH+'train1.csv',\
batch_size=batch_size,\
perform_shuffle=True) # return 512 random elements
Next, we define the feature columns are numeric
feature_columns = [tf.feature_column.numeric_column(k) for k in feature_names]
Thirdly, we create an estimator DNNClassifier:
classifier = tf.estimator.DNNClassifier(
feature_columns=feature_columns, # The input features to our model
hidden_units=[n_h], # One layer
n_classes=n_y,
model_dir=None)
Finally, the DNN is trained using the test csv file, while the evaluation is performed on the test file. Please change the repeat_count and steps to ensure that the training meets the required number of epochs in your code.
# train the DNN
classifier.train(
input_fn=lambda: my_input_fn(file_path=PATH+'train1.csv',\
perform_shuffle=True,\
repeat_count=1),\
steps=None)
# evaluate using the test csv file
evaluate_result = classifier.evaluate(
input_fn=lambda: my_input_fn(file_path=PATH+'test1.csv',\
perform_shuffle=False))
print("Evaluation results")
for key in evaluate_result:
print(" {}, was: {}".format(key, evaluate_result[key]))

Categories

Resources