In tensorflow2.0, I want to train a skip-gram model with nce loss. tf.data.Dataset.from_tensor_slices() is not suitable because the input file is really huge. So I write a dataset generator class like this:
class DataSet:
""""""
def __init__(self, args, vocab):
self.args = args
self.vocab = vocab
def generator(self):
"""a generator function, it will return skip-gram sample or cbow sample"""
with open(self.args.input) as f_input:
for line in tqdm.tqdm(f_input.readlines()):
tokens = line.strip().split()
tokens_indices = self.vocab.indices(tokens)
for index, target_word in enumerate(tokens_indices):
context_words = list()
begin = index - self.args.window_size if index - self.args.window_size > 0 else 0
end = index + 1 + self.args.window_size if index + self.args.window_size + 1 < len(tokens_indices) else len(
tokens_indices)
context_words.extend(tokens_indices[begin:index])
context_words.extend(tokens_indices[index + 1:end])
if self.args.cbow > 0:
yield context_words, target_word
else:
for i in range(len(context_words)):
yield target_word, context_words[i]
def dataset(self):
"""Using tf.data.Dataset.from_generator() to return sample"""
if self.args.cbow:
dataset = tf.data.Dataset.from_generator(
self.generator,
(tf.int32, tf.int32),
(tf.TensorShape([None]), tf.TensorShape([]))
)
else:
dataset = tf.data.Dataset.from_generator(
self.generator,
(tf.int32, tf.int32),
(tf.TensorShape([]), tf.TensorShape([]))
)
return dataset
Then I test my code with follow:
dataset = DataSet(args, vocab).dataset()
iterator = dataset.make_one_shot_iterator()
for batch, (x,y) in enumerate(dataset.batch(128)):
pass
print(batch, x.shape, y.shape)
But it cost a lot of time to iterate all lines(about 10 minutes / 15000 lines in MacBook pro 2012). Does there any methods can speed up the code?
If you are working with large datasets then TFRecord is suitable option. It uses a binary file format for storage of your data and can have a significant impact on the performance of your import pipeline and as a consequence on the training time of your model. Binary data takes up less space on disk, takes less time to copy and can be read much more efficiently from disk. This is especially true if your data is stored on spinning disks, due to the much lower read/write performance in comparison with SSDs.
However, pure performance isn’t the only advantage of the TFRecord file format. It is optimized for use with Tensorflow in multiple ways. To start with, it makes it easy to combine multiple datasets and integrates seamlessly with the data import and preprocessing functionality provided by the library. Especially for datasets that are too large to be stored fully in memory this is an advantage as only the data that is required at the time (e.g. a batch) is loaded from disk and then processed. Another major advantage of TFRecords is that it is possible to store sequence data — for instance, a time series or word encodings — in a way that allows for very efficient and (from a coding perspective) convenient import of this type of data.
Would recommend to go through this official link for glimpse on TFRecord. Also you can go through this link on how to build TFRecord pipeline.
Here is a simple example of writing the serialized record using TFRecordWriter and then loading it in TFRecordDatset
%tensorflow_version 2.x
import tensorflow as tf
print(tf.__version__)
def write_date_tfrecord():
#writes 10 dummy values to replicate the issue
Output = [20191221 + x for x in range(0,10)]
print("Writing Output - ", Output)
example = tf.train.Example(
features = tf.train.Features(
feature = {
'Output':tf.train.Feature(float_list=tf.train.FloatList(value=Output))
}
))
writer = tf.io.TFRecordWriter("Output.tf_record")
writer.write(example.SerializeToString())
def parse_function(serialized_example):
features = {
'Output': tf.io.FixedLenSequenceFeature([], tf.float32,allow_missing=True)
}
features = tf.io.parse_single_example(serialized=serialized_example, features=features)
Output = features['Output']
return Output
def dataset_generator():
trRecordDataset = tf.data.TFRecordDataset("Output.tf_record")
trRecordDataset = trRecordDataset.map(parse_function, num_parallel_calls = tf.data.experimental.AUTOTUNE)
return trRecordDataset
if __name__ == '__main__':
write_date_tfrecord()
generator = dataset_generator()
for Output in generator:
print(Output)
Output -
2.2.0
Writing Output - [20191221, 20191222, 20191223, 20191224, 20191225, 20191226, 20191227, 20191228, 20191229, 20191230]
tf.Tensor(
[20191220. 20191222. 20191224. 20191224. 20191224. 20191226. 20191228.
20191228. 20191228. 20191230.], shape=(10,), dtype=float32)
Hope this answers your question. Happy Learning.
Related
First of all, I would like to say that this is my first question in stackOverflow, so I hope that the question as a whole respects the rules. I realize that the question is a bit long, but I would like to provide as much background and detail as possible .
I am currently developing a real-time image binary classification system based on Tensorflow 2.8.0 and I am quite new at it. Here are some of the peculiarities of the data that I have for the mentioned project:
Too big to fit in memory: I have more than 200 GB of data. Keep in mind that I have labeled only a small portion of it, but I want to write code that could manage the whole dataset in the future.
Some files are not directly compatible with Tensorflow: I have .FITS and .FIT files that cannot be opened directly with Tensorflow. Due to this issue, I use a library called Astropy to open these files.
The classes are very unbalanced.
After reading the official documentation and tutorials, I thought that, in order to load, preprocess and feed data to my CNN, the best option was to build an input pipeline using the tf.data.Dataset class due to the ease of opening FITS files. My general procedure follows this idea:
Get a list of file paths and split it into train, val and test partitions if desired.
Create a tf.data.Dataset with the from_tensor_slices() method
Shuffle the data (before the heavier reading and image processing operations)
Read and process every path with map()
Batch and prefetch
Here are some code fragments in case they help to understand my goal:
(...)
import config as cfg # Custom .py file
import tensorflow as tf
# x_train, x_val and x_test are previously split file paths lists
train_ds = tf.data.Dataset.from_tensor_slices([str(p) for p in x_train])
val_ds = tf.data.Dataset.from_tensor_slices([str(p) for p in x_val])
test_ds = tf.data.Dataset.from_tensor_slices([str(p) for p in x_test])
train_ds = configure_tf_ds(train_ds)
val_ds = configure_tf_ds(val_ds)
test_ds = configure_tf_ds(test_ds)
def configure_tf_ds(self, tf_ds, buf_size):
# reshuffle_each_iteration=True ensures that data is shuffled each time it is iterated
tf_ds = tf_ds.shuffle(buffer_size=cfg.SHUFFLE_BUF_SIZE, seed=cfg.seed, reshuffle_each_iteration=True)
tf_ds = tf_ds.map(lambda x: tf.py_function(self.process_path, [x], [self.img_dtpye, self.label_dtype]))
tf_ds = tf_ds.batch(self.batch_size)
tf_ds = tf_ds.prefetch(buffer_size=tf.data.AUTOTUNE)
return tf_ds
def process_path(self, file_path):
# Labels are extracted from the file path, not relevant for my problem
label = get_label(file_path)
path = bytes.decode(file_path.numpy()).lower()
img = None
# Open and process images depending on their file paths' extension: FITS, FIT, JPG
if "fit" in path:
img = tf.py_function(func=self.decode_fits, inp=[file_path], Tout=self.img_dtpye)
else:
img = tf.py_function(func=self.decode_img, inp=[file_path], Tout=self.img_dtpye)
return img, label
model.fit(train_ds, epochs=50, validation_data=val_ds)
# Then, I would like to obtain predictions, plot results, and so on but knowing which file paths I am working with
(...)
Following the previous idea, I have successfully created and tested different types of pipelines for different types of partitions of my dataset: unlabeled (remember that only a portion of the data is labeled), labeled and weighted labeled (I wanted to see if my models improve by specifying class weights when training).
However, in order to monitor results and make proper adjustments to my model, I would like to retrieve the usual predictions, real labels and images next to the file paths preserving the ability to shuffle the data after every epoch.
I have managed to solve my question if I do not shuffle data with .shuffle(reshuffle_each_iteration=True), but models' performance is supposed to increase if data is shuffled after each epoch, according to several sources.
I have read different posts in stackOverflow related to my question. I will list those posts next to the problems that I have found for my particular use case:
Solution 1: My dataset cannot be fed to the model as X, y because it is a tf.data.Dataset
Solution 2: I want to obtain the image and the label too.
Solution 3: This works, but it would not respect the expected tf.data.Dataset format in the future .fit() call as stated here:
A tf.data dataset. Should return a tuple of either (inputs, targets)
or (inputs, targets, sample_weights)
I have also tried to keep a separate tf.data.Dataset with only the file paths but if I call the shuffle method with the reshuffle_each_iteration=True option in both tf.data.Dataset instances, the order of their elements does not match even if I set the same seed.
In short, is it possible to achieve what I want? If so, how should I proceed?
Thank you very much in advance.
Preprocess your data into three TFRecord files, one each for training, testing, and validation. Then you can shuffle and never cross records between the sets. This also speeds up data loading and can be done once and reused many times while playing with hyperparameters.
Here is an example of how you can preprocess and split your data. Your actual dataset data will have a different structure, this example has "encdata", a 2048-wide vector of vggface2 face encoding data. This assumes you have a single directory of data, with subdirectories named for a class and containing all the files for that class.
import tensorflow as tf
import numpy as np
import pickle
import sys
import os
# 80% to training, 10% to testing, 10% to validation
validation_portion = 1
testing_portion = 1
training_portion = 8
file_cycle_total = validation_portion + testing_portion + training_portion
# Where to store the TFRecord files
training_tfrecord_path = '/var/tmp/xtraining_tfrecords.tfr'
testing_tfrecord_path = '/var/tmp/xtesting_tfrecords.tfr'
validation_tfrecord_path = '/var/tmp/xvalidation_tfrecords.tfr'
# Where we keep the encodings
FACELIB_DIR='/aimiassd/Datasets/LabeledAstroFaces'
# Get list of all classes from all facelib dirs
classNames = sorted([x for x in os.listdir(FACELIB_DIR) if os.path.isdir(os.path.join(FACELIB_DIR,x)) and not x.startswith('.')])
classStrToInt = dict([(x,i) for i,x in enumerate(classNames)])
print('Found %d different classNames for labels\n' % len(classNames))
# Create our record writers
train_file_writer = tf.io.TFRecordWriter(training_tfrecord_path)
test_file_writer = tf.io.TFRecordWriter(testing_tfrecord_path)
val_file_writer = tf.io.TFRecordWriter(validation_tfrecord_path)
# Create a dataset of filenames of every enc2048 file in the facelibraries
cnt_records_written = [0,0,0]
for CN in classNames:
class_int = classStrToInt[CN]
# Get a list of all the encoding files
encfiles = sorted(filter((lambda x: x.endswith('.enc2048')), os.listdir(os.path.join(FACELIB_DIR, CN))))
# For each encoding file, read the encoding data and write it to the various tfrecords
for i, F in enumerate(encfiles):
file_path = os.path.join(FACELIB_DIR,CN,F)
with open(file_path,'rb') as fin:
encdata,_ = pickle.loads(fin.read()) # encodings, source_image_name
# Turn encdata into a tf.train.Example and serialize it for writing
record_bytes = tf.train.Example(features=tf.train.Features(feature={
"x": tf.train.Feature(float_list=tf.train.FloatList(value=encdata)),
"y": tf.train.Feature(int64_list=tf.train.Int64List(value=[class_int])),
})).SerializeToString()
# Write it out with the appropriate record writer
remainder = i % file_cycle_total
if remainder < validation_portion:
val_file_writer.write(record_bytes)
cnt_records_written[2] += 1
elif remainder < validation_portion + testing_portion:
test_file_writer.write(record_bytes)
cnt_records_written[1] += 1
else:
train_file_writer.write(record_bytes)
cnt_records_written[0] += 1
print('Writing records done.')
print('Wrote %d training, %d testing, %d validation records' %
(cnt_records_written[0], cnt_records_written[1], cnt_records_written[2]) )
train_file_writer.close()
test_file_writer.close()
val_file_writer.close()
print('Reading data back out...')
# Function to turn a serialized TFRecord back into a tf.train.Example
def decode_fn(record_bytes):
return tf.io.parse_single_example(
# Data
record_bytes,
# Schema
{"x": tf.io.FixedLenFeature([2048], dtype=tf.float32),
"y": tf.io.FixedLenFeature([], dtype=tf.int64)}
)
# Read and deserialize the datasets
train_ds = tf.data.TFRecordDataset([training_tfrecord_path]).map(decode_fn)
test_ds = tf.data.TFRecordDataset([ testing_tfrecord_path]).map(decode_fn)
validation_ds = tf.data.TFRecordDataset([validation_tfrecord_path]).map(decode_fn)
# Use a dataset
count = 0
for batch in tf.data.TFRecordDataset([training_tfrecord_path]).map(decode_fn):
print(batch)
count +=1
if count > 4:
sys.exit(0)
print('Done.')
Note how as the data is being process into TFRecords, it is alternately being written into the three datasets. Verify and Testing entries are written first, to ensure classes with very small amounts of samples still get something into the verify and testing datasets. This is controlled by the variables at the top, validation_portion, testing_portion, and training_portion, adjust per your preferences.
Finally, at the end, the TFRecords are re-read and used to build three new tf.data.Dataset, which can be fed to model.fit() and friends. The example code just prints four records to show the data is of the correct, original shape.
I am trying to build an input pipeline for an active learning application.
The script I wrote is loading in the data, augmenting and then caching it.
I cache the whole dataset once and then split it up in an unlabeled and a labeled dataset, from where on I transfer samples from one dataset to the other in every iteration.
After a few iteration the training gets extremely slow. The training time per sample is quadrupling after about 20 iterations, which is unacceptable for my model. I can see via tensorboard and the used computer ressources that lots of the time the training is idling and GPU is not even used but the CPU and Sysmem usage are really high.
Has anybody had the same problem and can tell me how to fix it? Or any detailled information about how the tensorflow caching is working?
Thanks!
Edit: I split up the dataset using .take() and .skip() and transfer samples using .concatenate()
Edit: these are the three methods I use, they are simplified though. indexlist is the list of samples that are to be transfered from unlabeled to labeled dataset
def input_pipeline()
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(__convert_tf_records, num_parallel_calls=1)
dataset_bounding_boxes = dataset.map(get_lung_bb,num_parallel_calls=1)
if not is_training:
datasets['bounding_box'] = dataset_bounding_boxes
if params.repeat > 1 and is_training:
dataset = dataset.repeat(count=params.repeat)
dataset_bounding_boxes = dataset_bounding_boxes.repeat(count=params.repeat)
if is_training:
dataset = dataset.map(data_aug_rot, num_parallel_calls=1)
dataset = dataset.map(data_aug_zoom, num_parallel_calls=1)
dataset = dataset.map(data_aug_elastic_deform, num_parallel_calls=1)
dataset = tf.data.Dataset.zip((dataset, dataset_bounding_boxes))
dataset = dataset.map(get_random_patches_fn, num_parallel_calls=1)
dataset = dataset.unbatch()
if params.b_shuffle and is_training:
dataset = dataset.shuffle(buffer_size=params.batch_size*5)
dataset = dataset.batch(params.batch_size)
if params.b_cache and is_training:
cache_path = os.path.join(experiment_dir, "cache")
if not os.path.exists(cache_path):
os.makedirs(cache_path)
dataset = dataset.cache(cache_path + '/')
dataset = dataset.prefetch(buffer_size=params.batch_size)
datasets['patches'] = dataset
return datasets
def get_trainingdatasets():
initial_dataset = input_pipeline()
dataset_all=initial_dataset['patches']
dataset_all=dataset_all.unbatch()
for element in dataset_all:
dataset_size+=1
dataset_unlabeled=dataset_all.skip( int(dataset_size*args.ds_labeled_perc) ).batch(params.batch_size)
dataset_labeled=dataset_all.take( int(dataset_size* args.ds_labeled_perc) )
dataset_labeled= dataset_labeled.shuffle(buffer_size=500, reshuffle_each_iteration=True).batch(params.batch_size)
return dataset_labeled, dataset_unlabeled
def transfer_samples():
dataset_unlabeled = dataset_unlabeled.unbatch()
dataset_labeled = dataset_labeled.unbatch()
for idx in index_list:
element_add=dataset_unlabeled.skip(idx).take(1)
dataset_add_to_labeled = dataset_add_to_labeled.concatenate(element_add)
dataset_labeled = dataset_labeled.concatenate(dataset_add_to_labeled)
dataset_labeled= dataset_labeled.shuffle(buffer_size=params.batch_size*5)
dataset_labeled = dataset_labeled.batch(params.batch_size).prefetch(buffer_size=params.batch_size)
dataset_unlabeled = dataset_unlabeled.batch(params.batch_size).prefetch(buffer_size=params.batch_size)
return dataset_labeled, dataset_unlabeled
I'm trying to create a Dataset object in tensorflow 1.14 (I have some legacy code that i can't change for this specific project) starting from numpy arrays, but everytime i try i get everything copied on my graph and for this reason when i create an event log file it is huge (719 MB in this case).
Originally i tried using this function "tf.data.Dataset.from_tensor_slices()", but it didn't work, then i read it is a common problem and someone suggested me to try with generators, thus i tried with the following code, but again i got a huge event file (719 MB again)
def fetch_batch(x, y, batch):
i = 0
while i < batch:
yield (x[i,:,:,:], y[i])
i +=1
train, test = tf.keras.datasets.fashion_mnist.load_data()
images, labels = train
images = images/255
training_dataset = tf.data.Dataset.from_generator(fetch_batch,
args=[images, np.int32(labels), batch_size], output_types=(tf.float32, tf.int32),
output_shapes=(tf.TensorShape(features_shape), tf.TensorShape(labels_shape)))
file_writer = tf.summary.FileWriter("/content", graph=tf.get_default_graph())
I know in this case I could use tensorflow_datasets API and it would be easier, but this is a more general question, and it involves how to create datasets in general, not only using the mnist one.
Could you explain to me what am i doing wrong? Thank you
I guess it's because you are using args in from_generator. This will surely put the provided args in the graph.
What you could do is define a function that will return a generator that will iterate through your set, something like (haven't tested):
def data_generator(images, labels):
def fetch_examples():
i = 0
while True:
example = (images[i], labels[i])
i += 1
i %= len(labels)
yield example
return fetch_examples
This would give in your example:
train, test = tf.keras.datasets.fashion_mnist.load_data()
images, labels = train
images = images/255
training_dataset = tf.data.Dataset.from_generator(data_generator(images, labels), output_types=(tf.float32, tf.int32),
output_shapes=(tf.TensorShape(features_shape), tf.TensorShape(labels_shape))).batch(batch_size)
file_writer = tf.summary.FileWriter("/content", graph=tf.get_default_graph())
Note that I changed fetch_batch to fetch_examples since you probably want to batch using the dataset utilities (.batch).
I'm trying to implement a custom data generator that reads data from csv file(s) in chunks using pandas.read_csv. I tested it with model.predict_generator but the number of predictions returned is less than expected (in my case, 248192 out of 253457).
Custom generator
class TestDataGenerator:
def __init__(self, directory, batch_size=1024):
self.directory = directory
self.batch_size = batch_size
self.chunk_size=10000
self.samples = 0
def _to_movie_id(self, ids):
ids = ast.literal_eval(ids)
if ids == []:
return [EMB_MATRIX_SIZE-1]
else:
return [movie2idx[str(movie_id)] for movie_id in ids]
def generate(self):
csv_files = glob.glob(self.directory + '/*.csv')
while True:
for file in csv_files:
df = pd.read_csv(file, chunksize=self.chunk_size)
for df_chunk in df:
chunk_steps = math.ceil(len(df_chunk) / self.batch_size)
for i in range(chunk_steps):
batch = df_chunk[i * self.batch_size:(i + 1) * self.batch_size]
X_batch, y_batch = self.preprocess(batch)
self.samples += len(batch)
yield X_batch, y_batch
def preprocess(self, df):
X_user = df['user'].apply(lambda x: user2idx[str(x)]).values
X_watched = df['watched'].apply(self._to_movie_id).values
X_watched_padded = pad_sequences(X_watched, maxlen=SEQ_LENGTH, value=0)
ohe = df['movie'].apply(lambda x: to_categorical(movie2idx[x], num_classes=len(movie2idx)))
X = [X_user, X_watched_padded]
y = np.array([o.tolist() for o in ohe])
return X, y
Run model.predict_generator
batch_size=1024
n_samples_test = 253457
test_dir = 'folder/'
test_gen = TestDataGenerator(test_dir, batch_size=batch_size)
next_test_gen = test_gen.generate()
preds = model.predict_generator(next_test_gen, steps=math.ceil(n_samples_test/batch_size))
After running model.predict_generator, the number of rows for preds is 248192 which is less than the actual 253457. It looks like it's missing a few number of epochs. I also tested generate individually without interacting with Keras and it behaved as expected returning the correct number of samples in csv file. Also, before the generate yields a value, I keep track of the number of samples processed with samples. Surprisingly, the value for samples is 250000. So, I'm pretty sure I might have done something with Keras.
Note that I also tried setting max_queue_size=1, and making generate thread-safe but got no luck. I placed only 1 csv file under test_dir for simplicity. I'm using Keras 2.1.2-tf embedded in Tensorflow 1.5.0.
I did some research on how this can be done but haven't come across a useful example yet. What is wrong with this implementation?
Thanks
Peeranat F.
Well, this is tricky. So let's dive into the problem:
How fit_generator works when batch provided is less than batch_size: As you may see - many batches you provide to fit_generator are of the size less than batch_size. This happens every time when you take the last batch from every file. Usually - a number of texts are not divisible by batch size so there are not enough texts to fill the batch. This ends up in feeding less examples to a model.
And here is a tricky part - keras ignores less size, treats this as valid generator step and returns values for an incomplete batch.
So why there are texts missing: let me show you by example. Let's assume that you have 2 files with 5 texts each and your batch_size is 4. This is how your batches would look like:
[1t1, 1t2, 1t3, 1t4], [1t5,], [2t1, 2t2, 2t3, 2t4], [2t5].
As you may see - the actual number of steps needed is equal to 4 which is not equal to 3 which is obtained by taking: math.ceil(10 / 4). This way is appropriate for these batches:
[1t1, 1t2, 1t3, 1t4], [1t5, 2t1, 2t2, 2t3], [2t4, 2t5]
But batches returned from your generator are not like these.
How to solve the problem? - you need to make your generator to compute the actual number of steps needed:
def steps_needed(self):
steps = 0
csv_files = glob.glob(self.directory + '/*.csv')
for file in csv_files:
df = pd.read_csv(file, chunksize=self.chunk_size)
for df_chunk in df:
chunk_steps = math.ceil(len(df_chunk) / self.batch_size)
steps += chunk_steps
return steps
This function computes exactly how many batches your generator will return.
Cheers :)
I have a huge training CSV file (709M) and a large testing CSV file (125M) that I want to send into a DNNClassifier in the context of using the high-level Tensorflow API.
It appears that the input_fn param accepted by fit and evaluate must hold all feature and label data in memory, but I currently would like to run this on my local machine, and thus expect it to run out of memory rather quickly if I read these files into memory and then process them.
I skimmed the doc on streamed-reading of data, but the sample code for reading CSVs appears to be for the low-level Tensorflow API.
And - if you'll forgive a bit of whining - it seems overly-complex for the trivial use case of sending well-prepared files of training and test data into an Estimator ... although, perhaps that level of complexity is actually required for training and testing large volumes of data in Tensorflow?
In any case, I'd really appreciate an example of using that approach with the high-level API, if it's even possible, which I'm beginning to doubt.
After poking around, I did manage to find DNNClassifier#partial_fit, and will attempt to use it for training.
Examples of how to use this method would save me some time, though hopefully I'll stumble into the correct usage in the next few hours.
However, there doesn't seem to be a corresponding DNNClassifier#partial_evaluate ... though I suspect that I could break-up the testing data into smaller pieces and run DNNClassifier#evaluate successively on each batch, which might actually be a great way to do it since I could segment the testing data into cohorts, and thereby obtain per-cohort accuracy.
==== Update ====
Short version:
DomJack's recommendation should be the accepted answer.
However, my Mac's 16GB of RAM enough for it to hold the entire 709Mb training data set in memory without crashing. So, while I will use the DataSets feature when I eventually deploy the app, I'm not using it yet for local dev work.
Longer version:
I started by using the partial_fit API as described above, but upon every use it emitted a warning.
So, I went to look at the source for the method here, and discovered that its complete implementation looks like this:
logging.warning('The current implementation of partial_fit is not optimized'
' for use in a loop. Consider using fit() instead.')
return self.fit(x=x, y=y, input_fn=input_fn, steps=steps,
batch_size=batch_size, monitors=monitors)
... which reminds me of this scene from Hitchhiker's Guide:
Arthur Dent: What happens if I press this button?
Ford Prefect: I wouldn't-
Arthur Dent: Oh.
Ford Prefect: What happened?
Arthur Dent: A sign lit up, saying 'Please do not press this button again'.
Which is to say: partial_fit seems to exist for the sole purpose of telling you not to use it.
Furthermore, the model generated by using partial_fit iteratively on training file chunks was much smaller than the one generated by using fit on the whole training file, which strongly suggests that only the last partial_fit training chunk actually "took".
Check out the tf.data.Dataset API. There are a number of ways to create a dataset. I'll outline four - but you'll only have to implement one.
I assume each row of your csv files is n_features float values followed by a single int value.
Creating a tf.data.Dataset
Wrap a python generator with Dataset.from_generator
The easiest way to get started is to wrap a native python generator. This can have performance issues, but may be fine for your purposes.
def read_csv(filename):
with open(filename, 'r') as f:
for line in f.readlines():
record = line.rstrip().split(',')
features = [float(n) for n in record[:-1]]
label = int(record[-1])
yield features, label
def get_dataset():
filename = 'my_train_dataset.csv'
generator = lambda: read_csv(filename)
return tf.data.Dataset.from_generator(
generator, (tf.float32, tf.int32), ((n_features,), ()))
This approach is highly versatile and allows you to test your generator function (read_csv) independently of TensorFlow.
Use Tensorflow Datasets API
Supporting tensorflow versions 1.12+, tensorflow datasets is my new favourite way of creating datasets. It automatically serializes your data, collects statistics and makes other meta-data available to you via info and builder objects. It can also handle automatic downloading and extracting making collaboration simple.
import tensorflow_datasets as tfds
class MyCsvDatasetBuilder(tfds.core.GeneratorBasedBuilder):
VERSION = tfds.core.Version("0.0.1")
def _info(self):
return tfds.core.DatasetInfo(
builder=self,
description=(
"My dataset"),
features=tfds.features.FeaturesDict({
"features": tfds.features.Tensor(
shape=(FEATURE_SIZE,), dtype=tf.float32),
"label": tfds.features.ClassLabel(
names=CLASS_NAMES),
"index": tfds.features.Tensor(shape=(), dtype=tf.float32)
}),
supervised_keys=("features", "label"),
)
def _split_generators(self, dl_manager):
paths = dict(
train='/path/to/train.csv',
test='/path/to/test.csv',
)
# better yet, if the csv files were originally downloaded, use
# urls = dict(train=train_url, test=test_url)
# paths = dl_manager.download(urls)
return [
tfds.core.SplitGenerator(
name=tfds.Split.TRAIN,
num_shards=10,
gen_kwargs=dict(path=paths['train'])),
tfds.core.SplitGenerator(
name=tfds.Split.TEST,
num_shards=2,
gen_kwargs=dict(cvs_path=paths['test']))
]
def _generate_examples(self, csv_path):
with open(csv_path, 'r') as f:
for i, line in enumerate(f.readlines()):
record = line.rstrip().split(',')
features = [float(n) for n in record[:-1]]
label = int(record[-1])
yield dict(features=features, label=label, index=i)
Usage:
builder = MyCsvDatasetBuilder()
builder.download_and_prepare() # will only take time to run first time
# as_supervised makes output (features, label) - good for model.fit
datasets = builder.as_dataset(as_supervised=True)
train_ds = datasets['train']
test_ds = datasets['test']
Wrap an index-based python function
One of the downsides of the above is shuffling the resulting dataset with a shuffle buffer of size n requires n examples to be loaded. This will either create periodic pauses in your pipeline (large n) or result in potentially poor shuffling (small n).
def get_record(i):
# load the ith record using standard python, return numpy arrays
return features, labels
def get_inputs(batch_size, is_training):
def tf_map_fn(index):
features, labels = tf.py_func(
get_record, (index,), (tf.float32, tf.int32), stateful=False)
features.set_shape((n_features,))
labels.set_shape(())
# do data augmentation here
return features, labels
epoch_size = get_epoch_size()
dataset = tf.data.Dataset.from_tensor_slices((tf.range(epoch_size,))
if is_training:
dataset = dataset.repeat().shuffle(epoch_size)
dataset = dataset.map(tf_map_fn, (tf.float32, tf.int32), num_parallel_calls=8)
dataset = dataset.batch(batch_size)
# prefetch data to CPU while GPU processes previous batch
dataset = dataset.prefetch(1)
# Also possible
# dataset = dataset.apply(
# tf.contrib.data.prefetch_to_device('/gpu:0'))
features, labels = dataset.make_one_shot_iterator().get_next()
return features, labels
In short, we create a dataset just of the record indices (or any small record ID which we can load entirely into memory). We then do shuffling/repeating operations on this minimal dataset, then map the index to the actual data via tf.data.Dataset.map and tf.py_func. See the Using with Estimators and Testing in isolation sections below for usage. Note this requires your data to be accessible by row, so you may need to convert from csv to some other format.
TextLineDataset
You can also read the csv file directly using a tf.data.TextLineDataset.
def get_record_defaults():
zf = tf.zeros(shape=(1,), dtype=tf.float32)
zi = tf.ones(shape=(1,), dtype=tf.int32)
return [zf]*n_features + [zi]
def parse_row(tf_string):
data = tf.decode_csv(
tf.expand_dims(tf_string, axis=0), get_record_defaults())
features = data[:-1]
features = tf.stack(features, axis=-1)
label = data[-1]
features = tf.squeeze(features, axis=0)
label = tf.squeeze(label, axis=0)
return features, label
def get_dataset():
dataset = tf.data.TextLineDataset(['data.csv'])
return dataset.map(parse_row, num_parallel_calls=8)
The parse_row function is a little convoluted since tf.decode_csv expects a batch. You can make it slightly simpler if you batch the dataset before parsing.
def parse_batch(tf_string):
data = tf.decode_csv(tf_string, get_record_defaults())
features = data[:-1]
labels = data[-1]
features = tf.stack(features, axis=-1)
return features, labels
def get_batched_dataset(batch_size):
dataset = tf.data.TextLineDataset(['data.csv'])
dataset = dataset.batch(batch_size)
dataset = dataset.map(parse_batch)
return dataset
TFRecordDataset
Alternatively you can convert the csv files to TFRecord files and use a TFRecordDataset. There's a thorough tutorial here.
Step 1: Convert the csv data to TFRecords data. Example code below (see read_csv from from_generator example above).
with tf.python_io.TFRecordWriter("my_train_dataset.tfrecords") as writer:
for features, labels in read_csv('my_train_dataset.csv'):
example = tf.train.Example()
example.features.feature[
"features"].float_list.value.extend(features)
example.features.feature[
"label"].int64_list.value.append(label)
writer.write(example.SerializeToString())
This only needs to be run once.
Step 2: Write a dataset that decodes these record files.
def parse_function(example_proto):
features = {
'features': tf.FixedLenFeature((n_features,), tf.float32),
'label': tf.FixedLenFeature((), tf.int64)
}
parsed_features = tf.parse_single_example(example_proto, features)
return parsed_features['features'], parsed_features['label']
def get_dataset():
dataset = tf.data.TFRecordDataset(['data.tfrecords'])
dataset = dataset.map(parse_function)
return dataset
Using the dataset with estimators
def get_inputs(batch_size, shuffle_size):
dataset = get_dataset() # one of the above implementations
dataset = dataset.shuffle(shuffle_size)
dataset = dataset.repeat() # repeat indefinitely
dataset = dataset.batch(batch_size)
# prefetch data to CPU while GPU processes previous batch
dataset = dataset.prefetch(1)
# Also possible
# dataset = dataset.apply(
# tf.contrib.data.prefetch_to_device('/gpu:0'))
features, label = dataset.make_one_shot_iterator().get_next()
estimator.train(lambda: get_inputs(32, 1000), max_steps=1e7)
Testing the dataset in isolation
I'd strongly encourage you to test your dataset independently of your estimator. Using the above get_inputs, it should be as simple as
batch_size = 4
shuffle_size = 100
features, labels = get_inputs(batch_size, shuffle_size)
with tf.Session() as sess:
f_data, l_data = sess.run([features, labels])
print(f_data, l_data) # or some better visualization function
Performance
Assuming your using a GPU to run your network, unless each row of your csv file is enormous and your network is tiny you probably won't notice a difference in performance. This is because the Estimator implementation forces data loading/preprocessing to be performed on the CPU, and prefetch means the next batch can be prepared on the CPU as the current batch is training on the GPU. The only exception to this is if you have a massive shuffle size on a dataset with a large amount of data per record, which will take some time to load in a number of examples initially before running anything through the GPU.
I agree with DomJack about using the Dataset API, except the need to read the whole csv file and then convert to TfRecord. I am hereby proposing to emply TextLineDataset - a sub-class of the Dataset API to directly load data into a TensorFlow program. An intuitive tutorial can be found here.
The code below is used for the MNIST classification problem for illustration and hopefully, answer the question of the OP. The csv file has 784 columns, and the number of classes is 10. The classifier I used in this example is a 1-hidden-layer neural network with 16 relu units.
Firstly, load libraries and define some constants:
# load libraries
import tensorflow as tf
import os
# some constants
n_x = 784
n_h = 16
n_y = 10
# path to the folder containing the train and test csv files
# You only need to change PATH, rest is platform independent
PATH = os.getcwd() + '/'
# create a list of feature names
feature_names = ['pixel' + str(i) for i in range(n_x)]
Secondly, we create an input function reading a file using the Dataset API, then provide the results to the Estimator API. The return value must be a two-element tuple organized as follows: the first element must be a dict in which each input feature is a key, and then a list of values for the training batch, and the second element is a list of labels for the training batch.
def my_input_fn(file_path, batch_size=32, buffer_size=256,\
perform_shuffle=False, repeat_count=1):
'''
Args:
- file_path: the path of the input file
- perform_shuffle: whether the data is shuffled or not
- repeat_count: The number of times to iterate over the records in the dataset.
For example, if we specify 1, then each record is read once.
If we specify None, iteration will continue forever.
Output is two-element tuple organized as follows:
- The first element must be a dict in which each input feature is a key,
and then a list of values for the training batch.
- The second element is a list of labels for the training batch.
'''
def decode_csv(line):
record_defaults = [[0.]]*n_x # n_x features
record_defaults.insert(0, [0]) # the first element is the label (int)
parsed_line = tf.decode_csv(records=line,\
record_defaults=record_defaults)
label = parsed_line[0] # First element is the label
del parsed_line[0] # Delete first element
features = parsed_line # Everything but first elements are the features
d = dict(zip(feature_names, features)), label
return d
dataset = (tf.data.TextLineDataset(file_path) # Read text file
.skip(1) # Skip header row
.map(decode_csv)) # Transform each elem by applying decode_csv fn
if perform_shuffle:
# Randomizes input using a window of 256 elements (read into memory)
dataset = dataset.shuffle(buffer_size=buffer_size)
dataset = dataset.repeat(repeat_count) # Repeats dataset this # times
dataset = dataset.batch(batch_size) # Batch size to use
iterator = dataset.make_one_shot_iterator()
batch_features, batch_labels = iterator.get_next()
return batch_features, batch_labels
Then, the mini-batch can be computed as
next_batch = my_input_fn(file_path=PATH+'train1.csv',\
batch_size=batch_size,\
perform_shuffle=True) # return 512 random elements
Next, we define the feature columns are numeric
feature_columns = [tf.feature_column.numeric_column(k) for k in feature_names]
Thirdly, we create an estimator DNNClassifier:
classifier = tf.estimator.DNNClassifier(
feature_columns=feature_columns, # The input features to our model
hidden_units=[n_h], # One layer
n_classes=n_y,
model_dir=None)
Finally, the DNN is trained using the test csv file, while the evaluation is performed on the test file. Please change the repeat_count and steps to ensure that the training meets the required number of epochs in your code.
# train the DNN
classifier.train(
input_fn=lambda: my_input_fn(file_path=PATH+'train1.csv',\
perform_shuffle=True,\
repeat_count=1),\
steps=None)
# evaluate using the test csv file
evaluate_result = classifier.evaluate(
input_fn=lambda: my_input_fn(file_path=PATH+'test1.csv',\
perform_shuffle=False))
print("Evaluation results")
for key in evaluate_result:
print(" {}, was: {}".format(key, evaluate_result[key]))