Python: how to save training datasets

Python: how to save training datasets - python

I have got training datasets, which are xtrain, ytrain, xtest and ytest. They are all numpy arrays. I want to save them together into a file, so that I can load them into workspace as done in keras for mnist.load_data:
(xtrain, ytrain), (xtest, ytest) = mnist.load_data(filepath)
In python, is there any way to save my training datasets into such a single file? Or is there any other appreciate methods to save them?

You have a number of options:
npz
hdf5
pickle
Keras provides option to save models to hdf5. Also, note that out of the three, it's the only interoperable format.

Pickle is a good way to go:
import pickle as pkl
#to save it
with open("train.pkl", "w") as f:
pkl.dump([train_x, train_y], f)
#to load it
with open("train.pkl", "r") as f:
train_x, train_y = pkl.load(f)
If your dataset is huge, I would recommend check out hdf5 as #Lukasz Tracewski mentioned.

I find hickle is a very nice way to save them all together into a dict:
import hickle as hkl
data = {'xtrain': xtrain, 'xtest': xtest,'ytrain': ytrain,'ytest':ytest}
hkl.dump(data,'data.hkl')

You simply could use numpy.save
np.save('xtrain.npy', xtrain)
or in a human readable format
np.savetxt('xtrain.txt', xtrain)

Related

Custom data generator

I have a standard directory structure of train, validation, test, and each contain class subdirectories.
...
|train
|class A
|1
|1_1.raw
|1_2.raw
...
|2
...
|class B
...
|test
...
I want to use the flow_from_directory API, but all I can find is an ImageDataGenerator, and the files I have are raw numpy arrays (generated with arr.tofile(...)).
Is there an easy way to use ImageDataGenerator with a custom file loader?
I'm aware of flow_from_dataframe, but that doesn't seem to accomplish what I want either; it's for reading images with more custom organization. I want a simple way to load raw binary files instead of having to re-encode 100,000s of files into jpgs with some precision loss along the way (and wasted time, etc.).

Tensorflow is an entire ecosystem with IO capabilities and ImageDataGenerator is one of the least flexible approaches. Read here on How to Load Numpy Data in Tensorflow.
import tensorflow as tf
import numpy as np
DATA_URL = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz'
path = tf.keras.utils.get_file('mnist.npz', DATA_URL)
with np.load(path) as data:
train_examples = data['x_train']
train_labels = data['y_train']
test_examples = data['x_test']
test_labels = data['y_test']
train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_examples, test_labels))

Saving downloaded data as well as trained model

Is there a way to save the execution of a particular block of code in a notebook such that I don't have to run it again. And can continue with the rest of code after reloading?
For example,
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
(train_images1, train_labels), (test_images1, test_labels) = datasets.cifar10.load_data()
# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images1 / 255.0, test_images1 / 255.0
#My cnn model, upto the training
#Save upto here.
Can I save the execution upto here for later usage, that is including the downloaded files and trained model.

Save Model:
model_json = model.to_json()
with open("model.json", "w") as json_file:
json_file.write(model_json)
model.save_weights("model.h5")
print("Saved model .......")
Load saved Model:
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
loaded_model.load_weights("model.h5")
print("Loaded model...........")
For more details, You can find my implementation here. Now this will save both dataset as well as trained model also.

There is! You can save your NumPy data using numpy.save("train_images.npy", train_images) and load them using train_images = numpy.load("train_images.npy"). When working with notebooks, just put save and load in two different cells and run whatever cell you need.
Documentation:
save
load
There are many variations like savez to save multiple arrays in an uncompressed file or savez_compressed for compressed files.

Tensorflow Dataset using many compressed numpy files

I have a large dataset that I would like to use for training in Tensorflow.
The data is stored in compressed numpy format (using numpy.savez_compressed). There are variable numbers of images per file due to the way they are produced.
Currently I use a Keras Sequence based generator object to train, but I'd like to move entirely to Tensorflow without Keras.
I'm looking at the Dataset API on the TF website, but it is not obvious how I might use this to read numpy data.
My first idea was this
import glob
import tensorflow as tf
import numpy as np
def get_data_from_filename(filename):
npdata = np.load(open(filename))
return npdata['features'],npdata['labels']
# get files
filelist = glob.glob('*.npz')
# create dataset of filenames
ds = tf.data.Dataset.from_tensor_slices(filelist)
ds.flat_map(get_data_from_filename)
However, this passes a TF Tensor placeholder to a real numpy function and numpy is expecting a standard string. This results in the error:
File "test.py", line 6, in get_data_from_filename
npdata = np.load(open(filename))
TypeError: coercing to Unicode: need string or buffer, Tensor found
The other option I'm considering (but seems messy) is to create a Dataset object built on TF placeholders which I then fill during my epoch-batch loop from my numpy files.
Any suggestions?

You can define a wrapper and use pyfunc like this:
def get_data_from_filename(filename):
npdata = np.load(filename)
return npdata['features'], npdata['labels']
def get_data_wrapper(filename):
# Assuming here that both your data and label is float type.
features, labels = tf.py_func(
get_data_from_filename, [filename], (tf.float32, tf.float32))
return tf.data.Dataset.from_tensor_slices((features, labels))
# Create dataset of filenames.
ds = tf.data.Dataset.from_tensor_slices(filelist)
ds.flat_map(get_data_wrapper)
If your dataset is very large and you have memory issues, you can consider using a combination of interleave or parallel_interleave and from_generator methods instead. The from_generator method uses py_func internally so you can directly read your np file and then define your generator in python.

Shuffling input files with tensorflow Datasets

With the old input-pipeline API I can do:
filename_queue = tf.train.string_input_producer(filenames, shuffle=True)
and then pass the filenames to other queue, for example:
reader = tf.TFRecordReader()
_, serialized_example = reader.read_up_to(filename_queue, n)
How can I achieve similar behaviour with the Dataset -API?
The tf.data.TFRecordDataset() expects tensor of file-names in fixed order.

Start reading them in order, shuffle right after:
BUFFER_SIZE = 1000 # arbitrary number
# define filenames somewhere, e.g. via glob
dataset = tf.data.TFRecordDataset(filenames).shuffle(BUFFER_SIZE)
EDIT:
The input pipeline of this question gave me an idea on how to implement filenames shuffling with the Dataset API:
dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.shuffle(BUFFER_SIZE) # doesn't need to be big
dataset = dataset.flat_map(tf.data.TFRecordDataset)
dataset = dataset.map(decode_example, num_parallel_calls=5) # add your decoding logic here
# further processing of the dataset
This will put all the data of one file before the one of the next and so on. Files are shuffled, but the data inside them will be produced in the same order.
You can alternatively replace dataset.flat_map with interleave to process multiple files at the same time and return samples from each:
dataset = dataset.interleave(tf.data.TFRecordDataset, cycle_length=4)
Note: interleave does not actually run in multiple threads, it's a round-robin operation. For true parallel processing see parallel_interleave

The current Tensorflow version (v1.5 in 02/2018) does not seem to support filename shuffling natively in the Dataset API. Here is a simple work around using numpy:
import numpy as np
import tensorflow as tf
myShuffledFileList = np.random.choice(myInputFileList, size=len(myInputFileList), replace=False).tolist()
dataset = tf.data.TFRecordDataset(myShuffledFileList)

Save Naive Bayes Trained Classifier in NLTK

I'm slightly confused in regard to how I save a trained classifier. As in, re-training a classifier each time I want to use it is obviously really bad and slow, how do I save it and the load it again when I need it? Code is below, thanks in advance for your help. I'm using Python with NLTK Naive Bayes Classifier.
classifier = nltk.NaiveBayesClassifier.train(training_set)
# look inside the classifier train method in the source code of the NLTK library
def train(labeled_featuresets, estimator=nltk.probability.ELEProbDist):
# Create the P(label) distribution
label_probdist = estimator(label_freqdist)
# Create the P(fval|label, fname) distribution
feature_probdist = {}
return NaiveBayesClassifier(label_probdist, feature_probdist)

To save:
import pickle
f = open('my_classifier.pickle', 'wb')
pickle.dump(classifier, f)
f.close()
To load later:
import pickle
f = open('my_classifier.pickle', 'rb')
classifier = pickle.load(f)
f.close()

I went thru the same problem, and you cannot save the object since is a ELEFreqDistr NLTK class. Anyhow NLTK is hell slow. Training took 45 mins on a decent set and I decided to implement my own version of the algorithm (run it with pypy or rename it .pyx and install cython). It takes about 3 minutes with the same set and it can simply save data as json (I'll implement pickle which is faster/better).
I started a simple github project, check out the code here

To Retrain the Pickled Classifer :
f = open('originalnaivebayes5k.pickle','rb')
classifier = pickle.load(f)
classifier.train(training_set)
print('Accuracy:',nltk.classify.accuracy(classifier,testing_set)*100)
f.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: how to save training datasets - python

You have a number of options: npz hdf5 pickle Keras provides option to save models to hdf5. Also, note that out of the three, it's the only interoperable format.

Pickle is a good way to go: import pickle as pkl #to save it with open("train.pkl", "w") as f: pkl.dump([train_x, train_y], f) #to load it with open("train.pkl", "r") as f: train_x, train_y = pkl.load(f) If your dataset is huge, I would recommend check out hdf5 as #Lukasz Tracewski mentioned.

I find hickle is a very nice way to save them all together into a dict: import hickle as hkl data = {'xtrain': xtrain, 'xtest': xtest,'ytrain': ytrain,'ytest':ytest} hkl.dump(data,'data.hkl')

You simply could use numpy.save np.save('xtrain.npy', xtrain) or in a human readable format np.savetxt('xtrain.txt', xtrain)

Related

Custom data generator

Saving downloaded data as well as trained model

Tensorflow Dataset using many compressed numpy files

Shuffling input files with tensorflow Datasets

Save Naive Bayes Trained Classifier in NLTK

Categories

Resources