I have over 100,000 files containing more than 20 examples per file. The number of samples per file differs. How can I create an iterator with a batch size of ~10 in Chainer without having to pre-load all the files in memory?
I think you can use DatasetMixin class to define your own dataset.
You may override get_example(i) method to extract i-th data, so you can load the file when you need the data inside get_example(i).
However, it still needs to "pre-indexing", meaning that you need to define which i-th data corresponds to which file.
Below are the references how to define own DatasetMixin class.
Reference:
- Chainer v3 tutorial for beginner (Japanese)
- Create dataset class from your own data with DatasetMixin
See official example which uses DatasetMixin to load the image on-demand:
https://github.com/chainer/chainer/blob/master/examples/imagenet/train_imagenet.py#L39
Related
I have eight folders with 1300 CSV files (3*50) in each folder, each folder represents a label, but I have no idea how to input my data in to a training model.
Still, a beginner in CNN.
A part of my csv file can be accessed using this link.
When using Keras, you can use the tf.data.Dataset package, which helps you doing what you want to achieve.
Example
Here is an example code, I took from one of my projects:
# matching a glob pattern!
dataset_pro_raw = tf.data.Dataset.list_files([f"./aclImdb/{name}/pos/*.txt"], shuffle=True)
dataset_pro_i = dataset_pro_raw.interleave(
lambda file: tf.data.TextLineDataset(file),
# how many files should be processed concurently
cycle_length = 20,
# number of threads to increase the performance
num_parallel_calls = 10
)
First, we create a filelist by tf.data.Dataset.list_files(), also note, that already there the order of the files is shuffled. Then via dataset_pro_raw.interleave() we iterate through the file set and read the content of the files with with tf.data.TextLineDataset().
That way you can load data from multiple .txt files or any data source very well. It is a big clumsy at the beginning ot use, but it has very well advantages. Currently I only use tf.data.Dataset for train-data generation.
For more information on tf.data.Dataset you might want to check out this link
Disclaimer: In the past, I've predominantly used PyTorch, hence my reasoning is in accordance with how things are done in PyTorch as well.
I have a large database (MySQL) which I want to load as a dataset. It is not feasible to keep this dataset in memory at all times, hence it needs to be done lazily/on demand. My plan is to instantiate a Dataset object from the range of row id's, then retrieve the corresponding rows. This is much like how you would use file names/paths when using large files like images which you would then load that way. The issue with this method is that I can only retrieve one row per worker thread, meaning that I have to issue a SELECT query for each. I found that storing a batch in a table and issuing a JOIN as if it was a foreign key is orders of magnitude faster.
My first thought was to apply a map operation over each batch, which would require me to call a function of that kind after I obtain the batch from the dataset. In PyTorch, I would be able to define all this behaviour in a class that inherits from its Dataset class, which I think is a cleaner way to do it, and encapsulates this behaviour. Is there anyway to (neatly) do this within tensorflow?
Bonus points if someone can conjure up a method that is perfectly encapsulated (the user does not know how the dataset is internally stored and kept track of) from the user, yet conforms to the tensorflow API (i.e. a callable class to be used a generator for tf.data.Dataset.from_generator()).
Edit: In PyTorch, a common implementation is as follows (which I consider to be "neat" and is encapsulated).
class MyDataset(torch.Dataset):
def __init__(self, row_ids):
# Store row ids, do any pre-processing if necessary.
def __getitem__(self, item):
# From the item (may be several), join all corresponding
# database rows and apply post-processing.
I am dealing with a relatively large dataset (>400 GB) for analytics purposes but have somewhat limited memory (256 GB). I am using python. So far I have been using pandas on a subset of the data but it is becoming obvious that I need a solution that allows me to access data from the entire dataset.
A little bit about the data. Right now the data is segregated over a set of flat files that are pandas dataframes. The files consist of column that have 2 keys. The primary key, let's call it "record", which I want to be able to use to access the data, and a secondary key, which is basically row number within the primary key. As in I want to access row 2 in record "A".
The dataset is used for training a NN (keras/tf). So the task is to partition the entire set into train/dev/test by record, and then pass the data to train/predict generators (I implement keras.utils.Sequence(), which I have to do because the data is variable length sequences that need to be padded for batch learning).
Given my desire to pass examples to the NN as fast as possible and my inability to store all of the examples in memory, should I use a database (mongodb or sqlite or something else?) and query examples as needed, or should I continue to store things in flat files and load them/delete them (and hope that python garbage collector works)?
Another complication is that there are about 3mil "records". Right now the pandas dataframes store them in batches of ~10k, but it would benefit me to split the training/test/validation randomly, which means I really need to be able to access some but not all of the records in a particular batch. In pandas this seems hard (as in as far as I know I need to read the entire flat file to then access the particular record since I don't know in which chunk of the file the data is located), on the other hand I don't think generating 3mil individual files is smart either.
A further complication is that the model is relatively simple and I am unable due to various bottlenecks to saturate my compute power during training, so if I could stream the training to several different models that would help with hyperparameter search, since otherwise I am wasting cycles.
What do you think is the correct (fast, simple) back-end to handle my data needs?
Best,
Ilya
This is a good use case for writing a custom generator, then using Keras' model.fit_generator. Here's something I wrote the other day in conjunction with Pandas.
Note that I first split my main dataframe into training and validation splits (merged was my original dataframe), but you may have to move things around on disk and specify them when selecting in the generator
Lots of the reshaping and lookup/loading is all custom to my problem, but you see the pattern.
msk = np.random.rand(len(merged)) < 0.8
train = merged[msk]
valid = merged[~msk]
def train_generator(batch_size):
sample_rows = train[train['match_id'].isin(npf.id.values)].sample(n=batch_size)
sample_file_ids = sample_rows.FILE_NAME.tolist()
sample_data = [np.load('/Users/jeff/spectro/' + x.split(".")[0] + ".npy").T for x in sample_file_ids]
sample_data = [x.reshape(x.shape[0], x.shape[1]) for x in sample_data]
sample_data = np.asarray([x[np.random.choice(x.shape[0], 128, replace=False)] for x in sample_data])
sample_labels = np.asarray([labels.get(x) for x in sample_file_ids])
while True:
yield (sample_data, sample_labels)
It essentially returns batch_size samples whenever you call it. Keras requires your generator to return a tuple of length 2, where the first element is your data in the expected shape (whatever your neural network input shape is) and the labels to also map to the expected shape (N_classes, or whatever).
Here's another relatively useful link regarding generator, which may help you determine when you've truly exhausted all examples. My generator just randomly samples, but the dataset is sufficiently large that I don't care.
https://github.com/keras-team/keras/issues/7729#issuecomment-324627132
Don't forget to write a validation_generator as well, which is reading from some set of files or dataframes which you randomly put in some other place, for validation purposes.
Lastly, here's calling the generator:
model.fit_generator(train_generator(32),
samples_per_epoch=10000, nb_epoch=20,
validation_data=valid_generator(32), validation_steps=500)
depending on the keras version, you may find arg names have changed slightly, but a few searches should get you fixed up quickly.
When I try to create a word2vec model (skipgram with negative sampling) I received 3 files as output as follows.
word2vec (File)
word2vec.syn1nef.npy (NPY file)
word2vec.wv.syn0.npy (NPY file)
I am just worried why this happens as for my previous test examples in word2vec I only received one model(no npy files).
Please help me.
Models with larger internal vector-arrays can't be saved via Python 'pickle' to a single file, so beyond a certain threshold, the gensim save() method will store subsidiary arrays in separate files, using the more-efficient raw format of numpy arrays (.npy format).
You still load() the model by just specifying the root model filename; when the subsidiary arrays are needed, the loading code will find the side files – as long as they're kept beside the root file. So when moving a model elsewhere, be sure to keep all files with the same root filename together.
I am new to Python. I need process such a task. I have multiple test images for each unit, distributed in different subfolders.
For instance, I have multiple images
/folder/subfolder1/../i1.png
/folder/subfolder2/../i2.png
/folder/subfolder3/../i3.png
....
....
/folder/subfolder100/../i100.png
I can read all image files and create a list object. The next step is to render all of them in a matrix format 10x10, each each matrix element the particular ix.png. Prefer that below there is the caption of its own name ix.
How can I do that?