batch data prefetching uisng queue on sharded data files

batch data prefetching uisng queue on sharded data files - python

I generated about 500 sharded numpy data files, each of them contains about 10000 data samples (e.g., image and its label), for example:
file-000001.npy
file-000002.npy
file-000003.npy
...
file-000500.npy
each of .npy contains a numpy dictionary whose keys and size are {'image':10000x3x512x64 (dtype=np.float32),'label':10000x100 (dtype=np.float32)}. Please note that some of these numpy files contain less than 10000 samples, say 8111 etc.
During training, for each epoch, we need to iterate all the 500x10000 samples. These data cannot be loaded into memory due to capacity limits. A common solution is data prefetching queue.
My thought is as follows: (1) first record all the filenames and the count of data samples in each file, (2) for each batch, compute the batch indices, then get the corresponding data files that are needed to be loaded into memory to read the data samples to compose the batch data.
During step (2), if we set the batch size as 256, it is possible that we need to read 256 files and read just one sample in each of them to compose the batch data. This might be slow and unpractical.
Based on the queue, the data loading might be running on background threads, and all readed batch data are saved in the queue (the capacity might be a large number depends on the memory capacity). And the background threads consistently read data to fill the queue once after the queue have space.
Is it hard to implement this? I've searched in Google, it seems there are some advanced solutions such as using cache technique, using mmap. But I'm not familiar with these guys. Are there any simple examples on this?

Related

What is the best way to access specific tensors in a shuffled queue?

I am processing video data in python using tensorflow and want to run a loss calculation using temporal information using the current frame and the ones before and after it. After I've read in the images they are shuffled using tf.train.shuffle_batch as is necessary for the training. However later I want to access the frame before and after the current one, is there a way to access the specific tensor for those frame by maintaining (for want of a better phrase) a pointer to the tensors corresponding to those frames?
At the moment I read in all frames 3 times, once for itself and once each for the frame before and after so they can be shuffled together but this seems inefficient to be reading in and storing the same frame info multiple times.

No, there is no other way than the one you already implemented. Shuffling uses a limited buffer where items are stored and randomly sampled from. If you shuffle individual frames, you don't even have the guarantee the three frames are in the queue at the same time, let alone the possibility to know where they end up in the queue.

Convert a csv larger than RAM into parquet with Dask

I have approximately 60,000 small CSV files of varying sizes 1MB to several hundred MB that I would like to convert into a single Parquet file. The total size of all the CSVs is around 1.3 TB. This is larger than the memory of the server that I am using (678 GB available).
Since all the CSVs have same fields, I've concatenated them into a single large file. I tried to process this file with Dask:
ddf = dd.read_csv("large.csv", blocksize="1G").to_parquet("large.pqt")
My understanding was that the blocksize option would prevent dask running out of memory when the job was split over multiple workers.
What happens is that eventually Dask does run out of memory and I get a bunch of messages like:
distributed.nanny - WARNING - Restarting worker
Is my approach completely wrong or am I just missing an important detail?

You don't have to concatenate all of your files into one large file. dd.read_csv is happy to accept a list of filenames, or a string with a "*" in it.
If you have text data in your CSV file, then loading it into pandas or dask dataframes can expand the amount of memory used considerably, so your 1GB chunks might be quite a bit bigger than you expect. Do things work if you use a smaller chunk size? You might want to consult this doc entry: https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-partitions
In general I recommend using Dask's dashboard to watch the computation, and see what is taking up your memory. This might help you find a good solution. https://docs.dask.org/en/latest/diagnostics-distributed.html

How does dask work for larger than memory datasets

Would anyone be able to tell me how dask works for larger than memory dataset in simple terms. For example I have a dataset which is 6GB and 4GB RAM with 2 Cores. How would dask go about loading the data and doing a simple calculation such as sum of a column.
Does dask automatically check the size of the memory and chunk the dataset to smaller than memory pieces. Then, once requested to compute bring chunk by chunk into memory and do the computation using each of the available cores. Am I right on this.
Thanks
Michael

By "dataset" you are apparently referring to a dataframe. Let's consider two file formats from which you may be loading: CSV and parquet.
For CSVs, there is no inherent chunking mechanism in the file, so you, the user, can choose the bytes-per-chunk appropriate for your application using dd.read_csv(path, blocksize=..), or allow Dask to try to make a decent guess; "100MB" may be a fine size to try.
For parquet, the format itself has internal chunking of the data, and Dask will make use of this pattern in loading the data
In both cases, each worker will load one chunk at a time, and calculate the column sum you have asked for. Then, the loaded data will be discarded to make space for the next one, only keeping the results of the sum in memory (a single number for each partition). If you have two workers, two partitions will be in memory and processed at the same time. Finally, all the sums are added together.
Thus, each partition should comfortably fit into memory - not be too big - but the time it takes to load and process each should be much longer than the overhead imposed by scheduling the task to run on a worker (the latter <1ms) - not be too small.

Database vs flat files in python (need speed but can't fit in memory) to be used with generator for NN training

I am dealing with a relatively large dataset (>400 GB) for analytics purposes but have somewhat limited memory (256 GB). I am using python. So far I have been using pandas on a subset of the data but it is becoming obvious that I need a solution that allows me to access data from the entire dataset.
A little bit about the data. Right now the data is segregated over a set of flat files that are pandas dataframes. The files consist of column that have 2 keys. The primary key, let's call it "record", which I want to be able to use to access the data, and a secondary key, which is basically row number within the primary key. As in I want to access row 2 in record "A".
The dataset is used for training a NN (keras/tf). So the task is to partition the entire set into train/dev/test by record, and then pass the data to train/predict generators (I implement keras.utils.Sequence(), which I have to do because the data is variable length sequences that need to be padded for batch learning).
Given my desire to pass examples to the NN as fast as possible and my inability to store all of the examples in memory, should I use a database (mongodb or sqlite or something else?) and query examples as needed, or should I continue to store things in flat files and load them/delete them (and hope that python garbage collector works)?
Another complication is that there are about 3mil "records". Right now the pandas dataframes store them in batches of ~10k, but it would benefit me to split the training/test/validation randomly, which means I really need to be able to access some but not all of the records in a particular batch. In pandas this seems hard (as in as far as I know I need to read the entire flat file to then access the particular record since I don't know in which chunk of the file the data is located), on the other hand I don't think generating 3mil individual files is smart either.
A further complication is that the model is relatively simple and I am unable due to various bottlenecks to saturate my compute power during training, so if I could stream the training to several different models that would help with hyperparameter search, since otherwise I am wasting cycles.
What do you think is the correct (fast, simple) back-end to handle my data needs?
Best,
Ilya

This is a good use case for writing a custom generator, then using Keras' model.fit_generator. Here's something I wrote the other day in conjunction with Pandas.
Note that I first split my main dataframe into training and validation splits (merged was my original dataframe), but you may have to move things around on disk and specify them when selecting in the generator
Lots of the reshaping and lookup/loading is all custom to my problem, but you see the pattern.
msk = np.random.rand(len(merged)) < 0.8
train = merged[msk]
valid = merged[~msk]
def train_generator(batch_size):
sample_rows = train[train['match_id'].isin(npf.id.values)].sample(n=batch_size)
sample_file_ids = sample_rows.FILE_NAME.tolist()
sample_data = [np.load('/Users/jeff/spectro/' + x.split(".")[0] + ".npy").T for x in sample_file_ids]
sample_data = [x.reshape(x.shape[0], x.shape[1]) for x in sample_data]
sample_data = np.asarray([x[np.random.choice(x.shape[0], 128, replace=False)] for x in sample_data])
sample_labels = np.asarray([labels.get(x) for x in sample_file_ids])
while True:
yield (sample_data, sample_labels)
It essentially returns batch_size samples whenever you call it. Keras requires your generator to return a tuple of length 2, where the first element is your data in the expected shape (whatever your neural network input shape is) and the labels to also map to the expected shape (N_classes, or whatever).
Here's another relatively useful link regarding generator, which may help you determine when you've truly exhausted all examples. My generator just randomly samples, but the dataset is sufficiently large that I don't care.
https://github.com/keras-team/keras/issues/7729#issuecomment-324627132
Don't forget to write a validation_generator as well, which is reading from some set of files or dataframes which you randomly put in some other place, for validation purposes.
Lastly, here's calling the generator:
model.fit_generator(train_generator(32),
samples_per_epoch=10000, nb_epoch=20,
validation_data=valid_generator(32), validation_steps=500)
depending on the keras version, you may find arg names have changed slightly, but a few searches should get you fixed up quickly.

I want to read a large amount of images for deep learning, but what is the solution for when there is insufficient memory?

In deep learning program written in python, I wanted to store a large amount of image data in numpy array at once, and to extract batch data randomly from that array, but the image data is too large and memory is run out.
How should we deal with such cases? I have no choice but to do IO processing and read image data from storage every time you retrieve batch data?

File I/O would solve the issue, but will slow down the leanring process, since FILE I/O is a task which takes long.
However, you could try to implement a mixture of both using multithreading, e.g.
https://github.com/stratospark/keras-multiprocess-image-data-generator
(I do not know what kind of framework you are using).
Anyhow back to basic idea:
Pick some random files and read them, start training. During training start a second thread which will read out random Files again. Thus, you learning thread does not have to wait for new data, since the training process might take longer than the reading process.
Some frameworks have this feature already implemented, check out:
https://github.com/fchollet/keras/issues/1627
or:
https://github.com/pytorch/examples/blob/master/mnist_hogwild/train.py

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.