I created a hdf5 file with 5000 groups each containing 1000 datasets with h5py. The datasets are 1-D arrays of integers, averaging about 10,000 elements, though this can vary from dataset to dataset. The datasets were created without the chunked storage option. The total size of the dataset is 281 GB. I want to iterate through all datasets to create a dictionary mapping the dataset name to the length of the dataset. I will use this dictionary in an algorithm later.
Here's what I've tried.
import h5py
f = h5py.File('/my/path/to/database.hdf5', 'r')
lens = {}
for group in f.itervalues():
for dataset in group.itervalues():
lens[dataset.name] = dataset.len()
This is too slow for my purposes and I'm looking for ways to speed up this procedure. I'm aware that it's possible to parallelize operations with h5py, but wanted to see if there was another approach before going down that road. I'd be open to re-creating the database with different options/structure if it would speed things up.
Related
High level idea:
I need to iterate over and perform fairly complex operations on a large dataset (240 million rows) that I have chunked into SQL calls that return ~20 million records each. I can successfully pull each chunk into pandas, but those dataframes are unwieldy and really need further chunking or partitioning further before I can operate on them. I unfortunately can not divide the ingestion calls any smaller (the SQL calls are against AWS S3 via Spectrum and would require many costly scans against a nonpartitioned column if so).
Using Python, how can I efficiently further chunk out these large datasets upon ingestion?
Specific details:
I have two primary columns to consider ID and Date. The already established chunks (of 20 million each) are for months in the Date column. Within each ingested chunk, the operations I need to perform look like:
Sort the data by Date
Iterate through each ID, get a new dataset filtered to that ID
For each row in each ID's dataset, do some stuff
Said stuff will allow me to create a new dataset with one row per ID
...then eventually concatenate the result of all of the months. To me, the inference then is that if I can immediately partition each ingested 20 million records by ID or sets of such, I'm golden, but I don't know how to reach that.
I could save each ID set as a separate csv, but then I'd need to iterate over the pandas dataframe (filtering and then saving), which isn't tenable. I've read about alternatives to scale out of pandas like dask, but it ?seems? like that doesn't really handle ingestion or setting up a big for loop as I need, but rather typical pandas-like data transformations. Having not worked with data such large sizes, I'm not sure what tools are available to approach a problem like this within a Python environment.
You don't give a minimal reproducible example so I can't work with your specific data.
You can use generators, but chunks are easier and faster to implement so you are on the right track.
I usually chunk huge datasets like that:
chunk_size = 10000
for i in range(0, len(column), chunk_size):
data_chunked = [j for j in column[i:i+chunk_size]] # or any other data manipulation
entries_to_sql = [data_chunked, *other_entries] # if others are not chunked: other_entry[i:i+chunk_size]
insert_many_entries_to_sql(entries_to_sql)
I need a way to efficiently store (size & read speed) data using numpy arrays with mixed (heterogeneous) dtypes. Imagine a dataset that has 100M observations, and 5 variables per observation (3 of which are int32, and 2 are float32).
I'm currently storing the data in two gzipped .npy files, one for the ints and one for the floats:
import numpy as np
import gzip as gz
with gz.open('array_ints.npy.gz', 'wb') as fObj:
np.save(fObj, int_ndarray)
with gz.open('array_floats.npy.gz', 'wb') as fObj:
np.save(fObj, flt_ndarray)
I've also tried storing the data as a Structured Array, but the final file size is roughly 25% larger than the combined size of storing the ints and floats separately. My data is stretching into the TBs range, so I'm looking for the most efficient way to store it (but I'd like to avoid changing compression algos to something like LZMA).
Is there another way different data types are efficiently stored together, so I can read in both at the same time? I'm starting to look into HD5, but I'm not sure that can help.
EDIT:
Ultimately, I ended up going down the HD5 route with h5py. Relative to gzip-compressed .npy arrays, I actually see a 25% decrease in size using h5py. However, this can be attributed to the shuffle filter. But when saving two arrays in the same file, there is virtually no overhead relative to saving individual files.
I realize that the original question was too broad, and sufficient answers can't be given without the specific format of the data and a representative sample (which I can't really disclose). For this reason, I'm closing the question.
I am dealing with a relatively large dataset (>400 GB) for analytics purposes but have somewhat limited memory (256 GB). I am using python. So far I have been using pandas on a subset of the data but it is becoming obvious that I need a solution that allows me to access data from the entire dataset.
A little bit about the data. Right now the data is segregated over a set of flat files that are pandas dataframes. The files consist of column that have 2 keys. The primary key, let's call it "record", which I want to be able to use to access the data, and a secondary key, which is basically row number within the primary key. As in I want to access row 2 in record "A".
The dataset is used for training a NN (keras/tf). So the task is to partition the entire set into train/dev/test by record, and then pass the data to train/predict generators (I implement keras.utils.Sequence(), which I have to do because the data is variable length sequences that need to be padded for batch learning).
Given my desire to pass examples to the NN as fast as possible and my inability to store all of the examples in memory, should I use a database (mongodb or sqlite or something else?) and query examples as needed, or should I continue to store things in flat files and load them/delete them (and hope that python garbage collector works)?
Another complication is that there are about 3mil "records". Right now the pandas dataframes store them in batches of ~10k, but it would benefit me to split the training/test/validation randomly, which means I really need to be able to access some but not all of the records in a particular batch. In pandas this seems hard (as in as far as I know I need to read the entire flat file to then access the particular record since I don't know in which chunk of the file the data is located), on the other hand I don't think generating 3mil individual files is smart either.
A further complication is that the model is relatively simple and I am unable due to various bottlenecks to saturate my compute power during training, so if I could stream the training to several different models that would help with hyperparameter search, since otherwise I am wasting cycles.
What do you think is the correct (fast, simple) back-end to handle my data needs?
Best,
Ilya
This is a good use case for writing a custom generator, then using Keras' model.fit_generator. Here's something I wrote the other day in conjunction with Pandas.
Note that I first split my main dataframe into training and validation splits (merged was my original dataframe), but you may have to move things around on disk and specify them when selecting in the generator
Lots of the reshaping and lookup/loading is all custom to my problem, but you see the pattern.
msk = np.random.rand(len(merged)) < 0.8
train = merged[msk]
valid = merged[~msk]
def train_generator(batch_size):
sample_rows = train[train['match_id'].isin(npf.id.values)].sample(n=batch_size)
sample_file_ids = sample_rows.FILE_NAME.tolist()
sample_data = [np.load('/Users/jeff/spectro/' + x.split(".")[0] + ".npy").T for x in sample_file_ids]
sample_data = [x.reshape(x.shape[0], x.shape[1]) for x in sample_data]
sample_data = np.asarray([x[np.random.choice(x.shape[0], 128, replace=False)] for x in sample_data])
sample_labels = np.asarray([labels.get(x) for x in sample_file_ids])
while True:
yield (sample_data, sample_labels)
It essentially returns batch_size samples whenever you call it. Keras requires your generator to return a tuple of length 2, where the first element is your data in the expected shape (whatever your neural network input shape is) and the labels to also map to the expected shape (N_classes, or whatever).
Here's another relatively useful link regarding generator, which may help you determine when you've truly exhausted all examples. My generator just randomly samples, but the dataset is sufficiently large that I don't care.
https://github.com/keras-team/keras/issues/7729#issuecomment-324627132
Don't forget to write a validation_generator as well, which is reading from some set of files or dataframes which you randomly put in some other place, for validation purposes.
Lastly, here's calling the generator:
model.fit_generator(train_generator(32),
samples_per_epoch=10000, nb_epoch=20,
validation_data=valid_generator(32), validation_steps=500)
depending on the keras version, you may find arg names have changed slightly, but a few searches should get you fixed up quickly.
I have a large, sparse, multidimensional lookup table, where cells contain arrays varying in size from 34 kB to circa 10 MB (essentially one or more elements stored in this bin/bucket/cell). My prototype has dimensions of 30**5=24,300,000, of which only 4,568 cells are non-empty (so it's sparse). Prototype non-empty cells contain structured arrays with sizes between 34 kB and 7.5 MB. At 556 MB, the prototype is easily small enough to fit in memory, but the production version will be a lot larger; maybe 100–1000 times (it is hard to estimate). This growth will be mostly due to increased dimensions, rather than due to the data contained in individual cells. My typical use case is write once (or rarely), read often.
I'm currently using a Python dictionary, where the keys are tuples, i.e. db[(29,27,29,29,16)] is a structured numpy.ndarray of around 1 MB. However, as it grows, this won't fit in memory.
A natural and easy to implement extension would be the Python shelve module.
I think tables is fast, in particular for the write once, read often use case, but I don't think it fits my data structure.
Considering that I will always need access only by the tuple index, a very simple way to store it would be to have a directory with some thousands of files with names like entry-29-27-29-29-16, which then stores the numpy.ndarray object in some format (NetCDF, HDF5, npy...).
I'm not sure if a classical database would work, considering that the size of the entries varies considerably.
What is a way to store a data structure as described above, that has efficient storage and a fast retrieval of data?
From what I understand, you might want to look at the amazing pandas package, as it has a specific facility for the sparse data structure you've described.
Also, while this stackoverflow post doesn't specifically address sparse data, it's a great description of using pandas for BIG data, which may be of interest.
Best of luck!
I have a dictionary with many entries and a huge vector as values. These vectors can be 60.000 dimensions large and I have about 60.000 entries in the dictionary. To save time, I want to store this after computation. However, using a pickle led to a huge file. I have tried storing to JSON, but the file remains extremely large (like 10.5 MB on a sample of 50 entries with less dimensions). I have also read about sparse matrices. As most entries will be 0, this is a possibility. Will this reduce the filesize? Is there any other way to store this information? Or am I just unlucky?
Update:
Thank you all for the replies. I want to store this data as these are word counts. For example, when given sentences, I store the amount of times word 0 (at location 0 in the array) appears in the sentence. There are obviously more words in all sentences than appear in one sentence, hence the many zeros. Then, I want to use this array tot train at least three, maybe six classifiers. It seemed easier to create the arrays with word counts and then run the classifiers over night to train and test. I use sklearn for this. This format was chosen to be consistent with other feature vector formats, which is why I am approaching the problem this way. If this is not the way to go, in this case, please let me know. I am very much aware that I have much to learn in coding efficiently!
I also started implementing sparse matrices. The file is even bigger now (testing with a sample set of 300 sentences).
Update 2:
Thank you all for the tips. John Mee was right by not needing to store the data. Both he and Mike McKerns told me to use sparse matrices, which sped up calculation significantly! So thank you for your input. Now I have a new tool in my arsenal!
See my answer to a very closely related question https://stackoverflow.com/a/25244747/2379433, if you are ok with pickling to several files instead of a single file.
Also see: https://stackoverflow.com/a/21948720/2379433 for other potential improvements, and here too: https://stackoverflow.com/a/24471659/2379433.
If you are using numpy arrays, it can be very efficient, as both klepto and joblib understand how to use minimal state representation for an array. If you indeed have most elements of the arrays as zeros, then by all means, convert to sparse matrices... and you will find huge savings in storage size of the array.
As the links above discuss, you could use klepto -- which provides you with the ability to easily store dictionaries to disk or database, using a common API. klepto also enables you to pick a storage format (pickle, json, etc.) -- where HDF5 is coming soon. It can utilize both specialized pickle formats (like numpy's) and compression (if you care about size and not speed).
klepto gives you the option to store the dictionary with "all-in-one" file or "one-entry-per" file, and also can leverage multiprocessing or multithreading -- meaning that you can save and load dictionary items to/from the backend in parallel.
With 60,000 dimensions do you mean 60,000 elements? if this is the case and the numbers are 1..10 then a reasonably compact but still efficient approach is to use a dictionary of Python array.array objects with 1 byte per element (type 'B').
The size in memory should be about 60,000 entries x 60,000 bytes, totaling 3.35Gb of data.
That data structure is pickled to about the same size to disk too.