proper method to save serialized data incrementally - python

This must be a very standard problem that also must have a standard solution:
What is the correct way to incrementally save feature vectors extract from data, rather than accumulating all vectors form the entire dataset and then saving all of them at once?
In more detail:
I have written a script for extracting custum text features (e.g. next_token, prefix-3, is_number) form text documents. After extraction is done I end up with one big list of scipy sparse vectors. Finally I pickle that list to space efficiently store and time efficiently load it when I want to train a model. But the problem is, that I am limited by my ram here. I can make that list of vectors only so big before it or the pickling exceeds my ram.
Of course incrementally appending string representations of these vectors would be possible. One could accumulate k vectors, append them to a text file and clear the list again for the next k vectors. But storing vectors and string would be space inefficient and require parsing the representations upon loading. That does not sound like a good solution.
I could also pickle sets of k vectors and end up with a whole bunch of pickle-files of k vectors. But that sounds messy.
So this must be a standard problem with a more elegant solution. What is the right method to solve this? Is there maybe even some existing functionality in scikit-learn for this kind of thing already, that I overlooked?
I found this: How to load one line at a time from a pickle file?
But it does not work with Python3.

Related

How to store a set of arrays for deep learning not consuming too much memory (Python)?

I`m trying to make a research in which the observations of my dataset are represented by matrices (arrays composed of numbers, similar to how images for deep learning are represented, but mine are not images) of different shapes.
What I`ve already tried is to write those arrays as lists in one column of a pandas dataframe and then save this as a csv\excel. After that I planned just to load such a file and convert those lists to arrays of appropriate shapes and then to convert a set of such arrays to a tensor which I finally will use for training the deep model in keras.
But it seems like this method is extremely inefficient, cause only 1/6 of my dataset has already occupied about 6 Gb of memory (pandas saved as csv) which is huge and I won't be able to load it in RAM (I'm using google colab to run my experiments).
So my question is: is there any other way of storing a set of arrays of different shapes, which won`t occupy so much memory? Maybe I can store tensors directly somehow? Or maybe there are some ways to store pandas in some compressed types of files which are not so heavy?
Yes, Avoid using csv/excel for big datasets, there are tons of data formats out there, for this case I would recommend to use a compressed format like pd.Dataframe.to_hdf, pd.Dataframe.to_parquet or pd.Dataframe.to_pickle.
There are even more formats to choose and compression options within the functions (for example to_hdf takes the argument complevel that you can set to 9 ).
Are you storing purely (or mostly) continuous variables? If so, maybe you could reduce the accuracy (i.e., from float64 to float32) these variables if you don't need need such an accurate value per datapoint.
There's a bunch of ways in reducing the size of your data that's being stored in your memory, and the what's written is one of the many ways to do so. Maybe you could break the process that you've mentioned into smaller chunks (i.e., storage of data, extraction of data), and work on each chunk/stage individually, which hopefully will reduce the overall size of your data!
Otherwise, you could perhaps take advantage of database management systems (SQL or NoSQL depending on which fits best) which might be better, though querying that amount of data might constitute yet another issue.
I'm by no means an expert in this but I'm just explaining more of how I've dealt with excessively large datasets (similar to what you're currently experiencing) in the past, and I'm pretty sure someone here will probably give you a more definitive answer as compared to my 'a little of everything' answer. All the best!

Filtering Word Embeddings from word2vec

I have downloaded Google's pretrained word embeddings as a binary file here (GoogleNews-vectors-negative300.bin.gz). I want to be able to filter the embedding based on some vocabulary.
I first tried loading the bin file as a KeyedVector object, and then creating a dictionary that uses its vocabulary along with another vocabulary as a filter. However, it takes a long time.
# X is the vocabulary we are interested in
embeddings = KeyedVectors.load_word2vec_format('GoogleNews-vectors-
negative300.bin.gz', binary=True)
embeddings_filtered = dict((k, embeddings[k]) for k in X if k in list(embeddings.wv.vocab.keys()))
It takes a very long time to run. I am not sure if this is the most efficient solution. Should I filter it out in the load_word2vec_format step first?
Your dict won't have all the features of a KeyedVectors object, and it won't be stored as compactly. The KeyedVectors stores all vectors in a large contiguous native 2D array, with a dict indicating the row for each word's vector. Your second dict, with a separate vector for each word, will involve more overhead. (And further, as the vectors you get back from embeddings[k] will be "views" into the full vector – so your subset may actually indirectly retain the larger array, even after you try to discard the KeyedVectors.)
Since it's likely that a reason you only want a subset of the original vectors is that the original set was too large, having a dict that takes as much or more memory probably isn't ideal.
You should consider two options:
load_word2vec_format() includes an optional limit parameter that only loads the first N words from the supplied file. As such files are typically sorted from most-frequent to least-frequent words, and the less-frequent words are both far less useful and of lower vector quality, it is often practical to just use the first 1 million, or 500,000, or 100,000, etc entries for a large memory & speed savings.
You could try filtering on load. You'd need to adapt the loading code to do this. Fortunately you can review the full source code for load_word2vec_format() (it's just a few dozen lines) inside your local gensim instalation, or online at the project source code hosting at:
https://github.com/RaRe-Technologies/gensim/blob/9c5215afe3bc4edba7dde565b6f2db982bba5113/gensim/models/utils_any2vec.py#L123
You'd write your own version of this routine that skips words not of interest. (It might have to do two passes over the file, one to count the words of interest, then a second to actually allocate the right-sized in-memory arrays and do the real reading.)

Database vs flat files in python (need speed but can't fit in memory) to be used with generator for NN training

I am dealing with a relatively large dataset (>400 GB) for analytics purposes but have somewhat limited memory (256 GB). I am using python. So far I have been using pandas on a subset of the data but it is becoming obvious that I need a solution that allows me to access data from the entire dataset.
A little bit about the data. Right now the data is segregated over a set of flat files that are pandas dataframes. The files consist of column that have 2 keys. The primary key, let's call it "record", which I want to be able to use to access the data, and a secondary key, which is basically row number within the primary key. As in I want to access row 2 in record "A".
The dataset is used for training a NN (keras/tf). So the task is to partition the entire set into train/dev/test by record, and then pass the data to train/predict generators (I implement keras.utils.Sequence(), which I have to do because the data is variable length sequences that need to be padded for batch learning).
Given my desire to pass examples to the NN as fast as possible and my inability to store all of the examples in memory, should I use a database (mongodb or sqlite or something else?) and query examples as needed, or should I continue to store things in flat files and load them/delete them (and hope that python garbage collector works)?
Another complication is that there are about 3mil "records". Right now the pandas dataframes store them in batches of ~10k, but it would benefit me to split the training/test/validation randomly, which means I really need to be able to access some but not all of the records in a particular batch. In pandas this seems hard (as in as far as I know I need to read the entire flat file to then access the particular record since I don't know in which chunk of the file the data is located), on the other hand I don't think generating 3mil individual files is smart either.
A further complication is that the model is relatively simple and I am unable due to various bottlenecks to saturate my compute power during training, so if I could stream the training to several different models that would help with hyperparameter search, since otherwise I am wasting cycles.
What do you think is the correct (fast, simple) back-end to handle my data needs?
Best,
Ilya
This is a good use case for writing a custom generator, then using Keras' model.fit_generator. Here's something I wrote the other day in conjunction with Pandas.
Note that I first split my main dataframe into training and validation splits (merged was my original dataframe), but you may have to move things around on disk and specify them when selecting in the generator
Lots of the reshaping and lookup/loading is all custom to my problem, but you see the pattern.
msk = np.random.rand(len(merged)) < 0.8
train = merged[msk]
valid = merged[~msk]
def train_generator(batch_size):
sample_rows = train[train['match_id'].isin(npf.id.values)].sample(n=batch_size)
sample_file_ids = sample_rows.FILE_NAME.tolist()
sample_data = [np.load('/Users/jeff/spectro/' + x.split(".")[0] + ".npy").T for x in sample_file_ids]
sample_data = [x.reshape(x.shape[0], x.shape[1]) for x in sample_data]
sample_data = np.asarray([x[np.random.choice(x.shape[0], 128, replace=False)] for x in sample_data])
sample_labels = np.asarray([labels.get(x) for x in sample_file_ids])
while True:
yield (sample_data, sample_labels)
It essentially returns batch_size samples whenever you call it. Keras requires your generator to return a tuple of length 2, where the first element is your data in the expected shape (whatever your neural network input shape is) and the labels to also map to the expected shape (N_classes, or whatever).
Here's another relatively useful link regarding generator, which may help you determine when you've truly exhausted all examples. My generator just randomly samples, but the dataset is sufficiently large that I don't care.
https://github.com/keras-team/keras/issues/7729#issuecomment-324627132
Don't forget to write a validation_generator as well, which is reading from some set of files or dataframes which you randomly put in some other place, for validation purposes.
Lastly, here's calling the generator:
model.fit_generator(train_generator(32),
samples_per_epoch=10000, nb_epoch=20,
validation_data=valid_generator(32), validation_steps=500)
depending on the keras version, you may find arg names have changed slightly, but a few searches should get you fixed up quickly.

Store large dictionary to file in Python

I have a dictionary with many entries and a huge vector as values. These vectors can be 60.000 dimensions large and I have about 60.000 entries in the dictionary. To save time, I want to store this after computation. However, using a pickle led to a huge file. I have tried storing to JSON, but the file remains extremely large (like 10.5 MB on a sample of 50 entries with less dimensions). I have also read about sparse matrices. As most entries will be 0, this is a possibility. Will this reduce the filesize? Is there any other way to store this information? Or am I just unlucky?
Update:
Thank you all for the replies. I want to store this data as these are word counts. For example, when given sentences, I store the amount of times word 0 (at location 0 in the array) appears in the sentence. There are obviously more words in all sentences than appear in one sentence, hence the many zeros. Then, I want to use this array tot train at least three, maybe six classifiers. It seemed easier to create the arrays with word counts and then run the classifiers over night to train and test. I use sklearn for this. This format was chosen to be consistent with other feature vector formats, which is why I am approaching the problem this way. If this is not the way to go, in this case, please let me know. I am very much aware that I have much to learn in coding efficiently!
I also started implementing sparse matrices. The file is even bigger now (testing with a sample set of 300 sentences).
Update 2:
Thank you all for the tips. John Mee was right by not needing to store the data. Both he and Mike McKerns told me to use sparse matrices, which sped up calculation significantly! So thank you for your input. Now I have a new tool in my arsenal!
See my answer to a very closely related question https://stackoverflow.com/a/25244747/2379433, if you are ok with pickling to several files instead of a single file.
Also see: https://stackoverflow.com/a/21948720/2379433 for other potential improvements, and here too: https://stackoverflow.com/a/24471659/2379433.
If you are using numpy arrays, it can be very efficient, as both klepto and joblib understand how to use minimal state representation for an array. If you indeed have most elements of the arrays as zeros, then by all means, convert to sparse matrices... and you will find huge savings in storage size of the array.
As the links above discuss, you could use klepto -- which provides you with the ability to easily store dictionaries to disk or database, using a common API. klepto also enables you to pick a storage format (pickle, json, etc.) -- where HDF5 is coming soon. It can utilize both specialized pickle formats (like numpy's) and compression (if you care about size and not speed).
klepto gives you the option to store the dictionary with "all-in-one" file or "one-entry-per" file, and also can leverage multiprocessing or multithreading -- meaning that you can save and load dictionary items to/from the backend in parallel.
With 60,000 dimensions do you mean 60,000 elements? if this is the case and the numbers are 1..10 then a reasonably compact but still efficient approach is to use a dictionary of Python array.array objects with 1 byte per element (type 'B').
The size in memory should be about 60,000 entries x 60,000 bytes, totaling 3.35Gb of data.
That data structure is pickled to about the same size to disk too.

Efficiently Reading Large Files with ATpy and numpy?

I've looked all over for an answer to this one, but nothing really seems to fit the bill. I've got very large files that I'm trying to read with ATpy, and the data comes in the form of numpy arrays. For smaller files the following code has been sufficient:
sat = atpy.Table('satellite_data.tbl')
From there I build up a number of variables that I have to manipulate later for plotting purposes. It's lots of these kinds of operations:
w1 = np.array([sat['w1_column']])
w2 = np.array([sat['w2_column']])
w3 = np.array([sat['w3_column']])
colorw1w2 = w1 - w2 #just subtracting w2 values from w1 values for each element
colorw1w3 = w1 - w3
etc.
But for very large files the computer can't handle it. I think all the data is getting stored in memory before parsing begins, and that's not feasible for 2GB files. So, what can I use instead to handle these large files?
I've seen lots of posts where people are breaking up the data into chunks and using for loops to iterate over each line, but I don't think that's going to work for me here given the nature of these files, and the kinds of operations I need to do on these arrays. I can't just do a single operation on every line of the file, because each line contains a number of parameters that are assigned to columns, and in some cases I need to do multiple operations with figures from a single column.
Honestly I don't really understand everything going on behind the scenes with ATpy and numpy. I'm new to Python, so I appreciate answers that spell it out clearly (i.e. not relying on lots of implicit coding knowledge). There has to be a clean way of parsing this, but I'm not finding it. Thanks.
For very large arrays (larger than your memory capacity) you can use pytables which stores arrays on disk in some clever ways (using the HDF5 format) so that manipulations can be done on them without loading the entire array into memory at once. Then, you won't have to manually break up your datasets or manipulate them one line at a time.
I know nothing about ATpy so you might be better off asking on an ATpy mailing list or at least some astronomy python users mailing list, as it's possible that ATpy has another solution built in.
From the pyables website:
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
PyTables is built on top of the HDF5 library, using the Python language and the NumPy package.
... fast, yet extremely easy to use tool for interactively browse, process and search very large amounts of data. One important feature of PyTables is that it optimizes memory and disk resources so that data takes much less space...
Look into using pandas. It's built for this kind of work. But the data files need to be stored in a well structured binary format like hdf5 to get good performance with any solution.

Categories

Resources