I have a dictionary with many entries and a huge vector as values. These vectors can be 60.000 dimensions large and I have about 60.000 entries in the dictionary. To save time, I want to store this after computation. However, using a pickle led to a huge file. I have tried storing to JSON, but the file remains extremely large (like 10.5 MB on a sample of 50 entries with less dimensions). I have also read about sparse matrices. As most entries will be 0, this is a possibility. Will this reduce the filesize? Is there any other way to store this information? Or am I just unlucky?
Update:
Thank you all for the replies. I want to store this data as these are word counts. For example, when given sentences, I store the amount of times word 0 (at location 0 in the array) appears in the sentence. There are obviously more words in all sentences than appear in one sentence, hence the many zeros. Then, I want to use this array tot train at least three, maybe six classifiers. It seemed easier to create the arrays with word counts and then run the classifiers over night to train and test. I use sklearn for this. This format was chosen to be consistent with other feature vector formats, which is why I am approaching the problem this way. If this is not the way to go, in this case, please let me know. I am very much aware that I have much to learn in coding efficiently!
I also started implementing sparse matrices. The file is even bigger now (testing with a sample set of 300 sentences).
Update 2:
Thank you all for the tips. John Mee was right by not needing to store the data. Both he and Mike McKerns told me to use sparse matrices, which sped up calculation significantly! So thank you for your input. Now I have a new tool in my arsenal!
See my answer to a very closely related question https://stackoverflow.com/a/25244747/2379433, if you are ok with pickling to several files instead of a single file.
Also see: https://stackoverflow.com/a/21948720/2379433 for other potential improvements, and here too: https://stackoverflow.com/a/24471659/2379433.
If you are using numpy arrays, it can be very efficient, as both klepto and joblib understand how to use minimal state representation for an array. If you indeed have most elements of the arrays as zeros, then by all means, convert to sparse matrices... and you will find huge savings in storage size of the array.
As the links above discuss, you could use klepto -- which provides you with the ability to easily store dictionaries to disk or database, using a common API. klepto also enables you to pick a storage format (pickle, json, etc.) -- where HDF5 is coming soon. It can utilize both specialized pickle formats (like numpy's) and compression (if you care about size and not speed).
klepto gives you the option to store the dictionary with "all-in-one" file or "one-entry-per" file, and also can leverage multiprocessing or multithreading -- meaning that you can save and load dictionary items to/from the backend in parallel.
With 60,000 dimensions do you mean 60,000 elements? if this is the case and the numbers are 1..10 then a reasonably compact but still efficient approach is to use a dictionary of Python array.array objects with 1 byte per element (type 'B').
The size in memory should be about 60,000 entries x 60,000 bytes, totaling 3.35Gb of data.
That data structure is pickled to about the same size to disk too.
Related
I`m trying to make a research in which the observations of my dataset are represented by matrices (arrays composed of numbers, similar to how images for deep learning are represented, but mine are not images) of different shapes.
What I`ve already tried is to write those arrays as lists in one column of a pandas dataframe and then save this as a csv\excel. After that I planned just to load such a file and convert those lists to arrays of appropriate shapes and then to convert a set of such arrays to a tensor which I finally will use for training the deep model in keras.
But it seems like this method is extremely inefficient, cause only 1/6 of my dataset has already occupied about 6 Gb of memory (pandas saved as csv) which is huge and I won't be able to load it in RAM (I'm using google colab to run my experiments).
So my question is: is there any other way of storing a set of arrays of different shapes, which won`t occupy so much memory? Maybe I can store tensors directly somehow? Or maybe there are some ways to store pandas in some compressed types of files which are not so heavy?
Yes, Avoid using csv/excel for big datasets, there are tons of data formats out there, for this case I would recommend to use a compressed format like pd.Dataframe.to_hdf, pd.Dataframe.to_parquet or pd.Dataframe.to_pickle.
There are even more formats to choose and compression options within the functions (for example to_hdf takes the argument complevel that you can set to 9 ).
Are you storing purely (or mostly) continuous variables? If so, maybe you could reduce the accuracy (i.e., from float64 to float32) these variables if you don't need need such an accurate value per datapoint.
There's a bunch of ways in reducing the size of your data that's being stored in your memory, and the what's written is one of the many ways to do so. Maybe you could break the process that you've mentioned into smaller chunks (i.e., storage of data, extraction of data), and work on each chunk/stage individually, which hopefully will reduce the overall size of your data!
Otherwise, you could perhaps take advantage of database management systems (SQL or NoSQL depending on which fits best) which might be better, though querying that amount of data might constitute yet another issue.
I'm by no means an expert in this but I'm just explaining more of how I've dealt with excessively large datasets (similar to what you're currently experiencing) in the past, and I'm pretty sure someone here will probably give you a more definitive answer as compared to my 'a little of everything' answer. All the best!
I have downloaded Google's pretrained word embeddings as a binary file here (GoogleNews-vectors-negative300.bin.gz). I want to be able to filter the embedding based on some vocabulary.
I first tried loading the bin file as a KeyedVector object, and then creating a dictionary that uses its vocabulary along with another vocabulary as a filter. However, it takes a long time.
# X is the vocabulary we are interested in
embeddings = KeyedVectors.load_word2vec_format('GoogleNews-vectors-
negative300.bin.gz', binary=True)
embeddings_filtered = dict((k, embeddings[k]) for k in X if k in list(embeddings.wv.vocab.keys()))
It takes a very long time to run. I am not sure if this is the most efficient solution. Should I filter it out in the load_word2vec_format step first?
Your dict won't have all the features of a KeyedVectors object, and it won't be stored as compactly. The KeyedVectors stores all vectors in a large contiguous native 2D array, with a dict indicating the row for each word's vector. Your second dict, with a separate vector for each word, will involve more overhead. (And further, as the vectors you get back from embeddings[k] will be "views" into the full vector – so your subset may actually indirectly retain the larger array, even after you try to discard the KeyedVectors.)
Since it's likely that a reason you only want a subset of the original vectors is that the original set was too large, having a dict that takes as much or more memory probably isn't ideal.
You should consider two options:
load_word2vec_format() includes an optional limit parameter that only loads the first N words from the supplied file. As such files are typically sorted from most-frequent to least-frequent words, and the less-frequent words are both far less useful and of lower vector quality, it is often practical to just use the first 1 million, or 500,000, or 100,000, etc entries for a large memory & speed savings.
You could try filtering on load. You'd need to adapt the loading code to do this. Fortunately you can review the full source code for load_word2vec_format() (it's just a few dozen lines) inside your local gensim instalation, or online at the project source code hosting at:
https://github.com/RaRe-Technologies/gensim/blob/9c5215afe3bc4edba7dde565b6f2db982bba5113/gensim/models/utils_any2vec.py#L123
You'd write your own version of this routine that skips words not of interest. (It might have to do two passes over the file, one to count the words of interest, then a second to actually allocate the right-sized in-memory arrays and do the real reading.)
This must be a very standard problem that also must have a standard solution:
What is the correct way to incrementally save feature vectors extract from data, rather than accumulating all vectors form the entire dataset and then saving all of them at once?
In more detail:
I have written a script for extracting custum text features (e.g. next_token, prefix-3, is_number) form text documents. After extraction is done I end up with one big list of scipy sparse vectors. Finally I pickle that list to space efficiently store and time efficiently load it when I want to train a model. But the problem is, that I am limited by my ram here. I can make that list of vectors only so big before it or the pickling exceeds my ram.
Of course incrementally appending string representations of these vectors would be possible. One could accumulate k vectors, append them to a text file and clear the list again for the next k vectors. But storing vectors and string would be space inefficient and require parsing the representations upon loading. That does not sound like a good solution.
I could also pickle sets of k vectors and end up with a whole bunch of pickle-files of k vectors. But that sounds messy.
So this must be a standard problem with a more elegant solution. What is the right method to solve this? Is there maybe even some existing functionality in scikit-learn for this kind of thing already, that I overlooked?
I found this: How to load one line at a time from a pickle file?
But it does not work with Python3.
I have a large, sparse, multidimensional lookup table, where cells contain arrays varying in size from 34 kB to circa 10 MB (essentially one or more elements stored in this bin/bucket/cell). My prototype has dimensions of 30**5=24,300,000, of which only 4,568 cells are non-empty (so it's sparse). Prototype non-empty cells contain structured arrays with sizes between 34 kB and 7.5 MB. At 556 MB, the prototype is easily small enough to fit in memory, but the production version will be a lot larger; maybe 100–1000 times (it is hard to estimate). This growth will be mostly due to increased dimensions, rather than due to the data contained in individual cells. My typical use case is write once (or rarely), read often.
I'm currently using a Python dictionary, where the keys are tuples, i.e. db[(29,27,29,29,16)] is a structured numpy.ndarray of around 1 MB. However, as it grows, this won't fit in memory.
A natural and easy to implement extension would be the Python shelve module.
I think tables is fast, in particular for the write once, read often use case, but I don't think it fits my data structure.
Considering that I will always need access only by the tuple index, a very simple way to store it would be to have a directory with some thousands of files with names like entry-29-27-29-29-16, which then stores the numpy.ndarray object in some format (NetCDF, HDF5, npy...).
I'm not sure if a classical database would work, considering that the size of the entries varies considerably.
What is a way to store a data structure as described above, that has efficient storage and a fast retrieval of data?
From what I understand, you might want to look at the amazing pandas package, as it has a specific facility for the sparse data structure you've described.
Also, while this stackoverflow post doesn't specifically address sparse data, it's a great description of using pandas for BIG data, which may be of interest.
Best of luck!
I've looked all over for an answer to this one, but nothing really seems to fit the bill. I've got very large files that I'm trying to read with ATpy, and the data comes in the form of numpy arrays. For smaller files the following code has been sufficient:
sat = atpy.Table('satellite_data.tbl')
From there I build up a number of variables that I have to manipulate later for plotting purposes. It's lots of these kinds of operations:
w1 = np.array([sat['w1_column']])
w2 = np.array([sat['w2_column']])
w3 = np.array([sat['w3_column']])
colorw1w2 = w1 - w2 #just subtracting w2 values from w1 values for each element
colorw1w3 = w1 - w3
etc.
But for very large files the computer can't handle it. I think all the data is getting stored in memory before parsing begins, and that's not feasible for 2GB files. So, what can I use instead to handle these large files?
I've seen lots of posts where people are breaking up the data into chunks and using for loops to iterate over each line, but I don't think that's going to work for me here given the nature of these files, and the kinds of operations I need to do on these arrays. I can't just do a single operation on every line of the file, because each line contains a number of parameters that are assigned to columns, and in some cases I need to do multiple operations with figures from a single column.
Honestly I don't really understand everything going on behind the scenes with ATpy and numpy. I'm new to Python, so I appreciate answers that spell it out clearly (i.e. not relying on lots of implicit coding knowledge). There has to be a clean way of parsing this, but I'm not finding it. Thanks.
For very large arrays (larger than your memory capacity) you can use pytables which stores arrays on disk in some clever ways (using the HDF5 format) so that manipulations can be done on them without loading the entire array into memory at once. Then, you won't have to manually break up your datasets or manipulate them one line at a time.
I know nothing about ATpy so you might be better off asking on an ATpy mailing list or at least some astronomy python users mailing list, as it's possible that ATpy has another solution built in.
From the pyables website:
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
PyTables is built on top of the HDF5 library, using the Python language and the NumPy package.
... fast, yet extremely easy to use tool for interactively browse, process and search very large amounts of data. One important feature of PyTables is that it optimizes memory and disk resources so that data takes much less space...
Look into using pandas. It's built for this kind of work. But the data files need to be stored in a well structured binary format like hdf5 to get good performance with any solution.