I have some data stored in a tree in memory and I regularly store the tree into disk using pickle.
Recently I noticed that the program using a large memory, then I checked saved pickle file, it is around 600M, then I wrote an other small test program loading the tree back into memory, and I found that it would take nearly 10 times memory(5G) than the size on disk, is that normal? And what's the best way to avoid that?
No it's not normal. I suspect your tree is bigger than you think. Write some code to walk it and add up all the space used (and count the nodes).
See memory size of Python data structure
Also what exactly are you asking? Are you surprised that a 600M data structure on disk is 5G in memory. That's not particularly surprising. Pickle compresses the data so you expect it to be smaller on disk. It's smaller by a factor of 10 (roughly) which is pretty good.
If you're surprised by the size of your own data that's another thing.
Related
I am working on a project where I am combining 300,000 small files together to form a dataset to be used for training a machine learning model. Because each of these files do not represent a single sample, but rather a variable number of samples, the dataset I require can only be formed by iterating through each of these files and concatenating/appending them to a single, unified array. With this being said, I unfortunately cannot avoid having to iterate through such files in order to form the dataset I require. As such, the process of data loading prior to model training is very slow.
Therefore my question is this: would it be better to merge these small files together into relatively larger files, e.g., reducing the 300,000 files to 300 (merged) files? I assume that iterating through less (but larger) files would be faster than iterating through many (but smaller) files. Can someone confirm if this is actually the case?
For context, my programs are written in Python and I am using PyTorch as the ML framework.
Thanks!
Usually working with one bigger file is faster than working with many small files.
It needs less open, read, close, etc. functions which need time to
check if file exists,
check if you have privilege to access this file,
get file's information from disk (where is beginning of file on disk, what is its size, etc.),
search beginning of file on disk (when it has to read data),
create system's buffer for data from disk (system reads more data to buffer and later function read() can read partially from buffer instead of reading partially from disk).
Using many files it has to do this for every file and disk is much slower than buffer in memory.
I've noticed that Python handles memory in a way that I didn't expect. I have a huge data set which is stored in a 70 GB file. I usually load this file with np.loadtxt() and do some math on that. I have 32 GB of RAM and I've noticed that, when loaded into memory, around 25 GB of RAM is used. But apparently, this value can change. For example once, while I was processing the data, I got a memory error. After the error the dataset was still in the memory, I verified that I could access it, but only around 5 GB of RAM were used. How is this possible? And how can I force python to use as less memory as possible with my data so that I can run other application simultaneously?
Moreover, sometimes I do calculations which return a new dataset as large as the original, so that at the end I've a number of large dataset in memory, but the total used RAM is not changed. Are these variables written on the hard disk in some way? If so, why sometimes I get memory crashes?
(BTW I use spyder as IDE if it matters)
In Python 3, for disk space (not speed) of saving Pandas objects, is HDF5 or Pickle better? Preferably, also, how much better?
Prior research:
I searched for an article comparing storage methods, and here is the most popular one. Unfortunately, it only talks about speed, which is less important to me.
I tested an example object myself, and found that HDF5 and Pickle produced basically the same file size for my example object. But I don't want to just trust my own result, because maybe my result is due to the specific structure/type of data I happened to test with.
I am reading in a chunk of data from a pytables.Table (version 3.1.1) using the read_where method from a big hdf5 file. The resulting numpy array has about 420 MB, however the memory consumption of my python process has gone up by 1.6GB during the read_where call and the memory is not released after the call is finished. Even deleting the array, closing the file and deleting the hdf5 file handle does not free the memory.
How can I free this memory again?
The huge memory consumption is due to the fact that python implements a lot of stuffs around the data to facilitate its manipulation.
You've got a good explanation of why the memory use is maintained here and there (found on this question). A good workaround would be to open and manipulate your table in a subprocess with the multiprocessing module
We would need more context on the details of your Table object, like how large is it and the chunk size. How HDF5 handles chunking is probably one of the largest responsibles for hugging memory in this case.
My advice is to have a thorough read of this: http://pytables.github.io/usersguide/optimization.html#understanding-chunking and experiment with different chunksizes (typically making them larger).
i am currently storing key-values of type int-int. For fast access, I am currently using an BTrees.IIBTree structure stored in memory. It is not stored on disk at all since we need the most recent data.
However, the current solution barely fits into memory, so I am looking for a more efficient database or data structure in terms of access time. In case it would be stored in memory it also needs to be efficient in terms of memory space.
One idea would be to replace the BTrees.IIBTree structure with a int-byte hash written in C as an extension for Python, but the data would still be lost in case the machine fails (not a terrible thing in our case).
What are your suggestions?