Python dictionary loaded from disk takes too much space in memory

Python dictionary loaded from disk takes too much space in memory - python

I have a dictionary pickled on disk with size of ~780 Megs (on disk). However, when I load that dictionary into the memory, its size swells unexpectedly to around 6 gigabytes. Is there anyway to keep the size around the actual filesize in the memory as well, (I mean it will be alright if it takes around 1 gigs in the memory, but 6 gigs is kind of a strange behavior). Is there a problem with the pickle module, or should I save the dictionary in some other format?
Here is how I am loading the file:
import pickle
with open('py_dict.pickle', 'rb') as file:
py_dict = pickle.load(file)
Any ideas, help, would be greatly appreciated.

If you're using pickle just for storing large values in a dictionary, or a very large number of keys, you should consider using shelve instead.
import shelve
s=shelve.open('shelve.bin')
s['a']='value'
This loads each key/value only as needed, keeping the rest on disk

Use SQL to store all the data into a database and use efficient queries to reach it.

Related

How to replace large python dictionary with a memory efficient data structure?

I use a python dictionary to store key value pairs and the dictionary gets too large(>100GB) and hits a memory limit.
What is a better memory efficient data structure to store key value pairs in python?
E.g. we can use generators to replace lists

You can use sqlitedict which provides key-value interface to SQLite database. About memory usage. SQLite doesn't need your dataset to fit in RAM. By default it caches up to cache_size pages, which is barely 2MB .

Maybe that could help: https://github.com/dagnelies/pysos
It keeps only the index in memory and keeps the data on disk.

Concatenating PDF files in memory with PyPDF2

I wish to concatenate (append) a bunch of small pdfs together effectively in memory in pure python. Specifically, an usual case is 500 single page pdfs, each with a size of about 400 kB, to be merged into one. Let's say the pdfs are available as a iterable in memory, say a list:
my_pdfs = [pdf1_fileobj, pdf2_fileobj, ..., pdfn_fileobj] # type is BytesIO
Where each pdf_fileobj is of type BytesIO. Then, the base memory usage is about 200 MB (500 pdfs, 400kB each).
Ideally, I would want the following code to concatenate using no more than 400-500 MB of memory in total (including my_pdfs). However, that doesn't seem to be the case, the debugging statement on the last line indicates the maximum memory used to be almost 700 MB. Moreover, using the Mac os x resource monitor, the allocated memory is indicated to be 600 MB when reaching the last line.
Running gc.collect() reduces this to 350 MB (almost too good?). Why do I have to run garbage collection manually to get rid of merging garbage, in this case? I have seen this (probably) causing memory build up in a slightly different scenario I'll skip for now.
import io
import resource # For debugging
from PyPDF2 import PdfFileMerger
def merge_pdfs(iterable):
"""Merge pdfs in memory"""
merger = PdfFileMerger()
for pdf_fileobj in iterable:
merger.append(pdf_fileobj)
myio = io.BytesIO()
merger.write(myio)
merger.close()
myio.seek(0)
return myio
my_concatenated_pdf = merge_pdfs(my_pdfs)
# Print the maximum memory usage
print("Memory usage: %s (kB)" % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
Question summary
Why does the code above need almost 700 MB of memory to merge 200 MB worth of pdfs? Shouldn't 400 MB + overhead be enough? How do I optimize it?
Why do I need to run garbage collection manually to get rid of PyPDF2 merging junk when the variables in question should already be out of scope?
What about this general approach? Is BytesIO suitable to use is this case? merger.write(myio) does seem to run kind of slow given that all happen in ram.
Thank you!

Q: Why does the code above need almost 700 MB of memory to merge 200 MB worth of pdfs? Shouldn't 400 MB + overhead be enough? How do I optimise it?
A: Because .append creates a new stream object and then you use merger.write(myio), which creates another stream object and you already have 200 MB of pdf files in memory so 3*200 MB.
Q: Why do I need to run garbage collection manually to get rid of PyPDF2 merging junk when the variables in question should already be out of scope?
A: It is a known issue in PyPDF2.
Q: What about this general approach? Is BytesIO suitable to use is this case?
A: Considering the memory issues, you might want to try a different approach. Maybe merging one by one, temporarily saving the files to disk, then clearing the already merged ones from memory.

PyMuPdf library may also be a good alternative to the performance issues of PDFMerger from PyPDF2.

why would data have different footprints for disk versus memory?

"We have the 2015 Yellow Cab NYC Taxi data as 12 CSV files on S3... This data is about 20GB on disk or 60GB in RAM."
i came across this observation while trying out dask, a python framework for handling out of memory datasets.
can someone explain to me why there is a 3x difference? id imagine it has to do with python objects but am not 100% sure.
thanks!

You are reading from a CSV on disk into a structured data frame object in memory. The two things are not at all analogous. The CSV data on disk is a single string of text. The data in memory is a complex data structure, with multiple data types, internal pointers, etc.
The CSV itself is not taking up any RAM. There is a complex data structure that is taking up RAM, and it was populated using data sourced from the CSV on disk. This is not at all the same thing.
To illustrate the difference, you could try reading the CSV into an actual single string variable and seeing how much memory that consumes. In this case, it would effectively be a single CSV string in memory:
with open('data.csv', 'r') as csvFile:
data=csvFile.read()

pickle and python data structure

I have some data stored in a tree in memory and I regularly store the tree into disk using pickle.
Recently I noticed that the program using a large memory, then I checked saved pickle file, it is around 600M, then I wrote an other small test program loading the tree back into memory, and I found that it would take nearly 10 times memory(5G) than the size on disk, is that normal? And what's the best way to avoid that?

No it's not normal. I suspect your tree is bigger than you think. Write some code to walk it and add up all the space used (and count the nodes).
See memory size of Python data structure
Also what exactly are you asking? Are you surprised that a 600M data structure on disk is 5G in memory. That's not particularly surprising. Pickle compresses the data so you expect it to be smaller on disk. It's smaller by a factor of 10 (roughly) which is pretty good.
If you're surprised by the size of your own data that's another thing.

Loading Large Files as Dictionary

My first question on stackoverflow :)
I am trying to load a pertained vectors from Glove vectors and create a dictionary with words as keys and corresponding vectors as values. I did the usual naive method to this:
fp=open(wordEmbdfile)
self.wordVectors={}
# Create wordVector dictionary
for aline in fp:
w=aline.rstrip().split()
self.wordVectors[w[0]]=w[1:]
fp.close()
I see a huge memory pressure on Activity Monitor, and eventually after trying for an hour or two it crashes.
I am going to try splitting in multiple smaller files and create multiple dictionaries.
In the meantime I have following questions:
To read the word2vec file, is it better if I read the gzipped file using gzip.open or uncompress it and then read it with plain open.
The word vector file has text in first column and float in the rest, would it be more optimal to use genfromtext or loadtext from numpy?
I intend save this dictionary using chicle, I know loading it is going to be hard too. I read the suggestion to use shelve, how does it compare to cPickle in loading time and access time. May be its better to spend some more time loading with cPickle if improve future accesses (if cPickle does not crash, with 8G RAM), Does anyone have some suggestion on this?
Thanks!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python dictionary loaded from disk takes too much space in memory - python

If you're using pickle just for storing large values in a dictionary, or a very large number of keys, you should consider using shelve instead. import shelve s=shelve.open('shelve.bin') s['a']='value' This loads each key/value only as needed, keeping the rest on disk

Use SQL to store all the data into a database and use efficient queries to reach it.

Related

How to replace large python dictionary with a memory efficient data structure?

Concatenating PDF files in memory with PyPDF2

why would data have different footprints for disk versus memory?

pickle and python data structure

Loading Large Files as Dictionary

Categories

Resources