Still Huge Memory usage for Python's json loads?

Still Huge Memory usage for Python's json loads? - python

Yes, this was asked seven years ago, but the 'answers' were not helpful in my opinion. So much open data uses JSON, so I'm asking this again to see if any better techniques are available. I'm loading a 28 MB JSON file (with 7,000 lines) and the memory used for json.loads is over 300 MB.
This statement is run repeatedly:
data_2_item = json.loads(data_1_item)
and is eating up the memory for the duration of the program. I've tried various other statements such as pd.read_json(in_file_name, lines=True)
with the same results. I've tried simplejson and rapidjson alternative packages also.

As a commenter observed, json.loads is NOT the culprit. data_2_item can be very large -- sometimes 45K. Since it is appended to a list over 7,000 times, the list becomes huge (300 MB) and that memory is NEVER released. So to me, the answer: no solution with existing packages/loaders. The overall goal is to load a large JSON file into a Pandas Dataframe without using 300 MB (or more) of memory for intermediate processing. And that memory does not shrink. See also https://github.com/pandas-dev/pandas/issues/17048

If after loading your content you will be using only part of it then consider using ijson to load the JSON content in a streaming fashion, with low memory consumption, and only constructing the data you need to handle rather than the whole object.

Related

What is faster/more efficient, read/write to file or to io file like object?

I am working with some large audio files (~500MB), with a lot of processing and conversion involved. One of the steps involves writing a file, sending it though a network, then reading the file at arrival, then saving the file based on some logic.
As the network part is irrelevant for me, I wonder what is faster or more efficient, reading and writing actual files, or io file like object.
Also, how significant is the performance difference, if at all.
My intuition would say io object would be more efficient, but I do not know how either process works.

io file-like object has been created to avoid creating temporary files that you don't want to store, just to be able to pass to other modules and "fool" them into believing that they're actual file handles (there are limitations but for most usages it's okay)
So yes, using a io.BytesIO object will be faster, even with a SSD drive, reading/writing to RAM wins.
class io.BytesIO([initial_bytes])
A stream implementation using an in-memory bytes buffer.
Now if the data is very big, you're going to be out of memory or swap mechanism will occur. So there's a limit to the amount of data you can store in memory (I remember that old audio editing software were able to do "direct-to-disk" for that very reason: memory was limited at the time, and it was not possible to store several minutes of audio data in memory)

Memory leak with PyYAML

I think that I'm having a memory leak when loading an .yml file with the library PyYAML.
I've followed the next steps:
import yaml
d = yaml.load(open(filename, 'r'))
The memory used by the process (I've gotten it with top or htop) has grown from 60K to 160M while the size of the file is lower than 1M.
Then, I've done the next command:
sys.getsizeof(d)
And it has returned a value lower than 400K.
I've also tried to use the garbage collector with gc.collect(), but nothing has happened.
As you can see, it seems that there's a memory leak, but I don't know what is producing it, neither I know how to free this amount of memory.
Any idea?

Your approach doesn't show a memory leak, it just shows that PyYAML uses a lot of memory while processing a moderately sized YAML file.
If you would do:
import yaml
X = 10
for x in range(X):
d = yaml.safe_load(open(filename, 'r'))
And the memory size used by the program would change depending on what you set X to, then there is reason to assume there is a memory leak.
In tests that I ran this is not the case. It is just that the default Loader and SafeLoader take about 330x the filesize in memory (based on an arbitrary 1Mb size simple, i.e. no tags, YAML file) and the CLoader about 145x that filesize.
Loading the YAML data multiple times doesn't increase that, so load() gives back the memory it uses, which means there is no memory leak.
That is not to say that it looks like an enormous amount of overhead.
(I am using safe_load() as PyYAML's documentation indicate that load() is not safe on uncontrolled input files).

Using Pickle vs database for loading large amount of data?

I have previously saved a dictionary which maps image_name -> list of feature vectors, with the file being ~32 Gb. I have been using cPickle to load the dictionary in, but since I only have 8 GB of RAM, this process takes forever. Someone suggested using a database to store all the info, and reading from that, but would that be a faster/better solution than reading a file from disk? Why?

Use a database because it allows you to query faster. I've done this before. I would suggest against using cPickle. What specific implementation are you using?

python parallel processing

I am new to Python. I have 2000 files each about 100 MB. I have to read each of them and merge them into a big matrix (or table). Can I use parallel processing for this so that I can save some time? If yes, how? I tried searching and things seem very complicated. Currently, it takes about 8 hours to get this done serially. We have a really big server with one Tera Byte RAM and few hundred processors. How can I efficiently make use of this?
Thank you for your help.

You make be able to preprocess the files in separate processes using the subprocess module; however, if the final table is kept in memory, then that process will end up being you bottleneck.
There is another possible approach using shared memory with mmap objects. Each subprocess can be responsible for loading the files into a subsection of the mapped memory.

Python IMAP search, search results exhaust all memory

I'm trying to fetch all auto-reponse emails from a specific address in Python using imaplib. Everything worked fine for weeks but now each time I run my program all my RAM is consumed (several GB!) and the script end up being killed by the OOM killer.
Here is the code I'm currently using:
M = imaplib.IMAP4_SSL('server')
M.LOGIN('user', 'pass')
M.SELECT()
date = (datetime.date.today() - datetime.timedelta(1)).strftime("%d-%b-%Y")
result, data = M.uid('search', None, '(SENTON %s HEADER FROM "auto#site.com" NOT SUBJECT "RE:")' % date)
...
I'm sure that less than 100 emails of a few kilobytes should be returned. What could be the matter here ? Or is there a way to limit the number of emails returned ?
Thx!

There's no way to know for sure what the cause is, without being able to reproduce the problem (and certainly not without seeing the complete program which triggers the problem, and knowing the version of all dependencies you're using).
However, here's my best guess. Several versions of Python include a very memory-wasteful implementation of imaplib. The problem is particularly evident on Windows, but not limited to that platform.
The core of the problem is the way strings are allocated when read from a socket, and the way imaplib reads strings from sockets.
When reading from a socket, Python first allocates a buffer large enough to handle as many bytes as the application asks for. This may be something reasonable sounding, perhaps 16 kB. Then data is read into that buffer and the buffer is resized down to fit the number of bytes actually read.
The efficiency of this operation depends on the quality of the platform re-allocation implementation. Resizing a buffer may end up moving it to a more suitable location, where the smaller size avoids wasting much memory. Or it may just mark the tail part of the memory, no longer allocated as part of that region, as re-usable (and it may even be able to re-use it in practice). Or it might end up wasting that technically unallocated memory.
Imagine the cumulative effects of that memory being wasted if you have to read a few dozen kB of data, and the data arrives from the network a few dozen bytes at a time. Worse, imagine if the data is really trickling, and you only get a few bytes at a time. Or if you're reading a very "large" response of several hundred kB.
The amount of memory wasted - effectively allocated by the process, but not usable in any meaningful way - can be huge. 100 kB of data, read 5 bytes at a time requires 20480 buffers. If each buffer starts off as 16 kB and is unsuccessfully shrunk, causing them to remain at 16Kb, then you've allocated at least 320MB of memory to hold that 100 kB of data.
Some versions of imaplib exacerbated this problem by introducing multiple layers of buffering and copying. A very old version (hopefully not one you're actually using) even read 1 byte at a time (which would result in 1.6GB of memory usage in the above scenario).
Of course, this problem usually doesn't show up on Linux, where the re-allocator is not so bad. And at various points in previous Python releases (previous to the most recent 2.x release), the bug was "fixed", so I wouldn't expect to see it show up these days. And this doesn't explain why your program ran fine for a while before failing this way.
But it is my best guess.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.