Huge memory usage of Python's json module? - python

When I load the file into json, pythons memory usage spikes to about 1.8GB and I can't seem to get that memory to be released. I put together a test case that's very simple:
with open("test_file.json", 'r') as f:
j = json.load(f)
I'm sorry that I can't provide a sample json file, my test file has a lot of sensitive information, but for context, I'm dealing with a file in the order of 240MB. After running the above 2 lines I have the previously mentioned 1.8GB of memory in use. If I then do del j memory usage doesn't drop at all. If I follow that with a gc.collect() it still doesn't drop. I even tried unloading the json module and running another gc.collect.
I'm trying to run some memory profiling but heapy has been churning 100% CPU for about an hour now and has yet to produce any output.
Does anyone have any ideas? I've also tried the above using cjson rather than the packaged json module. cjson used about 30% less memory but otherwise displayed exactly the same issues.
I'm running Python 2.7.2 on Ubuntu server 11.10.
I'm happy to load up any memory profiler and see if it does better then heapy and provide any diagnostics you might think are necessary. I'm hunting around for a large test json file that I can provide for anyone else to give it a go.

I think these two links address some interesting points about this not necessarily being a json issue, but rather just a "large object" issue and how memory works with python vs the operating system
See Why doesn't Python release the memory when I delete a large object? for why memory released from python is not necessarily reflected by the operating system:
If you create a large object and delete it again, Python has probably released the memory, but the memory allocators involved don’t necessarily return the memory to the operating system, so it may look as if the Python process uses a lot more virtual memory than it actually uses.
About running large object processes in a subprocess to let the OS deal with cleaning up:
The only really reliable way to ensure that a large but temporary use of memory DOES return all resources to the system when it's done, is to have that use happen in a subprocess, which does the memory-hungry work then terminates. Under such conditions, the operating system WILL do its job, and gladly recycle all the resources the subprocess may have gobbled up. Fortunately, the multiprocessing module makes this kind of operation (which used to be rather a pain) not too bad in modern versions of Python.

Related

How to efficiently run multiple Pytorch Processes / Models at once ? Traceback: The paging file is too small for this operation to complete

Background
I have a very small network which I want to test with different random seeds.
The network barely uses 1% of my GPUs compute power so i could in theory run 50 processes at once to try many different seeds at once.
Problem
Unfortunately i can't even import pytorch in multiple processes. When the nr of processes exceeds 4 I get a Traceback regarding a too small paging file.
Minimal reproducable code§ - dispatcher.py
from subprocess import Popen
import sys
procs = []
for seed in range(50):
procs.append(Popen([sys.executable, "ml_model.py", str(seed)]))
for proc in procs:
proc.wait()
§I increased the number of seeds so people with better machines can also reproduce this.
Minimal reproducable code - ml_model.py
import torch
import time
time.sleep(10)
Traceback (most recent call last):
File "ml_model.py", line 1, in <module>
import torch
File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\__init__.py", line 117, in <module>
import torch
File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\__init__.py", line 117, in <module>
raise err
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies.
raise err
Further Investigation
I noticed that each process loads a lot of dll's into RAM. And when i close all other programs which use a lot of RAM i can get up to 10 procesess instead of 4. So it seems like a resource constraint.
Questions
Is there a workaround ?
What's the recommended way to train many small networks with pytorch on a single gpu ?
Should i write my own CUDA Kernel instead, or use a different framework to achieve this ?
My goal would be to run around 50 processes at once (on a 16GB RAM Machine, 8GB GPU RAM)
I've looked a bit into this tonight. I don't have a solution (edit: I have a mitigation, see the edit at end), but I have a bit more information.
It seems the issue is caused by NVidia fatbins (.nv_fatb) being loaded into memory. Several DLLs, such as cusolver64_xx.dll, torcha_cuda_cu.dll, and a few others, have .nv_fatb sections in them. These contain tons of different variations of CUDA code for different GPUs, so it ends up being several hundred megabytes to a couple gigabytes.
When Python imports 'torch' it loads these DLLs, and maps the .nv_fatb section into memory. For some reason, instead of just being a memory mapped file, it is actually taking up memory. The section is set as 'copy on write', so it's possible something writes into it? I don't know. But anyway, if you look at Python using VMMap ( https://learn.microsoft.com/en-us/sysinternals/downloads/vmmap ) you can see that these DLLs are committing huge amounts of committed memory for this .nv_fatb section. The frustrating part is it doesn't seem to be using the memory. For example, right now my Python.exe has 2.7GB committed, but the working set is only 148MB.
Every Python process that loads these DLLs will commit several GB of memory loading these DLLs. So if 1 Python process is wasting 2GB of memory, and you try running 8 workers, you need 16GB of memory to spare just to load the DLLs. It really doesn't seem like this memory is used, just committed.
I don't know enough about these fatbinaries to try to fix it, but from looking at this for the past 2 hours it really seems like they are the issue. Perhaps its an NVidia problem that these are committing memory?
edit: I made this python script: https://gist.github.com/cobryan05/7d1fe28dd370e110a372c4d268dcb2e5
Get it and install its pefile dependency ( python -m pip install pefile ).
Run it on your torch and cuda DLLs. In OPs case, command line might look like:
python fixNvPe.py --input=C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\lib\*.dll
(You also want to run this wherever your cusolver64_*.dll and friends are. This may be in your torch\lib folder, or it may be, eg, C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vXX.X\bin . If it is under Program Files, you will need to run the script with administrative privileges)
What this script is going to do is scan through all DLLs specified by the input glob, and if it finds an .nv_fatb section it will back up the DLL, disable ASLR, and mark the .nv_fatb section read-only.
ASLR is 'address space layout randomization.' It is a security feature that randomizes where a DLL is loaded in memory. We disable it for this DLL so that all Python processes will load the DLL into the same base virtual address. If all Python processes using the DLL load it at the same base address, they can all share the DLL. Otherwise each process needs its own copy.
Marking the section 'read-only' lets Windows know that the contents will not change in memory. If you map a file into memory read/write, Windows has to commit enough memory, backed by the pagefile, just in case you make a modification to it. If the section is read-only, there is no need to back it in the pagefile. We know there are no modifications to it, so it can always be found in the DLL.
The theory behind the script is that by changing these 2 flags that less memory will be committed for the .nv_fatb, and more memory will be shared between the Python processes. In practice, it works. Not quite as well as I'd hope (it still commits a lot more than it uses), so my understanding may be flawed, but it significantly decreases memory commit.
In my limited testing I haven't ran into any issues, but I can't guarantee there are no code paths that attempts to write to that section we marked 'read only.' If you start running into issues, though, you can just restore the backups.
edit 2022-01-20:
Per NVIDIA: "We have gone ahead and marked the nv_fatb section as read-only, this change will be targeting next major CUDA release 11.7 . We are not changing the ASLR, as that is considered a safety feature ."
This should certainly help. If it's not enough without ASLR as well then the script should still work
For my case system is already set to system managed size, yet I have same error, that is because I pass a big sized variable to multiple processes within a function. Likely I need to set a very large paging file as Windows cannot create it on the fly, but instead opt out to reduce number of processes as it is not an always to be used function.
If you are in Windows it may be better to use 1 (or more) core less than total number of pysical cores as multiprocessing module in python in Windows tends to get everything as possible if you use all and actually tries to get all logical cores.
import multiprocessing
multiprocessing.cpu_count()
12
# I actually have 6 pysical cores, if you use this as base it will likely hog system
import psutil
psutil.cpu_count(logical = False)
6 #actual number of pysical cores
psutil.cpu_count(logical = True)
12 #logical cores (e.g. hyperthreading)
Please refer to here for more detail:
Multiprocessing: use only the physical cores?
Well, i managed to resolve this.
open "advanced system setting". Go to the advanced tab then click settings related to performance.
Again click on advanced tab--> change --> unselect 'automatically......'. for all the drives, set 'system managed size'. Restart your pc.
Following up on #chris-obryan's answer (I would comment but have no reputation), I've found that memory utilisation drops pretty sharply some time in to training with their fix applied (in orders of roughly the mentioned 2GB per process).
To eek out some more performance it may be worth monitoring memory utilisation and spawning a new instance of the model when these drops in memory occur, leaving enough space (~3 or 4 GB to be safe) for a bit of overhead.
I was seeings ~28GB of RAM utilised during the setup phase, which dropped to about 14GB after iterating for a while.
(Note that my use case is a little different here as I'm bottlenecked by host<->device transfers due to optimising with a GA, as a reasonable amount of CPU bound processing needs to occur after each generation, so this could play in to it. I am also using concurrent.futures.ProcessPoolExecutor() rather than manually using subprocesses)
I have changed 'num_workers = 10' to 'num_workers = 1'. It helped me to solve the problem.
To fix this problem, I updated the CUDA 11.8.0 version and PyTorch to the 11.6 cudatoolkit version with PyTorch 1.9.1. Using conda:
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
Thanks to #chris-obryan I understood the problem and thought an update was available already. I measured the memory consumption before and after the updates, dropping sharply.
Since it seems that each import torch loads a bunch of fat DLLs (thanks #chris-obryan), I tried changing this:
import torch
if __name__ == "__main__":
# multiprocessing stuff, paging file errors
to this...
if __name__ == "__main__":
import torch
# multiprocessing stuff
And it worked well (because when the subprocesses are created __name__ is not "__main__").
Not an elegant solution, but perhaps useful to someone.

Fast zip decryption in python

I've a program which process a zip file using zipfile. It works with an iterator, since the uncompressed file is bigger than 2GB and it can become a memory problem.
with zipfile.Zipfile(BytesIO(my_file)) as myzip:
for file_inside in myzip.namelist():
with myzip.open(file_inside) as file:
# Process here
# for loop ....
Then I noticed that this process was being extremely slow to process my file. And I can understand that it may take some time, but at least it should use my machine resources: lets say the python process should use at 100% the core where it lives.
Since it doesn't, I started researching the possible root causes. I'm not an expert in compression matters, so first considered basic things:
Resources seem not to be the problem, there's plenty RAM available even if my coding approach wouldn't use it.
CPU is not in high level usage, not even for one core.
The file being open is just about 80MB when compressed, so disk reading should not be a slowing issue either.
This made me to think that the bottleneck could be in the most invisible parameters: RAM bandwidth. However I have no idea how could I measure this.
Then on the software side, I found on the zipfile docs:
Decryption is extremely slow as it is implemented in native Python rather than C.
I guess that if it's using native Python, it's not even using OpenGL acceleration so another point for slowliness. I'm also curious about how this method works, again because of the low CPU usage.
So my question is of course, how could I work in a similar way (not having the full uncompress file in RAM), but uncompressing in a faster way in Python? Is there another library or maybe another approach to overcome this slowliness?
There is this lib for python to handle zipping files without memory hassle.
Quoted from the docs:
Buzon - ZipFly
ZipFly is a zip archive generator based on zipfile.py. It was created by Buzon.io to generate very large ZIP archives for immediate sending out to clients, or for writing large ZIP archives without memory inflation.
Never used but can help.
I've done some research and found the following:
You could "pip install czipfile", more information at https://pypi.org/project/czipfile/
Another solution is to use "Cython", a variant of python -https://www.reddit.com/r/Python/comments/cksvp/whats_a_python_zip_library_with_fast_decryption/
Or you could outsource to 7-Zip, as explained here: Faster alternative to Python's zipfile module?
It's quite stupid that Python doesn't implement zip decryption in pure c.
So I make it in cython, which is 17 times faster.
Just get the dezip.pyx and setup.py from this gist.
https://gist.github.com/zylo117/cb2794c84b459eba301df7b82ddbc1ec
And install cython and build a cython library
pip3 install cython
python3 setup.py build_ext --inplace
Then run the original script with two more lines.
import zipfile
# add these two lines
from dezip import _ZipDecrypter_C
setattr(zipfile, '_ZipDecrypter', _ZipDecrypter_C)
z = zipfile.ZipFile('./test.zip', 'r')
z.extractall('/tmp/123', None, b'password')

Can Pickle handle files larger than the RAM installed on my machine?

I'm using pickle for saving on disk my NLP classifier built with the TextBlob library.
I'm using pickle after a lot of searches related to this question. At the moment I'm working locally and I have no problem loading the pickle file (which is 1.5Gb) with my i7 and 16gb RAM machine. But the idea is that my program, in the future, has to run on my server which only has 512Mb RAM installed.
Can pickle handle such a large file or will I face memory issues?
On my server I've got Python 3.5 installed and it is a Linux server (not sure which distribution).
I'm asking because at the moment I can't access my server, so I can't just try and find out what happens, but at the same time I'm doubtful if I can keep this approach or I have to find other solutions.
Unfortunately this is difficult to accurately answer without testing it on your machine.
Here are some initial thoughts:
There is no inherent size limit that the Pickle module enforces, but you're pushing the boundaries of its intended use. It's not designed for individual large objects. However, you since you're using Python 3.5, you will be able to take advantage of PEP 3154 which adds better support for large objects. You should specify pickle.HIGHEST_PROTOCOL when you dump your data.
You will likely have a large performance hit because you're trying to deal with an object that is 3x the size of your memory. Your system will probably start swapping, and possibly even thrashing. RAM is so cheap these days, bumping it up to at least 2GB should help significantly.
To handle the swapping, make sure you have enough swap space available (a large swap partition if you're on Linux, or enough space for the swap file on your primary partition on Windows).
As pal sch's comment shows, Pickle is not very friendly to RAM consumption during the pickling process, so you may have to deal with Python trying to get even more memory from the OS than the 1.5GB we may expect for your object.
Given these considerations, I don't expect it to work out very well for you. I'd strongly suggest upgrading the RAM on your target machine to make this work.
I don't see how you could load an object into RAM that exceeds the RAM. i.e. bytes(num_bytes_greater_than_ram) will always raise an MemoryError.

Python process consuming increasing amounts of system memory, but heapy shows roughly constant usage

I'm trying to identify a memory leak in a Python program I'm working on. I'm current'y running Python 2.7.4 on Mac OS 64bit. I installed heapy to hunt down the problem.
The program involves creating, storing, and reading large database using the shelve module. I am not using the writeback option, which I know can create memory problems.
Heapy usage shows during the program execution, the memory is roughly constant. Yet, my activity monitor shows rapidly increasing memory. Within 15 minutes, the process has consumed all my system memory (16gb), and I start seeing page outs. Any idea why heapy isn't tracking this properly?
Take a look at this fine article. You are, most likely, not seeing memory leaks but memory fragmentation. The best workaround I have found is to identify what the output of your large working set operation actually is, load the large dataset in a new process, calculate the output, and then return that output to the original process.
This answer has some great insight and an example, as well. I don't see anything in your question that seems like it would preclude the use of PyPy.

Why does running SQLite (through python) cause memory to "unofficially" fill up?

I'm dealing with some big (tens of millions of records, around 10gb) database files using SQLite. I'm doint this python's standard interface.
When I try to insert millions of records into the database, or create indices on some of the columns, my computer slowly runs out of memory. If I look at the normal system monitor, it looks like the majority of the system memory is free. However, when I use top, it looks like I have almost no system memory free. If I sort the processes by their memory consuption, then non of them uses more than a couple percent of my memory (including the python process that is running sqlite).
Where is all the memory going? Why do top and Ubuntu's system monitor disagree about how much system memory I have? Why does top tell me that I have very little memory free, and at the same time not show which process(es) is (are) using all the memory?
I'm running Ubuntu 11.04, sqlite3, python 2.7.
10 to 1 says you are confused by linux's filesystem buffer/cache
see
ofstream leaking memory
https://superuser.com/questions/295900/linux-sort-all-data-in-memory/295902#295902
Test it by doing (as root)
echo 3 > /proc/sys/vm/drop_caches
The memory may be not assigned to a process, but it can be e.g. a file on tmpfs filesystem (/dev/shm, /tmp sometimes). You should show us the output of top or free (please note those tools do not show a single 'memory usage' value) to let us tell something more about the memory usage.
In case of inserting records to a database it may be a temporary image created for the current transaction, before it is committed to the real database. Splitting the insertion into many separate transactions (if applicable) may help.
I am just guessing, not enough data.
P.S. It seems I mis-read the original question (I assumed the computer slows down) and there is no problem. sehe's answer is probably better.

Categories

Resources