how to dereference a module and free memory in python? - python

I am working on a project where I need to dereference a module and free its memory for further model training. Here I mentioned the demo code with memory uses after certain block.
I used garbage collection as well as the del function but it couldn't worked for me.
import psutil
import sys
import gc
sys.path.insert(0,'/scripts_v2')
process = psutil.Process()
mem = process.memory_info().rss/(1024**2)
print(mem)
import pandas as pd
process = psutil.Process()
mem = process.memory_info().rss/(1024**2)
print(mem)
sys.modules.pop('pandas')
#del pd
gc.collect()
process = psutil.Process()
mem = process.memory_info().rss/(1024**2)
print(mem)
I calculated the memory after a specific block of code. Here I mentioned the output of the above code.
You can see that before and after deleting the pandas library, its memory is still 60.65 MB. How can I free its memory?

Deleting a variable, a module or anything, in Python, does not have to free its memory. Python simply caches some types when you initialise them.
Your problem is likely caused by the caching of some objects in the pandas library, and as far as I am aware, you cannot free the internal cache of Python. There might be a way to do it, but it will definitely be too hacky to even bother. If you need 100% control over your memory in some project then Python will not be helping you much. To get more information about the topic see the answers here and see especially this answer.
Note that del and gc.collect() might release memory, it is just that they do not have to.
Edit: sys.modules.pop('pandas') does not de-import pandas, you can see by doing perhaps print(pd.__version__). But, the same memory problem occurs if you actually de-import the library using the methods specified here.

Related

How to share zero copy dataframes between processes with PyArrow

I'm trying to work out how to share data between processes with PyArrow (to hopefully at some stage share pandas DataFrames). I am at a rather experimental (read: Newbie) stage and am trying to figure out how to use PyArrow. I'm a bit stuck and need help.
Going through the documentation, I found an example to create a buffer
import time
import pyarrow as pa
data = b'abcdefghijklmnopqrstuvwxyz'
buf = pa.py_buffer(data)
print(buf)
# <pyarrow.Buffer address=0x7fa5be7d5850 size=26 is_cpu=True is_mutable=False>
while True:
time.sleep(1)
While this process was running, I used the address and size, that the script printed, to (try to) access the buffer in another script:
import pyarrow as pa
buf = pa.foreign_buffer(0x7fa5be7d5850, size=26)
print(buf.to_pybytes())
and received... a segmentation fault - most likely because the script is trying to access memory from another process, which I may require different handling.
Is this not possible with PyArrow or is the just the way I am trying to do this? Do I need other libraries? I'd like to avoid serialisation (or writing to disk in general), if possible, but this may or may not be possible. Any pointers are appreciated.
You can't share buffers across processes, instead you should use memory mapped file.
To write:
import pyarrow as pa
mmap = pa.create_memory_map("hello.txt", 20)
mmap.write(b"hello")
To read:
import pyarrow as pa
mmap = pa.memory_map("hello.txt")
mmap.read(5)

Memory leak on pickle inside a for loop forcing a memory error

I have huge array objects that are pickled with the python pickler.
I am trying to unpickle them and reading out the data in a for loop.
Every time I am done reading and assesing, I delete all the references to those objects.
After deletion, I even call gc.collect() along with time.sleep() to see if the heap memory reduces.
The heap memory doesn't reduce pointing to the fact that, the data is still referenced somewhere within the pickle loading. After 15 datafiles(I got 250+ files to process, 1.6GB each) I hit the memory error.
I have seen many other questions here, pointing out a memory leak issue which was supposedly solved.
I don't understand what is exactly happening in my case.
Python memory management does not free memory to OS till the process is running.
Running the for loop with a subprocess to call the script helped me solved the issue.
Thanks for the feedbacks.

Profile memory. Find memory leak in loop

this question was already asked a few times and I already tried some methods. Unfortunately, somehow I can't find out why my python process uses so much memory.
My setup: python 3.5.2, Windows 10, and a lot of third-party packages.
The true memory usage for the process is 300 MB ( way too much but sometimes it even explodes to 32gb)
process = psutil.Process(os.getpid())
memory_real = process.memory_info().rss/(1024*1024) #--> 300 Mb
What I tried so far:
memory line profiler (didn't helped me)
tracemalloc.start(50) and then
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
print("[ Top 10 ]")
for stat in top_stats[:10]:
log_and_print(stat)
gives just few Mb's as result
gc.collect()
import objgraph
objgraph.show_most_common_types()
returns:
function 51791
dict 32939
tuple 28825
list 13823
set 10748
weakref 10551
cell 7870
getset_descriptor 6276
type 6088
OrderedDict 5083
(when the process had 200 mb's the numbers above were even higher)
pympler: process exists with some error-code
So really struggling to find a way where the memory of the process is allocated. Do I do something wrong, or is there some easy way to find out what is going on?
PS:
I was able to solve this problem through luck. It was a badly coded while loop, where a list was extended, without a proper break condition.
Anyway is there a way to find such memory leaks. What I often see is that some memory profiling packages is called explicitly. In this case, I wouldn't have a chance to make a memory dump or check the memory in the main thread since the loop is never left.

Huge memory usage of Python's json module?

When I load the file into json, pythons memory usage spikes to about 1.8GB and I can't seem to get that memory to be released. I put together a test case that's very simple:
with open("test_file.json", 'r') as f:
j = json.load(f)
I'm sorry that I can't provide a sample json file, my test file has a lot of sensitive information, but for context, I'm dealing with a file in the order of 240MB. After running the above 2 lines I have the previously mentioned 1.8GB of memory in use. If I then do del j memory usage doesn't drop at all. If I follow that with a gc.collect() it still doesn't drop. I even tried unloading the json module and running another gc.collect.
I'm trying to run some memory profiling but heapy has been churning 100% CPU for about an hour now and has yet to produce any output.
Does anyone have any ideas? I've also tried the above using cjson rather than the packaged json module. cjson used about 30% less memory but otherwise displayed exactly the same issues.
I'm running Python 2.7.2 on Ubuntu server 11.10.
I'm happy to load up any memory profiler and see if it does better then heapy and provide any diagnostics you might think are necessary. I'm hunting around for a large test json file that I can provide for anyone else to give it a go.
I think these two links address some interesting points about this not necessarily being a json issue, but rather just a "large object" issue and how memory works with python vs the operating system
See Why doesn't Python release the memory when I delete a large object? for why memory released from python is not necessarily reflected by the operating system:
If you create a large object and delete it again, Python has probably released the memory, but the memory allocators involved don’t necessarily return the memory to the operating system, so it may look as if the Python process uses a lot more virtual memory than it actually uses.
About running large object processes in a subprocess to let the OS deal with cleaning up:
The only really reliable way to ensure that a large but temporary use of memory DOES return all resources to the system when it's done, is to have that use happen in a subprocess, which does the memory-hungry work then terminates. Under such conditions, the operating system WILL do its job, and gladly recycle all the resources the subprocess may have gobbled up. Fortunately, the multiprocessing module makes this kind of operation (which used to be rather a pain) not too bad in modern versions of Python.

cProfile taking a lot of memory

I am attempting to profile my project in python, but I am running out of memory.
My project itself is fairly memory intensive, but even half-size runs are dieing with "MemoryError" when run under cProfile.
Doing smaller runs is not a good option, because we suspect that the run time is scaling super-linearly, and we are trying to discover which functions are dominating during large runs.
Why is cProfile taking so much memory? Can I make it take less? Is this normal?
Updated: Since cProfile is built into current versions of Python (the _lsprof extension) it should be using the main allocator. If this doesn't work for you, Python 2.7.1 has a --with-valgrind compiler option which causes it to switch to using malloc() at runtime. This is nice since it avoids having to use a suppressions file. You can build a version just for profiling, and then run your Python app under valgrind to look at all allocations made by the profiler as well as any C extensions which use custom allocation schemes.
(Rest of original answer follows):
Maybe try to see where the allocations are going. If you have a place in your code where you can periodically dump out the memory usage, you can use guppy to view the allocations:
import lxml.html
from guppy import hpy
hp = hpy()
trees = {}
for i in range(10):
# do something
trees[i] = lxml.html.fromstring("<html>")
print hp.heap()
# examine allocations for specific objects you suspect
print hp.iso(*trees.values())

Categories

Resources