How to free memory after opening a file in Python - python

I'm opening a 3 GB file in Python to read strings. I then store this data in a dictionary. My next goal is to build a graph using this dictionary so I'm closely monitoring memory usage.
It seems to me that Python loads the whole 3 GB file into memory and I can't get rid of it. My code looks like that :
with open(filename) as data:
accounts = dict()
for line in data:
username = line.split()[1]
IP = line.split()[0]
try:
accounts[username].add(IP)
except KeyError:
accounts[username] = set()
accounts[username].add(IP)
print "The accounts will be deleted from memory in 5 seconds"
time.sleep(5)
accounts.clear()
print "The accounts have been deleted from memory"
time.sleep(5)
print "End of script"
The last lines are there so that I could monitor memory usage.
The script uses a bit more than 3 GB in memory. Clearing the dictionary frees around 300 MB. When the script ends, the rest of the memory is freed.
I'm using Ubuntu and I've monitored memory usage using both "System Monitor" and the "free" command in terminal.
What I don't understand is why does Python need so much memory after I've cleared the dictionary. Is the file still stored in memory ? If so, how can I get rid of it ? Is it a problem with my OS not seeing freed memory ?
EDIT : I've tried to force a gc.collect() after clearing the dictionary, to no avail.
EDIT2 : I'm running Python 2.7.3 on Ubuntu 12.04.LTS
EDIT3 : I realize I forgot to mention something quite important. My real problem is not that my OS does not "get back" the memory used by Python. It's that later on, Python does not seem to reuse that memory (it just asks for more memory to the OS).

this really does make no sense to me either, and I wanted to figure out how/why this happens. ( i thought that's how this should work too! ) i replicated it on my machine - though with a smaller file.
i saw two discrete problems here
why is Python reading the file into memory ( with lazy line reading, it shouldn't - right ? )
why isn't Python freeing up memory to the system
I'm not knowledgable at all on the Python internals, so I just did a lot of web searching. All of this could be completely off the mark. ( I barely develop anymore , have been on the biz side of tech for the past few years )
Lazy line reading...
I looked around and found this post -
http://www.peterbe.com/plog/blogitem-040312-1
it's from a much earlier version of python, but this line resonated with me:
readlines() reads in the whole file at once and splits it by line.
then i saw this , also old, effbot post:
http://effbot.org/zone/readline-performance.htm
the key takeaway was this:
For example, if you have enough memory, you can slurp the entire file into memory, using the readlines method.
and this:
In Python 2.2 and later, you can loop over the file object itself. This works pretty much like readlines(N) under the covers, but looks much better
looking at pythons docs for xreadlines [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.xreadlines ]:
This method returns the same thing as iter(f)
Deprecated since version 2.3: Use for line in file instead.
it made me think that perhaps some slurping is going on.
so if we look at readlines [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.readlines ]...
Read until EOF using readline() and return a list containing the lines thus read.
and it sort of seems like that's what's happening here.
readline , however, looked like what we wanted [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.readline ]
Read one entire line from the file
so i tried switching this to readline, and the process never grew above 40MB ( it was growing to 200MB, the size of the log file , before )
accounts = dict()
data= open(filename)
for line in data.readline():
info = line.split("LOG:")
if len(info) == 2 :
( a , b ) = info
try:
accounts[a].add(True)
except KeyError:
accounts[a] = set()
accounts[a].add(True)
my guess is that we're not really lazy-reading the file with the for x in data construct -- although all the docs and stackoverflow comments suggest that we are. readline() consumed signficantly less memory for me, and realdlines consumed approximately the same amount of memory as for line in data
freeing memory
in terms of freeing up memory, I'm not familiar much with Python's internals, but I recall back from when I worked with mod_perl... if I opened up a file that was 500MB, that apache child grew to that size. if I freed up the memory, it would only be free within that child -- garbage collected memory was never returned to the OS until the process exited.
so i poked around on that idea , and found a few links that suggest this might be happening:
http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm
If you create a large object and delete it again, Python has probably released the memory, but the memory allocators involved don’t necessarily return the memory to the operating system, so it may look as if the Python process uses a lot more virtual memory than it actually uses.
that was sort of old, and I found a bunch of random (accepted) patches afterwards into python that suggested the behavior was changed and that you could now return memory to the os ( as of 2005 when most of those patches were submitted and apparently approved ).
then i found this posting http://objectmix.com/python/17293-python-memory-handling.html -- and note the comment #4
"""- Patch #1123430: Python's small-object allocator now returns an arena to the system free() when all memory within an arena becomes unused again. Prior to Python 2.5, arenas (256KB chunks of memory) were never freed. Some applications will see a drop in virtual memory size now, especially long-running applications that, from time to time, temporarily use a large number of small objects. Note that when Python returns an arena to the platform C's free(), there's no guarantee that the platform C library will in turn return that memory to the operating system. The effect of the patch is to stop making that impossible, and in tests it appears to be effective at least on Microsoft C and gcc-based systems. Thanks to Evan Jones for hard work and patience.
So with 2.4 under linux (as you tested) you will indeed not always get
the used memory back, with respect to lots of small objects being
collected.
The difference therefore (I think) you see between doing an f.read() and
an f.readlines() is that the former reads in the whole file as one large
string object (i.e. not a small object), while the latter returns a list
of lines where each line is a python object.
if the 'for line in data:' construct is essentially wrapping readlines and not readline, maybe this has something to do with it? perhaps it's not a problem of having a single 3GB object, but instead having millions of 30k objects.

Which version of python that are you trying this?
I did a test on Python 2.7/Win7, and it worked as expected, the memory was released.
Here I generate sample data like yours:
import random
fn = random.randint
with open('ips.txt', 'w') as f:
for i in xrange(9000000):
f.write('{0}.{1}.{2}.{3} username-{4}\n'.format(
fn(0,255),
fn(0,255),
fn(0,255),
fn(0,255),
fn(0, 9000000),
))
And then your script. I replaced dict by defaultdict because throwing exceptions makes the code slower:
import time
from collections import defaultdict
def read_file(filename):
with open(filename) as data:
accounts = defaultdict(set)
for line in data:
IP, username = line.split()[:2]
accounts[username].add(IP)
print "The accounts will be deleted from memory in 5 seconds"
time.sleep(5)
accounts.clear()
print "The accounts have been deleted from memory"
time.sleep(5)
print "End of script"
if __name__ == '__main__':
read_file('ips.txt')
As you can see, memory reached 1.4G and was then released, leaving 36MB:
Using your original script I got the same results, but a bit slower:

There are difference between when Python releases memory for reuse by Python and when it releases memory back to the OS. Python has internal pools for some kinds of objects and it will reuse these itself but doesn't give it back to the OS.

The gc module may be useful, particularly the collect function. I have never used it myself, but from the documentation, it looks like it may be useful. I would try running gc.collect() before you run accounts.clear().

Related

Is there a way to open hdf5 files with the POSIX_FADV_DONTNEED flag?

We are working with large (1.2TB) uncompressed, unchunked hdf5 files with h5py in python for a machine learning application, which requires us to work through the full dataset repeatedly, loading slices of ~15MB individually in a randomized order. We are working on a linux (Ubuntu 18.04) machine with 192 GB RAM. We noticed that the program is slowly filling the cache. When total size of cache reaches size comparable with full machine RAM (free memory in top almost 0 but plenty ‘available’ memory) swapping occurs slowing down all other applications. In order to pinpoint the source of the problem, we wrote a separate minimal example to isolate our dataloading procedures - but found that the problem was independent of each part of our method.
We tried:
Building numpy memmap and accessing requested slice:
#on init:
f = h5py.File(tv_path, 'r')
hdf5_event_data = f["event_data"]
self.event_data = np.memmap(tv_path, mode="r", shape=hdf5_event_data.shape,
offset=hdf5_event_data.id.get_offset(),dtype=hdf5_event_data.dtype)
self.e = np.ones((512,40,40,19))
#on __getitem__:
self.e = self.event_data[index,:,:,:19]
return self.e
Reopening the memmap on each call to getitem:
#on __getitem__:
self.event_data = np.memmap(self.path, mode="r", shape=self.shape,
offset=self.offset, dtype=self.dtype)
self.e = self.event_data[index,:,:,:19]
return self.e
Addressing the h5 file directly and converting to a numpy array:
#on init:
f = h5py.File(tv_path, 'r')
hdf5_event_data = f["event_data"]
self.event_data = hdf5_event_data
self.e = np.ones((512,40,40,19))
#on __getitem__:
self.e = self.event_data[index,:,:,:19]
return self.e
We also tried the above approaches within pytorch Dataset/Dataloader framework - but it made no difference.
We observe high memory fragmentation as evidenced by /proc/buddyinfo. Dropping the cache via sync; echo 3 > /proc/sys/vm/drop_caches doesn’t help while application is running. Cleaning cache before application starts removes swapping behaviour until cache eats up the memory again - and swapping starts again.
Our working hypothesis is that the system is trying to hold on to cached file data which leads to memory fragmentation. Eventually when new memory is requested swapping is performed even though most memory is still ‘available’.
As such, we turned to ways to change the Linux environment’s behaviour around file caching and found this post . Is there a way to call the POSIX_FADV_DONTNEED flag when opening an h5 file in python or a portion of that we accessed via numpy memmap, so that this accumulation of cache does not occur? In our use case we will not be re-visiting that particular file location for a long time (till we access all other remaining ‘slices’ of the file)
You can use os.posix_fadvise to tell the OS how regions you plan to load will be used. This naturally requires a bit of low-level tweaking to determine your file descriptor, and get an idea of the regions you plan on reading.
The easiest way to get the file descriptor is to supply it yourself:
pf = open(tv_path, 'rb')
f = h5py.File(pf, 'r')
You can now set the advice. For the entire file:
os.posix_fadvise(os.fileno(pf), 0, f.id.get_filesize(), os.POSIX_FADV_DONTNEED)
Or for a particular dataset:
os.posix_fadvise(os.fileno(pf), hdf5_event_data.id.get_offset(),
hdf5_event_data.id.get_storage_size(), os.POSIX_FADV_DONTNEED)
Other things to look at
H5py does its own chunk caching. You may want to try turning this off:
f = h5py.File(..., rdcc_nbytes=0)
As an alternative, you may want to try using one of the other drivers provided in h5py, like 'sec2':
f = h5py.File(..., driver='sec2')

Python script being killed

I have a program that runs for a while and outputs "Killed". I can't imagine that it's a memory thing because the file it's loading is under a gig. I've been trying to Google what other things can cause a python script to be killed but all I can find are articles about people being eaten by snakes... Here is my code:
import neo
from neo.io import BlackrockIO
dir = '/PHShome/gcw8/Ephys_Test/MG79_d4_Sat.ns3'
reader = BlackrockIO(filename=dir)
blks = reader.read(lazy=False, cascade=True)
for blk in blks:
for seg in blk.segments:
print 'Sampling Rate = %s' %seg.analogsignals[0].sampling_rate
print 'Number of Channels = %d' %len(blk.recordingchannelgroups[0].recordingchannels)
A little background. The file I'm working on is an electrophysioloy data file that consists of
1.) a header containing metadata (small)
2.) data (large)
the lazy option of reader.read() loads only the header when set to True and loads the entire file (including the data) when set to False. The code is not killed when lazy = True but does crash when lazy = False. While lazy = False causes much, much more of the file to be read,
[gcw8#database_dev Ephys_Test]$ du -h ./MG79_d4_Sat.ns3
719M ./MG79_d4_Sat.ns3
So I have trouble beleiving that it is a memory issue. Can anyone think of another reason this is being killed or a work around? I'm running Python 2.7 on CentOS.
That BlackrockIO library appears to parse the data and do all sorts of things with it. It could be that you actually ARE running out of memory. You could try monitoring memory usage using e.g. htop.

Python Memory leak when accessing GetCurrentImage() from DirectShow comtype

I have to debug someone elses code that has a memory leak. It uses up all the RAM and then eventually crashes (at a rate of 4Mb/s).
I isolated it down to a call that grabs a screen shot of a video filter and saves it to a object named lbDib. But then Python does not free it afterward it is used. I have tried to Del it, then call gc.collect(). I have assigned it to zero or None etc. But the memory is not freed.
After some googling I came across this:
http://msdn.microsoft.com/en-us/library/windows/desktop/dd390556%28v=vs.85%29.aspx
Which states:
The VMR allocates the memory for the image and returns a pointer to it
in the lpDib variable. The caller must free the memory by calling
CoTaskMemFree.
When I research memory management I just keep coming across the python garbage collector is the best thing since sliced pie, and you doing something wrong if you trying to clear up memory.
This is the line that creates memory that does not get released:
lpDib = vmr_windowless_control.GetCurrentImage()
Where vmr_windowless_control comes from:
vmr_windowless_control = vmr_config.QueryInterface(IVMRWindowlessControl)
Where vmr_config comes from:
vmr_config = self.filter.QueryInterface(IVMRFilterConfig)
vmr_config.SetRenderingMode(Renderer.VMRMode[vmrmode]
Where IVMRFilterConfig comes from COMtype DirectShowLib import:
try:
from comtypes.gen.DirectShowLib import (ICreateDevEnum, IBaseFilter, IBindCtx,
IMoniker, IFilterGraph,
IVMRFilterConfig,
IVMRWindowlessControl, IAMCrossbar,
IAMTVTuner, IAMStreamConfig,
IAMVideoProcAmp, IFilterMapper2)
except ImportError:
GetModule("DirectShow.tlb") # Create gen.DirectShowLib if it doesn't exist
from comtypes.gen.DirectShowLib import (ICreateDevEnum, IBaseFilter, IBindCtx,
IMoniker, IFilterGraph,
IVMRFilterConfig,
IVMRWindowlessControl, IAMCrossbar,
IAMTVTuner, IAMStreamConfig,
IAMVideoProcAmp, IFilterMapper2)
No idea how to attack this, brain is fried. Thanks in advance for any help or leads.
Just needed to add the following:
from ctypes import windll
windll.ole32.CoTaskMemFree(lpDib)
Thank guys

Python object persistence

I'm seeking advice about methods of implementing object persistence in Python. To be more precise, I wish to be able to link a Python object to a file in such a way that any Python process that opens a representation of that file shares the same information, any process can change its object and the changes will propagate to the other processes, and even if all processes "storing" the object are closed, the file will remain and can be re-opened by another process.
I found three main candidates for this in my distribution of Python - anydbm, pickle, and shelve (dbm appeared to be perfect, but it is Unix-only, and I am on Windows). However, they all have flaws:
anydbm can only handle a dictionary of string values (I'm seeking to store a list of dictionaries, all of which have string keys and string values, though ideally I would seek a module with no type restrictions)
shelve requires that a file be re-opened before changes propagate - for instance, if two processes A and B load the same file (containing a shelved empty list), and A adds an item to the list and calls sync(), B will still see the list as being empty until it reloads the file.
pickle (the module I am currently using for my test implementation) has the same "reload requirement" as shelve, and also does not overwrite previous data - if process A dumps fifteen empty strings onto a file, and then the string 'hello', process B will have to load the file sixteen times in order to get the 'hello' string. I am currently dealing with this problem by preceding any write operation with repeated reads until end of file ("wiping the slate clean before writing on it"), and by making every read operation repeated until end of file, but I feel there must be a better way.
My ideal module would behave as follows (with "A>>>" representing code executed by process A, and "B>>>" code executed by process B):
A>>> import imaginary_perfect_module as mod
B>>> import imaginary_perfect_module as mod
A>>> d = mod.load('a_file')
B>>> d = mod.load('a_file')
A>>> d
{}
B>>> d
{}
A>>> d[1] = 'this string is one'
A>>> d['ones'] = 1 #anydbm would sulk here
A>>> d['ones'] = 11
A>>> d['a dict'] = {'this dictionary' : 'is arbitrary', 42 : 'the answer'}
B>>> d['ones'] #shelve would raise a KeyError here, unless A had called d.sync() and B had reloaded d
11 #pickle (with different syntax) would have returned 1 here, and then 11 on next call
(etc. for B)
I could achieve this behaviour by creating my own module that uses pickle, and editing the dump and load behaviour so that they use the repeated reads I mentioned above - but I find it hard to believe that this problem has never occurred to, and been fixed by, more talented programmers before. Moreover, these repeated reads seem inefficient to me (though I must admit that my knowledge of operation complexity is limited, and it's possible that these repeated reads are going on "behind the scenes" in otherwise apparently smoother modules like shelve). Therefore, I conclude that I must be missing some code module that would solve the problem for me. I'd be grateful if anyone could point me in the right direction, or give advice about implementation.
Use the ZODB (the Zope Object Database) instead. Backed with ZEO it fulfills your requirements:
Transparent persistence for Python objects
ZODB uses pickles underneath so anything that is pickle-able can be stored in a ZODB object store.
Full ACID-compatible transaction support (including savepoints)
This means changes from one process propagate to all the other processes when they are good and ready, and each process has a consistent view on the data throughout a transaction.
ZODB has been around for over a decade now, so you are right in surmising this problem has already been solved before. :-)
The ZODB let's you plug in storages; the most common format is the FileStorage, which stores everything in one Data.fs with an optional blob storage for large objects.
Some ZODB storages are wrappers around others to add functionality; DemoStorage for example keeps changes in memory to facilitate unit testing and demonstration setups (restart and you have clean slate again). BeforeStorage gives you a window in time, only returning data from transactions before a given point in time. The latter has been instrumental in recovering lost data for me.
ZEO is such a plugin that introduces a client-server architecture. Using ZEO lets you access a given storage from multiple processes at a time; you won't need this layer if all you need is multi-threaded access from one process only.
The same could be achieved with RelStorage, which stores ZODB data in a relational database such as PostgreSQL, MySQL or Oracle.
For beginners, You can port your shelve databases to ZODB databases like this:
#!/usr/bin/env python
import shelve
import ZODB, ZODB.FileStorage
import transaction
from optparse import OptionParser
import os
import sys
import re
reload(sys)
sys.setdefaultencoding("utf-8")
parser = OptionParser()
parser.add_option("-o", "--output", dest = "out_file", default = False, help ="original shelve database filename")
parser.add_option("-i", "--input", dest = "in_file", default = False, help ="new zodb database filename")
parser.set_defaults()
options, args = parser.parse_args()
if options.in_file == False or options.out_file == False :
print "Need input and output database filenames"
exit(1)
db = shelve.open(options.in_file, writeback=True)
zstorage = ZODB.FileStorage.FileStorage(options.out_file)
zdb = ZODB.DB(zstorage)
zconnection = zdb.open()
newdb = zconnection.root()
for key, value in db.iteritems() :
print "Copying key: " + str(key)
newdb[key] = value
transaction.commit()
I suggest using TinyDB, it's much much better and simple to use.
https://tinydb.readthedocs.io/en/stable/

How to accelerate reads from batches of files

I read many files from my system. I want to read them faster, maybe like this:
results=[]
for file in open("filenames.txt").readlines():
results.append(open(file,"r").read())
I don't want to use threading. Any advice is appreciated.
the reason why i don't want to use threads is because it will make my code unreadable,i want to find so tricky way to make speed faster and code lesser,unstander easier
yesterday i have test another solution with multi-processing,it works bad,i don't know why,
here is the code as follows:
def xml2db(file):
s=pq(open(file,"r").read())
dict={}
for field in g_fields:
dict[field]=s("field[#name='%s']"%field).text()
p=Product()
for k,v in dict.iteritems():
if v is None or v.strip()=="":
pass
else:
if hasattr(p,k):
setattr(p,k,v)
session.commit()
#cost_time
#statistics_db
def batch_xml2db():
from multiprocessing import Pool,Queue
p=Pool(5)
#q=Queue()
files=glob.glob(g_filter)
#for file in files:
# q.put(file)
def P():
while q.qsize()<>0:
xml2db(q.get())
p.map(xml2db,files)
p.join()
results = [open(f.strip()).read() for f in open("filenames.txt").readlines()]
This may be insignificantly faster, but it's probably less readable (depending on the reader's familiarity with list comprehensions).
Your main problem here is that your bottleneck is disk IO - buying a faster disk will make much more of a difference than modifying your code.
Well, if you want to improve performance then improve the algorithm, right? What are you doing with all this data? Do you really need it all in memory at the same time, potentially causing OOM if filenames.txt specifies too many or too large of files?
If you're doing this with lots of files I suspect you are thrashing, hence your 700s+ (1 hour+) time. Even my poor little HD can sustain 42 MB/s writes (42 * 714s = 30GB). Take that grain of salt knowing you must read and write, but I'm guessing you don't have over 8 GB of RAM available for this application. A related SO question/answer suggested you use mmap, and the answer above that suggested an iterative/lazy read like what you get in Haskell for free. These are probably worth considering if you really do have tens of gigabytes to munge.
Is this a one-off requirement or something that you need to do regularly? If it's something you're going to be doing often, consider using MySQL or another database instead of a file system.
Not sure if this is still the code you are using.
A couple adjustments I would consider making.
Original:
def xml2db(file):
s=pq(open(file,"r").read())
dict={}
for field in g_fields:
dict[field]=s("field[#name='%s']"%field).text()
p=Product()
for k,v in dict.iteritems():
if v is None or v.strip()=="":
pass
else:
if hasattr(p,k):
setattr(p,k,v)
session.commit()
Updated:
remove the use of the dict, it is extra object creation, iteration and collection.
def xml2db(file):
s=pq(open(file,"r").read())
p=Product()
for k in g_fields:
v=s("field[#name='%s']"%field).text()
if v is None or v.strip()=="":
pass
else:
if hasattr(p,k):
setattr(p,k,v)
session.commit()
You could profile the code using the python profiler.
This might tell you where the time being spent is.
It may be in session.Commit() this may need to be reduced to every couple of files.
I have no idea what it does so that is really a stab in the dark, you may try and run it without sending or writing any output.
If you can separate your code into Reading, Processing and Writing.
A) You can see how long it takes to read all the files.
Then by loading a single file into memory process it enough time to represent the entire job without extra reading IO.
B) Processing cost
Then save a whole bunch of sessions representative of your job size.
C) Output cost
Test the cost of each stage individually. This should show you what is taking the most time and if any improvement can be made in any area.

Categories

Resources