Loading dictionary object causing memory spike - python

I have a dictionary object with about 60,000 keys that I cache and access in my Django view. The view provides basic search functionality where I look for a search term in the dictionary like so:
projects_map = cache.get('projects_map')
projects_map.get('search term')
However, just grabbing the cached object (in line 1) causes a a giant spike in memory usage on the server - upwards of 100MBs sometimes - and the memory isn't released even after the values are returned and the template rendered.
How can I keep the memory from jacking up like this? Also, I've tried explicitly deleting the object after I grab the value but even that doesn't release the memory spike.
Any help is greatly appreciated.
Update: Solution I ultimately implemented
I decided to implement my own indexing table in which I store the keys and their pickled value. Now, instead of using get() on a dictionary, I use:
ProjectsIndex.objects.get(index_key=<search term>)
and unpickle the value. This seems to take care of the memory issue as I'm no longer loading a giant object into memory. It adds another small query to the page but that's about it. Seems to be the perfect solution...for now.

..what about using some appropriate service for caching, such as redis or memcached instead of loading the huge object in memory python-side? This way, you'll even have the ability to scale on extra machines, should the dictionary grow more..
Anyways, the 100MB memory contain all the data + hash index + misc. overhead; I noticed myself the other day that many times memory doesn't get deallocated until you quit the Python process (I filled up couple gigs of memory from the Python interpreter, loading a huge json object.. :)); it would be interesting if anybody has a solution for that..
Update: caching with very few memory
Your options with only 512MB ram are:
Use redis, and have a look here http://redis.io/topics/memory-optimization (but I suspect 512MB isn't enough, even optimizing)
Use a separate machine (or a cluster of, since both memcached and redis support sharding) with way more ram to keep the cache
Use the database cache backend, much slower but less memory-consuming, as it saves everything on the disk
Use filesystem cache (although I don't see the point of preferring this over database cache)
and, in the latter two cases, try splitting up your objects, so that you never retrieve megabytes of objects from the cache at once.
Update: lazy dict spanning over multiple cache keys
You can replace your cached dict with something like this; this way, you can continue treating it as you would with a normal dictionary, but data will be loaded from cache only when you really need it.
from django.core.cache import cache
from UserDict import DictMixin
class LazyCachedDict(DictMixin):
def __init__(self, key_prefix):
self.key_prefix = key_prefix
def __getitem__(self, name):
return cache.get('%s:%s' % (self.key_prefix, name))
def __setitem__(self, name, value):
return cache.set('%s:%s' % (self.key_prefix, name), value)
def __delitem__(self, name):
return cache.delete('%s:%s' % (self.key_prefix, name))
def has_key(self, name):
return cache.has_key(name)
def keys():
## Just fill the gap, as the cache object doesn't provide
## a method to list cache keys..
return []
And then replace this:
projects_map = cache.get('projects_map')
projects_map.get('search term')
with:
projects_map = LazyCachedDict('projects_map')
projects_map.get('search term')

I don't know how Windows works, but in Linux a process practically cannot return memory to system. It's because a process address space is contiguous and the only available system call to increase memory is brk(), which just increases a pointer which marks last address available to a process.
All allocators which applications use (malloc etc.) are defined in user-space as a library. They operate on a byte-blocks level and use brk() to increase internal memory-pool only. In a running application this memory pool is cluttered with requested blocks. The only possibility to return memory to the system is when the last part of a pool has no blocks used (very unlikely that this will be of significant size, because even simple applications allocate and deallocate thousands of objects).
So a bloat caused by a memory spike will stay till the end. Solutions:
avoid the spike by optimizing memory usage, even if caused by temporary objects (eg: process a file line by line instead of reading whole contents at once)
put the cache in another process (memcached, as suggested in the first answer)
use a serialized dictionary (gdbm) or some other storage detached from process' private memory (mmap, shared memory)

If get on particular key is the only operation you perform, why not to keep all keys separately in cache? That way all entries will end up in separate files and django will be able to access them quickly.
The update will be much more painfull of course, but you can abstract it nicely. First thing I can think of is some cache key prefix.
The code could look like cache.get('{prefix}search_term') then.
Edit:
We're trying to solve wrong problem here. You don't need caching. The data gets updated, not dumped (after 5 mins or so).
You need to create a database table with all your entries.
If you don't have access to any database server from your setting, try to use sqlite. It's file based and should serve your purpose well.

Related

Python script doesn't terminates after a long time when it's finished

I have a weird problem.
I'm loading a huge file (3.5G) and making a dictionary out of it and do some processing.
After everything is finished, my script doesn't terminate immediately, it terminates after some time.
I think it might be due to memory freeing , what can be other reasons ?? I'd appreciate any opinion. And how can I make my script run faster?
Here's the corresponding code:
class file_processor:
def __init__(self):
self.huge_file_dict = self.upload_huge_file()
def upload_huge_file(self):
d = defaultdict(list)
f= codecs.open('huge_file', 'r', encoding='utf-8').readlines()
for line in f:
l = line.strip()
x,y,z,rb,t = l.split()
d[rb].append((x,y,z,t))
return d
def do_some_processing(self, word):
if word in self.huge_file_dict:
do sth with self.huge_file_dict[word]
My guess is that your horrible slowdown, which doesn't recover until after your program is finished, is caused by using more memory than you actually have, which causes your OS to start swapping VM pages in and out to disk. Once you get enough swapping happening, you end up in "swap hell", where a large percentage of your memory accesses involve a disk read and even a disk write, which takes orders of magnitude more time, and your system won't recover until a few seconds after you finally free up all that memory.
The obvious solution is to not use so much memory.
tzaman's answer, avoiding readlines(), will eliminate some of that memory. A giant list of all the lines in a 3.5GB file has to take at least 3.5GB on Python 3.4 or 2.7 (but realistically at least 20% more than that) and maybe 2x or 4x on 3.0-3.3.
But the dict is going to be even bigger than the list, and you need that, right?
Well, no, you probably don't. Keeping the dict on-disk and fetching the values as-needed may sound slow, but it may still be a lot faster than keeping it in virtual memory, if that virtual memory has to keep swapping back and forth to disk.
You may want to consider using a simple dbm, or a more powerful key-value database (google "NoSQL key value" for some options), or a sqlite3 database, or even a server-based SQL database like MySQL.
Alternatively, if you can keep everything in memory, but in a more compact form, that's the best of both worlds.
I notice that in your example code, the only thing you're doing with the dict is checking word in self.huge_file_dict. If that's true, then you can use a set instead of a dict and not keep all those values around in memory. That should cut your memory use by about 80%.
If you frequently need the keys, but occasionally need the values, you might want to consider a dict that just maps the keys to indices into something you can read off disk as needed (e.g., a file with fixed-length strings, which you can then mmap and slice).
Or you could stick the values in a Pandas frame, which will be a little more compact than native Python storage—maybe enough to make the difference—and use a dict mapping keys to indices.
Finally, you may be able to reduce the amount of swapping without actually reducing the amount of memory. Bisecting a giant sorted list, instead of accessing a giant dict, may—depending on the pattern of your words—give much better memory locality.
Don't call .readlines() -- that loads the entire file into memory beforehand. You can just iterate over f directly and it'll work fine.
with codecs.open('huge_file', 'r', encoding='utf-8') as f:
for line in f:
...

About the speed of random file read (Python)

Please take a look at the following code (kind of pseudo code):
index = db.open()
fh = open('somefile.txt','rb')
for i in range(1000):
x = random_integer(1,5000)
pos,length = index[x]
fh.seek(pos)
buffer = fh.read(length)
doSomeThingWith(buffer)
fh.close()
db.close()
I used a database to index the positions and lengths of text segments in a .txt file for random retrieval.
No wonder, if the above code is run repeatedly, the execution takes less and less time.
1) What is responsible for this speed-up? Is it because of things staying in the memory or the "caching" or something?
2) Is there anyway to control it?
3) I've compared with other methods where the text segments are stored in Berkeley DB and so on. When at its fastest, the above code is faster than retrieval from Berkeley DB. How do I judge the performance of my database+file solution? I mean, is it safe to judge it as at least "fast enough"?
what is behind and responsible for this speed-up?
It could be the operating system's disk cache. http://en.wikipedia.org/wiki/Page_cache
Once you've read a chunk of a file from disk once, it will hang around in RAM for a while. RAM is orders of magnitude faster than disk, so you'll see a lot of variability in the time it takes to read random pieces of a large file.
Or, depending on what "db" is, the database implementation could be doing its own caching.
Is there anyway to control it?
If it's the disk cache:
It depends on the operating system, but it's typically a pretty coarse-grained control; for example, you may be forced to disable caching for an entire volume, which would affect other processes on the system reading from that volume, and would affect every other file that lived on that volume. It would also probably require root/admin access.
See this similar question about disabling caching on Linux: Linux : Disabling File cache for a process?
Depending on what you're trying to do, you can force-flush the disk cache. This can be useful in situations where you want to run a test with a cold cache, letting you get an idea of the worst-case performance. (This also depends on your OS and may require root/admin access.)
If it's the database:
Depends on the database. If it's a local database, you may just be seeing disk cache effects, or the database library could be doing its own caching. If you're talking to a remote database, the caching could be happening locally or remotely (or both).
There may be configuration options to disable or control caching at either of these layers.

Very poor weakref performance in Python/SQL Alchemy

I've spent the day trying to debug a memory problem in my Python script. I'm using SQL Alchemy as my ORM. There are several confounding issues here, and I'm hoping that if I list them all out, somebody will be able to point me in the right direction.
In order to achieve the performance I'm looking for, I read in all the records in a table (~400k), then loop through a spreadsheet, match the records I've previously read in, then create new records (~800k) into another table. Here's roughly what the code looks like:
dimensionMap = {}
for d in connection.session.query(Dimension):
dimensionMap[d.businessKey] = d.primarySyntheticKey
# len(dimensionMap) == ~400k, sys.getsizeof(dimensionMap) == ~4MB
allfacts = []
sheet = open_spreadsheet(path)
for row in sheet.allrows():
dimensionId = dimensionMap[row[0]]
metric = row[1]
fact = Fact(dimensionId, metric)
connection.session.add(fact)
allfacts.append(fact)
if row.number % 20000 == 0:
connection.session.flush()
# len(allfacts) == ~800k, sys.getsizeof(allfacts) == ~50MB
connection.session.commit()
sys.stdout.write('All Done')
400k and 800k don't seem like especially big numbers to me, but I'm nonetheless running into memory problems a machine with 4GB of memory. This is really strange to me, as I ran sys.getsizeof on my two biggest collections, and they were both well under any size that would cause problems.
While trying to figure this out, I noticed that the script was running really, really slowly. So I ran a profile on it, hoping the results would lead me in the direction of the memory problem, and came up with two confounding issues.
First, 87% of the program time is spent in the commit, specifically on this line of code:
self.transaction._new[state] = True
This can be found in session.py:1367. self.transaction._new is an instance of weakref.WeakKeyDictionary(). Why is weakref:261:__setitem__ taking up so much time?
Second, even when the program is done ('All Done' has been printed to stdout), the script continues, seemingly forever, with 2.2GB of memory used.
I've done some searching on weakrefs, but haven't seen anybody mention the performance issues I'm facing. Ultimately, there isn't a whole lot I can do about this, given it's buried deep in SQL Alchemy, but I'd still appreciate any ideas.
Key Learnings
As mentioned by #zzzeek, there's a lot of overhead required to maintain persistent objects. Here's a little graph to show the growth.
The trendline suggests that each persistent instance takes about 2KB of memory overhead, even though the instance itself is only 30 bytes. This actually brings me another thing I learned, which is to take sys.getsizeof with a huge grain of salt.
This function only returns the shallow size of an object, and doesn't take into account any other objects that need to be there for the first object to make sense (__dict__, for example). You really need to use something like Heapy to get a good understanding of the actual memory footprint of an instance.
The last thing I learned is that, when Python is on the verge of running out of memory, and is thrashing like crazy, weird stuff happens that shouldn't be taken as part of the problem. In my case, the massive slow-down, the profile pointing to the weakref creation, and the hangup after the program completed, are all effects of the memory issue. Once I stopped creating and keeping around persistent instances, and instead just kept around the objects' properties that I needed, all the other issues went away.
800K ORM objects is very large. These are Python objects, each of which has a __dict__ as well as an _sa_instance_state attribute which is itself an object, which then has weakrefs and other things inside of it, then the Session has more than one weakref to your object - an ORM object is identity tracked, a feature
which provides a high degree of automation in persistence but at the cost of lots more memory and
function call overhead.
As far as why your profiling is all focused on that one weakref line, that seems very strange, I'd be curious to see the actual profile result there (see How can I profile a SQLAlchemy powered application? for background).
Your code example can be modified to not use any ORM identity-mapped objects as follows.
For more detail on bulk inserts, see Why is SQLAlchemy insert with sqlite 25 times slower than using sqlite3 directly?.
# 1. only load individual columns - loading simple tuples instead
# of full ORM objects with identity tracking. these tuples can be
# used directly in a dict comprehension
dimensionMap = dict(
connection.session.query(Dimension.businessKey, Dimension.primarySyntheticKey)
)
# 2. For bulk inserts, use Table.insert() call with
# multiparams in chunks
buf = []
for row in sheet.allrows():
dimensionId = dimensionMap[row[0]]
metric = row[1]
buf.append({"dimensionId": dimensionId, "metric": metric})
if len(buf == 20000):
connection.session.execute(Fact.__table__.insert(), params=buf)
buf[:] = []
connection.session.execute(Fact.__table__.insert(), params=buf)
sys.stdout.write('All Done')

*large* python dictionary with persistence storage for quick look-ups

I have a 400 million lines of unique key-value info that I would like to be available for quick look ups in a script. I am wondering what would be a slick way of doing this. I did consider the following but not sure if there is a way to disk map the dictionary and without using a lot of memory except during dictionary creation.
pickled dictionary object : not sure if this is an optimum solution for my problem
NoSQL type dbases : ideally want something which has minimum dependency on third party stuff plus the key-value are simply numbers. If you feel this is still the best option, I would like to hear that too. May be it will convince me.
Please let me know if anything is not clear.
Thanks!
-Abhi
If you want to persist a large dictionary, you are basically looking at a database.
Python comes with built in support for sqlite3, which gives you an easy database solution backed by a file on disk.
No one has mentioned dbm. It is opened like a file, behaves like a dictionary and is in the standard distribution.
From the docs https://docs.python.org/3/library/dbm.html
import dbm
# Open database, creating it if necessary.
with dbm.open('cache', 'c') as db:
# Record some values
db[b'hello'] = b'there'
db['www.python.org'] = 'Python Website'
db['www.cnn.com'] = 'Cable News Network'
# Note that the keys are considered bytes now.
assert db[b'www.python.org'] == b'Python Website'
# Notice how the value is now in bytes.
assert db['www.cnn.com'] == b'Cable News Network'
# Often-used methods of the dict interface work too.
print(db.get('python.org', b'not present'))
# Storing a non-string key or value will raise an exception (most
# likely a TypeError).
db['www.yahoo.com'] = 4
# db is automatically closed when leaving the with statement.
I would try this before any of the more exotic forms, and using shelve/pickle will pull everything into memory on loading.
Cheers
Tim
In principle the shelve module does exactly what you want. It provides a persistent dictionary backed by a database file. Keys must be strings, but shelve will take care of pickling/unpickling values. The type of db file can vary, but it can be a Berkeley DB hash, which is an excellent light weight key-value database.
Your data size sounds huge so you must do some testing, but shelve/BDB is probably up to it.
Note: The bsddb module has been deprecated. Possibly shelve will not support BDB hashes in future.
Without a doubt (in my opinion), if you want this to persist, then Redis is a great option.
Install redis-server
Start redis server
Install redis python pacakge (pip install redis)
Profit.
import redis
ds = redis.Redis(host="localhost", port=6379)
with open("your_text_file.txt") as fh:
for line in fh:
line = line.strip()
k, _, v = line.partition("=")
ds.set(k, v)
Above assumes a files of values like:
key1=value1
key2=value2
etc=etc
Modify insertion script to your needs.
import redis
ds = redis.Redis(host="localhost", port=6379)
# Do your code that needs to do look ups of keys:
for mykey in special_key_list:
val = ds.get(mykey)
Why I like Redis.
Configurable persistance options
Blazingly fast
Offers more than just key / value pairs (other data types)
#antrirez
I don't think you should try the pickled dict. I'm pretty sure that Python will slurp the whole thing in every time, which means your program will wait for I/O longer than perhaps necessary.
This is the sort of problem for which databases were invented. You are thinking "NoSQL" but an SQL database would work also. You should be able to use SQLite for this; I've never made an SQLite database that large, but according to this discussion of SQLite limits, 400 million entries should be okay.
What are the performance characteristics of sqlite with very large database files?
I personally use LMDB and its python binding for a few million records DB.
It is extremely fast even for a database larger than the RAM.
It's embedded in the process so no server is needed.
Dependency are managed using pip.
The only downside is you have to specify the maximum size of the DB. LMDB is going to mmap a file of this size. If too small, inserting new data will raise a error. To large, you create sparse file.

Data persistence for python when a lot of lookups but few writes?

I am working on a project that basically monitors a set remote directories (FTP, networked paths, and another), if the file is considered new and meets criteria we download it and process it. However i am stuck on what the best way is to keep track of the files we already downloaded. I don't want to download any duplicate files, so i need to keep track of what is already downloaded.
Orignally i was storing it as a tree:
server->directory->file_name
When the service shuts down it writes it to a file, and rereads it back when it starts up. However given that when there is around 20,000 or so files in the tree stuff starts to slow down alot.
Is there a better way to do this?
EDIT
The lookup times start to slowdown alot, my basic implementation is a dict of a dict. The storing stuff on the disk is fine, its more or less just the lookup time. I know i can optimize the tree and partition it. However that seems excessive for such a small project i was hoping python would have something like that.
I would create a set of tuples, then pickle it to a file. The tuples would be (server, directory, file_name), or even just (server, full_file_name_including_directory). There's no need for a multiple-level data structure. The tuples will hash into the set, and give you a O(1) lookup.
You mention "stuff starts to slow down alot," but you don't say if it's reading and writing time, or lookup times that are slowing down. If your lookup times are slowing down, you may be paging. Is your data structure approaching a significant fraction of your physical memory?
One way to get back some memory is to intern() the server names. This way, each server name will be stored only once in memory.
An interesting alternative is to use a Bloom filter. This will let you use far less memory, but will occasionally download a file that you didn't have to. This might be a reasonable trade-off, depending on why you didn't want to download the file twice.

Categories

Resources