I'm currently writing a Python daemon process that monitors a log file in realtime and updates entries in a Postgresql database based on their results. The process only cares about a unique key that appears in the log file and the most recent value it's seen from that key.
I'm using a polling approach,and process a new batch every 10 seconds. In order to reduce the overall set of data to avoid extraneous updates to the database, I'm only storing the key and the most recent value in a dict. Depending on how much activity there has been in the last 10 seconds, this dict can vary from 10-1000 unique entries. Then the dict gets "processed" and those results are sent to the database.
My main concern has revolves around memory management and the dict over time (days, weeks, etc). Since this is a daemon process that's constantly running, memory usage bloats based on the size of the dict, but never shrinks appropriately. I've tried reseting dict using a standard dereference, and the dict.clear() method after processing a batch, but noticed no changes in memory usage (FreeBSD/top). It seems that forcing a gc.collect() does recover some memory, but usually only around 50%.
Do you guys have any advice on how I should proceed? Is there something more I could be doing in my process? Feel free to chime in if you see a different road around the issue :)
When you clear() the dict or del the objects referenced by the dict, the contained objects are still around in memory. If they aren't referenced anywhere, they can be garbage-collected, as you have seen, but garbage-collection isn't run explicitly on a del or clear().
I found this similar question for you: https://stackoverflow.com/questions/996437/memory-management-and-python-how-much-do-you-need-to-know. In short, if you aren't running low on memory, you really don't need to worry a lot about this. FreeBSD itself does a good job handling virtual memory, so even if you have a huge amount of stale objects in your Python program, your machine probably won't be swapping to the disk.
Related
I have a very long list of objects that I would like to only load from the db once to memory (Meaning not for each session) this list WILL change it's values and grow over time by user inputs, The reason I need it in memory is because I am doing some complex searches on it and want to give a quick answer back.
My question is how do I load a list on the start of the server and keep it alive through sessions letting them all READ/WRITE to it.
Will it be better to do a heavy SQL search instead of keeping the list alive through my server?
The answer is that this is bad idea, you are opening a pandora's box specially since you need write access as well. However all is not lost. You can quite easily use redis for this task.
Redis is a peristent data store but at the same time everything is held in memory. If the redis server runs on the same device as the web server access is almost instantaneous
Some observations on Heroku that don't completely mesh with my mental model.
My understanding is that CPython will never release memory once it has been allocated by the OS. So we should never observe a decrease in resident memory of CPython processes. And this is in fact my observation from occasionally profiling my Django application on Heroku; sometimes the resident memory will increase, but it will never decrease.
However, sometimes Heroku will alert me that my worker dyno is using >100% of its memory quota. This generally happens when a long-running response-data-heavy HTTPS request that I make to an external service (using the requests library) fails due to a server-side timeout. In this case, memory usage will spike way past 100%, then gradually drop back to less than 100% of quota, when the alarm ceases.
My question is, how is this memory released back to the OS? AFAIK it can't be CPython releasing it. My guess is that the incoming bytes from the long-running TCP connection are being buffered by the OS, which has the power to de-allocate. It's murky to me when exactly "ownership" of TCP bytes is transferred to my Django app. I'm certainly not explicitly reading lines from the input stream, I delegate all of that to requests.
Apparently, at one time, CPython did NOT ever release memory back to the OS. Then a patch was introduced in Python 2.5 that allowed memory to be released under certain circumstances, detailed here. Therefore it's no longer true to say that python doesn't release memory; it's just that python doesn't often release memory, because it doesn't handle memory fragmentation very well.
At a high-level, python keeps track of its memory in 256K blocks called arenas. Object pools are held in these arenas. Python is smart enough to free the arenas back to the OS when they're empty, but it still doesn't handle fragmentation across the arenas very well.
In my particular circumstance, I was reading large HTTP responses. If you dig down the code chain starting with HttpAdapter.send() in the requests library, you'll eventually find that socket.read() in the python socket library is making a system call to receive from its socket in chunks of 8192 bytes (default buffer size). This is the point at which the OS copies bytes from the kernel to the process, where they will be designated by CPython as string objects of size 8K and shoved into an arena. Note that StringIO, which is the python-land buffer object for sockets, simply keeps a list of these 8K strings rather than mushing them together into a super-string object.
Since 8K fits precisely 32 times into 256K, I think what is happening is that the received bytes are nicely filling up entire arenas without much fragmentation. These arenas can then be freed to the OS when the 8K strings filling them are deleted.
I think I understand why the memory is released gradually (asynchronous garbage collection?), but I still don't understand why it takes so long to release after a connection error. If the memory release always took so long, I should be seeing these memory usage errors all the time, because my python memory usage should spike whenever one of these calls are made. I've checked my logs, and I can sometimes see these violations last for minutes. Seems like an insanely long interval for memory release.
Edit: I have a solid theory on the issue now. This error is being reported to me by a logging system that keeps a reference to the last traceback. The traceback maintains a reference to all the variables in the frames of the traceback, including the StringIO buffer, which in turn holds references to all the 8K-strings read from socket. See the note under sys.exc_clear(): This function is only needed in only a few obscure situations. These include logging and error handling systems that report information on the last or current exception.
Therefore, in exception cases, the 8K-string ref counts don't drop to zero and immediately empty their arenas as they would in the happy path; we have to wait for background garbage collection to detect their reference cycles.
The GC delay is compounded by the fact that when this exception occurs, lots of objects are allocated over 5 minutes until timeout, which I'm guessing is plenty of time for lots of the 8K-strings to make it into the 2nd generation. With default GC thresholds of (700, 10, 10), it would take roughly 700*10 allocations for string objects to make it into the 2nd generation. That comes out to 7000*8192 ~= 57MB, which means that all the strings received before the last 57MB of the bytestream make it into the 2nd gen, maybe even 3rd gen if 570MB is streamed (but that seems high).
Intervals on the order of minutes still seem awfully long for garbage collection of the 2nd generation, but I guess it's possible. Recall that GC isn't triggered only by allocations, the formula is actually trigger == (allocations - deallocations > threshold).
TL;DR Large responses fill up socket buffers that fill up arenas without much fragmentation, allowing Python to actually release their memory back to the OS. In unexceptional cases, this memory will be released immediately upon exit of whatever context referenced the buffers, because the ref count on the buffers will drop to zero, triggering an immediate reclamation. In exceptional cases, as long as the traceback is alive, the buffers will still be referenced, therefore we will have to wait for garbage collection to reclaim them. If the exception occurred in the middle of a connection and a lot of data was already transmitted, then by the time of the exception many buffers will have been classified as members of an elder generation, and we will have to wait even longer for garbage collection to reclaim them.
CPython will release memory, but it's a bit murky.
CPython allocates chunks of memory at a time, lets call them fields.
When you instantiate an object, CPython will use blocks of memory from an existing field if possible; possible in that there's enough contagious blocks for said object.
If there's not enough contagious blocks, it'll allocate a new field.
Here's where it gets murky.
A Field is only freed when it contains zero objects, and while there's garbage collection in CPython, there's no "trash compactor". So if you have a couple objects in a few fields, and each field is only 70% full, CPython wont move those objects all together and free some fields.
It seems pretty reasonable that the large data chunk you're pulling from the HTTP call is getting allocated to "new" fields, but then something goes sideways, the object's reference count goes to zero, then garbage collection runs and returns those fields to the OS.
Is it possible to have multiple workers running with Gunicorn and have them accessing some global variable in an ordered manner i.e. without running into problems with race conditions?
Assuming that by global variable, you mean another process that keeps them in memory or on disk, yes I think so. I haven't checked the source code of Gunicorn, but based on a problem I had with some old piece of code was that several users retrieved the same key from a legacy MyISAM table, incrementing it and creating a new entry using it assuming it was unique to create a new record. The result was that occasionally (when under very heavy traffic) one record is created (the newest one overwriting the older ones, all using the same incremented key). This problem was never observed during a hardware upgrade, when I reduced the gunicorn workers of the website to one, which was the reason to explore this probable cause in the first place.
Now usually, reducing the workers will degrade performance, and it is better to deal with these issues with transactions (if you are using an ACID RDBMS, unlike MyISAM). The same issue should be present with in Redis and similar stores.
Also this shouldn't a problem with files and sockets, since to my knowledge, the operating system will block other processes (even children) from accessing an open file.
I am using ubuntu. I have some management commands which when run, does lots of database manipulations, so it takes nearly 15min.
My system monitor shows that my system has 4 cpu's and 6GB RAM. But, this process is not utilising all the cpu's . I think it is using only one of the cpus and that too very less ram. I think, if I am able to make it to use all the cpu's and most of the ram, then the process will be completed in very less time.
I tried renice , by settings priority to -18 (means very high) but still the speed is less.
Details:
its a python script with loop count of nearly 10,000 and that too nearly ten such loops. In every loop, it saves to postgres database.
If you are looking to make this application run across multiple cpu's then there are a number of things you can try depending on your setup.
The most obvious thing that comes to mind is making the application make use of threads and multiprocesses. This will allow the application to "do more" at once. Obviously the issue you might have here is concurrent database access so you might need to use transactions (at which point you might loose the advantage of using multiprocesses in the first place).
Secondly, make sure you are not opening and closing lots of database connections, ensure your application can hold the connection open for as long as it needs.
thirdly, Ensure the database is correctly indexed. If you are doing searches on large strings then things are going to be slow.
Fourthly, Do everything you can in SQL leaving little manipulation to python, sql is horrendously quick at doing data manipulation if you let it. As soon as you start taking data out of the database and into code then things are going to slow down big time.
Fifthly, make use of stored procedures which can be cached and optimized internally within the database. These can be a lot quicker than application built queries which cannot be optimized as easily.
Sixthly, dont save on each iteration of a program. Try to produce a batch styled job whereby you alter a number of records then save all of those in one batch job. This will reduce the amount of IO on each iteration and speed up the process massivly.
Django does support the use of a bulk update method, there was also a question on stackoverflow a while back about saving multiple django objects at once.
Saving many Django objects with one big INSERT statement
Django: save multiple object signal once
Just in case, did you run the command renice -20 -p {pid} instead of renice --20 -p {pid}? In the first case it will be given the lowest priority.
We have hundreds of thousands of tasks that need to be run at a variety of arbitrary intervals, some every hour, some every day, and so on. The tasks are resource intensive and need to be distributed across many machines.
Right now tasks are stored in a database with an "execute at this time" timestamp. To find tasks that need to be executed, we query the database for jobs that are due to be executed, then update the timestamps when the task is complete. Naturally this leads to a substantial write load on the database.
As far as I can tell, we are looking for something to release tasks into a queue at a set interval. (Workers could then request tasks from that queue.)
What is the best way to schedule recurring tasks at scale?
For what it's worth we're largely using Python, although we have no problems using components (RabbitMQ?) written in other languages.
UPDATE: Right now we have about 350,000 tasks that run every half hour or so, with some variation. 350,000 tasks * 48 times per day is 16,800,000 tasks executed per day.
UPDATE 2: There are no dependencies. The tasks do not have to be executed in order and do not rely on previous results.
Since ACID isn't needed and you're okay with tasks potentially running twice, I wouldn't keep the timestamps in the database at all. For each task, create a list of [timestamp_of_next_run, task_id] and use a min-heap to store all of the lists. Python's heapq module can maintain the heap for you. You'll be able to very efficiently pop off the task with the soonest timestamp. When you need to run a task, use its task_id to look up in the database what the task needs to do. When a task completes, update the timestamp and put it back into the heap. (Just be careful not to change an item that's currently in the heap, as that will break the heap properties).
Use the database only to store information that you will still care about after a crash and reboot. If you won't need the information after a reboot, don't spend the time writing to disk. You will still have a lot of database read operations to load the information about a task that needs to run, but a read is much cheaper than a write.
If you don't have enough RAM to store all of the tasks in memory at the same time, you could go with a hybrid setup where you keep the tasks for the next 24 hours (for example) in RAM and everything else stays in the database. Alternately, you could rewrite the code in C or C++, which are less memory hungry.
If you don't want a database, you could store just the next run timestamp and task id in memory. You could store the properties for each task in a file named [task_id].txt. You would need a data structure to store all the tasks, sorted by timestamp in memory, an AVL tree seems like it would work, here's a simple one for python: http://bjourne.blogspot.com/2006/11/avl-tree-in-python.html. Hopefully Linux (I assume that's what you are running on) could handle millions of files in a directory, otherwise you might need to hash on the task id to get a sub folder).
Your master server would just need to run a loop, popping off tasks out of the AVL tree until the next task's timestamp is in the future. Then you could sleep for a few seconds and start checking again. Whenever a task runs, you would update the next run timestamp in the task file and re-insert it into the AVL tree.
When the master server reboots, there would be the overhead of reloading all tasks id and next run timestamp back into memory, so that might be painful with millions of files. Maybe you just have one giant file and give each task 1K space in the file for properties and next run timestamp and then use [task_id] * 1K to get to the right offset for the task properties.
If you are willing to use a database, I am confident MySQL could handle whatever you throw at it given the conditions you describe, assuming you have 4GB+ RAM and several hard drives in RAID 0+1 on your master server.
Finally, if you really want to get complicated, Hadoop might work too: http://hadoop.apache.org/
If you're worried about writes, you can have a set of servers that dispatch the tasks (may be stripe the servers to equalize load) and have each server write bulk checkpoints to the DB (this way, you will not have so many write queries). You still have to write to be able to recover if scheduling server dies, of course.
In addition, if you don't have a clustered index on timestamp, you will avoid having a hot-spot at the end of the table.
350,000 tasks * 48 times per day is
16,800,000 tasks executed per day.
To schedule the jobs, you don't need a database.
Databases are for things that are updated. The only update visible here is a change to the schedule to add, remove or reschedule a job.
Cron does this in a totally scalable fashion with a single flat file.
Read the entire flat file into memory, start spawning jobs. Periodically, check the fstat to see if the file changed. Or, even better, wait for a HUP signal and use that to reread the file. Use kill -HUP to signal the scheduler to reread the file.
It's unclear what you're updating the database for.
If the database is used to determine future schedule based on job completion, then a single database is a Very Dad Idea.
If you're using the database to do some analysis of job history, then you have a simple data warehouse.
Record completion information (start time, end time, exit status, all that stuff) in a simple flat log file.
Process the flat log files to create a fact table and dimension updates.
When someone has the urge to do some analysis, load relevant portions of the flat log files into a datamart so they can do queries and counts and averages and the like.
Do not directly record 17,000,000 rows per day into a relational database. No one wants all that data. They want summaries: counts and averages.
Why hundreds of thousands and not hundreds of millions ? :evil:
I think you need stackless python, http://www.stackless.com/. created by the genius of Christian Tismer.
Quoting
Stackless Python is an enhanced
version of the Python programming
language. It allows programmers to
reap the benefits of thread-based
programming without the performance
and complexity problems associated
with conventional threads. The
microthreads that Stackless adds to
Python are a cheap and lightweight
convenience which can if used
properly, give the following benefits:
Improved program structure. More
readable code. Increased programmer
productivity.
Is used for massive multiplayer games.