Variable access in gunicorn with multiple workers - python

Is it possible to have multiple workers running with Gunicorn and have them accessing some global variable in an ordered manner i.e. without running into problems with race conditions?

Assuming that by global variable, you mean another process that keeps them in memory or on disk, yes I think so. I haven't checked the source code of Gunicorn, but based on a problem I had with some old piece of code was that several users retrieved the same key from a legacy MyISAM table, incrementing it and creating a new entry using it assuming it was unique to create a new record. The result was that occasionally (when under very heavy traffic) one record is created (the newest one overwriting the older ones, all using the same incremented key). This problem was never observed during a hardware upgrade, when I reduced the gunicorn workers of the website to one, which was the reason to explore this probable cause in the first place.
Now usually, reducing the workers will degrade performance, and it is better to deal with these issues with transactions (if you are using an ACID RDBMS, unlike MyISAM). The same issue should be present with in Redis and similar stores.
Also this shouldn't a problem with files and sockets, since to my knowledge, the operating system will block other processes (even children) from accessing an open file.

Related

postgres database: When does a job get killed

I am using a postgres database with sql-alchemy and flask. I have a couple of jobs which I have to run through the entire database to updates entries. When I do this on my local machine I get a very different behavior compared to the server.
E.g. there seems to be an upper limit on how many entries I can get from the database?
On my local machine I just query all elements, while on the server I have to query 2000 entries step by step.
If I have too many entries the server gives me the message 'Killed'.
I would like to know
1. Who is killing my jobs (sqlalchemy, postgres)?
2. Since this does seem to behave differently on my local machine there must be a way to control this. Where would that be?
thanks
carl
Just the message "killed" appearing in the terminal window usually means the kernel was running out of memory and killed the process as an emergency measure.
Most libraries which connect to PostgreSQL will read the entire result set into memory, by default. But some libraries have a way to tell it to process the results row by row, so they aren't all read into memory at once. I don't know if flask has this option or not.
Perhaps your local machine has more available RAM than the server does (or fewer demands on the RAM it does have), or perhaps your local machine is configured to read from the database row by row rather than all at once.
Most likely kernel is killing your Python script. Python can have horrible memory usage.
I have a feeling you are trying to do these 2000 entry batches in a loop in one Python process. Python does not release all used memory, so the memory usage grows until it gets killed. (You can watch this with top command.)
You should try adapting your script to process 2000 records in a step and then quit. If you run in multiple times, it should continue where it left off. Or, a better option, try using multiprocessing and run each job in separate worker. Run the jobs serially and let them die, when they finish. This way they will release the memory back to OS when they exit.

Can concurrent processes write to a shared database?

I'm using the Python multiprocessing library to generate several processes that each write to a shared (MongoDB) database. Is this safe, or will the writes overwrite each other?
So long as you make sure to create a separate database connection for each worker process, it's perfectly safe to have multiple processes accessing a database at the same time. Any queries they issue which make changes to the database will be applied individually, typically in the order they are received by the database. Under most situations this will be safe, but:
If your processes are all just inserting documents into the database, each insert will typically create a separate object.
The exception is if you explicitly specify an _id for a document, and that identifier has already been used within the collection. This will cause the insert to fail. (So don't do that: leave the _id out, and MongoDB will always generate a unique value for you.)
If your processes are deleting documents from the database, the operation will fail if another process has already deleted the same object. (This is not strictly a failure, though; it just means that someone else got there before you.)
If your processes are updating documents in the database, things get murkier.
So long as each process is updating a different document, you're fine.
If multiple processes are trying to update the same document at the same time, you start needing to be careful. Updates which replace values on an object will be applied in order, which may cause changes made by one process to inadvertently be overwritten by another. You should be careful to avoid specifying fields that you don't intend to change. Using MongoDB's update operators may be helpful to perform complex operations atomically, such as changing the numeric values of fields.
Note that "at the same time" doesn't necessarily mean that operations are occurring at exactly the same time. It means more generally that there's an "overlap" in the time two processes are working with the same document, e.g.
Process A Process B
--------- ---------
Reads object from DB ...
working... Reads object from DB
working... working...
updates object with changes working...
updates object with changes
In the above situation, it's possible for some of the changes made by process A to inadvertently be overwritten by process B.
In short, yes it is perfectly reasonable (and actually preferred) to let your database worry about the concurrency of your database operations.
Any relevant database driver (MongoDB included) will handle concurrent operations for you automatically.

ubuntu django run managements much faster( i tried renice by setting -18 priority to python process pid)

I am using ubuntu. I have some management commands which when run, does lots of database manipulations, so it takes nearly 15min.
My system monitor shows that my system has 4 cpu's and 6GB RAM. But, this process is not utilising all the cpu's . I think it is using only one of the cpus and that too very less ram. I think, if I am able to make it to use all the cpu's and most of the ram, then the process will be completed in very less time.
I tried renice , by settings priority to -18 (means very high) but still the speed is less.
Details:
its a python script with loop count of nearly 10,000 and that too nearly ten such loops. In every loop, it saves to postgres database.
If you are looking to make this application run across multiple cpu's then there are a number of things you can try depending on your setup.
The most obvious thing that comes to mind is making the application make use of threads and multiprocesses. This will allow the application to "do more" at once. Obviously the issue you might have here is concurrent database access so you might need to use transactions (at which point you might loose the advantage of using multiprocesses in the first place).
Secondly, make sure you are not opening and closing lots of database connections, ensure your application can hold the connection open for as long as it needs.
thirdly, Ensure the database is correctly indexed. If you are doing searches on large strings then things are going to be slow.
Fourthly, Do everything you can in SQL leaving little manipulation to python, sql is horrendously quick at doing data manipulation if you let it. As soon as you start taking data out of the database and into code then things are going to slow down big time.
Fifthly, make use of stored procedures which can be cached and optimized internally within the database. These can be a lot quicker than application built queries which cannot be optimized as easily.
Sixthly, dont save on each iteration of a program. Try to produce a batch styled job whereby you alter a number of records then save all of those in one batch job. This will reduce the amount of IO on each iteration and speed up the process massivly.
Django does support the use of a bulk update method, there was also a question on stackoverflow a while back about saving multiple django objects at once.
Saving many Django objects with one big INSERT statement
Django: save multiple object signal once
Just in case, did you run the command renice -20 -p {pid} instead of renice --20 -p {pid}? In the first case it will be given the lowest priority.

Python Daemon Process Memory Management

I'm currently writing a Python daemon process that monitors a log file in realtime and updates entries in a Postgresql database based on their results. The process only cares about a unique key that appears in the log file and the most recent value it's seen from that key.
I'm using a polling approach,and process a new batch every 10 seconds. In order to reduce the overall set of data to avoid extraneous updates to the database, I'm only storing the key and the most recent value in a dict. Depending on how much activity there has been in the last 10 seconds, this dict can vary from 10-1000 unique entries. Then the dict gets "processed" and those results are sent to the database.
My main concern has revolves around memory management and the dict over time (days, weeks, etc). Since this is a daemon process that's constantly running, memory usage bloats based on the size of the dict, but never shrinks appropriately. I've tried reseting dict using a standard dereference, and the dict.clear() method after processing a batch, but noticed no changes in memory usage (FreeBSD/top). It seems that forcing a gc.collect() does recover some memory, but usually only around 50%.
Do you guys have any advice on how I should proceed? Is there something more I could be doing in my process? Feel free to chime in if you see a different road around the issue :)
When you clear() the dict or del the objects referenced by the dict, the contained objects are still around in memory. If they aren't referenced anywhere, they can be garbage-collected, as you have seen, but garbage-collection isn't run explicitly on a del or clear().
I found this similar question for you: https://stackoverflow.com/questions/996437/memory-management-and-python-how-much-do-you-need-to-know. In short, if you aren't running low on memory, you really don't need to worry a lot about this. FreeBSD itself does a good job handling virtual memory, so even if you have a huge amount of stale objects in your Python program, your machine probably won't be swapping to the disk.

SQLite3 and Multiprocessing

I noticed that sqlite3 isnĀ“t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?
First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?
I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.
sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.
If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/

Categories

Resources