I'm using the Python multiprocessing library to generate several processes that each write to a shared (MongoDB) database. Is this safe, or will the writes overwrite each other?
So long as you make sure to create a separate database connection for each worker process, it's perfectly safe to have multiple processes accessing a database at the same time. Any queries they issue which make changes to the database will be applied individually, typically in the order they are received by the database. Under most situations this will be safe, but:
If your processes are all just inserting documents into the database, each insert will typically create a separate object.
The exception is if you explicitly specify an _id for a document, and that identifier has already been used within the collection. This will cause the insert to fail. (So don't do that: leave the _id out, and MongoDB will always generate a unique value for you.)
If your processes are deleting documents from the database, the operation will fail if another process has already deleted the same object. (This is not strictly a failure, though; it just means that someone else got there before you.)
If your processes are updating documents in the database, things get murkier.
So long as each process is updating a different document, you're fine.
If multiple processes are trying to update the same document at the same time, you start needing to be careful. Updates which replace values on an object will be applied in order, which may cause changes made by one process to inadvertently be overwritten by another. You should be careful to avoid specifying fields that you don't intend to change. Using MongoDB's update operators may be helpful to perform complex operations atomically, such as changing the numeric values of fields.
Note that "at the same time" doesn't necessarily mean that operations are occurring at exactly the same time. It means more generally that there's an "overlap" in the time two processes are working with the same document, e.g.
Process A Process B
--------- ---------
Reads object from DB ...
working... Reads object from DB
working... working...
updates object with changes working...
updates object with changes
In the above situation, it's possible for some of the changes made by process A to inadvertently be overwritten by process B.
In short, yes it is perfectly reasonable (and actually preferred) to let your database worry about the concurrency of your database operations.
Any relevant database driver (MongoDB included) will handle concurrent operations for you automatically.
Related
I've got a simple webservice which uses SQLAlchemy to connect to a database using the pattern
engine = create_engine(database_uri)
connection = engine.connect()
In each endpoint of the service, I then use the same connection, in the following fashion:
for result in connection.execute(query):
<do something fancy>
Since Sessions are not thread-safe, I'm afraid that connections aren't either.
Can I safely keep doing this? If not, what's the easiest way to fix it?
Minor note -- I don't know if the service will ever run multithreaded, but I'd rather be sure that I don't get into trouble when it does.
Short answer: you should be fine.
There is a difference between a connection and a Session. The short description is that connections represent just that… a connection to a database. Information you pass into it will come out pretty plain. It won't keep track of your transactions unless you tell it to, and it won't care about what order you send it data. So if it matters that you create your Widget object before you create your Sprocket object, then you better call that in a thread-safe context. Same generally goes for if you want to keep track of a database transaction.
Session, on the other hand, keeps track of data and transactions for you. If you check out the source code, you'll notice quite a bit of back and forth over database transactions and without a way to know that you have everything you want in a transaction, you could very well end up committing in one thread while you expect to be able to add another object (or several) in another.
In case you don't know what a transaction is this is Wikipedia, but the short version is that transactions help make sure your data stays stable. If you have 15 inserts and updates, and insert 15 fails, you might not want to make the other 14. A transaction would let you cancel the entire operation in bulk.
I'm using SQLAlchemy Core to run a few independent statements. The statements are to separate tables and unrelated. Because of that I can't use the standard table.insert() with multiple dictionaries of params passed in. Right now, I'm doing this:
sql_conn.execute(query1)
sql_conn.execute(query2)
Is there any way I can run these in one shot instead of needing two back-and-forths to the db? I'm on MySQL 5.7 and Python 2.7.11.
Sounds like you want a Transaction:
with engine.connect() as sql_conn:
with sql_conn.begin():
sql_conn.execute(query1)
sql_conn.execute(query2)
There is an implicit sql_conn.commit() above (when using the context manager) which commits the changes to the database in one trip. If you want to do it explicitly, it's done like this:
from sqlalchemy import create_engine
engine = create_engine("postgresql://scott:tiger#localhost/test")
connection = engine.connect()
trans = connection.begin()
connection.execute(text("insert into x (a, b) values (1, 2)"))
trans.commit()
https://docs.sqlalchemy.org/en/14/core/connections.html#basic-usage
While this is mostly geared towards creating real database transactions, it has a useful side effect for your use case, where it will maintain a "virtual" transaction in SQLAlchemy, see this link for more info:
https://docs.sqlalchemy.org/en/14/orm/session_transaction.html#session-level-vs-engine-level-transaction-control
The Session tracks the state of a single “virtual” transaction at a time, using an object called SessionTransaction. This object then makes use of the underlying Engine or engines to which the Session object is bound in order to start real connection-level transactions using the Connection object as needed.
This “virtual” transaction is created automatically when needed, or
can alternatively be started using the Session.begin() method. To as
great a degree as possible, Python context manager use is supported
both at the level of creating Session objects as well as to maintain
the scope of the SessionTransaction.
The above describes the ORM functionality but the link shows that it has parity with the Core functionality.
It is neither wise, nor practical, to run two queries at once. I am referring to having a single call to the server with two SQL statements one after another: "SELECT ...; SELECT ...;"
It is not wise allowing such give hackers another way to do nasty things via "SQL Injection".
On the other hand, it is possible, but not necessarily practical. You would create a Stored Procedure that contains any number of related (or unrelated) queries in it. Then CALL that procedure. There some things that may make it impractical:
The only way to get data in is via a finite number of scalar arguments.
The output comes back as multiple resultsets; you need to code differently to see what happened.
Roundtrip latency is insignificant if you are on the same machine with the MySQL server. It can usually be ignored even if the two servers are in the same datacenter. Latency becomes important when the client and server are separated by a long distance. For cross-Atlantic latency, we are talking over 100ms. Brazil to China is about 250ms. (Be glad we are no living on Jupiter.)
Is it possible to have multiple workers running with Gunicorn and have them accessing some global variable in an ordered manner i.e. without running into problems with race conditions?
Assuming that by global variable, you mean another process that keeps them in memory or on disk, yes I think so. I haven't checked the source code of Gunicorn, but based on a problem I had with some old piece of code was that several users retrieved the same key from a legacy MyISAM table, incrementing it and creating a new entry using it assuming it was unique to create a new record. The result was that occasionally (when under very heavy traffic) one record is created (the newest one overwriting the older ones, all using the same incremented key). This problem was never observed during a hardware upgrade, when I reduced the gunicorn workers of the website to one, which was the reason to explore this probable cause in the first place.
Now usually, reducing the workers will degrade performance, and it is better to deal with these issues with transactions (if you are using an ACID RDBMS, unlike MyISAM). The same issue should be present with in Redis and similar stores.
Also this shouldn't a problem with files and sockets, since to my knowledge, the operating system will block other processes (even children) from accessing an open file.
Problem
I am writing a program that reads a set of documents from a corpus (each line is a document). Each document is processed using a function processdocument, assigned a unique ID, and then written to a database. Ideally, we want to do this using several processes. The logic is as follows:
The main routine creates a new database and sets up some tables.
The main routine sets up a group of processes/threads that will run a worker function.
The main routine starts all the processes.
The main routine reads the corpus, adding documents to a queue.
Each process's worker function loops, reading a document from a queue, extracting the information from it using processdocument, and writes the information to a new entry in a table in the database.
The worker loops breaks once the queue is empty and an appropriate flag has been set by the main routine (once there are no more documents to add to the queue).
Question
I'm relatively new to sqlalchemy (and databases in general). I think the code used for setting up the database in the main routine works fine, from what I can tell. Where I'm stuck is I'm not sure exactly what to put into the worker functions for each process to write to the database without clashing with the others.
There's nothing particularly complicated going on: each process gets a unique value to assign to an entry from a multiprocessing.Value object, protected by a Lock. I'm just not sure whether what I should be passing to the worker function (aside from the queue), if anything. Do I pass the sqlalchemy.Engine instance I created in the main routine? The Metadata instance? Do I create a new engine for each process? Is there some other canonical way of doing this? Is there something special I need to keep in mind?
Additional Comments
I'm well aware I could just not bother with the multiprocessing but and do this in a single process, but I will have to write code that has several processes reading for the database later on, so I might as well figure out how to do this now.
Thanks in advance for your help!
The MetaData and its collection of Table objects should be considered a fixed, immutable structure of your application, not unlike your function and class definitions. As you know with forking a child process, all of the module-level structures of your application remain present across process boundaries, and table defs are usually in this category.
The Engine however refers to a pool of DBAPI connections which are usually TCP/IP connections and sometimes filehandles. The DBAPI connections themselves are generally not portable over a subprocess boundary, so you would want to either create a new Engine for each subprocess, or use a non-pooled Engine, which means you're using NullPool.
You also should not be doing any kind of association of MetaData with Engine, that is "bound" metadata. This practice, while prominent on various outdated tutorials and blog posts, is really not a general purpose thing and I try to de-emphasize this way of working as much as possible.
If you're using the ORM, a similar dichotomy of "program structures/active work" exists, where your mapped classes of course are shared between all subprocesses, but you definitely want Session objects to be local to a particular subprocess - these correspond to an actual DBAPI connection as well as plenty of other mutable state which is best kept local to an operation.
I noticed that sqlite3 isn´t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?
First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?
I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.
sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.
If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/