I noticed that sqlite3 isnĀ“t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?
First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?
I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.
sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.
If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/
Related
When my app runs, I'm very frequently getting issues around the connection pooling (one is "QueuePool limit of size 5 overflow 10 reached", another is "FATAL: remaining connection slots are reserved for non-replication superuser connections").
I have a feeling that it's due to some code not closing connections properly, or other code greedily trying to open new ones when it shouldn't, but I'm using the default SQL Alchemy settings so I assume the pool connection defaults shouldn't be unreasonable. We are using the scoped_session(sessionmaker()) way of creating the session so multiple threads are supported.
So my main question is if there is a tool or way to find out where the connections are going? Short of being able to see as soon as a new one is created (that is not supposed to be created), are there any obvious anti-patterns that might result in this effect?
Pyramid is very un-opinionated and with DB connections, there seem to be two main approaches (equally supported by Pyramid it would seem). In our case, the code base when I started the job used one approach (I'll call it the "globals" approach) and we've agreed to switch to another approach that relies less on globals and more on Pythonic idioms.
About our architecture: the application comprises one repo which houses the Pyramid project and then sources a number of other git modules, each of which had their own connection setup. The "globals" way connects to the database in a very non-ORM fashion, eg.:
(in each repo's __init__ file)
def load_database:
global tables
tables['table_name'] = Table(
'table_name', metadata,
Column('column_name', String),
)
There are related globals that are frequently peppered all over the code:
def function_needing_data(field_value):
global db, tables
select = sqlalchemy.sql.select(
[tables['table_name'].c.data], tables['table_name'].c.name == field_value)
return db.execute(select)
This tables variable is latched onto within each git repo which adds some more tables definitions and somehow the global tables manages to work, providing access to all of the tables.
The approach that we've moved to (although at this time, there are parts of both approaches still in the code) is via a centralised connection, binding all of the metadata to it and then querying the db in an ORM approach:
(model)
class ModelName(MetaDataBase):
__tablename__ = "models_table_name"
... (field values)
(function requiring data)
from models.db import DBSession
from models.model_name import ModelName
def function_needing_data(field_value):
return DBSession.query(ModelName).filter(
ModelName.field_value == field_value).all()
We've largely moved the code over to the latter approach which feels right, but perhaps I'm mistaken in my intentions. I don't know if there is anything inherently good or bad in either approach but could this (one of the approaches) be part of the problem so we keep running out of connections? Is there a telltale sign that I should look out for?
It appears that Pyramid functions best (in terms of handling the connection pool) when you use the Pyramid transaction manager (pyramid_tm). This excellent article by Jon Rosebaugh provides some helpful insight into both how Pyramid apps typically set up their database connections and how they should set them up.
In my case, it was necessary to include the pyramid_tm package and then remove a few occurrences where we were manually committing session changes since pyramid_tm will automatically commit changes if it doesn't see a reason not to.
[Update]
I continued to have connection pooling issues although much fewer of them. After a lot of debugging, I found that the pyramid transaction manager (if you're using it correctly) should not be the issue at all. The issue to the other connection pooling issues I had had to do with scripts that ran via cron jobs. A script will release it's connections when it's finished, but bad code design may result in situations where the same script can be opened up and starts running while the previous one is running (causing them both to run slower, slow enough to have both running while a third instance of the script starts and so on).
This is a more language- and database-agnostic error since it stems from poor job-scripting design but it's worth keeping in mind. In my case, the script had an "&" at the end so that each instance started as a background process, waited 10 seconds, then spawned another, rather than making sure the first job started AND completed, then waited 10 seconds, then started another.
Hope this helps when debugging this very frustrating and thorny issue.
I am using Python to stream large amounts of Twitter data into a MySQL database. I anticipate my job running over a period of several weeks. I have code that interacts with the twitter API and gives me an iterator that yields lists, each list corresponding to a database row. What I need is a means of maintaining a persistent database connection for several weeks. Right now I find myself having to restart my script repeatedly when my connection is lost, sometimes as a result of MySQL being restarted.
Does it make the most sense to use the mysqldb library, catch exceptions and reconnect when necessary? Or is there an already made solution as part of sqlalchemy or another package? Any ideas appreciated!
I think the right answer is to try and handle the connection errors; it sounds like you'd only be pulling in a much a larger library just for this feature, while trying and catching is probably how it's done, whatever level of the stack it's at. If necessary, you could multithread these things since they're probably IO-bound (i.e. suitable for Python GIL threading as opposed to multiprocessing) and decouple the production and the consumption with a queue, too, which would maybe take some of the load off of the database connection.
I've got a sqlite3 database and I want to write in it from multiple threads. I've got multiple ideas but I'm not sure which I should implement.
create multiple connection, detect and waif if the DB is locked
use one connection and try to make use of Serialized connections (which don't seem to be implemented in python)
have a background process with a single connection, which collects the queries from all threads and then executes them on their behalft
forget about SQlite and use something like Postgresql
What are the advances of these different approaches and which is most likely to be fruitful? Are there any other possibilities?
Try to use https://pypi.python.org/pypi/sqlitedict
A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.
But take into account "Concurrent requests are still serialized internally, so this "multithreaded support" doesn't give you any performance benefits. It is a work-around for sqlite limitations in Python."
PostgreSQL, MySQL, etc. give you the better performance for several connections in one time
I used method 1 before. It is the easiest in coding. Since that project has a small website, each query take only several milliseconds. All the users requests can be processed promptly.
I also used method 3 before. Because when the query take longer time, it is better to queue the queries since frequent "detect and wait" makes no sense here. And would require a classic consumer-producer model. It would require more time to code.
But if the query is really heavy and frequent. I suggest look to other db like MS SQL/MySQL.
Problem
I am writing a program that reads a set of documents from a corpus (each line is a document). Each document is processed using a function processdocument, assigned a unique ID, and then written to a database. Ideally, we want to do this using several processes. The logic is as follows:
The main routine creates a new database and sets up some tables.
The main routine sets up a group of processes/threads that will run a worker function.
The main routine starts all the processes.
The main routine reads the corpus, adding documents to a queue.
Each process's worker function loops, reading a document from a queue, extracting the information from it using processdocument, and writes the information to a new entry in a table in the database.
The worker loops breaks once the queue is empty and an appropriate flag has been set by the main routine (once there are no more documents to add to the queue).
Question
I'm relatively new to sqlalchemy (and databases in general). I think the code used for setting up the database in the main routine works fine, from what I can tell. Where I'm stuck is I'm not sure exactly what to put into the worker functions for each process to write to the database without clashing with the others.
There's nothing particularly complicated going on: each process gets a unique value to assign to an entry from a multiprocessing.Value object, protected by a Lock. I'm just not sure whether what I should be passing to the worker function (aside from the queue), if anything. Do I pass the sqlalchemy.Engine instance I created in the main routine? The Metadata instance? Do I create a new engine for each process? Is there some other canonical way of doing this? Is there something special I need to keep in mind?
Additional Comments
I'm well aware I could just not bother with the multiprocessing but and do this in a single process, but I will have to write code that has several processes reading for the database later on, so I might as well figure out how to do this now.
Thanks in advance for your help!
The MetaData and its collection of Table objects should be considered a fixed, immutable structure of your application, not unlike your function and class definitions. As you know with forking a child process, all of the module-level structures of your application remain present across process boundaries, and table defs are usually in this category.
The Engine however refers to a pool of DBAPI connections which are usually TCP/IP connections and sometimes filehandles. The DBAPI connections themselves are generally not portable over a subprocess boundary, so you would want to either create a new Engine for each subprocess, or use a non-pooled Engine, which means you're using NullPool.
You also should not be doing any kind of association of MetaData with Engine, that is "bound" metadata. This practice, while prominent on various outdated tutorials and blog posts, is really not a general purpose thing and I try to de-emphasize this way of working as much as possible.
If you're using the ORM, a similar dichotomy of "program structures/active work" exists, where your mapped classes of course are shared between all subprocesses, but you definitely want Session objects to be local to a particular subprocess - these correspond to an actual DBAPI connection as well as plenty of other mutable state which is best kept local to an operation.
I am using PostgreSQL 8.4. I really like the new unnest() and array_agg() features; it is about time they realize the dynamic processing potential of their Arrays!
Anyway, I am working on web server back ends that uses long Arrays a lot. Their will be two successive processes which will each occur on a different physical machine. Each such process is a light python application which ''manage'' SQL queries to the database on each of their machines as well as requests from the front ends.
The first process will generate an Array which will be buffered into an SQL Table. Each such generated Array is accessible via a Primary Key. When its done the first python app sends the key to the second python app. Then the second python app, which is running on a different machine, uses it to go get the referenced Array found in the first machine. It then sends it to it's own db for generating a final result.
The reason why I send a key is because I am hopping that this will make the two processes go faster. But really what I would like is for a way to have the second database send a query to the first database in the hope of minimizing serialization delay and such.
Any help/advice would be appreciated.
Thanks
Sounds like you want dblink from contrib. This allows some inter-db postgres communication. The pg docs are great and should provide the needed examples.
not sure I totally understand, but you've looked at notify/listen? http://www.postgresql.org/docs/8.1/static/sql-listen.html
I am thinking either listen/notify or something with a cache such as memcache. You would send the key to memcache and have the second python app retrieve it from there. You could even do it with listen/notify... e.g; send the key and notify your second app that the key is in memcache waiting to be retrieved.