Understanding Python sqlite mechanics in multi-module environments - python

First off, I have no idea if "Ownership" is the correct term for this, it's just what I am calling it in Java.
I am currently building a Server that uses SQLite, and I am encountering errors concerning object "ownership":
I have one Module that manages the SQLite Database. Let's call it "pyDB". Simplified:
import threading
import sqlite3
class DB(object):
def __init__(self):
self.lockDB = threading.Lock()
self.conn = sqlite3.connect('./data.sqlite')
self.c = self.conn.cursor()
[...]
def doSomething(self,Param):
with self.lockDB:
self.c.execute("SELECT * FROM xyz WHERE ID = ?", Param)
(Note that the lockDB object is there because the Database-Class can be called by multiple concurrent threads, and although SQLite itself is thread-safe, the cursor-Object is not, as far as I know).
Then I have a worker thread that processes stuff.
import pyDB
self.DB = pyDB.DB()
class Thread(threading.Thread):
[omitting some stuff that is not relevant here]
def doSomethingElse(self, Param):
DB.doSomething(Param)
If I am executing this, I am getting the following exception:
self.process(task)
File "[removed]/ProcessingThread.py", line 67, in process
DB.doSomething(Param)
File "[removed]/pyDB.py", line 101, in doSomething
self.c.execute(self,"SELECT * FROM xyz WHERE ID = ?", Param)
ProgrammingError: SQLite objects created in a thread can only be used in that same
thread.The object was created in thread id 1073867776 and this is thread id 1106953360
Now, as far as I can see, this is the same problem I had earlier (Where Object ownership was given not to the initialized class, but to the one that called it. Or so I understand it), and this has led me to finally accept that I generally don't understand how object ownership in Python works. I have seached the Python Documentation for an understandable explanation, but have not found any.
So, my Questions are:
Who owns the cursor object in this case? The Processing Thread or the DB thread?
Where can I read up on this stuff to finally "get" it?
Is the term "Object ownership" even correct, or is there an other term for this in Python? (Edit: For explanations concerning this, read the comments of the main question)
I will be glad to take specific advice for this case, but am generally more interested in the whole concept of "what belongs to who" in Python, because to me it seems pretty different to the way Java handles it, and since I am planning to use Python a lot in the future, I might as well just learn it now, as this is a pretty important part of Python.

ProgrammingError: SQLite objects created in a thread can only be used in that same
The problem is that you're trying to conserve the cursor for some reason. You should not be doing this. Create a new cursor for every transaction; or if you're not totally sure where transactions start or end, a new cursor per query.
import sqlite3
class DB(object):
def __init__(self):
self.conn_uri = './data.sqlite'
[...]
def doSomething(self,Param):
conn = sqlite.connect(self.conn_uri)
c = conn.cursor()
c.execute("SELECT * FROM xyz WHERE ID = ?", Param)
Edit, Re comments in your question: What's going on here has very little to do with python. When you create a sqlite resource, which is a C library and totally independent of python, sqlite requires that resource be used only in the thread that created it. It verifies this by looking at the thread ID of the currently running thread, and not at all attempting to coordinate the transfer of the resource from one thread to another. As such, you are obligated to create sqlite resources in each thread that needs them.
In your code, you create all of the sqlite resources in the DB object's __init__ method, which is probably called only once, and in the main thread. Thus these resources are only permitted to be used in that thread, threading.Lock not withstanding.
Your questions:
Who owns the cursor object in this case? The Processing Thread or the DB thread?
The thread that created it. Since it looks like you're calling DB() at the module level, it's very likely that it's the main thread.
Where can I read up on this stuff to finally "get" it?
There's not really much of anything to get. Nothing is happening at all behind the scenes, except what SQLite has to say on the matter, when you are using it.
Is the term "Object ownership" even correct, or is there an other term for this in Python?
Python doesn't really have much of anything at all to do with threading, except that it allows you to use threads. It's on you to coordinate multi-threaded applications properly.
EDIT again:
Objects do not live inside particular threads. When you call a method on an object, that method runs in the calling thread. ten threads can call the same method on the same object; all will run concurrently (or whatever passes for that re the GIL), and it's up to the caller or the method body to make sure nothing breaks.

I'm the author of an alternate SQLite wrapper for Python (APSW) and very familiar with this issue. SQLite itself used to require that objects - the database connection and cursors could only be used in the same thread. Around SQLite 3.5 this was changed and you could use objects concurrently although internally SQLite did its own locking so you didn't actually get concurrent performance. The default Python SQLite wrapper (aka pysqlite) supports even old versions of SQLite 3 so it continues to enforce this restriction even though it is no longer necessary for SQLite itself. However the pysqlite code would need to be modified to allow concurrency as the way it wraps SQLite is not safe - eg handling error messages is not safe because of SQLite API design flaws and requires special handling.
Note that cursors are very cheap. Do not try to reuse them or treat them as precious. The actual underlying SQLite objects (sqlite3_stmt) are kept in a cache and reused as needed.
If you do want maximum concurrency then open multiple connections and use them simultaneously.
The APSW doc has more about multi-threading and re-entrancy. Note that it has extra code to allow the actual concurrent usage that pysqlite does not have, but the other tips and info apply to any usage of SQLite.

Related

How to use sqlite across multiple (spawned) python processes via sqlalchemy

I have a file called db.py with the following code:
from sqlalchemy import create_engine
from sqlalchemy.orm import scoped_session, sessionmaker
engine = create_engine('sqlite:///my_db.sqlite')
session = scoped_session(sessionmaker(bind=engine,autoflush=True))
I am trying to import this file in various subprocesses started using a spawn context (potentially important, since various fixes that worked for fork don't seem to work for spawn)
The import statement is something like:
from db import session
and then I use this session ad libitum without worrying about concurrency, assuming SQLite's internal locking mechanism will order transactions as to avoid concurrency error, I don't really care about transaction order.
This seems to result in errors like the following:
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 139813508335360 and this is thread id 139818279995200.
Mind you, this doesn't directly seem to affect my program, every transaction goes through just fine, but I am still worried about what's causing this.
My understanding was that scoped_session was thread-local, so I could import it however I want without issues. Furthermore, my assumption was that sqlalchemy will always handle the closing of connections and that sqllite will handle ordering (i.e. make a session wait for another seesion to end until it can do any transaction).
Obviously one of these assumptions is wrong, or I am misunderstanding something basic about the mechanism here, but I can't quite figure out what. Any suggestions would be useful.
The problem isn't about thread-local sessions, it's that the original connection object is in a different thread to those sessions. SQLite disables using a connection across different threads by default.
The simplest answer to your question is to turn off sqlite's same thread checking. In SQLAlchemy you can achieve this by specifying it as part of your database URL:
engine = create_engine('sqlite:///my_db.sqlite?check_same_thread=False')
I'm guessing that will do away with the errors, at least.
Depending on what you're doing, this may still be dangerous - if you're ensuring your transactions are serialised (that is, one after the other, never overlapping or simultaneous) then you're probably fine. If you can't guarantee that then you're risking data corruption, in which case you should consider a) using a database backend that can handle concurrent writes, or b) creating an intermediary app or service that solely manages sqlite reads and writes and that your other apps can communicate with. That latter option sounds fun but be warned you may end up reinventing the wheel when you're better off just spinning up a Postgres container or something.

How to diagnose extra SQLAlchemy connections in Pyramid

When my app runs, I'm very frequently getting issues around the connection pooling (one is "QueuePool limit of size 5 overflow 10 reached", another is "FATAL: remaining connection slots are reserved for non-replication superuser connections").
I have a feeling that it's due to some code not closing connections properly, or other code greedily trying to open new ones when it shouldn't, but I'm using the default SQL Alchemy settings so I assume the pool connection defaults shouldn't be unreasonable. We are using the scoped_session(sessionmaker()) way of creating the session so multiple threads are supported.
So my main question is if there is a tool or way to find out where the connections are going? Short of being able to see as soon as a new one is created (that is not supposed to be created), are there any obvious anti-patterns that might result in this effect?
Pyramid is very un-opinionated and with DB connections, there seem to be two main approaches (equally supported by Pyramid it would seem). In our case, the code base when I started the job used one approach (I'll call it the "globals" approach) and we've agreed to switch to another approach that relies less on globals and more on Pythonic idioms.
About our architecture: the application comprises one repo which houses the Pyramid project and then sources a number of other git modules, each of which had their own connection setup. The "globals" way connects to the database in a very non-ORM fashion, eg.:
(in each repo's __init__ file)
def load_database:
global tables
tables['table_name'] = Table(
'table_name', metadata,
Column('column_name', String),
)
There are related globals that are frequently peppered all over the code:
def function_needing_data(field_value):
global db, tables
select = sqlalchemy.sql.select(
[tables['table_name'].c.data], tables['table_name'].c.name == field_value)
return db.execute(select)
This tables variable is latched onto within each git repo which adds some more tables definitions and somehow the global tables manages to work, providing access to all of the tables.
The approach that we've moved to (although at this time, there are parts of both approaches still in the code) is via a centralised connection, binding all of the metadata to it and then querying the db in an ORM approach:
(model)
class ModelName(MetaDataBase):
__tablename__ = "models_table_name"
... (field values)
(function requiring data)
from models.db import DBSession
from models.model_name import ModelName
def function_needing_data(field_value):
return DBSession.query(ModelName).filter(
ModelName.field_value == field_value).all()
We've largely moved the code over to the latter approach which feels right, but perhaps I'm mistaken in my intentions. I don't know if there is anything inherently good or bad in either approach but could this (one of the approaches) be part of the problem so we keep running out of connections? Is there a telltale sign that I should look out for?
It appears that Pyramid functions best (in terms of handling the connection pool) when you use the Pyramid transaction manager (pyramid_tm). This excellent article by Jon Rosebaugh provides some helpful insight into both how Pyramid apps typically set up their database connections and how they should set them up.
In my case, it was necessary to include the pyramid_tm package and then remove a few occurrences where we were manually committing session changes since pyramid_tm will automatically commit changes if it doesn't see a reason not to.
[Update]
I continued to have connection pooling issues although much fewer of them. After a lot of debugging, I found that the pyramid transaction manager (if you're using it correctly) should not be the issue at all. The issue to the other connection pooling issues I had had to do with scripts that ran via cron jobs. A script will release it's connections when it's finished, but bad code design may result in situations where the same script can be opened up and starts running while the previous one is running (causing them both to run slower, slow enough to have both running while a third instance of the script starts and so on).
This is a more language- and database-agnostic error since it stems from poor job-scripting design but it's worth keeping in mind. In my case, the script had an "&" at the end so that each instance started as a background process, waited 10 seconds, then spawned another, rather than making sure the first job started AND completed, then waited 10 seconds, then started another.
Hope this helps when debugging this very frustrating and thorny issue.

SQLAlchemy - Multithreaded Persistent Object Creation, how to merge back into single session to avoid state conflict?

I have tens (potentially hundreds) of thousands of persistent objects that I want to generate in a multithreaded fashion due the processing required.
While the creation of the objects happens in separate threads (using Flask-SQLAlchemy extension btw with scoped sessions) the call to write the generated objects to the DB happens in 1 place after the generation has completed.
The problem, I believe, is that the objects being created are part of several existing relationships-- thereby triggering the automatic addition to the identity map despite being created in separate, concurrent, threads with no explicit session in any of the threads.
I was hoping to contain the generated objects in a single list, and then write the whole list (using a single session object) to the database. This results in an error like this:
AssertionError: A conflicting state is already present in the identity map for key (<class 'app.ModelObject'>, (1L,))
Hence why I believe the identity map has already been populated, because it's when I try to add and commit using the global session outside of the concurrent code, the assertion error is triggered.
The final detail is that whatever session object(s), (scoped or otherwise, as I don't fully understand how automatic addition to the identity map works in the case of multithreading) I cannot find a way / don't know how to get a reference to them so that even if I wanted to deal with a separate session per process I could.
Any advice is greatly appreciated. The only reason I am not posting code (yet) is because it's difficult to abstract a working example immediately out of my app. I will post if somebody really needs to see it though.
Each session is thread-local; in other words there is a separate session for each thread. If you decide to pass some instances to another thread, they will become "detached" from the session. Use db.session.add_all(objects) in the receiving thread to put them all back.
For some reason, it looks like you're creating objects with the same identity (primary key columns) in different threads, then trying to send them both to the database. One option is to fix why this is happening, so that identities will be guaranteed unique. You may also try merging; merged_object = db.session.merge(other_object, load=False).
Edit: zzzeek's comment clued me in on something else that may be going on:
With Flask-SQLAlchemy, the session is tied to the app context. Since that is thread local, spawning a new thread will invalidate the context; there will be no database session in the threads. All the instances are detached there, and cannot properly track relationships. One solution is to pass app to each thread and perform everything within a with app.app_context(): block. Inside the block, first use db.session.add to populate the local session with the passed instances. You should still merge in the master task afterwards to ensure consistency.
I just want to clarify the problem and the solution with some pseudo-code in case somebody has this problem / wants to do this in the future.
class ObjA(object):
obj_c = relationship('ObjC', backref='obj_c')
class ObjB(object):
obj_c = relationship('ObjC', backref='obj_c')
class ObjC(object):
obj_a_id = Column(Integer, ForeignKey('obj_a.id'))
obj_b_id = Column(Integer, ForeignKey('obj_b.id'))
def __init__(self, obj_a, obj_b):
self.obj_a = obj_a
self.obj_b = obj_b
def make_a_bunch_of_c(obj_a, list_of_b=None):
return [ObjC(obj_a, obj_b) for obj_b in list_of_b]
def parallel_generate():
list_of_a = session.query(ObjA).all() # assume there are 1000 of these
list_of_b = session.query(ObjB).all() # and 30 of these
fxn = functools.partial(make_a_bunch_of_c, list_of_b=list_of_b)
pool = multiprocessing.Pool(10)
all_the_things = pool.map(fxn, list_of_a)
return all_the_things
Now let's stop here a second. The original problem was that attempting to ADD the list of ObjC's caused the error message in the original question:
session.add_all(all_the_things)
AssertionError: A conflicting state is already present in the identity map for key [...]
Note: The error occurs during the adding phase, the commit attempt never even happens because the assertion occurs pre-commit. As far as I could tell.
Solution:
all_the_things = parallel_generate()
for thing in all_the_things:
session.merge(thing)
session.commit()
The details of session utilization when dealing with automatically added objects (via the relationship cascading) is still beyond me and I cannot explain why the conflict originally occurred. All I know is that using the merge function will cause SQLAlchemy to sort all of the child objects that were created across 10 different processes into a single session in the master process.
I would be curious in the why, if anyone happens across this.

How to use simple sqlalchemy calls while using thread/multiprocessing

Problem
I am writing a program that reads a set of documents from a corpus (each line is a document). Each document is processed using a function processdocument, assigned a unique ID, and then written to a database. Ideally, we want to do this using several processes. The logic is as follows:
The main routine creates a new database and sets up some tables.
The main routine sets up a group of processes/threads that will run a worker function.
The main routine starts all the processes.
The main routine reads the corpus, adding documents to a queue.
Each process's worker function loops, reading a document from a queue, extracting the information from it using processdocument, and writes the information to a new entry in a table in the database.
The worker loops breaks once the queue is empty and an appropriate flag has been set by the main routine (once there are no more documents to add to the queue).
Question
I'm relatively new to sqlalchemy (and databases in general). I think the code used for setting up the database in the main routine works fine, from what I can tell. Where I'm stuck is I'm not sure exactly what to put into the worker functions for each process to write to the database without clashing with the others.
There's nothing particularly complicated going on: each process gets a unique value to assign to an entry from a multiprocessing.Value object, protected by a Lock. I'm just not sure whether what I should be passing to the worker function (aside from the queue), if anything. Do I pass the sqlalchemy.Engine instance I created in the main routine? The Metadata instance? Do I create a new engine for each process? Is there some other canonical way of doing this? Is there something special I need to keep in mind?
Additional Comments
I'm well aware I could just not bother with the multiprocessing but and do this in a single process, but I will have to write code that has several processes reading for the database later on, so I might as well figure out how to do this now.
Thanks in advance for your help!
The MetaData and its collection of Table objects should be considered a fixed, immutable structure of your application, not unlike your function and class definitions. As you know with forking a child process, all of the module-level structures of your application remain present across process boundaries, and table defs are usually in this category.
The Engine however refers to a pool of DBAPI connections which are usually TCP/IP connections and sometimes filehandles. The DBAPI connections themselves are generally not portable over a subprocess boundary, so you would want to either create a new Engine for each subprocess, or use a non-pooled Engine, which means you're using NullPool.
You also should not be doing any kind of association of MetaData with Engine, that is "bound" metadata. This practice, while prominent on various outdated tutorials and blog posts, is really not a general purpose thing and I try to de-emphasize this way of working as much as possible.
If you're using the ORM, a similar dichotomy of "program structures/active work" exists, where your mapped classes of course are shared between all subprocesses, but you definitely want Session objects to be local to a particular subprocess - these correspond to an actual DBAPI connection as well as plenty of other mutable state which is best kept local to an operation.

SQLite3 and Multiprocessing

I noticed that sqlite3 isnĀ“t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?
First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?
I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.
sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.
If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/

Categories

Resources