Concurrency on PostgreSQL Database with python subprocesses

Concurrency on PostgreSQL Database with python subprocesses - python

I use python multiprocessing processes to establish multiple connections to a postgreSQL database via psycopg.
Every process establishes a connection, creates a cursor, fetches an object from a mp.Queue and does some work on the database. If everything works fine, the changes are commited and the connection is closed.
If one of the processes however creates an error (e.g. an ADD COLUMN request fails, because the COLUMN is already present), all the processes seem to stop working.
import psycopg2
import multiprocessing as mp
import Queue
def connect():
C = psycopg2.connect(host = "myhost", user = "myuser", password = "supersafe", port = 62013, database = "db")
cur = C.cursor()
return C,cur
def commit_and_close(C,cur):
C.commit()
cur.close()
C.close()
def commit(C):
C.commit()
def sub(queue):
C,cur = connect()
while not queue.empty():
work_element = queue.get(timeout=1)
#do something with the work element, that might produce an SQL error
commit_and_close(C,cur)
return 0
if __name__ == '__main__':
job_queue = mp.Queue()
#Fill Job_queue
print 'Run'
for i in range(20):
p=mp.Process(target=sub, args=(job_queue))
p.start()
I can see, that processes are still alive (because the job_queue is still full), but no Network traffic / SQL actions are happening. Is it possible, that an SQL error blocks communication from other subprocesses? How can I prevent that happening?

As chance would have it, I was doing something similar today.
It shouldn't be that the state of one connection can affect a different one, so I don't think we should start there.
There is clearly a race condition in your queue handling. You check if the queue is empty and then try to get a statement to execute. With multiple readers one of the others could empty the queue leaving the others all blocking on their queue.get. If the queue is empty when they all lock up then I would suspect this.
You also never join your processes back when they complete. I'm not sure what effect that would have in the larger picture, but it's probably good practice to clean up.
The other thing that might be happening is that your error-ing process is not rolling back properly. That might leave other transactions waiting to see if it completes or rolls back. They can wait for quite a long time by default but you can configure it.
To see what is happening, fire up psql and check out two useful system views pg_stat_activity and pg_locks. That should show where the cause lies.

Related

Snowflake SQLAlchemy connection pooling is rolling back transaction by default, while the documentation says the default is commit transaction

I am doing a poc where I wanted to implement connection pooling in python. I took help of snowflake-sqlalchemy as my target is snowflake database. I wanted to check the following scenarios.
parallely run multiple stored procedures/sql statements.
If the number of SP/SQL which need to execute in parallel is more then the Pool connections, the code should wait. I am writing a while loop for this purpose.
I have defined a function func1 and it is executing in parallel multiple times to simulate multiple thread processing.
Problem Description:
Everything is working fine. The pool size is 2 and max_overflow is 5. When the code is running, it is running 7 parallel SP's/SQL's instead of 10. This is understood that the pool only can support 7 connections to run in parallel. If the rest has to execute, they will wait till one of the thread execution is complete. On the snowflake side, the session is established, and the query executes and then rollback is being triggered. I am unable to figure out why roll back is triggered. I tried to explicitly issue a commit statement also from my code. It executes commit and then roll back also is triggered. The second problem is, even if I am not closing the connection, after the thread execution is complete, the connection is being closed automatically and the next set of statements are triggered. Ideally, I was expecting that the thread is not going to release the connection since there is no closure of connection.
Summary: I wanted to know two things.
Why is the roll back happening.
Why is the connection getting released after thread execution is complete even if the connection is not returned to pool.
May I please request for help in understanding the execution process. I am stuck and unable to figure out what's happening.
import time
from concurrent.futures import ThreadPoolExecutor, wait
from functools import partial
from sqlalchemy.pool import QueuePool
from snowflake.connector import connect
def get_conn():
return connect(
user='<username>',
password="<password>",
account='<account>',
warehouse='<WH>',
database='<DB>',
schema='<Schema>',
role='<role>'
)
pool = QueuePool(get_conn, max_overflow=5, pool_size=2, timeout=10.0)
def func1():
conn = pool.connect()
print("Connection is created and SP is triggered")
print("Pool size is:", pool.size())
print("Checked OUt connections are:", pool.checkedout())
print("Checked in Connections are:", pool.checkedin())
print("Overflow connections are:", pool.overflow())
print("current status of pool is:", pool.status())
c = conn.cursor()
# c.execute("select current_session()")
c.execute("call SYSTEM$WAIT(60)")
# c.execute("commit")
# conn.close()
func_list = []
for _ in range(0, 10):
func_list.append(partial(func1))
print(func_list)
proc = []
with ThreadPoolExecutor() as executor:
for func in func_list:
print("Calling func")
while (pool.checkedout() == pool.size() + pool._max_overflow):
print("Max connections reached. Waiting for an existing connection to release")
time.sleep(15)
p = executor.submit(func)
proc.append(p)
wait(proc)
pool.dispose()
print("process complete")

Python multiprocessing on mysqldb

I'm trying to insert to update really big values of data in a MySQL db and in the same try, I was trying to see in the process list what is doing!
So I made the following script:
I have a modified db MySQL that takes care to connect. Everything is working fine unless I use multiprocesses, if I use multiprocessing I got an error at some time with "Lost connection to database".
The script is like:
from mysql import DB
import multiprocessing
def check writing(db):
result = db.execute("show full processlist").fethcall()
for i in result:
if i['State'] == "updating":
print i['Info']
def main(db):
# some work to create a big list of tuple called tuple
sql = "update `table_name` set `field` = %s where `primary_key_id` = %s"
monitor = multiprocessing.Process(target=check_writing,args=(db,)) # I create the monitor process
monitor.start()
db.execute_many(sql,tuple) # I start to modify table
monitor.terminate()
monitor.join
if __name__ == "__main__"
db = DB(host,user,password,database_name) # this way I create the object connected
main(db)
db.close()
And the a part of my mysql class is:
class DB:
def __init__(self,host,user,password,db_name)
self.db = MySQLdb.connect(host=host.... etc
def execute_many(self,sql,data):
c = self.db.cursor()
c.executemany(sql, data)
c.close()
self.db.commit()
As I said before, if I don't try to execute in check_writing, the script is working fine!
Maybe someone can explain me what is the cause and how can overcome? Also, I have problems trying to threadPool writing in MySQL using map (or map_async).
Do I miss something related to mysql?

There is a better way to approach that:
Connector/Python Connection Pooling:
mysql.connector.pooling module implements pooling.
A pool opens a number of connections and handles thread safety when providing connections to requesters.
The size of a connection pool is configurable at pool creation time. It cannot be resized thereafter.
it is possible to have multiple connection pools. This enables applications to support pools of connections to different MySQL servers, for example.
Check documentation here
I think your parallel processes are exhausting your mysql connections.

SQLite-connect() blocks due to unrelated connect() on separate process

I want to launch a separate process that connects to a SQLite-db (the ultimate goal is to run a service on that process). This typically works fine. When I however connect to another db-file before launching the process, the connect() command is completely blocking: neither does it finish, nor does it raise an Error.
import sqlite3, multiprocessing, time
def connect(filename):
print 'Creating a file on current process!'
sqlite3.connect(filename).close()
def connect_process(filename):
def process_f():
print 'Gets here...'
conn = sqlite3.connect(filename)
print '...but not here when local process has previously connected to any unrelated sqlite-file!!'
conn.close()
process = multiprocessing.Process(target=process_f)
process.start()
process.join()
if(__name__=='__main__'):
connect_process('my_db_1') # Just to show that it generally works
time.sleep(0.5)
connect('any_file') # Connect to unrelated file
connect_process('my_db_2') # Does not get to the end!!
time.sleep(2)
This returns:
> Gets here...
...but not here when local process has connected to any unrelated sqlite-file!!
Creating a file on current process!
Gets here..
So we would expect another line to be printed at the end ...but not here when...
Remarks:
I know that SQLite cannot candle concurrent access. It should however
work here for 2 reasons: 1) the file I connect to on my local process
is different to the separately created one. 2) the connection
on the former file is long closed by the time the process gets
created.
The only operation I use here is to connect to the DB and
then close the connection immediately (which creates the file if not
existing). I have of course verified that we get the same behavior if we
actually do anything meaningful...
The code is just a minimal working example for what I really want to do. The goal is to test a service
that uses SQLite. Hence in the test-setup, I need to create some mock
SQLite-files… The service is then launched on a separate process in order to test it via the respective client.

Modifying database from Pool does not work (but no Exceptions)

I've created a method, which downloads a product from one web page and then, this product is stored into the database SQLite3.
This function works good when it's called normally but I want to make a pool and do it parallel (because of sending parallel requests) (the web page allows bots to send 2000 requests/ minute).
The problem is that when I try to put it into the pool it does not store the data into the database nor raises some error or exception.
Here is the code of main function:
if __name__ == '__main__':
pu = product_updater() # class which handles almost all, this class also has database manager class as an attribute
pool = Pool(10)
for line in lines[0:100]: # list lines is a list of urls
# pu.update_product(line[:-1]) # this works correctly
pool.apply_async(pu.update_product, args=(line[:-1],)) # this works correctly but does not store products into the database
pool.close()
pool.join()
def update_product(self,url): # This method belongs to product_updater class
prod = self.parse_product(url)
self.man.insert_product(prod) # man is a class to handling database
I use this pool: from multiprocessing.pool import ThreadPool as Pool
Do you know what could be wrong?
EDIT: I think that it could be caused because there is only one cursor which is shared between workers but I think that if this would be a problem it would raises some Exception.
EDIT2: The weird is that I tried to make Pool which has only 1 worker so there shouldn't be a problem with concurrency but the same result - no new rows in a database.

multiprocessing.Pool is not notifying exceptions happening within the workers as long as you don't ask for confirmation from the tasks.
This examples will be silent.
from multiprocessing import Pool
def function():
raise Exception("BOOM!")
p = Pool()
p.apply_async(function)
p.close()
p.join()
This example instead will show the exception.
from multiprocessing import Pool
def function():
raise Exception("BOOM!")
p = Pool()
task = p.apply_async(function)
task.get() # <---- you will get the exception here
p.close()
p.join()
The root cause of your issue is the sharing of a single cursor object which is not thread/process safe. As multiple workers are reading/writing on the same cursor things get broken and the Pool silently eats the exception (om nom).
First solution is acknowledging the tasks as I showed in order to make problems visible. Then what you can do, is getting a dedicated cursor per each worker.

How do I cleanly exit from a multiprocessing script?

I am building a non-blocking chat application for my website, and I decided to implement some multiprocessing to deal with DB querying and real-time messaging.
I assume that when a user lands on a given URL to see their conversation with the other person, I will fire off the script, the multiprocessing will begin, the messages will be added to a queue and displayed on the page, new messages will be sent to a separate queue that interacts with the DB, etc. (Regular message features ensue.)
However, what happens when the user leaves this page? I assume I need to exit these various processes, but currently, this does not lend itself to a "clean" exit. I would have to terminate processes and according to the multiprocessing docs:
Warning: If this method (terminate()) is used when the associated process is using a pipe
or queue then the pipe or queue is liable to become corrupted and may become
unusable by other process.Similarly, if the process has acquired a lock or
semaphore etc. then terminating it is liable to cause other processes to
deadlock.
I have also looked into sys.exit(); however, it doesn't fully exit the script without the use of terminate() on the various processes.
Here is my code that is simplified for the purposes of this question. If I need to change it, that's completely fine. I simply want to make sure I am going about this appropriately.
import multiprocessing
import Queue
import time
import sys
## Get all past messages
def batch_messages():
# The messages list here will be attained via a db query
messages = [">> This is the message.", ">> Hello, how are you doing today?", ">> Really good!"]
for m in messages:
print m
## Add messages to the DB
def add_messages(q2):
while True:
# Retrieve from the queue
message_to_db = q2.get()
# For testing purposes only; perfrom another DB query to add the message to the DB
print message_to_db, "(Add to DB)"
## Recieve new, inputted messages.
def receive_new_message(q1, q2):
while True:
# Add the new message to the queue:
new_message = q1.get()
# Print the message to the (other user's) screen
print ">>", new_message
# Add the q1 message to q2 for databse manipulation
q2.put(new_message)
def shutdown():
print "Shutdown initiated"
p_rec.terminate()
p_batch.terminate()
p_add.terminate()
sys.exit()
if __name__ == "__main__":
# Set up the queue
q1 = multiprocessing.Queue()
q2 = multiprocessing.Queue()
# Set up the processes
p_batch = multiprocessing.Process(target=batch_messages)
p_add = multiprocessing.Process(target=add_messages, args=(q2,))
p_rec = multiprocessing.Process(target=receive_new_message, args=(q1, q2,))
# Start the processes
p_batch.start() # Perfrom batch get
p_rec.start()
p_add.start()
time.sleep(0.1) # Test: Sleep to allow proper formatting
while True:
# Enter a new message
input_message = raw_input("Type a message: ")
# TEST PURPOSES ONLY: shutdown
if input_message == "shutdown_now":
shutdown()
# Add the new message to the queue:
q1.put(input_message)
# Let the processes catch up before printing "Type a message: " again. (Shell purposes only)
time.sleep(0.1)
How should I deal with this situation? Does my code need to be fundamentally revised?, and if so, what should I do to fix it?
Any thoughts, comments, revisions, or resources appreciated.
Thank you!

Disclaimer: I don't actually know python. But multithreadding concepts are similar enough in all the langauges I do know that I feel confident enough to try to answer anyway.
When using multiple threads/proccesses each one should have a step in it's loop to check a variable, (I often call the variable "active", or "keepGoing" or something and it's usually a boolean.)
That variable is usually either shared between the threads, or sent as a message to each thread depending on your programming language and when you want the proccessing to stop, (finish your work first y/n?)
Once the variable is set all threads quit their proccessing loops and proceed to exit their threads.
In your case you have a loop "while true". This never exits. Change it to exit when a variable is set and the thread should close itself when the function exit is reached.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Concurrency on PostgreSQL Database with python subprocesses - python

Related

Snowflake SQLAlchemy connection pooling is rolling back transaction by default, while the documentation says the default is commit transaction

Python multiprocessing on mysqldb

SQLite-connect() blocks due to unrelated connect() on separate process

Modifying database from Pool does not work (but no Exceptions)

How do I cleanly exit from a multiprocessing script?

Categories

Resources