Time out and connection lost in mysql in multithreading - python

I am working on an online judge.I am using python 2.7 and Mysql ( as I am working on back end-part)
My Method:
I create a main thread which pulls out submissions from database( 10 at a time) and puts them in a queue.Then I have multiple threads that take submissions from queue, evaluate it and write the result back to database.
My doubts:
1.The main thread and the other threads have their own database connections assigned to them in
beginning.But I guess it not a good process because sometimes I get the error : Lost connection to mysql server while querying which I guess is when resources of a db connection are exhausted.Then I looked up psqlpool.So I want to know whether the connections
provided by pool are dedicated or shared(I want dedicated).
2.Also when I stop my main thread all others threads stop(as daemon for them is set true) but the db connections are not closed(as I stop main thread by Ctrl-Z).So next time I again
start my program there are issues of Lock wait timeout exceeded; try restarting transaction.Which are due to previous connections which were not closed.Rather than manually killing from show full processlist is there any other method.Also how will we solve it in case of psqlpool or is already handled by the library.

Related

What happens when multiple threads are writing to a single table in mysql?

I am facing a situation where multiple threads are trying to insert data to the same table in mysql, **Will it be OK without explicit handle it? ** I am afraid every thread is inserting, some thread will be locked and hold on too long, then cause the program to corrupt.
Basically I am trying to do is the following:
import threading
import mysql.connector
db = mysql.connector.connect()
cursor = db.cursor()
def update_to_table(data):
sql = "insert into my_db.my_table values(%s)" % data
cursor.excute(sql)
db.commit()
print("update complete!")
for i in range(10):
print("%d -th time..." % i)
data = get_data(i)
t = threading.Thread(target=update_to_table, args=(data,))
t.start()
Do I need to check if other threads are inserting, and hold on and waiting until them to finish etc...
The data for different i has no overlap so we don't need to worry about the duplicate key problem.
After experimenting, it seems some thread will hang on and no response.
According to the MySQL Connector/Python Developer Guide, the mysql.connector.threadsafety property is 1.
According to PEP 249, the meaning of the threadsafety property is as follows:
0 - Threads may not share the module.
1 - Threads may share the module, but not connections.
2 - Threads may share the module and connections.
3 - Threads may share the module, connections and cursors.
Sharing in the above context means that two threads may use a resource without wrapping it using a mutex semaphore to implement resource locking. Note that you cannot always make external resources thread safe by managing access using a mutex: the resource may rely on global variables or other external sources that are beyond your control.
In your example, you have threads sharing a single connection. without any explicit resource locking. That is liable to lead to threading problems, and the symptoms you observe (threads locking up) are not unexpected.
The simple solution in this example is to give each thread its own connection object.
(If the thread count was large, you would be advised to use a connection pool with a bound on the number of concurrent connection. The DB server will limit the number of connections that one client can have open ... to husband server-side resources. Furthermore, there will be a point at which you are using all of a particular server-side resource; e.g. CPU, memory, disk bandwidth, network bandwidth. Beyond that point, adding more client threads won't increase throughput.)

Tornado multiple processes: create multiple MySQL connections

I'm running a Tornado HTTPS server across multiple processes using the first method described here http://www.tornadoweb.org/en/stable/guide/running.html (server.start(n))
The server is connected to a local MySQL instance and I would like to have a independent MySQL connection per Tornado process.
However, right now I only have one MySQL connection according to the output of SHOW PROCESSLIST. I guess this happens because I establish the connection before calling server.start(n) and IOLoop.current().start() right?
What I don't really understand is whether the processes created after calling server.start(n) share some data (for instance, global variables within the same module) or are totally independent.
Should I establish the connection after calling server.start(n) ? Or after calling IOLoop.current().start() ? If I do so, will I have one MySQL connection per Tornado process?
Thanks
Each child process gets a copy of the variables that existed in the parent process when start(n) was called. For things like connections, this will usually cause problems. When using multi-process mode, it's important to do as little as possible before starting the child processes, so don't create the mysql connections until after start(n) (but before IOLoop.start(); IOLoop.start() doesn't return until the server is stopped).

Creating separate database connection for every celery worker

I keep running into wierd mysql issues while workers executing tasks just after creation.
We use django 1.3, celery 3.1.17, djorm-ext-pool 0.5
We start celery process with concurrency 3.
My obeservation so far is, when the workers process start, they all get same mysql connecition. We log db connection id as below.
from django.db import connection
connection.cursor()
logger.info("Task %s processing with db connection %s", str(task_id), str(connection.connection.thread_id()))
When all the workers get tasks, the first one executes successfully but the other two gives weird Mysql errors. It either errors with "Mysql server gone away", or with a condition where Django throws "DoesNotExist" error. clearly the objects that Django is querying do exist.
After this error, each worker starts getting its own database connection after which we don't find any issue.
What is the default behavior of celery ? Is it designed to share same database connection. If so how is the inter process communication handled ?
I would ideally prefer different database connection for each worker.
I tried the code mentioned in below link which did not work.
Celery Worker Database Connection Pooling
We have also fixed the celery code suggested below.
https://github.com/celery/celery/issues/2453
For those who downvote the question, kindly let me know the reason for downvote.
Celery is started with below command
celery -A myproject worker --loglevel=debug --concurrency=3 -Q testqueue
myproject.py as part of the master process was making some queries to mysql database before forking the worker processes.
As part of query flow in main process, django ORM creates a sqlalchemy connection pool if it does not already exist. Worker processes are then created.
Celery as part of django fixups closes existing connections.
def close_database(self, **kwargs):
if self._close_old_connections:
return self._close_old_connections() # Django 1.6
if not self.db_reuse_max:
return self._close_database()
if self._db_recycles >= self.db_reuse_max * 2:
self._db_recycles = 0
self._close_database()
self._db_recycles += 1
In effect what could be happening is that, the sqlalchemy pool object with one unused db connection gets copied to the 3 worker process when forked. So the 3 different pools have 3 connection objects pointing to the same connection file descriptor.
Workers while executing the tasks when asked for a db connection, all the workers get the same unused connection from sqlalchemy pool because that is currently unused. The fact that all the connections point to the same file descriptor has caused the MySQL connection gone away errors.
New connections created there after are all new and don't point to the same socket file descriptor.
Solution:
In the main process add
from django.db import connection
connection.cursor()
before any import is done. i.e before even djorm-ext-pool module is added.
That way all the db queries will use connection created by django outside the pool. When celery django fixup closes the connection, the connection actually gets closed as opposed to going back to the alchemy pool leaving the alchemy pool with no connections in it at the time of coping over to all the workers when forked. There after when workers ask for db connection, sqlalchemy returns one of the newly created connections.

ZeroMQ is too fast for database transaction

Inside an web application ( Pyramid ) I create certain objects on POST which need some work done on them ( mainly fetching something from the web ). These objects are persisted to a PostgreSQL database with the help of SQLAlchemy. Since these tasks can take a while it is not done inside the request handler but rather offloaded to a daemon process on a different host. When the object is created I take it's ID ( which is a client side generated UUID ) and send it via ZeroMQ to the daemon process. The daemon receives the ID, and fetches the object from the database, does it's work and writes the result to the database.
Problem: The daemon can receive the ID before it's creating transaction is committed. Since we are using pyramid_tm, all database transactions are committed when the request handler returns without an error and I would rather like to leave it this way. On my dev system everything runs on the same box, so ZeroMQ is lightning fast. On the production system this is most likely not an issue since web application and daemon run on different hosts but I don't want to count on this.
This problem only recently manifested itself since we previously used MongoDB with a write_convern of 2. Having only two database servers the write on the entity always blocked the web-request until the entity was persisted ( which is obviously is not the greatest idea ).
Has anyone run into a similar problem?
How did you solve it?
I see multiple possible solutions, but most of them don't satisfy me:
Flushing the transaction manually before triggering the ZMQ message. However, I currently use SQLAlchemy after_created event to trigger it and this is really nice since it decouples this process completely and thus eliminating the risk of "forgetting" to tell the daemon to work. Also think that I still would need a READ UNCOMMITTED isolation level on the daemon side, is this correct?
Adding a timestamp to the ZMQ message, causing the worker thread that received the message, to wait before processing the object. This obviously limits the throughput.
Dish ZMQ completely and simply poll the database. Noooo!
I would just use PostgreSQL's LISTEN and NOTIFY functionality. The worker can connect to the SQL server (which it already has to do), and issue the appropriate LISTEN. PostgreSQL would then let it know when relevant transactions finished. You trigger for generating the notifications in the SQL server could probably even send the entire row in the payload, so the worker doesn't even have to request anything:
CREATE OR REPLACE FUNCTION magic_notifier() RETURNS trigger AS $$
BEGIN
PERFORM pg_notify('stuffdone', row_to_json(new)::text);
RETURN new;
END;
$$ LANGUAGE plpgsql;
With that, right as soon as it knows there is work to do, it has the necessary information, so it can begin work without another round-trip.
This comes close to your second solution:
Create a buffer, drop the ids from your zeromq messages in there and let you worker poll regularly this id-pool. If it fails retrieving an object for the id from the database, let the id sit in the pool until the next poll, else remove the id from the pool.
You have to deal somehow with the asynchronous behaviour of your system. When the ids arrive constantly before persisting the object in the database, it doesnt matter whether pooling the ids (and re-polling the the same id) reduces throughput, because the bottleneck is earlier.
An upside is, you could run multiple frontends in front of this.

How to manage db connection especially in case of multi threading

I am working on an online judge.I am using python 2.7 and Mysql ( as I am working on back end-part)
My Method:
I create a main thread which pulls out submissions from database( 10 at a time) and puts them in a queue.Then I have multiple threads that take submissions from queue, evaluate it and write the result back to database.
Now I have some doubts(I know they are doubts from different topics but approach to some of them also is highly appreciated).
Currently when I start the threads I give them their own db connections, Which they use.Is this a good practice to give one connection per thread. Does sharing of connections between threads create problems.How do I go about this.
My main thread uses a single connection as its only work is to pull submissions from db and put then in queue(also update their status in db to Assessing Submission). But sometimes I get the error: Lost connection to Mysql server while querying. I keep getting it even when I stop the program and start it again.What do I do about it? Also should I implement a Pool of connections for only the main thread?
Also does a db connection stay alive for ever? What to do when its session memory etc gets exhausted how to handle that?
Use a connection pool. Sharing the database connection is not always bad but you have to be careful about it. You can try SQLAlchemy to manage a lot of this for you: http://docs.sqlalchemy.org/en/rel_0_8/orm/session.html#unitofwork-contextual
The server might be out of connections, your connection might have been killed because it uses too many resources.. etc. A connection pool could help you solve this.
It all depends, it could stay alive indefinitely theoretically, but usually you have a timeout somewhere.
If you give the same connection to every thread then the threads will not be able to query the database and race condition will occur. So you need to provide separate connection to every thread and indeed it is a good idea. Use a Connection Pool for the purpose it will help you get different connections.
Connection Pool will surely help.
Release the connection once your work is over. There is a limit to connection which is termed as connection time out. So you need to use some third party library to handle that, c3p0 is a good library which can help you in this.
Please refer the below link to configure it:
Best configuration of c3p0

Categories

Resources