I'm trying to insert to update really big values of data in a MySQL db and in the same try, I was trying to see in the process list what is doing!
So I made the following script:
I have a modified db MySQL that takes care to connect. Everything is working fine unless I use multiprocesses, if I use multiprocessing I got an error at some time with "Lost connection to database".
The script is like:
from mysql import DB
import multiprocessing
def check writing(db):
result = db.execute("show full processlist").fethcall()
for i in result:
if i['State'] == "updating":
print i['Info']
def main(db):
# some work to create a big list of tuple called tuple
sql = "update `table_name` set `field` = %s where `primary_key_id` = %s"
monitor = multiprocessing.Process(target=check_writing,args=(db,)) # I create the monitor process
monitor.start()
db.execute_many(sql,tuple) # I start to modify table
monitor.terminate()
monitor.join
if __name__ == "__main__"
db = DB(host,user,password,database_name) # this way I create the object connected
main(db)
db.close()
And the a part of my mysql class is:
class DB:
def __init__(self,host,user,password,db_name)
self.db = MySQLdb.connect(host=host.... etc
def execute_many(self,sql,data):
c = self.db.cursor()
c.executemany(sql, data)
c.close()
self.db.commit()
As I said before, if I don't try to execute in check_writing, the script is working fine!
Maybe someone can explain me what is the cause and how can overcome? Also, I have problems trying to threadPool writing in MySQL using map (or map_async).
Do I miss something related to mysql?
There is a better way to approach that:
Connector/Python Connection Pooling:
mysql.connector.pooling module implements pooling.
A pool opens a number of connections and handles thread safety when providing connections to requesters.
The size of a connection pool is configurable at pool creation time. It cannot be resized thereafter.
it is possible to have multiple connection pools. This enables applications to support pools of connections to different MySQL servers, for example.
Check documentation here
I think your parallel processes are exhausting your mysql connections.
Related
I have a generic multiprocess script that could run any task in a multiprocess set up. I inject the task as a command line argument and use getattr to call the functions in the injected code.
taskModule = importlib.import_module(taskFile.replace(".py", ""))
taskContext = getattr(taskModule, 'init')()
response = pool.map_async(getattr(taskModule, 'run'), inputList)
The init() function creates all relevant variables for the task to execute and returns them as a dict object - the taskContext. inputList is a list of dict objects, each dict containing both the taskContext object as well as the specific item to be processed, so that each process gets a unique item to process along with a copy of the context required by the task.
One of those tasks is meant for FTP and the taskContext in that case contains information on the FTP server along with other details. The run function in the FTP task pretty much opens a connection using the context variables, uploads the required files and closes it, and this works perfectly.
However, I think it'd be good if I can set up a connection pool with multiple FTP connections at the start, as part of the init() function when the context is created, and then use them in an as-available fashion within the run method, very similar to a DB connection pool that prevents the need to open and close connections to the database every time.
Is this even feasible? If so, what's the best way to go about doing it?
I put together a connection_pool module as part of a proof of concept. I'm not sure how robust it is.
I added connection closing to this which is a bugfix of this.
I was able to set up connection pooling of FTP and SFTP connections transferring a few thousand files over 10-20 threads.
You can install my version from conda:
conda install -c jmeppley connectionpool
Creating an FTP pool looks something like this:
# this is from snakemake
def connect(*args_to_use, **kwargs_to_use):
ftp_base_class = (
ftplib.FTP_TLS if kwargs_to_use["encrypt_data_channel"] else ftplib.FTP
)
ftp_session_factory = ftputil.session.session_factory(
base_class=ftp_base_class,
port=kwargs_to_use["port"],
encrypt_data_channel=kwargs_to_use["encrypt_data_channel"],
debug_level=None,
)
return ftputil.FTPHost(
kwargs_to_use["host"],
kwargs_to_use["username"],
kwargs_to_use["password"],
session_factory=ftp_session_factory,
)
# a function to create a pool using the connect() method
def create_ftp_pool(pool_size, *args_to_use, **kwargs_to_use):
create_callback = partial(connect, *args_to_use, **kwargs_to_use)
connection_pool = ConnectionPool(create_callback,
close=lambda c: c.close(),
max_size=pool_size)
# create a pool with the arguments you'd use to create a connection
pool_size = 10
ftp_pool = create_ftp_pool(pool_size, host=...)
# use item() as a context manager
with ftp_pool.item() as connection:
...
I followed the below code in order to implement a parallel select query on a postgres database:
https://tech.geoblink.com/2017/07/06/parallelizing-queries-in-postgresql-with-python/
My basic problem is that I have ~6k queries that need to be executed, and I am trying to optimise the execution of these select queries. Initially it was a single query with the where id in (...) contained all 6k predicate IDs but I ran into issues with the query using up > 4GB of RAM on the machine it ran on, so I decided to split it out into 6k individual queries which when synchronously keeps a steady memory usage. However it takes a lot longer to run time wise, which is less of an issue for my use case. Even so I am trying to reduce the time as much as possible.
This is what my code looks like:
class PostgresConnector(object):
def __init__(self, db_url):
self.db_url = db_url
self.engine = self.init_connection()
self.pool = self.init_pool()
def init_pool(self):
CPUS = multiprocessing.cpu_count()
return multiprocessing.Pool(CPUS)
def init_connection(self):
LOGGER.info('Creating Postgres engine')
return create_engine(self.db_url)
def run_parallel_queries(self, queries):
results = []
try:
for i in self.pool.imap_unordered(self.execute_parallel_query, queries):
results.append(i)
except Exception as exception:
LOGGER.error('Error whilst executing %s queries in parallel: %s', len(queries), exception)
raise
finally:
self.pool.close()
self.pool.join()
LOGGER.info('Parallel query ran producing %s sets of results of type: %s', len(results), type(results))
return list(chain.from_iterable(results))
def execute_parallel_query(self, query):
con = psycopg2.connect(self.db_url)
cur = con.cursor()
cur.execute(query)
records = cur.fetchall()
con.close()
return list(records)
However whenever this runs, I get the following error:
TypeError: can't pickle _thread.RLock objects
I've read lots of similar questions regarding the use of multiprocessing and pickleable objects but I cant for the life of me figure out what I am doing wrong.
The pool is generally one per process (which I believe is the best practise) but shared per instance of the connector class so that its not creating a pool for each use of the parallel_query method.
The top answer to a similar question:
Accessing a MySQL connection pool from Python multiprocessing
Shows an almost identical implementation to my own, except using MySql instead of Postgres.
Am I doing something wrong?
Thanks!
EDIT:
I've found this answer:
Python Postgres psycopg2 ThreadedConnectionPool exhausted
which is incredibly detailed and looks as though I have misunderstood what multiprocessing.Pool vs a connection pool such as ThreadedConnectionPool gives me. However in the first link it doesn't mention needing any connection pools etc. This solution seems good but seems A LOT of code for what I think is a fairly simple problem?
EDIT 2:
So the above link solves another problem, which I would have likely run into anyway so I'm glad I found that, but it doesnt solve the initial issue of not being able to use imap_unordered down to the pickling error. Very frustrating.
Lastly, I think its probably worth noting that this runs in Heroku, on a worker dyno, using Redis rq for scheduling, background tasks etc and a hosted instance of Postgres as the database.
To put it simply, postgres connection and sqlalchemy connection pool is thread safe, however they are not fork-safe.
If you want to use multiprocessing, you should initialize the engine in each child processes after the fork.
You should use multithreading instead if you want to share engines.
Refer to Thread and process safety in psycopg2 documentation:
libpq connections
shouldn’t be used by a forked processes, so when using a module such
as multiprocessing or a forking web deploy method such as FastCGI make
sure to create the connections after the fork.
If you are using multiprocessing.Pool, there is a keyword argument initializer which can be used to run code once on each child process. Try this:
class PostgresConnector(object):
def __init__(self, db_url):
self.db_url = db_url
self.pool = self.init_pool()
def init_pool(self):
CPUS = multiprocessing.cpu_count()
return multiprocessing.Pool(CPUS, initializer=self.init_connection(self.db_url))
#classmethod
def init_connection(cls, db_url):
def _init_connection():
LOGGER.info('Creating Postgres engine')
cls.engine = create_engine(db_url)
return _init_connection
def run_parallel_queries(self, queries):
results = []
try:
for i in self.pool.imap_unordered(self.execute_parallel_query, queries):
results.append(i)
except Exception as exception:
LOGGER.error('Error whilst executing %s queries in parallel: %s', len(queries), exception)
raise
finally:
pass
#self.pool.close()
#self.pool.join()
LOGGER.info('Parallel query ran producing %s sets of results of type: %s', len(results), type(results))
return list(chain.from_iterable(results))
def execute_parallel_query(self, query):
with self.engine.connect() as conn:
with conn.begin():
result = conn.execute(query)
return result.fetchall()
def __getstate__(self):
# this is a hack, if you want to remove this method, you should
# remove self.pool and just pass pool explicitly
self_dict = self.__dict__.copy()
del self_dict['pool']
return self_dict
Now, to address the XY problem.
Initially it was a single query with the where id in (...) contained
all 6k predicate IDs but I ran into issues with the query using up >
4GB of RAM on the machine it ran on, so I decided to split it out into
6k individual queries which when synchronously keeps a steady memory
usage.
What you may want to do instead is one of these options:
write a subquery that generates all 6000 IDs and use the subquery in your original bulk query.
as above, but write the subquery as a CTE
if your ID list comes from an external source (i.e. not from the database), then you can create a temporary table containing the 6000 IDs and then run your original bulk query against the temporary table
However, if you insist on running 6000 IDs through python, then the fastest query is likely neither to do all 6000 IDs in one go (which will run out of memory) nor to run 6000 individual queries. Instead, you may want to try to chunk the queries. Send 500 IDs at once for example. You will have to experiment with the chunk size to determine the largest number of IDs you can send at one time while still comfortably within your memory budget.
I have a Flask web app that might need to execute some heavy SQL queries via Sqlalchemy given user input. I would like to set a timeout for the query, let's say 20 seconds so if a query takes more than 20 seconds, the server will display an error message to user so they can try again later or with smaller inputs.
I have tried with both multiprocessing and threading modules, both with the Flask development server and Gunicorn without success: the server keeps blocking and no error message is returned. You'll find below an excerpt of the code.
How do you handle slow SQL query in Flask in a user friendly way?
Thanks.
from multiprocessing import Process
#app.route("/long_query")
def long_query():
query = db.session(User)
def run_query():
nonlocal query
query = query.all()
p = Process(target=run_query)
p.start()
p.join(20) # timeout of 20 seconds
if p.is_alive():
p.terminate()
return render_template("error.html", message="please try with smaller input")
return render_template("result.html", data=query)
I would recommend using Celery or something similar (people use python-rq for simple workflows).
Take a look at Flask documentation regarding Celery: http://flask.pocoo.org/docs/0.12/patterns/celery/
As for dealing with the results of a long-running query: you can create an endpoint for requesting task results and have client application periodically checking this endpoint until the results are available.
As leovp mentioned Celery is the way to go if you are working on a long-term project. However, if you are working on a small project where you want something easy to setup, I would suggest going with RQ and it's flask plugin. Also think very seriously about terminating processes while they are querying the database, since they might not be able to clean up after themselves (e.g. releasing locks they have on the db)
Well if you really want to terminate queries on a timeout, I would suggest you use a database that supports it (PostgreSQL is one). I will assume you use PostgreSQL for this section.
from sqlalchemy.interfaces import ConnectionProxy
class ConnectionProxyWithTimeouts(ConnectionProxy):
def cursor_execute(self, execute, cursor, statement, parameters, context, executemany):
timeout = context.execution_options.get('timeout', None)
if timeout:
c = cursor._parent.cursor()
c.execute('SET statement_timeout TO %d;' % int(timeout * 1000))
c.close()
ret = execute(cursor, statement, parameters, context)
c = cursor._parent.cursor()
c.execute('SET statement_timeout TO 0')
c.close()
return ret
else:
return execute(cursor, statement, parameters, context)
Then when you created an engine you would your own connection proxy
engine = create_engine(URL, proxy=TimeOutProxy(), pool_size=1, max_overflow=0)
And you could query then like this
User.query.execution_options(timeout=20).all()
If you want to use the code above, use it only as a base for your own implementation, since I am not 100% sure it's bug free.
Do queries executed with the same SQLAlchemy session object use the same underlying connection? If not, is there a way to ensure this?
Some background: I have a need to use MySQL's named lock feature, i.e. GET_LOCK() and RELEASE_LOCK() functions. As far as the MySQL server is concerned, only the connection that obtained the lock can release it - so I have to make sure that I either execute these two commands within the same connection or the connection dies to ensure the lock is released.
To make things nicer, I have created a "locked" context like so:
#contextmanager
def mysql_named_lock(session, name, timeout):
"""Get a named mysql lock on a session
"""
lock = session.execute("SELECT GET_LOCK(:name, :timeout)",
name=name, timeout=timeout).scalar()
if lock:
try:
yield session
finally:
session.execute("SELECT RELEASE_LOCK(:name)", name=name)
else:
e = "Count not obtain named lock {} within {} sections".format(
name, timeout)
raise RuntimeError(e)
def my_critical_section(session):
with mysql_named_lock(session, __name__, 10) as lockedsession:
thing = lockedsession.query(MyStuff).one()
return thing
I want to make sure that the two execute calls in mysql_named_lock happen on the same underlying connection or the connection is closed.
Can I assume this would "just work" or is there anything I need to be aware of here?
it will "just work" if (a) your session is a scoped_session and (b) you are using it in a non-concurrent fashion (same pid / thread). If you're too paranoid, make sure (assert) you're using the same connection ID via
session.connection().connection.thread_id()
also, there is no point to pass session as an argument. Init it once, somewhere in your application’s global scope, then call anywhere in a code, you will get the same connection ID.
I use python multiprocessing processes to establish multiple connections to a postgreSQL database via psycopg.
Every process establishes a connection, creates a cursor, fetches an object from a mp.Queue and does some work on the database. If everything works fine, the changes are commited and the connection is closed.
If one of the processes however creates an error (e.g. an ADD COLUMN request fails, because the COLUMN is already present), all the processes seem to stop working.
import psycopg2
import multiprocessing as mp
import Queue
def connect():
C = psycopg2.connect(host = "myhost", user = "myuser", password = "supersafe", port = 62013, database = "db")
cur = C.cursor()
return C,cur
def commit_and_close(C,cur):
C.commit()
cur.close()
C.close()
def commit(C):
C.commit()
def sub(queue):
C,cur = connect()
while not queue.empty():
work_element = queue.get(timeout=1)
#do something with the work element, that might produce an SQL error
commit_and_close(C,cur)
return 0
if __name__ == '__main__':
job_queue = mp.Queue()
#Fill Job_queue
print 'Run'
for i in range(20):
p=mp.Process(target=sub, args=(job_queue))
p.start()
I can see, that processes are still alive (because the job_queue is still full), but no Network traffic / SQL actions are happening. Is it possible, that an SQL error blocks communication from other subprocesses? How can I prevent that happening?
As chance would have it, I was doing something similar today.
It shouldn't be that the state of one connection can affect a different one, so I don't think we should start there.
There is clearly a race condition in your queue handling. You check if the queue is empty and then try to get a statement to execute. With multiple readers one of the others could empty the queue leaving the others all blocking on their queue.get. If the queue is empty when they all lock up then I would suspect this.
You also never join your processes back when they complete. I'm not sure what effect that would have in the larger picture, but it's probably good practice to clean up.
The other thing that might be happening is that your error-ing process is not rolling back properly. That might leave other transactions waiting to see if it completes or rolls back. They can wait for quite a long time by default but you can configure it.
To see what is happening, fire up psql and check out two useful system views pg_stat_activity and pg_locks. That should show where the cause lies.