Python: Using a generator when accessing a SQL source - python

I'm trying to refactor some code and have come up with this
def get_inpatients():
"""
Getting all the inpatients currently sitting in A&E
"""
cnxn = pyodbc.connect(f'DRIVER={DB_DRIVER};SERVER={DB_SERVER};DATABASE={DB_NAME};UID={DB_USER};PWD={DB_PASS}')
cursor = cnxn.cursor()
cursor.execute('EXEC spGetInpatients')
row = cursor.fetchone()
while row is not None:
yield row[0]
row = cursor.fetchone()
In the main file I then do this
for nhs_number in get_inpatients():
.... # This then goes and grabs details from several APIs meaning
# it will be a few seconds for each loop
My question is whether a genertaor is a good choice here. I previously had it so that the function would return a list. Thinking about it now, would this then mean the connection is open for as long as the for loop is running in the main file in which case I am better returning a list?

Yes, the connection will remain open. Whether that is a good idea depends on the circumstances. Normally it is a good idea to use the generator because it allows the processing in your application to run concurrently with the fetching of more rows by the database. It also reduces memory consumption and improves CPU cache efficiency in your application. When done right, it also reduces latency which is very user-visible.
But of course you could run into the maximum connection limit sooner. I'd argue that increasing the connection limit is better than artifically making your application perform worse.
Also note that you can have multiple cursors per connection. See for example
Max SQL connections with Python and pyodbc on a local database showing as 1

I am adding this answer for 2 reasons.
To point that the cursor is an iterator
To make more clear that the "maximum connection limit" (as per the answer of #Homer512) is a client side setting and not a server side one and defaults to 0 both for the database connection and the queries.
So:
According to pyodbc wiki you can avoid the boilerplate code:
The fetchall() function returns all remaining rows in a list. Bear
in mind those rows will all be stored in memory so if there a lot
of rows, you may run out of memory. If you are going to process the rows one at a time, you can use the
cursor itself as an iterator:
for row in cursor.execute("select user_id, user_name from users"):
print(row.user_id, row.user_name)
The connection limit lies in the client side and not the server side.
The comment on that answer reads:
You should clarify what server scoped means. SQL Server has a remote
query timeout value that refers to its queries issued on over linked
servers, not to queries issued by clients to it. I believe the query
timeout is a client property, not a server property. The server runs
the query indefinitely. There is such a thing as a query governor for
addressing this issue which is disabled by default.
Indeed, the docs verify:
This value applies to an outgoing connection initiated by the Database
Engine as a remote query. This value has no effect on queries received
by the Database Engine. A query will wait until it completes.
Regarding the question if it is safe to keep open a database connection for a long time, I found this old but relevant question which has an extended answer in favor of "yes, if you know what you are doing".

Related

Why does PostgreSQL say FATAL: sorry, too many clients already when I am nowhere close to the maximum connections?

I am working with an installation of PostgreSQL 11.2 that periodically complains in its system logs
FATAL: sorry, too many clients already
despite being no-where close to its configured limit of connections. This query:
SELECT current_setting('max_connections') AS max,
COUNT(*) AS total
FROM pg_stat_activity
tells me that the database is configured for a maximum of 100 connections. I have never seen over about 45 connections into the database with this query, not even moments before a running program receives a database error saying too many clients backed by the above message in the Postgres logs.
Absolutely everything I can find on issue on the Internet this suggests that the error means you have exceeded the max_connections setting, but the database itself tells me that I am not.
For what it's worth, pyspark is the only database client that triggers this error, and only when it's writing into tables from dataframes. The regular python code using psycopg2 (that is the main client) never triggers it (not even when writing into tables in the same manner from Pandas dataframes), and admin tools like pgAdmin also never trigger it. If I didn't see the error in the database logs directly, I would think that Spark is lying to me about the error. Most of the time, if I use a query like this:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE pid <> pg_backend_pid() AND application_name LIKE 'pgAdmin%';
then the problem goes away for several days. But like I said, I've never seen even 50% of the supposed max of 100 connections in use, according to the database itself. How do I figure out what is causing this error?
This is caused by how Spark reads/writes data using JDBC. Spark tries to open several concurrent connections to the database in order to read/write multiple partitions of data in parallel.
I couldn't find it in the docs but I think by default the number of connections is equal to the number of partitions in the datafame you want to write into db table. This explains the intermittency you've noticed.
However, you can control this number by setting numPartitions option:
The maximum number of partitions that can be used for parallelism in
table reading and writing. This also determines the maximum number of
concurrent JDBC connections. If the number of partitions to write
exceeds this limit, we decrease it to this limit by calling
coalesce(numPartitions) before writing.
Example:
spark.read.format("jdbc") \
.option("numPartitions", "20") \
# ...
Three possibilities:
The connections are very short-lived, and they were already gone by the time you looked.
You have a lower connection limit on that database.
You have a lower connection limit on the database user.
But options 2 and 3 would result in a different error message, so it must be the short-lived connections.
Whatever it is, the answer to your problem would be a well-configured connection pool.

How to use multiple cursors in mysql.connector?

I want to execute multiple queries without each blocking other. I created multiple cursors and did the following but got mysql.connector.errors.OperationalError: 2013 (HY000): Lost connection to MySQL server during query
import mysql.connector as mc
from threading import Thread
conn = mc.connect(#...username, password)
cur1 = conn.cursor()
cur2 = conn.cursor()
e1 = Thread(target=cur1.execute, args=("do sleep(30)",)) # A 'time taking' task
e2 = Thread(target=cur2.execute, args=("show databases",)) # A simple task
e1.start()
e2.start()
But I got that OperationalError. And reading a few other questions, some suggest that using multiple connections is better than multiple cursors. So shall I use multiple connections?
I don't have the full context of your situation to understand the performance considerations. Yes, starting a new connection could be considered heavy if you are operating under strict timing constraints that are short relative to the time it takes to start a new connection and you were forced to do that for every query...
But you can mitigate that with a shared connection pool that you create ahead of time, and then distribute your queries (in separate threads) over those connections as resources allow.
On the other hand, if all of your query times are fairly long relative to the time it takes to create a new connection, and you aren't looking to run more than a handful of queries in parallel, then it can be a reasonable option to create connections on demand. Just be aware that you will run into limits with the number of open connections if you try to go too far, as well as resource limitations on the database system itself. You probably don't want to do something like that against a shared database. Again, this is only a reasonable option within some very specific contexts.

Avoiding or handling "BadRequestError: The requested query has expired."?

I'm looping over data in app engine using chained deferred tasks and query cursors. Python 2.7, using db (not ndb). E.g.
def loop_assets(cursor = None):
try:
assets = models.Asset.all().order('-size')
if cursor:
assets.with_cursor(cursor)
for asset in assets.run():
if asset.is_special():
asset.yay = True
asset.put()
except db.Timeout:
cursor = assets.cursor()
deferred.defer(loop_assets, cursor = cursor, _countdown = 3, _target = version, _retry_options = dont_retry)
return
This ran for ~75 minutes total (each task for ~ 1 minute), then raised this exception:
BadRequestError: The requested query has expired. Please restart it with the last cursor to read more results.
Reading the docs, the only stated cause of this is:
New App Engine releases may change internal implementation details, invalidating cursors that depend on them. If an application attempts to use a cursor that is no longer valid, the Datastore raises a BadRequestError exception.
So maybe that's what happened, but it seems a co-incidence that the first time I ever try this technique I hit a 'change in internal implementation' (unless they happen often).
Is there another explanation for this?
Is there a way to re-architect my code to avoid this?
If not, I think the only solution is to mark which assets have been processed, then add an extra filter to the query to exclude those, and then manually restart the process each time it dies.
For reference, this question asked something similar, but the accepted answer is 'use cursors', which I am already doing, so it cant be the same issue.
You may want to look at AppEngine MapReduce
MapReduce is a programming model for processing large amounts of data
in a parallel and distributed fashion. It is useful for large,
long-running jobs that cannot be handled within the scope of a single
request, tasks like:
Analyzing application logs
Aggregating related data from external sources
Transforming data from one format to another
Exporting data for external analysis
When I asked this question, I had run the code once, and experienced the BadRequestError once. I then ran it again, and it completed without a BadRequestError, running for ~6 hours in total. So at this point I would say that the best 'solution' to this problem is to make the code idempotent (so that it can be retried) and then add some code to auto-retry.
In my specific case, it was also possible to tweak the query so that in the case that the cursor 'expires', the query can restart w/o a cursor where it left off. Effectively change the query to:
assets = models.Asset.all().order('-size').filter('size <', last_seen_size)
Where last_seen_size is a value passed from each task to the next.

Python MySQLdb - Is there a benefit to leaving the cursor open?

[Python/MySQLdb] - CentOS - Linux - VPS
I have a page that parses a large file and queries the datase up to 100 times for each run. The database is pretty large and I'm trying to reduce the execution time of this script.
My SQL functions are inside a class, currently the connection object is a class variable created when the class is instantiated. I have various fetch and query functions that create a cursor from the connection object every time they are called. Would it be faster to create the cursor when the connection object is created and reuse it or would it be better practice to create the cursor every time it's called?
import MySQLdb as mdb
class parse:
con = mdb.connect( server, username, password, dbname )
#cur = con.cursor() ## create here?
def q( self, q ):
cur = self.con.cursor() ## it's currently here
cur.execute( q )
Any other suggestions on how to speed up the script are welcome too. The insert statement is the same for all the queries in the script.
Opening and closing connections is never free, it always wastes some amount of performance.
The reason you wouldn't want to just leave the connection open is that if two requests were to come in at the same time the second request would have to wait till the first request had finished before it could do any work.
One way to solve this is to use connection pooling. You create a bunch of open connections and then reuse them. Every time you need to do a query you check a connection out of the pool, preform the request and then put it back into the pool.
Setting all this up can be quite tedious, so I would recommend using SQLAlchemy. It has built in connection pooling, relatively low overhead and supports MySQL.
Since you care about speed I would only use the core part of SQLAlchemy since the ORM part comes is a bit slower.

Psycopg / Postgres : Connections hang out randomly

I'm using psycopg2 for the cherrypy app I'm currently working on and cli & phpgadmin to handle some operations manually. Here's the python code :
#One connection per thread
cherrypy.thread_data.pgconn = psycopg2.connect("...")
...
#Later, an object is created by a thread :
class dbobj(object):
def __init__(self):
self.connection=cherrypy.thread_data.pgconn
self.curs=self.connection.cursor(cursor_factory=psycopg2.extras.DictCursor)
...
#Then,
try:
blabla
self.curs.execute(...)
self.connection.commit()
except:
self.connection.rollback()
lalala
...
#Finally, the destructor is called :
def __del__(self):
self.curs.close()
I'm having a problem with either psycopg or postgres (altough I think the latter is more likely). After having sent a few queries, my connections drop dead. Similarly, phpgadmin -usually- gets dropped as well ; it prompts me to reconnect after having made requests several times. Only the CLI remains persistent.
The problem is, these happen very randomly and I can't even track down what the cause is. I can either get locked down after a few page requests or never really encounter anything after having requested hundreds of pages. The only error I've found in postgres log, after terminating the app is :
...
LOG: unexpected EOF on client connection
LOG: could not send data to client: Broken pipe
LOG: unexpected EOF on client connection
...
I thought of creating a new connection every time a new dbobj instance is created but I absolutely don't want to do this.
Also, I've read that one may run into similar problems unless all transactions are committed : I use the try/except block for every single INSERT/UPDATE query, but I never use it for SELECT queries nor do I want to write even more boilerplate code (btw, do they need to be committed ?). Even if that's the case, why would phpgadmin close down ?
max_connections is set to 100 in the .conf file, so I don't think that's the reason either. A single cherrypy worker has only 10 threads.
Does anyone have an idea where I should look first ?
Psycopg2 needs a commit or rollback after every transaction, including SELECT queries, or it leaves the connections "IDLE IN TRANSACTION". This is now a warning in the docs:
Warning: By default, any query execution, including a simple SELECT will start a transaction: for long-running programs, if no further action is taken, the session will remain “idle in transaction”, an undesirable condition for several reasons (locks are held by the session, tables bloat...). For long lived scripts, either ensure to terminate a transaction as soon as possible or use an autocommit connection.
It's a bit difficult to see exactly where you're populating and accessing cherrypy.thread_data. I'd recommend investigating psycopg2.pool.ThreadedConnectionPool instead of trying to bind one conn to each thread yourself.
Even though I don't have any idea why successful SELECT queries block the connection, spilling .commit() after pretty much every single query that doesn't have to work in conjunction with another solved the problem.

Categories

Resources