[Python/MySQLdb] - CentOS - Linux - VPS
I have a page that parses a large file and queries the datase up to 100 times for each run. The database is pretty large and I'm trying to reduce the execution time of this script.
My SQL functions are inside a class, currently the connection object is a class variable created when the class is instantiated. I have various fetch and query functions that create a cursor from the connection object every time they are called. Would it be faster to create the cursor when the connection object is created and reuse it or would it be better practice to create the cursor every time it's called?
import MySQLdb as mdb
class parse:
con = mdb.connect( server, username, password, dbname )
#cur = con.cursor() ## create here?
def q( self, q ):
cur = self.con.cursor() ## it's currently here
cur.execute( q )
Any other suggestions on how to speed up the script are welcome too. The insert statement is the same for all the queries in the script.
Opening and closing connections is never free, it always wastes some amount of performance.
The reason you wouldn't want to just leave the connection open is that if two requests were to come in at the same time the second request would have to wait till the first request had finished before it could do any work.
One way to solve this is to use connection pooling. You create a bunch of open connections and then reuse them. Every time you need to do a query you check a connection out of the pool, preform the request and then put it back into the pool.
Setting all this up can be quite tedious, so I would recommend using SQLAlchemy. It has built in connection pooling, relatively low overhead and supports MySQL.
Since you care about speed I would only use the core part of SQLAlchemy since the ORM part comes is a bit slower.
Related
I'm trying to refactor some code and have come up with this
def get_inpatients():
"""
Getting all the inpatients currently sitting in A&E
"""
cnxn = pyodbc.connect(f'DRIVER={DB_DRIVER};SERVER={DB_SERVER};DATABASE={DB_NAME};UID={DB_USER};PWD={DB_PASS}')
cursor = cnxn.cursor()
cursor.execute('EXEC spGetInpatients')
row = cursor.fetchone()
while row is not None:
yield row[0]
row = cursor.fetchone()
In the main file I then do this
for nhs_number in get_inpatients():
.... # This then goes and grabs details from several APIs meaning
# it will be a few seconds for each loop
My question is whether a genertaor is a good choice here. I previously had it so that the function would return a list. Thinking about it now, would this then mean the connection is open for as long as the for loop is running in the main file in which case I am better returning a list?
Yes, the connection will remain open. Whether that is a good idea depends on the circumstances. Normally it is a good idea to use the generator because it allows the processing in your application to run concurrently with the fetching of more rows by the database. It also reduces memory consumption and improves CPU cache efficiency in your application. When done right, it also reduces latency which is very user-visible.
But of course you could run into the maximum connection limit sooner. I'd argue that increasing the connection limit is better than artifically making your application perform worse.
Also note that you can have multiple cursors per connection. See for example
Max SQL connections with Python and pyodbc on a local database showing as 1
I am adding this answer for 2 reasons.
To point that the cursor is an iterator
To make more clear that the "maximum connection limit" (as per the answer of #Homer512) is a client side setting and not a server side one and defaults to 0 both for the database connection and the queries.
So:
According to pyodbc wiki you can avoid the boilerplate code:
The fetchall() function returns all remaining rows in a list. Bear
in mind those rows will all be stored in memory so if there a lot
of rows, you may run out of memory. If you are going to process the rows one at a time, you can use the
cursor itself as an iterator:
for row in cursor.execute("select user_id, user_name from users"):
print(row.user_id, row.user_name)
The connection limit lies in the client side and not the server side.
The comment on that answer reads:
You should clarify what server scoped means. SQL Server has a remote
query timeout value that refers to its queries issued on over linked
servers, not to queries issued by clients to it. I believe the query
timeout is a client property, not a server property. The server runs
the query indefinitely. There is such a thing as a query governor for
addressing this issue which is disabled by default.
Indeed, the docs verify:
This value applies to an outgoing connection initiated by the Database
Engine as a remote query. This value has no effect on queries received
by the Database Engine. A query will wait until it completes.
Regarding the question if it is safe to keep open a database connection for a long time, I found this old but relevant question which has an extended answer in favor of "yes, if you know what you are doing".
i am using cx_oracle with python 3.7 to connect to oracle database and execute stored procedures stored in oracle database.
right now i am connecting to database as follows
dbconstr = "username/password#databaseip/sid"
db_connection = cx_Oracle.connect(dbconstr)
cursor = db_connection.cursor()
#calling sp here
cursor.close()
db_connection.close()
but in this code connection time for cx_Oracle.connect(dbconstr) is about 250ms and whole code will run in about 500ms what i want is to reduce conenction time of 250ms.
I am using flask rest-api in python and this code is used for that, 250ms for connection is too long when entire response time is 500ms.
i have also tried maintaining a connection a for a life time of application by declaring global variable for connect object and creating and closing cursors only as shown below will give result in 250ms
dbconstr = "username/password#databaseip/sid"
db_connection = cx_Oracle.connect(dbconstr)
def api_response():
cursor = db_connection.cursor()
#calling sp here
cursor.close()
return result
by this method response time is reduced but a connection is getting maintained even when no one is using the application. After some time of being idle execution speed will get reduced for first request after some idle time, it is in seconds which is very bad.
so, i want help in creating stable code with good response time.
Creating a connection involves a lot of work on the database server: process startup, memory allocation, authentication etc.
Your solution - or using a connection pool - are the ways to reduce connection times in Oracle applications. A pool with an acquire & release around the point of use in the app has benefits for planned and unplanned DB maintenance. This is due to the internal implementation of the pool.
What's the load on your service? You probably want to start a pool and aquire/release connections, see
How to use cx_Oracle session pool with Flask gracefuly? and Unresponsive requests- understanding the bottleneck (Flask + Oracle + Gunicorn) and others. Pro tip: keep the pool small, and make the minimum & maximum size the same.
Is there a problem with having connections open? What is that impacting? There are some solutions such as Shared Servers, or DRCP but generally there shouldn't be any need to use them unless your database server is short of memory.
I want to execute multiple queries without each blocking other. I created multiple cursors and did the following but got mysql.connector.errors.OperationalError: 2013 (HY000): Lost connection to MySQL server during query
import mysql.connector as mc
from threading import Thread
conn = mc.connect(#...username, password)
cur1 = conn.cursor()
cur2 = conn.cursor()
e1 = Thread(target=cur1.execute, args=("do sleep(30)",)) # A 'time taking' task
e2 = Thread(target=cur2.execute, args=("show databases",)) # A simple task
e1.start()
e2.start()
But I got that OperationalError. And reading a few other questions, some suggest that using multiple connections is better than multiple cursors. So shall I use multiple connections?
I don't have the full context of your situation to understand the performance considerations. Yes, starting a new connection could be considered heavy if you are operating under strict timing constraints that are short relative to the time it takes to start a new connection and you were forced to do that for every query...
But you can mitigate that with a shared connection pool that you create ahead of time, and then distribute your queries (in separate threads) over those connections as resources allow.
On the other hand, if all of your query times are fairly long relative to the time it takes to create a new connection, and you aren't looking to run more than a handful of queries in parallel, then it can be a reasonable option to create connections on demand. Just be aware that you will run into limits with the number of open connections if you try to go too far, as well as resource limitations on the database system itself. You probably don't want to do something like that against a shared database. Again, this is only a reasonable option within some very specific contexts.
I'm playing around with SQLAlchemy core in Python, and I've read over the documentation numerous times and still need clarification about engine.execute() vs connection.execute().
As I understand it, engine.execute() is the same as doing connection.execute(), followed by connection.close().
The tutorials I've followed let me to use this in my code:
Initial setup in script
try:
engine = db.create_engine("postgres://user:pass#ip/dbname", connect_args={'connect_timeout': 5})
connection = engine.connect()
metadata = db.MetaData()
except exc.OperationalError:
print_error(f":: Could not connect to {db_ip}!")
sys.exit()
Then, I have functions that handle my database access, for example:
def add_user(a_username):
query = db.insert(table_users).values(username=a_username)
connection.execute(query)
Am I supposed to be calling connection.close() before my script ends? Or is that handled efficiently enough by itself? Would I be better off closing the connection at the end of add_user(), or is that inefficient?
If I do need to be calling connection.close() before the script ends, does that mean interrupting the script will cause hanging connections on my Postgres DB?
I found this post helpful to better understand the different interaction paradigms in sqlalchemy, in case you haven't read it yet.
Regarding your question as to when to close your db connection: It is indeed very inefficient to create and close connections for every statement execution. However you should make sure that your application does not have connection leaks in it's global flow.
Trying to serve database query results to adhoc client requests, but do not want to open a connection for each individual query. I'm not sure if i'm doing it right.
Current solution is something like this on the "server" side (heavily cut down for clarity):
import rpyc
from rpyc.utils.server import ThreadedServer
import cx_Oracle
conn = cx_Oracle.conect('whatever connect string')
cursor = conn.cursor()
def get_some_data(barcode):
# do something
return cursor.execute("whatever query",{'barcode':barcode})
class data_service(rpyc.Service):
def exposed_get_some_data(self, brcd):
return get_some_data(brcd)
if __name__ == '__main__':
s = ThreadedServer(data_service, port=12345, auto_register=False)
s.start()
This runs okay for a while. However from time to time the program crashes and so far i haven't been able to track when it does that.
What i wish to confirm, is see how the database connection is created outside of the data_service class. Is this in itself likely to cause problems?
Many thanks any thoughts appreciated.
I don't think the problem is that you're creating the connection outside of the class, that should be fine.
I think the problem is that you are creating just one cursor and using it for a long time, which as far as I understand is not how cursors are meant to be used.
You can use conn.execute without manually creating a cursor, which should be fine for how you're using the database. If I remember correctly, behind the scenes this creates a new cursor for each SQL command. You could also do this yourself in get_some_data(): create a new cursor, use it once, and then close it before returning the data.
In the long run, if you wish your server to be more robust, you'll need to add some error-handling for when database operations fail or the connection is lost.
A final note: Essentially you've written a very basic database proxy server. There are probably various existing solutions for this already, which already handle many issues you are likely to run in to. I recommend at least considering using an existing solution.