Slow MySQL Inserts from Python - python

I'm trying to insert some data into a MySQL database using python (pymysql connector) and I'm getting really poor performance (around 10 rows inserted per second). The table is InnoDB, and I'm using a multiple values insert statement and have ensured that autocommit is turned off. Any ideas why my inserts are still so slow?
I initially thought that autocommit wasn't properly being disabled but I've added code to test that it is disabled (=0) during each connection.
Here my example code:
for i in range(1,500):
params.append([i,i,i,i,i,i])
insertDB(params)
def insertDB(params):
query = """INSERT INTO test (o_country_id, i_country_id,c_id,period_id,volume,date_created,date_updated)
VALUES (%s,%s,%s,%s,%s,NOW(),NOW())
ON DUPLICATE KEY UPDATE trade_volume = %s, date_updated = NOW();"""
db.insert_many(query,params)
def insert_many(query,params=None):
cur = _connection.cursor()
try:
_connection.autocommit(False)
cur.executemany(query,params)
_connection.commit()
except pymysql.Error, e:
print ("MySQL error %d: %s" %
(e.args[0], e.args[1]))
cur.close()
What else could be the issue? The above example takes an eternity of about 110 seconds to execute.

Not sure what is wrong, but I would try the mysqldb and/or the mysql connector modules instead and see if I get the same performance numbers.

Related

PostgreSQL : All queries running slow except in PgAdmin or Dbeaver

I tried to run a Python program which query our database.
But unfortunately any query i run with psycopg2 is very very very slow.
As an exemple you can see in the picture that the same query took 47ms in Dbeaver and take more than 3 minutes in Python !
In the past i tried to move from dbever to oracle client. But all my queries in oracle were so slow so i decided to stay on dBeaver.
But scripting and make queries on the database is a need for my project.
Here an exemple of table I am querying "bex" :
ID
Name
code
code_acr
1
Paris
PAR
PAR
2
Dijon
DIJ
DIJ
3
Brest
BRS
BRT
4
Toulon
TLN
TLN
Here is the code I am using in Python :
import psycopg2
try:
conn = psycopg2.connect(
host="xxxxx.sogate-pacy.xxxxxx.fr",
dbname="xxxxxx",
user="xxxxxxx",
password="\<xxxxx\>",
port="5432",
options="-c search_path=xxx",
sslmode = "disable"
)
cursor = conn.cursor()
postgreSQL_select_Query = "SELECT * FROM bex"
cursor.execute(postgreSQL_select_Query)
ouvrage = cursor.fetchone()
print("Print each row and it's columns values")
print(cursor.fetchone())
except (Exception, psycopg2.Error) as error:
print("Error while fetching data from PostgreSQL", error)
finally:
# closing database connection.
if conn:
cursor.close()
conn.close()
print("PostgreSQL connection is closed")
I tried to make a Python script to get data from the database
To be noted that this table has only 10 rows at total.
and this happen even if do a select to return me only one row
Your python code fetches the entire bex table into memory in your python process memory space, and then processes the first row and throws the rest away. While pgAdmin4 and DBeaver both uses cursors (or something equivalent to them) to fetch only a small number of rows until you do something which calls for more. You can use a psycopg2 "named cursor" to get the same behavior in your own python code as you get with pgAdmin4.

updating a mysql database from a pandas dataframe wiht executemany

I am using mysqldb to try to update a lot of records in a database.
cur.executemany("""UPDATE {} set {} =%s Where id = %s """.format(table, ' = %s, '.join(col)),updates.values.tolist())
I get the error message...
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near...
So I tried outputting the actual sql update statement as that error message wasn't helpful using the following code:
cur.execute('set profiling = 1')
try:
cur.executemany("""UPDATE {} set {} =%s Where id = %s """.format(table, ' = %s, '.join(col)),updates.values.tolist())
except Exception:
cur.execute('show profiles')
for row in cur:
print(row)
That print statement seems to cut off the update statement at 300 characters. I can't find anything in the documentation about limits so I am wondering is this the print statement limiting or is it mysqldb?
Is there a way I can generate the update statement with just python rather than mysqldb to see the full statement?
To see exactly what the cursor was executing, you can use the cursor.statement command as shown here in the API. That may help with the debugging.
I don't have experience with the mySQL adapter, but I work with the PostgreSQL adapter on a daily basis. At least in that context, it is recommended not to format your query string directly, but let the second argument in the cursor.execute statement do the substitution. This avoids problems with quoted strings and such. Here is an example, the second one is correct (at least for Postgres):
cur.execute("""UPDATE mytbl SET mycol = %s WHERE mycol2 = %s""".format(val, cond))
cur.execute("""UPDATE mytbl SET mycol = %(myval)s WHERE mycol2 = %(mycond)s""", {'myval': val, 'mycond': cond})
This can result in the query
UPDATE mytbl SET mycol = abc WHERE mycol2 = xyz
instead of
UPDATE mytbl SET mycol = 'abc' WHERE mycol2 = 'xyz'.
You would have needed to explicitly add those quotes if you do the value substitution in the query yourself, which becomes annoying and circumvents the type handling of the database adapter (keep in mind this was only a text example). See the API for a bit more information on this notation and the cursor.executemany command.

Python code running too slow (SQLITE)

I have located a piece of code that runs quite slow (in my opinion) and would liek to know what you guys think. The code in question is as follows and is supposed to:
Query a database and get 2 fields, a field and its value
Populate the object dictionary with their values
The code is:
query = "SELECT Field, Value FROM metrics " \
"WHERE Status NOT LIKE '%ERROR%' AND Symbol LIKE '{0}'".format(self.symbol)
query = self.db.run(query, True)
if query is not None:
for each in query:
self.metrics[each[0].lower()] = each[1]
The query is run using a db class I created that is very simple:
def run(self, query, onerrorkeeprunning=False):
# Run query provided and return result
try:
con = lite.connect(self.db)
cur = con.cursor()
cur.execute(query)
con.commit()
runsql = cur.fetchall()
data = []
for rows in runsql:
line = []
for element in rows:
line.append(element)
data.append(line)
return data
except lite.Error, e:
if onerrorkeeprunning is True:
if con:
con.close()
return
else:
print 'Error %s:' % e.args[0]
sys.exit(1)
finally:
if con:
con.close()
I know there are tons of ways of writting this code and I was trying to keep things simple but for 24 fields this takes 0.03s so if I have 1,000 elements that is 30s and I find it a little too long!
EDIT: on further review, runsql = cur.fetchall() is the line that takes the most to run.
Any help will be much appreciated.
2nd EDIT: Looking online further, I have found the issue lies with the fetchall() commant and not with my query or the initialization of the DB. Has anybody been able to imporve the performance of the result fetching? (Some people mentioned changing the SQL code but this is not to blame, it runs pretty fast but then the slowness comes when you try to grab those results)
fetchall() reads all results, and returns them in a temporary list.
Your run() function then just puts all the results into another list.
Your top-level code then copies these values into yet another dictionary.
You should fetch only the row you need (which can be done directly on the cursor), and handle it directly:
cur.execute("SELECT Field, Value ...")
for row in cur:
self.metrics[row[0].lower()] = row[1]
Note: this distributes the cost of the SQL query over all for iteration; the overall time spent in the database does not change.
This code improves only on the time that would have been spent handling all the temporary variables.

How to update records in SQL Alchemy in a Loop

I am trying to use SQLSoup - the SQLAlchemy extention, to update records in a SQL Server 2008 database. I am using pyobdc for the connections. There are a number of issues which make it hard to find a relevant example.
I am reprojection a geometry field in a very large table (2 million + records), so many of the standard ways of updating fields cannot be used. I need to extract coordinates from the geometry field to text, convert them and pass them back in. All this is fine, and all the individual pieces are working.
However I want to execute a SQL Update statement on each row, while looping through the records one by one. I assume this places locks on the recordset, or the connection is in use - as if I use the code below it hangs after successfully updating the first record.
Any advice on how to create a new connection, reuse the existing one, or accomplish this another way is appreciated.
s = select([text("%s as fid" % id_field),
text("%s.STAsText() as wkt" % geom_field)],
from_obj=[feature_table])
rs = s.execute()
for row in rs:
new_wkt = ReprojectFeature(row.wkt)
update_value = "geometry :: STGeomFromText('%s',%s)" % (new_wkt, "3785")
update_sql = ("update %s set GEOM3785 = %s where %s = %i" %
(full_name, update_value, id_field, row.fid))
conn = db.connection()
conn.execute(update_sql)
conn.close() #or not - no effect..
Updated working code now looks like this. It works fine on a few records, but hangs on the whole table, so I guess it is reading in too much data.
db = SqlSoup(conn_string)
#create outer query
Session = sessionmaker(autoflush=False, bind=db.engine)
session = Session()
rs = session.execute(s)
for row in rs:
#create update sql...
session.execute(update_sql)
session.commit()
I now get connection busy errors.
DBAPIError: (Error) ('HY000', '[HY000] [Microsoft][ODBC SQL Server Driver]Connection is busy with results for another hstmt (0) (SQLExecDirectW)')
It looks like this could be a problem with the ODBC driver - http://sourceitsoftware.blogspot.com/2008/06/connection-is-busy-with-results-for.html
Further Update:
On the server using profiler, it shows the select statement then the first update statement are "starting" but neither complete.
If I set the Select statement to return the top 10 rows, then it does complete and the updates run.
SQL: Batch Starting Select...
SQL: Batch Starting Update...
I believe this is an issue with pyodbc and SQL Server drivers. If I remove SQL Alchemy and execute the same SQL with pyodbc it also hangs. Even if I create a new connection object for the updates.
I also tried the SQL Server Native Client 10.0 driver which is meant to allow MARS - Multiple Active Record Sets but it made no difference. In the end I have resorted to "paging the results" and updating these batches using pyodbc and SQL (see below), however I thought SQLAlchemy would have been able to do this for me automatically.
Try using a Session.
rs = s.execute() then becomes session.execute(rs) and you can replace the last three lines with session.execute(update_sql). I'd also suggest configuring your Session with autocommit off and call session.commit() at the end.
Can I suggest that when your process hangs you do a sp_who2 on the Sql box and see what is happening. Check for blocked spid's and see if you can find anything in the Sql code that can suggest what is happening. If you do find a spid that is blocking others you can do a dbcc inputbuffer(*spidid*) and see if that tells you what the query was it executed. Otherwise you can also attach the Sql profiler and trace your calls.
In some cases it could also be parallelism on the Sql server that cause blocks. Unless this is a data warehouse, I suggest turn your Max DOP off, (set it to 1). Let me know and when I check this again in the morning and you need help, I'll be glad to help.
Until I find another solution I am using a single connection and custom SQL to return sets of records, and updating these in batches. I don't think what I am doing is a particulary unique case, so I am not sure why I cannot handle multiple result sets simultaneously.
Below works but is very, very slow..
cnxn = pyodbc.connect(conn_string, autocommit=True)
cursor = cnxn.cursor()
#get total recs in the database
s = "select count(fid) as count from table"
count = cursor.execute(s).fetchone().count
#choose number of records to update in each iteration
batch_size = 100
for i in range(1,count, batch_size):
#sql to bring back relevant records in each batch
s = """SELECT fid, wkt from(select ROW_NUMBER() OVER(ORDER BY FID ASC) AS 'RowNumber'
,FID
,GEOM29902.STAsText() as wkt
FROM %s) features
where RowNumber >= %i and RowNumber <= %i""" % (full_name,i,i+batch_size)
rs = cursor.execute(s).fetchall()
for row in rs:
new_wkt = ReprojectFeature(row.wkt)
#...create update sql statement for the record
cursor.execute(update_sql)
counter += 1
cursor.close()
cnxn.close()

sqlite3 Operation Error when doing many commits rapidly

I get
sqlite3.OperationalError: SQL logic error or missing database
when I run an application I've been working on. What follows is a narrowed-down but complete sample that exhibits the problem for me. This sample uses two tables; one to store users and one to record whether user information is up-to-date in an external directory system. (As you can imagine, the tables are a fair bit longer in my real application). The sample creates a bunch of random users, and then goes through a list of (random) users and adds them to the second table.
#!/usr/bin/env python
import sqlite3
import random
def random_username():
# Returns one of 10 000 four-letter placeholders for a username
seq = 'abcdefghij'
return random.choice(seq) + random.choice(seq) + \
random.choice(seq) + random.choice(seq)
connection = sqlite3.connect("test.sqlite")
connection.execute('''CREATE TABLE IF NOT EXISTS "users" (
"entry_id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL ,
"user_id" INTEGER NOT NULL ,
"obfuscated_name" TEXT NOT NULL)''')
connection.execute('''CREATE TABLE IF NOT EXISTS "dir_x_user" (
"user_id" INTEGER PRIMARY KEY NOT NULL)''')
# Create a bunch of random users
random.seed(0) # get the same results every time
for i in xrange(1500):
connection.execute('''INSERT INTO users
(user_id, obfuscated_name) VALUES (?, ?)''',
(i, random_username()))
connection.commit()
#random.seed()
for i in xrange(4000):
username = random_username()
result = connection.execute(
'SELECT user_id FROM users WHERE obfuscated_name = ?',
(username, ))
row = result.fetchone()
if row is not None:
user_id = row[0]
print " %4d %s" % (user_id, username)
connection.execute(
'INSERT OR IGNORE INTO dir_x_user (user_id) VALUES(?)',
(user_id, ))
else:
print " ? %s" % username
if i % 10 == 0:
print "i = %s; committing" % i
connection.commit()
connection.commit()
Of particular note is the line near the end that says,
if i % 10 == 0:
In the real application, I'm querying the data from a network resource, and want to commit the users every now and then. Changing that line changes when the error occurs; it seems that when I commit, there is a non-zero chance of the OperationalError. It seems to be somewhat related to the data I'm putting in the database, but I can't determine what the problem is.
Most of the time if I read all the data and then commit only once, an error does not occur. [Yes, there is an obvious work-around there, but a latent problem remains.]
Here is the end of a sample run on my computer:
? cgha
i = 530; committing
? gegh
? aabd
? efhe
? jhji
? hejd
? biei
? eiaa
? eiib
? bgbf
759 bedd
i = 540; committing
Traceback (most recent call last):
File "sqlitetest.py", line 46, in <module>
connection.commit()
sqlite3.OperationalError: SQL logic error or missing database
I'm using Mac OS X 10.5.8 with the built-in Python 2.5.1 and Sqlite3 3.4.0.
As the "lite" part of the name implies, sqlite3 is meant for light-weight database use, not massive scalable concurrency like some of the Big Boys. Seems to me that what's happening here is that sqlite hasn't finished writing the last change you requested when you make another request
So, some options I see for you are:
You could spend a lot of time learning about file locking, concurrency, and transaction in sqlite3
You could add some more error-proofing simply by having your app retry the action after the first failure, as suggested by some on this Reddit post, which includes tips such as "If the code has an effective mechanism for simply trying again, most of sqlite's concurrency problems go away" and "Passing isolation_level=None to connect seems to fix it".
You could switch to using a more scalable database like PostgreSQL
(For my money, #2 or #3 are the way to go.)

Categories

Resources