I have located a piece of code that runs quite slow (in my opinion) and would liek to know what you guys think. The code in question is as follows and is supposed to:
Query a database and get 2 fields, a field and its value
Populate the object dictionary with their values
The code is:
query = "SELECT Field, Value FROM metrics " \
"WHERE Status NOT LIKE '%ERROR%' AND Symbol LIKE '{0}'".format(self.symbol)
query = self.db.run(query, True)
if query is not None:
for each in query:
self.metrics[each[0].lower()] = each[1]
The query is run using a db class I created that is very simple:
def run(self, query, onerrorkeeprunning=False):
# Run query provided and return result
try:
con = lite.connect(self.db)
cur = con.cursor()
cur.execute(query)
con.commit()
runsql = cur.fetchall()
data = []
for rows in runsql:
line = []
for element in rows:
line.append(element)
data.append(line)
return data
except lite.Error, e:
if onerrorkeeprunning is True:
if con:
con.close()
return
else:
print 'Error %s:' % e.args[0]
sys.exit(1)
finally:
if con:
con.close()
I know there are tons of ways of writting this code and I was trying to keep things simple but for 24 fields this takes 0.03s so if I have 1,000 elements that is 30s and I find it a little too long!
EDIT: on further review, runsql = cur.fetchall() is the line that takes the most to run.
Any help will be much appreciated.
2nd EDIT: Looking online further, I have found the issue lies with the fetchall() commant and not with my query or the initialization of the DB. Has anybody been able to imporve the performance of the result fetching? (Some people mentioned changing the SQL code but this is not to blame, it runs pretty fast but then the slowness comes when you try to grab those results)
fetchall() reads all results, and returns them in a temporary list.
Your run() function then just puts all the results into another list.
Your top-level code then copies these values into yet another dictionary.
You should fetch only the row you need (which can be done directly on the cursor), and handle it directly:
cur.execute("SELECT Field, Value ...")
for row in cur:
self.metrics[row[0].lower()] = row[1]
Note: this distributes the cost of the SQL query over all for iteration; the overall time spent in the database does not change.
This code improves only on the time that would have been spent handling all the temporary variables.
Related
I'm new to SQL and psycopg2. I'm playing around a bit and try to find our how to display the results of a query. I have a small script where I make a connection to the database and create a cursor to run the query.
from psycopg2 import connect
conn = connect(host="localhost", user="postgres", dbname="portfolio",
password="empty")
cur = conn.cursor()
cur.execute("SELECT * FROM portfolio")
for record in cur:
print("ISIN: {}, Naam: {}".format(record[0], record[1]))
print(cur.fetchmany(3))
cur.close()
conn.close()
If I run this code, the first print is fine, but the second print-statement returns [].
If I run only one of the two print-statements, I get a result every time.
Can someone explain me why?
The cursor loops over the results and returns one at a time. When it has returned all of them, it can't return any more. This is precisely like when you loop over the lines in a file (there are no more lines once you reach the end of the file) or even looping over a list (there are no more entries in the list after the last one).
If you want to manipulate the results in Python, you should probably read them into a list, which you can then traverse as many times as you like, or search, sort, etc, or access completely randomly.
cur.execute("SELECT * FROM portfolio")
result = cur.fetchall()
for record in result:
print("ISIN: {}, Naam: {}".format(record[0], record[1]))
print(result[0:3]))
I'm working on a project but I'm kinda stuck due to a weird problem. I pull data from an external API and save it into my own SQL database with a Python script. After pulling the data I check if it's already present in my database. I do this with the following code snippet:
def getDatabaseMatchesForSummoner(summonerId):
sqlSelect = 'SELECT gameId FROM playerMatchHistory WHERE playerId=%s'
try:
cur.execute(sqlSelect,(summonerId,))
db.commit()
except MySQLdb as e:
db.rollback()
print e
gameIds = []
print cur.rowcount
for i in range(cur.rowcount):
gameIds += [str(cur.fetchone()[0])]
return gameIds
Now the problem is the following: this piece of code tends to return an amount of rows that are not in agreement with my actual database. For instance, for a particular summoner ID, it returns 7 rows whereas if I enter the query into phpMyAdmin I get 10, the correct amount. I've been searching for some hours now and I can't honestly find anything wrong with it. I tried some other things like fetchall(), other string formatting, etc. I really hope someone can point out what's wrong.
I'm trying to insert some data into a MySQL database using python (pymysql connector) and I'm getting really poor performance (around 10 rows inserted per second). The table is InnoDB, and I'm using a multiple values insert statement and have ensured that autocommit is turned off. Any ideas why my inserts are still so slow?
I initially thought that autocommit wasn't properly being disabled but I've added code to test that it is disabled (=0) during each connection.
Here my example code:
for i in range(1,500):
params.append([i,i,i,i,i,i])
insertDB(params)
def insertDB(params):
query = """INSERT INTO test (o_country_id, i_country_id,c_id,period_id,volume,date_created,date_updated)
VALUES (%s,%s,%s,%s,%s,NOW(),NOW())
ON DUPLICATE KEY UPDATE trade_volume = %s, date_updated = NOW();"""
db.insert_many(query,params)
def insert_many(query,params=None):
cur = _connection.cursor()
try:
_connection.autocommit(False)
cur.executemany(query,params)
_connection.commit()
except pymysql.Error, e:
print ("MySQL error %d: %s" %
(e.args[0], e.args[1]))
cur.close()
What else could be the issue? The above example takes an eternity of about 110 seconds to execute.
Not sure what is wrong, but I would try the mysqldb and/or the mysql connector modules instead and see if I get the same performance numbers.
I have the following code:
def executeQuery(conn, query):
cur = conn.cursor()
cur.execute(query)
return cur
def trackTagsGenerator(chunkSize, baseCondition):
""" Returns a dict of trackId:tag limited to chunkSize. """
sql = """
SELECT track_id, tag
FROM tags
WHERE {baseCondition}
""".format(baseCondition=baseCondition)
limit = chunkSize
offset = 0
while True:
trackTags = {}
# fetch the track ids with the coresponding tag
limitPhrase = " LIMIT %d OFFSET %d" % (limit, offset)
query = sql + limitPhrase
offset += limit
cur = executeQuery(smacConn, query)
rows = cur.fetchall()
if not rows:
break
for row in rows:
trackTags[row['track_id']] = row['tag']
yield trackTags
I want to use it like this:
for trackTags in list(trackTagsGenerator(DATA_CHUNK_SIZE, baseCondition)):
print trackTags
break
This code produces the following error without even fetching one chunk of track tags:
Exception _mysql_exceptions.ProgrammingError: (2014, "Commands out of sync; you can't run this command now") in <bound method SSDictCursor.__del__ of <MySQLdb.cursors.SSDictCursor object at 0x10b067b90>> ignored
I suspect it's because I have the query execute logic in the body of loop in the generator function.
Is someone able to tell me how to fetch chunks of data using mysqldb in such way?
I'm pretty sure this is because it can run into situations where you've got two queries
running simultaniously because of the yield. Depending on how you call the function (threads, async, etc..) I'm pretty sure your cursor might get clobbered too?
As well, you're opening yourself up to (sorry, but I can't sugar coat this part) horrific SQL injection holes by inserting baseConditional using essentially a printf. Take a look at the DB-API’s parameter substitution docs for help.
Yield isn't going to save you time or energy here at all, the full sql command will always need to run before you'll get a single result. (Hence you're using LIMIT and OFFSET to make it more friendly, kudos)
i.e. someone updates the table while you're yielding out some data, in this particular case - not the end of the world. In many others, it gets ugly.
If you're just goofing around and you want this to work 'right-now-dammit', it'd probably work to modify executeQuery as such:
def executeQuery(conn, query):
cur = conn.cursor()
cur.execute(query)
cur = executeQuery(smacConn, query)
rows = cur.fetchall()
cur.close()
return rows
One thing that also kinda jumps out at me - you define trackTags = {}, but then you update tagTrackIds, and yield trackTags.. Which will always be empty dict.
My suggestion would be to not bother yourself with the headache of hand writing SQL if you're just trying to get a hobby project working. Take a look at Elixir which is built on top of SQLAlchemy.
Using an ORM (object-relational-mapper) can be a much more friendly introduction to databases. Defining what your objects look like in Python, and having it automatically generate your schema for you - and being able to add/modify/delete things in a Pythonic manner is really nifty.
If you really need to be async, check out ultramysql python module.
You use a SSDictCursor, something that maps to mysql_use_result() on MySQL-API-side. This requires that you read out the complete result before you can issue a new command.
As this happens before you receive the first chunk of data after all: are yu sure that this doesn't happen in the context of the query before this part of code is executed? The results of that last query might be still in the line, and executing the next one (i. e., the fist one in this context) might break things...
I am trying to use SQLSoup - the SQLAlchemy extention, to update records in a SQL Server 2008 database. I am using pyobdc for the connections. There are a number of issues which make it hard to find a relevant example.
I am reprojection a geometry field in a very large table (2 million + records), so many of the standard ways of updating fields cannot be used. I need to extract coordinates from the geometry field to text, convert them and pass them back in. All this is fine, and all the individual pieces are working.
However I want to execute a SQL Update statement on each row, while looping through the records one by one. I assume this places locks on the recordset, or the connection is in use - as if I use the code below it hangs after successfully updating the first record.
Any advice on how to create a new connection, reuse the existing one, or accomplish this another way is appreciated.
s = select([text("%s as fid" % id_field),
text("%s.STAsText() as wkt" % geom_field)],
from_obj=[feature_table])
rs = s.execute()
for row in rs:
new_wkt = ReprojectFeature(row.wkt)
update_value = "geometry :: STGeomFromText('%s',%s)" % (new_wkt, "3785")
update_sql = ("update %s set GEOM3785 = %s where %s = %i" %
(full_name, update_value, id_field, row.fid))
conn = db.connection()
conn.execute(update_sql)
conn.close() #or not - no effect..
Updated working code now looks like this. It works fine on a few records, but hangs on the whole table, so I guess it is reading in too much data.
db = SqlSoup(conn_string)
#create outer query
Session = sessionmaker(autoflush=False, bind=db.engine)
session = Session()
rs = session.execute(s)
for row in rs:
#create update sql...
session.execute(update_sql)
session.commit()
I now get connection busy errors.
DBAPIError: (Error) ('HY000', '[HY000] [Microsoft][ODBC SQL Server Driver]Connection is busy with results for another hstmt (0) (SQLExecDirectW)')
It looks like this could be a problem with the ODBC driver - http://sourceitsoftware.blogspot.com/2008/06/connection-is-busy-with-results-for.html
Further Update:
On the server using profiler, it shows the select statement then the first update statement are "starting" but neither complete.
If I set the Select statement to return the top 10 rows, then it does complete and the updates run.
SQL: Batch Starting Select...
SQL: Batch Starting Update...
I believe this is an issue with pyodbc and SQL Server drivers. If I remove SQL Alchemy and execute the same SQL with pyodbc it also hangs. Even if I create a new connection object for the updates.
I also tried the SQL Server Native Client 10.0 driver which is meant to allow MARS - Multiple Active Record Sets but it made no difference. In the end I have resorted to "paging the results" and updating these batches using pyodbc and SQL (see below), however I thought SQLAlchemy would have been able to do this for me automatically.
Try using a Session.
rs = s.execute() then becomes session.execute(rs) and you can replace the last three lines with session.execute(update_sql). I'd also suggest configuring your Session with autocommit off and call session.commit() at the end.
Can I suggest that when your process hangs you do a sp_who2 on the Sql box and see what is happening. Check for blocked spid's and see if you can find anything in the Sql code that can suggest what is happening. If you do find a spid that is blocking others you can do a dbcc inputbuffer(*spidid*) and see if that tells you what the query was it executed. Otherwise you can also attach the Sql profiler and trace your calls.
In some cases it could also be parallelism on the Sql server that cause blocks. Unless this is a data warehouse, I suggest turn your Max DOP off, (set it to 1). Let me know and when I check this again in the morning and you need help, I'll be glad to help.
Until I find another solution I am using a single connection and custom SQL to return sets of records, and updating these in batches. I don't think what I am doing is a particulary unique case, so I am not sure why I cannot handle multiple result sets simultaneously.
Below works but is very, very slow..
cnxn = pyodbc.connect(conn_string, autocommit=True)
cursor = cnxn.cursor()
#get total recs in the database
s = "select count(fid) as count from table"
count = cursor.execute(s).fetchone().count
#choose number of records to update in each iteration
batch_size = 100
for i in range(1,count, batch_size):
#sql to bring back relevant records in each batch
s = """SELECT fid, wkt from(select ROW_NUMBER() OVER(ORDER BY FID ASC) AS 'RowNumber'
,FID
,GEOM29902.STAsText() as wkt
FROM %s) features
where RowNumber >= %i and RowNumber <= %i""" % (full_name,i,i+batch_size)
rs = cursor.execute(s).fetchall()
for row in rs:
new_wkt = ReprojectFeature(row.wkt)
#...create update sql statement for the record
cursor.execute(update_sql)
counter += 1
cursor.close()
cnxn.close()