I've sourced a slowness in my application to the execute() function of mysql. I crafted a simple sql query that exemplifies this problem:
SELECT * FROM `cid444_agg_big` c WHERE 1
.
>>> import MySQLdb as mdb
>>> import time;
>>>
>>> dbconn = mdb.connect('localhost','*****','*****','*****');
>>> cursorconn = dbconn.cursor()
>>>
>>> sql="SELECT * FROM `cid444_agg_big` c WHERE 1";
>>>
>>> startstart=time.time();
>>> cursorconn.execute(sql);
21600L #returned 21600 records
>>> print time.time()-startstart, "for execute()"
2.86254501343 for execute() #why does this take so long?
>>>
>>> startstart=time.time();
>>> rows = cursorconn.fetchall()
>>> print time.time()-startstart, "for fetchall()"
0.0021288394928 for fetchall() #this is very fast, no problem with fetchall()
Running this query in the mysql shell, yields 0.27 seconds, or 10 times faster!!!
My only thought is the size of the data being returned. This returns 21600 "wide" rows. So that's a lot of data being sent to python. The database is localhost, so there's no network latency.
Why does this take so long?
UPDATE MORE INFORMATION
I wrote a similar script in php:
$c = mysql_connect ( 'localhost', '*****', '****', true );
mysql_select_db ( 'cachedata', $c );
$time_start = microtime_float();
$sql="SELECT * FROM `cid444_agg_big` c WHERE 1";
$q=mysql_query($sql);$c=0;
while($r=mysql_fetch_array($q))
$c++;//do something?
echo "Did ".$c." loops in ".(microtime_float() - $time_start)." seconds\n";
function microtime_float(){//function taken from php.net
list($usec, $sec) = explode(" ", microtime());
return ((float)$usec + (float)$sec);
}
This prints:
Did 21600 loops in 0.56120800971985 seconds
This loops on all the data instead of retrieving it all at once. PHP appears to be 6 times faster than the python version ....
The default MySQLdb cursor fetches the complete result set to the client on execute, and fetchall() will just copy the data from memory to memory.
If you want to store the result set on the server and fetch it on demand, you should use SSCursor instead.
Cursor:
This is the standard Cursor class that returns rows as tuples and stores the result set in the client.
SSCursor:
This is a Cursor class that returns rows as tuples and stores the result set in the server.
Very old discussion but i try to add my 2 cent.
Script had to select within many rows by timestamp. In standard situation (id index, name, timestamp) was very very slow (i didnt check but minutes, lot of minutes). I added an index to timestamp too.. the query took under 10 seconds. Much better.
"ALTER TABLE BTC ADD INDEX(timestamp)"
i hope can help.
Related
Below is the SQL string I pass to a Oracle server via an Oracle API within Python. I suspect that the listagg function is the reason for the extended amount of processing time. Without the listagg function the results are parsed in under half the time that the below SQL takes. Any suggestions or fixes more than welcome.
SELECT to_char(a.transaction_dt, 'MM/DD/YYYY'),
a.sub_acct_nbr,
a.trn_ldgr_entr_desc,
a.fdoc_nbr,
a.fin_object_cd,
a.fin_sub_obj_cd,
a.fin_obj_cd_nm,
b.explanation,
(select listagg(c.txt, ';') WITHIN GROUP (order by a.fdoc_nbr) from View3 c where a.fdoc_nbr = c.fdoc_nbr) Notes,
to_char(a.trn_ldgr_entr_amt, '9,999,999.99'),
a.trn_debit_crdt_cd
FROM View1 a
LEFT OUTER JOIN View2 b
ON a.fdoc_nbr = b.doc_hdr_id
WHERE a.account_nbr = 123456
AND a.univ_fiscal_prd_cd = 12
AND (a.fin_object_cd BETWEEN '5000' AND '7999'
OR a.fin_object_cd BETWEEN '9902' AND '9905')
ORDER BY a.transaction_dt;
How can I use executemany here to speed up the process.
with dest_conn.cursor() as dcur:
while True:
rows = scur.fetchmany(size=25)
if rows:
place_holders = "(%s)" % ','.join("?"*len(rows[0]))
place_holders_list = ', '.join([place_holders] * len(rows))
insert_query = "INSERT IGNORE INTO `%s` VALUES %s" % (tname, place_holders_list)
dcur.execute(insert_query, (val for row in rows for val in row))
else:
log("No more rows found to insert")
break
Here dcur destination cursor where to copy data and scur is source cursor where I am fetching data from
Even I am inserting 25 rows at once (I found this number is optimal for my db) I am creating a prepared statement and executing them. The manual of oursql says executemany is faster. It can send all the values in batch. How can I use it here instead of execute?
There are few things you can change to your code. First, you really should create the insert_query string only once. It will never change in the loop. Also, it seems like you have some errors, like '?'*nr is not returning a sequence, so I correct these as well.
Using oursql
import oursql
# ...
place_holders = '(' + ','.join(['?'] * len(scur.description)) + ')'
insert_query = "INSERT IGNORE INTO `%s` VALUES %s" % (tname, place_holders)
with dest_conn.cursor() as dcur:
while True:
rows = scur.fetchmany(size=25)
if not rows:
log("No more rows found to insert")
break
dcur.executemany(insert_query, rows)
However, I do not see much optimisation done with the executemany() method. It will always use MySQL Prepared Statements and execute each insert one by one.
MySQL General log entries executing using oursql:
..
14 Prepare SELECT * FROM t1
14 Execute SELECT * FROM t1
15 Prepare INSERT INTO `t1copy` VALUES (?)
15 Execute INSERT INTO `t1copy` VALUES (1)
15 Execute INSERT INTO `t1copy` VALUES (2)
15 Execute INSERT INTO `t1copy` VALUES (3)
..
Using MySQL Connector/Python
If you use MySQL Connector/Python (note, I'm the maintainer), you'll see different queries going to the MySQL server. Here's the similar code, but reworked so it runs with mysql.connector:
import mysql.connector
# ...
place_holders = ','.join(['%s'] * len(scur.description))
place_holders_list = ', '.join([place_holders] * len(scur.description))
insert_query = "INSERT INTO `{0}` VALUES ({1})".format(tname, place_holders_list)
dcur = dest_conn.cursor()
while True:
rows = scur.fetchmany(size=25)
if not rows:
log("No more rows found to insert")
break
dcur.executemany(insert_query, rows)
dest_conn.commit()
MySQL General log entries executing using mysql.connector:
..
18 Query SELECT * FROM t1
19 Query INSERT INTO `t1copy` VALUES (1),(2),(3),(4),(5),(6),(1),(2),(3),(4),(5),(6)
19 Query COMMIT
What is faster will have to be benchmarked. oursql is using the MySQL C library; MySQL Connector/Python is pure Python. The magic to make the optimised insert is thus also pure Python string parsing, so you'll have to check it.
Conclusion
oursql is not optimising the INSERT statement itself. Instead, executemany() is only creating the MySQL Prepared Statement once. So that's good.
I need to fetch huge data from Oracle (using cx_oracle) in python 2.6, and to produce some csv file.
The data size is about 400k record x 200 columns x 100 chars each.
Which is the best way to do that?
Now, using the following code...
ctemp = connection.cursor()
ctemp.execute(sql)
ctemp.arraysize = 256
for row in ctemp:
file.write(row[1])
...
... the script remain hours in the loop and nothing is writed to the file... (is there a way to print a message for every record extracted?)
Note: I don't have any issue with Oracle, and running the query in SqlDeveloper is super fast.
Thank you, gian
You should use cur.fetchmany() instead.
It will fetch chunk of rows defined by arraysise (256)
Python code:
def chunks(cur): # 256
global log, d
while True:
#log.info('Chunk size %s' % cur.arraysize, extra=d)
rows=cur.fetchmany()
if not rows: break;
yield rows
Then do your processing in a for loop;
for i, chunk in enumerate(chunks(cur)):
for row in chunk:
#Process you rows here
That is exactly how I do it in my TableHunter for Oracle.
add print statements after each line
add a counter to your loop indicating progress after each N rows
look into a module like 'progressbar' for displaying a progress indicator
I think your code is asking the database for the data one row at the time which might explain the slowness.
Try:
ctemp = connection.cursor()
ctemp.execute(sql)
Results = ctemp.fetchall()
for row in Results:
file.write(row[1])
I have a netCDF file with eight variables. (sorry, canĀ“t share the actual file)
Each variable have two dimensions, time and station. Time is about 14 steps and station is currently 38000 different ids.
So for 38000 different "locations" (actually just an id) we have 8 variables and 14 different times.
$ncdump -h stationdata.nc
netcdf stationdata {
dimensions:
station = 38000 ;
name_strlen = 40 ;
time = UNLIMITED ; // (14 currently)
variables:
int time(time) ;
time:long_name = "time" ;
time:units = "seconds since 1970-01-01" ;
char station_name(station, name_strlen) ;
station_name:long_name = "station_name" ;
station_name:cf_role = "timeseries_id" ;
float var1(time, station) ;
var1:long_name = "Variable 1" ;
var1:units = "m3/s" ;
float var2(time, station) ;
var2:long_name = "Variable 2" ;
var2:units = "m3/s" ;
...
This data needs to be loaded into a PostGres database so that the data can be join to some geometries matching the station_name for later visualization .
Currently I have done this in Python with the netCDF4-module. Works but it takes forever!
Now I am looping like this:
times = rootgrp.variables['time']
stations = rootgrp.variables['station_name']
for timeindex, time in enumerate(times):
stations = rootgrp.variables['station_name']
for stationindex, stationnamearr in enumerate(stations):
var1val = var1[timeindex][stationindex]
print "INSERT INTO ncdata (validtime, stationname, var1) \
VALUES ('%s','%s', %s);" % \
( time, stationnamearr, var1val )
This takes several minutes on my machine to run and I have a feeling it could be done in a much more clever way.
Anyone has any idea on how this can be done in a smarter way? Preferably in Python.
Not sure this is the right way to do it but I found a good way to solve this and thought I should share it.
In the first version the script took about one hour to run. After a rewrite of the code it now runs in less than 30 sec!
The big thing was to use numpy arrays and transponse the variables arrays from the NetCDF reader to become rows and then stack all columns to one matrix. This matrix was then loaded in the db using psycopg2 copy_from function. I got the code for that from this question
Use binary COPY table FROM with psycopg2
Parts of my code:
dates = num2date(rootgrp.variables['time'][:],units=rootgrp.variables['time'].units)
var1=rootgrp.variables['var1']
var2=rootgrp.variables['var2']
cpy = cStringIO.StringIO()
for timeindex, time in enumerate(dates):
validtimes=np.empty(var1[timeindex].size, dtype="object")
validtimes.fill(time)
# Transponse and stack the arrays of parameters
# [a,a,a,a] [[a,b,c],
# [b,b,b,b] => [a,b,c],
# [c,c,c,c] [a,b,c],
# [a,b,c]]
a = np.hstack((
validtimes.reshape(validtimes.size,1),
stationnames.reshape(stationnames.size,1),
var1[timeindex].reshape(var1[timeindex].size,1),
var2[timeindex].reshape(var2[timeindex].size,1)
))
# Fill the cStringIO with text representation of the created array
for row in a:
cpy.write(row[0].strftime("%Y-%m-%d %H:%M")+'\t'+ row[1] +'\t' + '\t'.join([str(x) for x in row[2:]]) + '\n')
conn = psycopg2.connect("host=postgresserver dbname=nc user=user password=passwd")
curs = conn.cursor()
cpy.seek(0)
curs.copy_from(cpy, 'ncdata', columns=('validtime', 'stationname', 'var1', 'var2'))
conn.commit()
There are a few simple improvements you can make to speed this up. All these are independent, you can try all of them or just a couple to see if it's fast enough. They're in roughly ascending order of difficulty:
Use the psycopg2 database driver, it's faster
Wrap the whole block of inserts in a transaction. If you're using psycopg2 you're already doing this - it auto-opens a transaction you have to commit at the end.
Collect up several rows worth of values in an array and do a multi-valued INSERT every n rows.
Use more than one connection to do the inserts via helper processes - see the multiprocessing module. Threads won't work as well because of GIL (global interpreter lock) issues.
If you don't want to use one big transaction you can set synchronous_commit = off and set a commit_delay so the connection can return before the disk flush actually completes. This won't help you much if you're doing all the work in one transaction.
Multi-valued inserts
Psycopg2 doesn't directly support multi-valued INSERT but you can just write:
curs.execute("""
INSERT INTO blah(a,b) VALUES
(%s,%s),
(%s,%s),
(%s,%s),
(%s,%s),
(%s,%s);
""", parms);
and loop with something like:
parms = []
rownum = 0
for x in input_data:
parms.extend([x.firstvalue, x.secondvalue])
rownum += 1
if rownum % 5 == 0:
curs.execute("""INSERT ...""", tuple(parms))
del(parms[:])
Organize your loop to access all the variables for each time. In other words, read and write a record at a time rather than a variable at a time. This can speed things up enormously, especially if the source netCDF dataset is stored on a file system with large disk blocks, e.g. 1MB or larger. For an explanation of why this is faster and a discussion of order-of-magnitude resulting speedups, see this NCO speedup discussion, starting with entry 7.
Hey guys,
i have the following problem:
1 process executes a very large query and writes the results to a file, inbetween the process should update an status to the database.
first thaught: NO PROBLEM, pseudo code:
db = mysqldb.connect()
cursor = db.cursor()
large = cursor.execute(SELECT * FROM VERYLARGETABLE)
for result in large.fetchall():
file.write(result)
if timetoUpdateStatus: cursor.execute(UPDATE STATUS)
problem: when getting 9 million results the "large = cursor.execute(SELECT * FROM VERYLARGETABLE)" never finishes... i figured out a border at 2 million entrys at 4 columns where the mysql server finished the query after 30 seconds but the python process keeps running for hours... that maybe a bug in the Python MySQLDB library..
SO SECOND TRY: db.query function with db.use_results() and fetch_row():
db = mysqldb.connect()
cursor = db.cursor()
db.query(SELECT * FROM VERYLARGETABLE)
large = large.use_result()
while true:
for row in large.fetch_row(100000):
file.write(row)
if timetoUpdateStatus: cursor.execute(UPDATE STATUS) <-- ERROR (2014, "Commands out of sync; you can't run this command now")
so THIRD TRY was using 2 MySQL connections... which doesnt work, when i open a second connection the first one disappears....
any suggestions??
Try using a MySQL SSCursor. It will keep the result set in the server (MySQL data structure), rather than transfer the result set to the client (Python data structure) which is what the default cursor does. Using an SSCursor will avoid the long initial delay caused by the default cursor trying to build a Python data structure -- and allocate memory for -- the huge result set. Thus, the SSCursor should also require less memory.
import MySQLdb
import MySQLdb.cursors
import config
cons = [MySQLdb.connect(
host=config.HOST, user=config.USER,
passwd=config.PASS, db=config.MYDB,
cursorclass=MySQLdb.cursors.SSCursor) for i in range(2)]
select_cur, update_cur = [con.cursor() for con in cons]
select_cur.execute(SELECT * FROM VERYLARGETABLE)
for i, row in enumerate(select_cur):
print(row)
if i % 100000 == 0 or timetoUpdateStatus:
update_cur.execute(UPDATE STATUS)
Try splitting up the "select * from db" query into smaller chunks
index=0
while True:
cursor.execute('select * from verylargetable LIMIT %s,%s', (index, index+10000))
records = cursor.fetchall()
if len(records)==0:
break
file.write(records)
index+=10000
file.close()
Use the LIMIT statement in your big select:
limit = 0
step = 10000
query = "SELECT * FROM VERYLARGETABLE LIMIT %d, %d"
db = mysqldb.connect()
cursor = db.cursor()
while true:
cursor.execute(query, (step, limit))
for row in cursor.fetch_all():
file.write(row)
if timetoUpdateStatus:
cursor.execute(update_query)
limit += step
Code is not tested, but you should get the idea.