Hey guys,
i have the following problem:
1 process executes a very large query and writes the results to a file, inbetween the process should update an status to the database.
first thaught: NO PROBLEM, pseudo code:
db = mysqldb.connect()
cursor = db.cursor()
large = cursor.execute(SELECT * FROM VERYLARGETABLE)
for result in large.fetchall():
file.write(result)
if timetoUpdateStatus: cursor.execute(UPDATE STATUS)
problem: when getting 9 million results the "large = cursor.execute(SELECT * FROM VERYLARGETABLE)" never finishes... i figured out a border at 2 million entrys at 4 columns where the mysql server finished the query after 30 seconds but the python process keeps running for hours... that maybe a bug in the Python MySQLDB library..
SO SECOND TRY: db.query function with db.use_results() and fetch_row():
db = mysqldb.connect()
cursor = db.cursor()
db.query(SELECT * FROM VERYLARGETABLE)
large = large.use_result()
while true:
for row in large.fetch_row(100000):
file.write(row)
if timetoUpdateStatus: cursor.execute(UPDATE STATUS) <-- ERROR (2014, "Commands out of sync; you can't run this command now")
so THIRD TRY was using 2 MySQL connections... which doesnt work, when i open a second connection the first one disappears....
any suggestions??
Try using a MySQL SSCursor. It will keep the result set in the server (MySQL data structure), rather than transfer the result set to the client (Python data structure) which is what the default cursor does. Using an SSCursor will avoid the long initial delay caused by the default cursor trying to build a Python data structure -- and allocate memory for -- the huge result set. Thus, the SSCursor should also require less memory.
import MySQLdb
import MySQLdb.cursors
import config
cons = [MySQLdb.connect(
host=config.HOST, user=config.USER,
passwd=config.PASS, db=config.MYDB,
cursorclass=MySQLdb.cursors.SSCursor) for i in range(2)]
select_cur, update_cur = [con.cursor() for con in cons]
select_cur.execute(SELECT * FROM VERYLARGETABLE)
for i, row in enumerate(select_cur):
print(row)
if i % 100000 == 0 or timetoUpdateStatus:
update_cur.execute(UPDATE STATUS)
Try splitting up the "select * from db" query into smaller chunks
index=0
while True:
cursor.execute('select * from verylargetable LIMIT %s,%s', (index, index+10000))
records = cursor.fetchall()
if len(records)==0:
break
file.write(records)
index+=10000
file.close()
Use the LIMIT statement in your big select:
limit = 0
step = 10000
query = "SELECT * FROM VERYLARGETABLE LIMIT %d, %d"
db = mysqldb.connect()
cursor = db.cursor()
while true:
cursor.execute(query, (step, limit))
for row in cursor.fetch_all():
file.write(row)
if timetoUpdateStatus:
cursor.execute(update_query)
limit += step
Code is not tested, but you should get the idea.
Related
I am migrating my ETL code to Python and was using pyhs2, but am going to switch to pyhive since it is actively supported and maintained and no one has taken ownership of pyhs2.
My question is how to structure the fetchmany method to iterate over dataset.
Here is how I did it using pyhs2:
while hive_cur.hasMoreRows:
hive_stg_result = hive_cur.fetchmany(size=200000)
hive_stg_df = pd.DataFrame(hive_stg_result)
hive_stg_df[27] = etl_load_key
if len(hive_stg_df) == 0:
call("rm -f /tmp/{0} ".format(filename), shell=True)
print ("No data delta")
else:
print (str(len(hive_stg_df)) + " delta records identified")
for i, row in hive_stg_df.iterrows():
I had fetchmany(size=100000), but it fails when it returns empty set.
hive_stg_result = pyhive_cur.fetchmany(size=100000)
hive_stg_df = pd.DataFrame(hive_stg_result)
How can I use executemany here to speed up the process.
with dest_conn.cursor() as dcur:
while True:
rows = scur.fetchmany(size=25)
if rows:
place_holders = "(%s)" % ','.join("?"*len(rows[0]))
place_holders_list = ', '.join([place_holders] * len(rows))
insert_query = "INSERT IGNORE INTO `%s` VALUES %s" % (tname, place_holders_list)
dcur.execute(insert_query, (val for row in rows for val in row))
else:
log("No more rows found to insert")
break
Here dcur destination cursor where to copy data and scur is source cursor where I am fetching data from
Even I am inserting 25 rows at once (I found this number is optimal for my db) I am creating a prepared statement and executing them. The manual of oursql says executemany is faster. It can send all the values in batch. How can I use it here instead of execute?
There are few things you can change to your code. First, you really should create the insert_query string only once. It will never change in the loop. Also, it seems like you have some errors, like '?'*nr is not returning a sequence, so I correct these as well.
Using oursql
import oursql
# ...
place_holders = '(' + ','.join(['?'] * len(scur.description)) + ')'
insert_query = "INSERT IGNORE INTO `%s` VALUES %s" % (tname, place_holders)
with dest_conn.cursor() as dcur:
while True:
rows = scur.fetchmany(size=25)
if not rows:
log("No more rows found to insert")
break
dcur.executemany(insert_query, rows)
However, I do not see much optimisation done with the executemany() method. It will always use MySQL Prepared Statements and execute each insert one by one.
MySQL General log entries executing using oursql:
..
14 Prepare SELECT * FROM t1
14 Execute SELECT * FROM t1
15 Prepare INSERT INTO `t1copy` VALUES (?)
15 Execute INSERT INTO `t1copy` VALUES (1)
15 Execute INSERT INTO `t1copy` VALUES (2)
15 Execute INSERT INTO `t1copy` VALUES (3)
..
Using MySQL Connector/Python
If you use MySQL Connector/Python (note, I'm the maintainer), you'll see different queries going to the MySQL server. Here's the similar code, but reworked so it runs with mysql.connector:
import mysql.connector
# ...
place_holders = ','.join(['%s'] * len(scur.description))
place_holders_list = ', '.join([place_holders] * len(scur.description))
insert_query = "INSERT INTO `{0}` VALUES ({1})".format(tname, place_holders_list)
dcur = dest_conn.cursor()
while True:
rows = scur.fetchmany(size=25)
if not rows:
log("No more rows found to insert")
break
dcur.executemany(insert_query, rows)
dest_conn.commit()
MySQL General log entries executing using mysql.connector:
..
18 Query SELECT * FROM t1
19 Query INSERT INTO `t1copy` VALUES (1),(2),(3),(4),(5),(6),(1),(2),(3),(4),(5),(6)
19 Query COMMIT
What is faster will have to be benchmarked. oursql is using the MySQL C library; MySQL Connector/Python is pure Python. The magic to make the optimised insert is thus also pure Python string parsing, so you'll have to check it.
Conclusion
oursql is not optimising the INSERT statement itself. Instead, executemany() is only creating the MySQL Prepared Statement once. So that's good.
The (regrettably lengthy) MWE at the end of this question is cut down from a real application. It is supposed to work like this: There are two tables. One includes both already-processed and not-yet-processed data, the other has the results of processing the data. On startup, we create a temporary table that lists all of the data that has not yet been processed. We then open a read cursor on that table and scan it from beginning to end; for each datum, we do some crunching (omitted in the MWE) and then insert the results into the processed-data table, using a separate cursor.
This works correctly in autocommit mode. However, if the write operation is wrapped in a transaction -- and in the real application, it has to be, because the write actually touches several tables (all but one of which have been omitted from the MWE) -- then the COMMIT operation has the side-effect of resetting the read cursor on the temp table, causing rows that have already been processed to be reprocessed, which not only prevents forward progress, it causes the program to crash with an IntegrityError upon trying to insert a duplicate row into data_out. If you run the MWE you should see this output:
0
1
2
3
4
5
6
7
8
9
10
0
---
127 rows remaining
Traceback (most recent call last):
File "sqlite-test.py", line 85, in <module>
test_main()
File "sqlite-test.py", line 83, in test_main
test_run(db)
File "sqlite-test.py", line 71, in test_run
(row[0], b"output"))
sqlite3.IntegrityError: UNIQUE constraint failed: data_out.value
What can I do to prevent the read cursor from being reset by a COMMIT touching unrelated tables?
Notes: All of the INTEGERs in the schema are ID numbers; in the real application there are several more ancillary tables that hold more information for each ID, and the write transaction touches two or three of them in addition to data_out, depending on the result of the computation. In the real application, the temporary "data_todo" table is potentially very large -- millions of rows; I started down this road precisely because a Python list was too big to fit in memory. The MWE's shebang is for python3 but it will behave exactly the same under python2 (provided the interpreter is new enough to understand b"..." strings). Setting PRAGMA locking_mode = EXCLUSIVE; and/or PRAGMA journal_mode = WAL; has no effect on the phenomenon. I am using SQLite 3.8.2.
#! /usr/bin/python3
import contextlib
import sqlite3
import sys
import tempfile
import textwrap
def init_db(db):
db.executescript(textwrap.dedent("""\
CREATE TABLE data_in (
origin INTEGER,
origin_id INTEGER,
value INTEGER,
UNIQUE(origin, origin_id)
);
CREATE TABLE data_out (
value INTEGER PRIMARY KEY,
processed BLOB
);
"""))
db.executemany("INSERT INTO data_in VALUES(?, ?, ?);",
[ (1, x, x) for x in range(100) ])
db.executemany("INSERT INTO data_in VALUES(?, ?, ?);",
[ (2, x, 200 - x*2) for x in range(100) ])
db.executemany("INSERT INTO data_out VALUES(?, ?);",
[ (x, b"already done") for x in range(50, 130, 5) ])
db.execute(textwrap.dedent("""\
CREATE TEMPORARY TABLE data_todo AS
SELECT DISTINCT value FROM data_in
WHERE value NOT IN (SELECT value FROM data_out)
ORDER BY value;
"""))
def test_run(db):
init_db(db)
read_cur = db.cursor()
write_cur = db.cursor()
read_cur.arraysize = 10
read_cur.execute("SELECT * FROM data_todo;")
try:
while True:
block = read_cur.fetchmany()
if not block: break
for row in block:
# (in real life, data actually crunched here)
sys.stdout.write("{}\n".format(row[0]))
write_cur.execute("BEGIN TRANSACTION;")
# (in real life, several more inserts here)
write_cur.execute("INSERT INTO data_out VALUES(?, ?);",
(row[0], b"output"))
db.commit()
finally:
read_cur.execute("SELECT COUNT(DISTINCT value) FROM data_in "
"WHERE value NOT IN (SELECT value FROM data_out)")
result = read_cur.fetchone()
sys.stderr.write("---\n{} rows remaining\n".format(result[0]))
def test_main():
with tempfile.NamedTemporaryFile(suffix=".db") as tmp:
with contextlib.closing(sqlite3.connect(tmp.name)) as db:
test_run(db)
test_main()
Use a second, separate connection for the temporary table, it'll be unaffected by commits on the other connection.
I need to fetch huge data from Oracle (using cx_oracle) in python 2.6, and to produce some csv file.
The data size is about 400k record x 200 columns x 100 chars each.
Which is the best way to do that?
Now, using the following code...
ctemp = connection.cursor()
ctemp.execute(sql)
ctemp.arraysize = 256
for row in ctemp:
file.write(row[1])
...
... the script remain hours in the loop and nothing is writed to the file... (is there a way to print a message for every record extracted?)
Note: I don't have any issue with Oracle, and running the query in SqlDeveloper is super fast.
Thank you, gian
You should use cur.fetchmany() instead.
It will fetch chunk of rows defined by arraysise (256)
Python code:
def chunks(cur): # 256
global log, d
while True:
#log.info('Chunk size %s' % cur.arraysize, extra=d)
rows=cur.fetchmany()
if not rows: break;
yield rows
Then do your processing in a for loop;
for i, chunk in enumerate(chunks(cur)):
for row in chunk:
#Process you rows here
That is exactly how I do it in my TableHunter for Oracle.
add print statements after each line
add a counter to your loop indicating progress after each N rows
look into a module like 'progressbar' for displaying a progress indicator
I think your code is asking the database for the data one row at the time which might explain the slowness.
Try:
ctemp = connection.cursor()
ctemp.execute(sql)
Results = ctemp.fetchall()
for row in Results:
file.write(row[1])
I've sourced a slowness in my application to the execute() function of mysql. I crafted a simple sql query that exemplifies this problem:
SELECT * FROM `cid444_agg_big` c WHERE 1
.
>>> import MySQLdb as mdb
>>> import time;
>>>
>>> dbconn = mdb.connect('localhost','*****','*****','*****');
>>> cursorconn = dbconn.cursor()
>>>
>>> sql="SELECT * FROM `cid444_agg_big` c WHERE 1";
>>>
>>> startstart=time.time();
>>> cursorconn.execute(sql);
21600L #returned 21600 records
>>> print time.time()-startstart, "for execute()"
2.86254501343 for execute() #why does this take so long?
>>>
>>> startstart=time.time();
>>> rows = cursorconn.fetchall()
>>> print time.time()-startstart, "for fetchall()"
0.0021288394928 for fetchall() #this is very fast, no problem with fetchall()
Running this query in the mysql shell, yields 0.27 seconds, or 10 times faster!!!
My only thought is the size of the data being returned. This returns 21600 "wide" rows. So that's a lot of data being sent to python. The database is localhost, so there's no network latency.
Why does this take so long?
UPDATE MORE INFORMATION
I wrote a similar script in php:
$c = mysql_connect ( 'localhost', '*****', '****', true );
mysql_select_db ( 'cachedata', $c );
$time_start = microtime_float();
$sql="SELECT * FROM `cid444_agg_big` c WHERE 1";
$q=mysql_query($sql);$c=0;
while($r=mysql_fetch_array($q))
$c++;//do something?
echo "Did ".$c." loops in ".(microtime_float() - $time_start)." seconds\n";
function microtime_float(){//function taken from php.net
list($usec, $sec) = explode(" ", microtime());
return ((float)$usec + (float)$sec);
}
This prints:
Did 21600 loops in 0.56120800971985 seconds
This loops on all the data instead of retrieving it all at once. PHP appears to be 6 times faster than the python version ....
The default MySQLdb cursor fetches the complete result set to the client on execute, and fetchall() will just copy the data from memory to memory.
If you want to store the result set on the server and fetch it on demand, you should use SSCursor instead.
Cursor:
This is the standard Cursor class that returns rows as tuples and stores the result set in the client.
SSCursor:
This is a Cursor class that returns rows as tuples and stores the result set in the server.
Very old discussion but i try to add my 2 cent.
Script had to select within many rows by timestamp. In standard situation (id index, name, timestamp) was very very slow (i didnt check but minutes, lot of minutes). I added an index to timestamp too.. the query took under 10 seconds. Much better.
"ALTER TABLE BTC ADD INDEX(timestamp)"
i hope can help.