Retrieve huge data table from MySQL within Jupyter notebook - python

I'm currently trying to fetch 100 million of rows from a MySQL table using the Jupyter Notebook. I have made some attempts with pymysql.cursors provided for open a MySQL connection. Actually I have tried to use batches in order to speed-up the query selection process cause it's a too much heavy operation to select all the rows together. Here below my test:
import pymysql.cursors
# Connect to the database
connection = pymysql.connect(host='XXX',
user='XXX',
password='XXX',
db='XXX',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
print(cursor.execute("SELECT count(*) FROM `table`"))
count = cursor.fetchone()[0]
batch_size = 50
for offset in xrange(0, count, batch_size):
cursor.execute(
"SELECT * FROM `table` LIMIT %s OFFSET %s",
(batch_size, offset))
for row in cursor:
print(row)
finally:
connection.close()
For now the test should just print out each row (more or less not so worth), but the best solution in my opinion would be to store everything in a pandas dataframe.
Unfortunately when I run it, I got this error:
KeyError Traceback (most recent call
last) in ()
print(cursor.execute("SELECT count(*) FROM `table`"))
---> count = cursor.fetchone()[0]
batch_size = 50
KeyError: 0
Someone has an idea of what would be the problem?
Maybe the use of chunksize would be a better idea?
Thanks in advance!
UPDATE
I have rewrite again the code without batch_size and storing the query result in a pandas dataframe. Finally it seems running but of course the execution time seems pretty much 'infinite' due to the fact that are 100mln rows as volume of data:
connection = pymysql.connect(user='XXX', password='XXX', database='XXX', host='XXX')
try:
with connection.cursor() as cursor:
query = "SELECT * FROM `table`"
cursor.execute(query)
cursor.fetchall()
df = pd.read_sql(query, connection)
finally:
connection.close()
What should be a correct approach for speed-up the process? Maybe by passing as parameter chunksize = 250?
And also If I try to print the type of df then it ouputs that is a generator. Actually this is not a dataframe.
If I print df the output is:
<generator object _query_iterator at 0x11358be10>
How can I get the data in a dataframe format? Cause if I print the fetch_all command I can see the correct output selection of the query, so till that point everything works as expected.
If I try to use Dataframe() with the result of the fetchAll command I get:
ValueError: DataFrame constructor not properly called!
Another UPDATE
I was able to output the result by iterating pd.read_sql like this:
for chunk in pd.read_sql(query, connection, chunksize = 250):
chunks.append(chunk)
result = pd.concat(chunks, ignore_index=True)
print(type(result))
#print(result)
And finally I got just one dataframe called result.
Now the questions are:
Is it possible to query all the data without a LIMIT?
What exactly influence the process benchmark?

Related

Pandas .to_sql() Output exceeds the size limit

I'm using pydobc and sqlalchemy to insert data into a table in SQL Server, and I'm getting this error.
https://i.stack.imgur.com/miSp9.png
Here are snippets of the functions I use.
This is the function I use to connect to the SQL server (using fast_executemany)
def connect(server, database):
global cnxn_str, cnxn, cur, quoted, engine
cnxn_str = ("Driver={SQL Server Native Client 11.0};"
"Server=<server>;"
"Database=<database>;"
"UID=<user>;"
"PWD=<password>;")
cnxn = pyodbc.connect(cnxn_str)
cur = cnxn.cursor()
cur.fast_executemany=True
quoted = quote_plus(cnxn_str)
engine = create_engine('mssql+pyodbc:///?odbc_connect={}'.format(quoted), fast_executemany=True)
And this is the function I'm using to query and insert the data into the SQL server
def insert_to_sql_server():
global df, np_array
# Dataframe df is from a numpy array dtype = object
df = pd.DataFrame(np_array[1:,],columns=np_array[0])
# add new columns, data processing
df['comp_key'] = df['col1']+"-"+df['col2'].astype(str)
df['comp_key2'] = df['col3']+"-"+df['col4'].astype(str)+"-"+df['col5'].astype(str)
df['comp_statusID'] = df['col6']+"-"+df['col7'].astype(str)
convert_dict = {'col1': 'string', 'col2': 'string', ..., 'col_n': 'string'}
# convert data types of columns from objects to strings
df = df.astype(convert_dict)
connect(<server>, <database>)
cur.rollback()
# Delete old records
cur.execute("DELETE FROM <table>")
cur.commit()
# Insert dataframe to table
df.to_sql(<table name>, engine, index=False, \
if_exists='replace', schema='dbo', chunksize=1000, method='multi')
The insert function runs for about 30 minutes before finally returning the error message.
I encountered no errors when doing it with a smaller df size. The current df size I have is 27963 rows and 9 columns. One thing which I think contributes to the error is the length of the string. By default the numpy array is dtype='<U25', but I had to override this to dtype='object' because it was truncating the text data.
I'm out of ideas because it seems like the error is referring to limitations of either Pandas or the SQL Server, which I'm not familiar with.
Thanks
Thanks for all the input (still new here)! Accidentally stumbled upon the solution, which is by reducing the df.to_sql from
df.to_sql(chunksize=1000)
to
df.to_sql(chunksize=200)
After digging it turns out there's a limitation from SQL server (https://discuss.dizzycoding.com/to_sql-pyodbc-count-field-incorrect-or-syntax-error/)
In my case, I had the same "Output exceeds the size limit" error, and I fixed it adding "method='multi'" in df.to_sql(method='multi').
First I tried the "chuncksize" solution and it didn't work.
So... check that if you're at the same scenario!
with engine.connect().execution_options(autocommit=True) as conn:
df.to_sql('mytable', con=conn, method='multi', if_exists='replace', index=True)

cx_Oracle: fetchall() stops working with big SELECT statements

I'm trying to read data from an oracle db.
I have to read on python the results of a simple select that returns a million of rows.
I use the fetchall() function, changing the arraysize property of the cursor.
select_qry = db_functions.read_sql_file('src/data/scripts/03_perimetro_select.sql')
dsn_tns = cx_Oracle.makedsn(ip, port, sid)
con = cx_Oracle.connect(user, pwd, dsn_tns)
start = time.time()
cur = con.cursor()
cur.arraysize = 1000
cur.execute('select * from bigtable where rownum < 10000')
res = cur.fetchall()
# print res # uncomment to display the query results
elapsed = (time.time() - start)
print(elapsed, " seconds")
cur.close()
con.close()
If I remove the where condition where rownum < 10000 the python environment freezes and the fetchall() function never ends.
After some trials I found a limit for this precise select, it works till 50k lines, but it fails if I select 60k lines.
What is causing this problem? Do I have to find another way to fetch this amount of data or the problem is the ODBC connection? How can I test it?
Consider running in batches using Oracle's ROWNUM. To combine back into single object append to a growing list. Below assumes total row count for table is 1 mill. Adjust as needed:
table_row_count = 1000000
batch_size = 10000
# PREPARED STATEMENT
sql = """SELECT t.* FROM
(SELECT *, ROWNUM AS row_num
FROM
(SELECT * FROM bigtable ORDER BY primary_id) sub_t
) AS t
WHERE t.row_num BETWEEN :LOWER_BOUND AND :UPPER_BOUND;"""
data = []
for lower_bound in range(0, table_row_count, batch_size):
# BIND PARAMS WITH BOUND LIMITS
cursor.execute(sql, {'LOWER_BOUND': lower_bound,
'UPPER_BOUND': lower_bound + batch_size - 1})
for row in cur.fetchall():
data.append(row)
You are probably running out of memory on the computer running cx_Oracle. Don't use fetchall() because this will require cx_Oracle to hold all result in memory. Use something like this to fetch batches of records:
cursor = connection.cursor()
cursor.execute("select employee_id from employees")
res = cursor.fetchmany(numRows=3)
print(res)
res = cursor.fetchmany(numRows=3)
print(res)
Stick the fetchmany() calls in a loop, process each batch of rows in your app before fetching the next set of rows, and exit the loop when there is no more data.
What ever solution you use, tune cursor.arraysize to get best performance.
The already given suggestion to repeat the query and select subsets of rows is also worth considering. If you are using Oracle DB 12 there is a newer (easier) syntax like SELECT * FROM mytab ORDER BY id OFFSET 5 ROWS FETCH NEXT 5 ROWS ONLY.
PS cx_Oracle does not use ODBC.

Fastest way to read huge MySQL table in python

I was trying to read a very huge MySQL table made of several millions of rows. I have used Pandas library and chunks. See the code below:
import pandas as pd
import numpy as np
import pymysql.cursors
connection = pymysql.connect(user='xxx', password='xxx', database='xxx', host='xxx')
try:
with connection.cursor() as cursor:
query = "SELECT * FROM example_table;"
chunks=[]
for chunk in pd.read_sql(query, connection, chunksize = 1000):
chunks.append(chunk)
#print(len(chunks))
result = pd.concat(chunks, ignore_index=True)
#print(type(result))
#print(result)
finally:
print("Done!")
connection.close()
Actually the execution time is acceptable if I limit the number of rows to select. But if want to select also just a minimum of data (for example 1 mln of rows) then the execution time dramatically increases.
Maybe is there a better/faster way to select the data from a relational database within python?
Another option might be to use the multiprocessing module, dividing the query up and sending it to multiple parallel processes, then concatenating the results.
Without knowing much about pandas chunking - I think you would have to do the chunking manually (which depends on the data)... Don't use LIMIT / OFFSET - performance would be terrible.
This might not be a good idea, depending on the data. If there is a useful way to split up the query (e.g if it's a timeseries, or there some kind of appropriate index column to use, it might make sense). I've put in two examples below to show different cases.
Example 1
import pandas as pd
import MySQLdb
def worker(y):
#where y is value in an indexed column, e.g. a category
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
query = "SELECT * FROM example_table WHERE col_x = {0}".format(y)
return pd.read_sql(query, connection)
p = multiprocessing.Pool(processes=10)
#(or however many process you want to allocate)
data = p.map(worker, [y for y in col_x_categories])
#assuming there is a reasonable number of categories in an indexed col_x
p.close()
results = pd.concat(data)
Example 2
import pandas as pd
import MySQLdb
import datetime
def worker(a,b):
#where a and b are timestamps
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
query = "SELECT * FROM example_table WHERE x >= {0} AND x < {1}".format(a,b)
return pd.read_sql(query, connection)
p = multiprocessing.Pool(processes=10)
#(or however many process you want to allocate)
date_range = pd.date_range(start=d1, end=d2, freq="A-JAN")
# this arbitrary here, and will depend on your data /knowing your data before hand (ie. d1, d2 and an appropriate freq to use)
date_pairs = list(zip(date_range, date_range[1:]))
data = p.map(worker, date_pairs)
p.close()
results = pd.concat(data)
Probably nicer ways doing this (and haven't properly tested etc). Be interested to know how it goes if you try it.
You could try using a different mysql connector. I would recommend trying mysqlclient which is the fastest mysql connector (by a considerable margin I believe).
pymysql is a pure python mysql client, whereas mysqlclient is wrapper around the (much faster) C libraries.
Usage is basically the same as pymsql:
import MySQLdb
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
Read more about the different connectors here: What's the difference between MySQLdb, mysqlclient and MySQL connector/Python?
For those using Windows and having troubles to install MySQLdb. I'm using this way to fetch data from huge table.
import mysql.connector
i = 1
limit = 1000
while True:
sql = "SELECT * FROM super_table LIMIT {}, {}".format(i, limit)
cursor.execute(sql)
rows = self.cursor.fetchall()
if not len(rows): # break the loop when no more rows
print("Done!")
break
for row in rows: # do something with results
print(row)
i += limit

Pyodbc slow on table with a million+ rows

The question is when querying my database using pyodbc it can take around a minute to return my data. This table in my database is hovering around a million + rows. How can I speed this up? I've tried setting values for the main columns in the table and it still is slow. I've also tried limiting the data returned to no avail. Is there something else out there that is faster or I can I possibly change something to make the following faster? Any help would be greatly appreciated!
Here is my code:
import pyodbc
conn = pyodbc.connect(r'DSN=mydb;UID=myuserid;PWD=mypass')
class Visual(object):
def get_resource_ids(self, *xargs):
resource_ids = []
cur = conn.cursor()
cur.execute("select * from operation where status = 'C' and workorder_type = 'W' and workorder_base_id = ? and workorder_lot_id = ? and workorder_split_id = ? and workorder_sub_id = ? and rownum <= 10", xargs[0], xargs[1], xargs[2], xargs[3])
try:
return [dict(count=str(index), resource_id=row[6]) for index, row in enumerate(cur, 1)]
except ValueError:
print "Error"
finally:
cur.close()
conn.close()
Basically due to poor table design this was self inflicted not a Pyodbc issue. Moral of the story design tables better.

pyobdc and direct query differ in Teradata

I'm using pyodbc to connect to a Teradata database and it seems that something is now working properly:
This:
conn = connect(params)
cur = conn.cursor()
if len(argv) > 1:
query = ''.join(open(argv[1]).readlines())
else:
query = "SELECT count(*) FROM my_table"
cur.execute(query)
print "...done"
print cur.fetchall()
returns what seems to be an overflow, a number like 140630114173190, but in fact there are only 260 entries in the table (which I do get by querying directly on the sql assistant from teradata)
However, when doing a select * the result seems to be correct.
Any idea of what could be going on?
Running on:
Linux eron-redhat-100338 2.6.32-131.0.15.el6.x86_64
Thanks
EDIT: I don't think this is a fetchall() issue. That's only gong to change whether I get a list, or a tuple or whatever but the number won't change.
Interestingly, I discovered that changing to
query = "SELECT CAST(count(*)) AS DECIMAL(10,2) FROM my_table"
does get the right number, only in as float number. Something is going on with the integers.
While fetchall() returns recordset, and you need 1st column of 1st record you should use something like:
print('# of rows: [%s]' % (c.fetchall()[0][0]))
or:
for row in c.fetchall():
print('# of rows: [%s]' % (row[0]))

Categories

Resources