Pyodbc slow on table with a million+ rows - python

The question is when querying my database using pyodbc it can take around a minute to return my data. This table in my database is hovering around a million + rows. How can I speed this up? I've tried setting values for the main columns in the table and it still is slow. I've also tried limiting the data returned to no avail. Is there something else out there that is faster or I can I possibly change something to make the following faster? Any help would be greatly appreciated!
Here is my code:
import pyodbc
conn = pyodbc.connect(r'DSN=mydb;UID=myuserid;PWD=mypass')
class Visual(object):
def get_resource_ids(self, *xargs):
resource_ids = []
cur = conn.cursor()
cur.execute("select * from operation where status = 'C' and workorder_type = 'W' and workorder_base_id = ? and workorder_lot_id = ? and workorder_split_id = ? and workorder_sub_id = ? and rownum <= 10", xargs[0], xargs[1], xargs[2], xargs[3])
try:
return [dict(count=str(index), resource_id=row[6]) for index, row in enumerate(cur, 1)]
except ValueError:
print "Error"
finally:
cur.close()
conn.close()

Basically due to poor table design this was self inflicted not a Pyodbc issue. Moral of the story design tables better.

Related

Python cx_Oracle Insert Into table with multiple columns automating the values (1:,2: ... 100:)

I am working on a script to read from an oracle table with about 75 columns in one environment and load it into same table definition in a different environment. Till now I have been using cx_Oracle cur.execute() method to 'INSERT INTO TABLENAME VALUES(:1,:2,:3..:8);' and then load the data using 'cur.execute(sql, conn)' method.
However,this table that I'm trying to load has about 75+ columns and writing (:1, :2 ... :75) would be tedious and I'm guessing not part of best practice.
Is there an automated way to loop over the number of columns and automatically fill the values() portion of the SQL query.
user = 'username'
pass = getpass.getpass()
connection_prod = cx_Oracle.makedsn(host, port, service_name = '')
cursor_prod = connection_prod.cursor()
connection_dev = cx_Oracle.makedsn(host, port, service_name = '')
cursor_dev = connection_dev.cursor()
SQL_Read = """Select * from Table_name_Prod"""
Data = cur.execute(SQL_Read, connection_prod)
for row in Data:
SQL_Load = "INSERT INTO TABLE_NAME_DEV VALUES(:1, :2,:3, :4 ...:75);" --This part is ugly and tedious.
cursor_dev.execute(SQL_LOAD, row)
This is where I need Help
connection_Prod.commit()
cursor_Prod.close()
connection_Prod.close()
You can do the following which should help not only in reducing code but also in improving performance:
connection_prod = cx_Oracle.connect(...)
cursor_prod = connection_prod.cursor()
# set array size for source cursor to some reasonable value
# increasing this value reduces round-trips but increases memory usage
cursor_prod.arraysize = 500
connection_dev = cx_Oracle.connect(...)
cursor_dev = connection_dev.cursor()
cursor_prod.execute("select * from table_name_prod")
bind_names = ",".join(":" + str(i + 1) \
for i in range(len(cursor_prod.description)))
sql_load = "insert into table_name_dev values (" + bind_names + ")"
while True:
rows = cursor_prod.fetchmany()
if not rows:
break
cursor_dev.executemany(sql_load, rows)
# can call connection_dev.commit() here if you want to commit each batch
The use of cursor.executemany() will significantly help in terms of performance. Hope this helps you out!

Fastest way to read huge MySQL table in python

I was trying to read a very huge MySQL table made of several millions of rows. I have used Pandas library and chunks. See the code below:
import pandas as pd
import numpy as np
import pymysql.cursors
connection = pymysql.connect(user='xxx', password='xxx', database='xxx', host='xxx')
try:
with connection.cursor() as cursor:
query = "SELECT * FROM example_table;"
chunks=[]
for chunk in pd.read_sql(query, connection, chunksize = 1000):
chunks.append(chunk)
#print(len(chunks))
result = pd.concat(chunks, ignore_index=True)
#print(type(result))
#print(result)
finally:
print("Done!")
connection.close()
Actually the execution time is acceptable if I limit the number of rows to select. But if want to select also just a minimum of data (for example 1 mln of rows) then the execution time dramatically increases.
Maybe is there a better/faster way to select the data from a relational database within python?
Another option might be to use the multiprocessing module, dividing the query up and sending it to multiple parallel processes, then concatenating the results.
Without knowing much about pandas chunking - I think you would have to do the chunking manually (which depends on the data)... Don't use LIMIT / OFFSET - performance would be terrible.
This might not be a good idea, depending on the data. If there is a useful way to split up the query (e.g if it's a timeseries, or there some kind of appropriate index column to use, it might make sense). I've put in two examples below to show different cases.
Example 1
import pandas as pd
import MySQLdb
def worker(y):
#where y is value in an indexed column, e.g. a category
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
query = "SELECT * FROM example_table WHERE col_x = {0}".format(y)
return pd.read_sql(query, connection)
p = multiprocessing.Pool(processes=10)
#(or however many process you want to allocate)
data = p.map(worker, [y for y in col_x_categories])
#assuming there is a reasonable number of categories in an indexed col_x
p.close()
results = pd.concat(data)
Example 2
import pandas as pd
import MySQLdb
import datetime
def worker(a,b):
#where a and b are timestamps
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
query = "SELECT * FROM example_table WHERE x >= {0} AND x < {1}".format(a,b)
return pd.read_sql(query, connection)
p = multiprocessing.Pool(processes=10)
#(or however many process you want to allocate)
date_range = pd.date_range(start=d1, end=d2, freq="A-JAN")
# this arbitrary here, and will depend on your data /knowing your data before hand (ie. d1, d2 and an appropriate freq to use)
date_pairs = list(zip(date_range, date_range[1:]))
data = p.map(worker, date_pairs)
p.close()
results = pd.concat(data)
Probably nicer ways doing this (and haven't properly tested etc). Be interested to know how it goes if you try it.
You could try using a different mysql connector. I would recommend trying mysqlclient which is the fastest mysql connector (by a considerable margin I believe).
pymysql is a pure python mysql client, whereas mysqlclient is wrapper around the (much faster) C libraries.
Usage is basically the same as pymsql:
import MySQLdb
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
Read more about the different connectors here: What's the difference between MySQLdb, mysqlclient and MySQL connector/Python?
For those using Windows and having troubles to install MySQLdb. I'm using this way to fetch data from huge table.
import mysql.connector
i = 1
limit = 1000
while True:
sql = "SELECT * FROM super_table LIMIT {}, {}".format(i, limit)
cursor.execute(sql)
rows = self.cursor.fetchall()
if not len(rows): # break the loop when no more rows
print("Done!")
break
for row in rows: # do something with results
print(row)
i += limit

Retrieve huge data table from MySQL within Jupyter notebook

I'm currently trying to fetch 100 million of rows from a MySQL table using the Jupyter Notebook. I have made some attempts with pymysql.cursors provided for open a MySQL connection. Actually I have tried to use batches in order to speed-up the query selection process cause it's a too much heavy operation to select all the rows together. Here below my test:
import pymysql.cursors
# Connect to the database
connection = pymysql.connect(host='XXX',
user='XXX',
password='XXX',
db='XXX',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
print(cursor.execute("SELECT count(*) FROM `table`"))
count = cursor.fetchone()[0]
batch_size = 50
for offset in xrange(0, count, batch_size):
cursor.execute(
"SELECT * FROM `table` LIMIT %s OFFSET %s",
(batch_size, offset))
for row in cursor:
print(row)
finally:
connection.close()
For now the test should just print out each row (more or less not so worth), but the best solution in my opinion would be to store everything in a pandas dataframe.
Unfortunately when I run it, I got this error:
KeyError Traceback (most recent call
last) in ()
print(cursor.execute("SELECT count(*) FROM `table`"))
---> count = cursor.fetchone()[0]
batch_size = 50
KeyError: 0
Someone has an idea of what would be the problem?
Maybe the use of chunksize would be a better idea?
Thanks in advance!
UPDATE
I have rewrite again the code without batch_size and storing the query result in a pandas dataframe. Finally it seems running but of course the execution time seems pretty much 'infinite' due to the fact that are 100mln rows as volume of data:
connection = pymysql.connect(user='XXX', password='XXX', database='XXX', host='XXX')
try:
with connection.cursor() as cursor:
query = "SELECT * FROM `table`"
cursor.execute(query)
cursor.fetchall()
df = pd.read_sql(query, connection)
finally:
connection.close()
What should be a correct approach for speed-up the process? Maybe by passing as parameter chunksize = 250?
And also If I try to print the type of df then it ouputs that is a generator. Actually this is not a dataframe.
If I print df the output is:
<generator object _query_iterator at 0x11358be10>
How can I get the data in a dataframe format? Cause if I print the fetch_all command I can see the correct output selection of the query, so till that point everything works as expected.
If I try to use Dataframe() with the result of the fetchAll command I get:
ValueError: DataFrame constructor not properly called!
Another UPDATE
I was able to output the result by iterating pd.read_sql like this:
for chunk in pd.read_sql(query, connection, chunksize = 250):
chunks.append(chunk)
result = pd.concat(chunks, ignore_index=True)
print(type(result))
#print(result)
And finally I got just one dataframe called result.
Now the questions are:
Is it possible to query all the data without a LIMIT?
What exactly influence the process benchmark?

python - SQL Select Conditional statements in python

This is my R piece of code but i want to do the same thing in python, as i am new in it having problems to write the correct code can anybody guide me how to write this is python. I have already made connections of database and also tried simple queries but here i am struggling
sql_command <- "SELECT COUNT(DISTINCT Id) FROM \"Bowlers\";"
total<-as.numeric(dbGetQuery(con, sql_command))
data<-setNames(data.frame(matrix(ncol=8,
nrow=total)),c("Name","Wkts","Ave","Econ","SR","WicketTaker","totalovers",
"Matches"))
for (i in 1:total){
sql_command <- paste("SELECT * FROM \"Bowlers\" where Id = ", i ,";",
sep="")
p<-dbGetQuery(con, sql_command)
p[is.na(p)] <- 0
data$Name[i] = p$bowler[1]
}
after this which works fine how should i proceed to write the loop code:
with engine.connect() as con:
rs=con.execute('SELECT COUNT(DISTINCT id) FROM "Bowlers"')
for row in rs:
print (row)
Use the format method for strings in python to achieve it.
I am using postgresql, but your connection should be similar. Something like:
connect to test database:
import psycopg2
con = psycopg2.connect("dbname='test' user='your_user' host='your_host' password='your_password'")
cur = con.cursor() # cursor method may differ for your connection
loop over your id's:
for i in range(1,total+1):
sql_command = 'SELECT * FROM "Bowlers" WHERE id = {}'.format(i)
cur.execute(sql_command) # execute and fetchall method may differ
rows = cur.fetchall() # check for your connection
print ("output first row for id = {}".format(i))
print (rows[0]) # sanity check, printing first row for ids
print('\n') # rows is a list of tuples
# you can convert them into numpy arrays

pyobdc and direct query differ in Teradata

I'm using pyodbc to connect to a Teradata database and it seems that something is now working properly:
This:
conn = connect(params)
cur = conn.cursor()
if len(argv) > 1:
query = ''.join(open(argv[1]).readlines())
else:
query = "SELECT count(*) FROM my_table"
cur.execute(query)
print "...done"
print cur.fetchall()
returns what seems to be an overflow, a number like 140630114173190, but in fact there are only 260 entries in the table (which I do get by querying directly on the sql assistant from teradata)
However, when doing a select * the result seems to be correct.
Any idea of what could be going on?
Running on:
Linux eron-redhat-100338 2.6.32-131.0.15.el6.x86_64
Thanks
EDIT: I don't think this is a fetchall() issue. That's only gong to change whether I get a list, or a tuple or whatever but the number won't change.
Interestingly, I discovered that changing to
query = "SELECT CAST(count(*)) AS DECIMAL(10,2) FROM my_table"
does get the right number, only in as float number. Something is going on with the integers.
While fetchall() returns recordset, and you need 1st column of 1st record you should use something like:
print('# of rows: [%s]' % (c.fetchall()[0][0]))
or:
for row in c.fetchall():
print('# of rows: [%s]' % (row[0]))

Categories

Resources