I am currently writing an application in IronPython/WPF and will heavily utilize SQL select and set statements during production.
I have successfully connected to the server and can grab the data I wish via queries. However I am having issues parsing the response data. See below code
import clr
clr.AddReference('System.Data')
from System.Data import *
query = 'sqlquery'
conn = SqlClient.SqlConnection(--sql connection properties--)
conn.Open()
result = SqlClient.SqlCommand(query, conn)
data = result.ExecuteReader()
while data.Read():
print(data[0])
data.Close()
conn.Close()
The issue I am having is print(data[0]) is required to print any of the SQL response, a simple print(data) returns:
<System.Data.SqlClient.SqlDataReader object at 0x00000000000002FF [System.Data.SqlClient.SqlDataReader]>
However a print(data[0]) only returns the index of the row in SQL, [1] the next column etc etc.
I would like to access all data from the row (where rows can be variable lengths, different queries etc)
How could I get this to work?
EDIT:
I have successfully extracted all data from one row of the response with the following code;
for i in range(1, data.FieldCount):
print(data.GetName(i))
print(data.GetValue(i))
Just need to determine how to perform this iteration over all returned rows so I can pass it to a dict/datagrid
I have found a solution that works for me.
Utilizing the SqlAdapter and Datasets to store/view data later as required
def getdataSet(self, query):
conn = SqlClient.SqlConnection(---sql cred---))
conn.Open()
result = SqlClient.SqlCommand(query, conn)
data = SqlClient.SqlDataAdapter(result)
ds = DataSet()
data.Fill(ds)
return ds
def main(self):
ds = self.getdataSet("query")
self.maindataGrid.ItemsSource = ds.Tables[0].DefaultView
Related
I have a dask dataframe which has 220 partitions and 7 columns. I have imported this file from a bcp file as and completed some wrangling in dask. I then want to write this whole file to mssql using turboodbc. I connect to the DB as follows:
mydb = 'TEST'
from turbodbc import connect, make_options
connection = connect(driver="ODBC Driver 17 for SQL Server",
server="TEST SERVER",
port="1433",
database=mydb,
uid="sa",
pwd="5pITfir3")
I then use the function i found from a medium article to write to a test table in the DB:
def turbo_write(mydb, df, table):
"""Use turbodbc to insert data into sql."""
start = time.time()
# preparing columns
columns = '('
columns += ', '.join(df.columns)
columns += ')'
# preparing value place holders
val_place_holder = ['?' for col in df.columns]
sql_val = '('
sql_val += ', '.join(val_place_holder)
sql_val += ')'
# writing sql query for turbodbc
sql = f"""
INSERT INTO {mydb}.dbo.{table} {columns}
VALUES {sql_val}
"""
print(sql)
print(sql_val)
# writing array of values for turbodbc
values_df = [df[col].values for col in df.columns]
print(values_df)
# cleans the previous head insert
with connection.cursor() as cursor:
cursor.execute(f"delete from {mydb}.dbo.{table}")
connection.commit()
# inserts data, for real
with connection.cursor() as cursor:
#try:
cursor.executemanycolumns(sql, values_df)
connection.commit()
# except Exception:
# connection.rollback()
# print('something went wrong')
stop = time.time() - start
return print(f'finished in {stop} seconds')
This works when I upload a small amount of rows as follows:
turbo_write(mydb, df_train.head(1000), table)
When i try to do a larger number of rows it fails:
turbo_write(mydb, df_train.head(10000), table)
I get the error:
RuntimeError: Unable to cast Python instance to C++ type (compile in
debug mode for details)
How do i write the whole dask dataframe to mssql without any errors?
I needed to convert to a maskedarray by changing:
# writing array of values for turbodbc
values_df = [df[col].values for col in df.columns]
to
values_df = [np.ma.MaskedArray(df[col].values, pd.isnull(df[col].values)) for col in df.columns]
I can then write all the data using:
for i in range(df_train.npartitions):
partition = df_train.get_partition(i)
turbo_write(mydb, partition, table)
i += 1
This still takes a long time to write in comparison to saving the file and writing to the DB using BCP. If anyone has any more efficient suggestions I would love to see them.
I was trying to read a very huge MySQL table made of several millions of rows. I have used Pandas library and chunks. See the code below:
import pandas as pd
import numpy as np
import pymysql.cursors
connection = pymysql.connect(user='xxx', password='xxx', database='xxx', host='xxx')
try:
with connection.cursor() as cursor:
query = "SELECT * FROM example_table;"
chunks=[]
for chunk in pd.read_sql(query, connection, chunksize = 1000):
chunks.append(chunk)
#print(len(chunks))
result = pd.concat(chunks, ignore_index=True)
#print(type(result))
#print(result)
finally:
print("Done!")
connection.close()
Actually the execution time is acceptable if I limit the number of rows to select. But if want to select also just a minimum of data (for example 1 mln of rows) then the execution time dramatically increases.
Maybe is there a better/faster way to select the data from a relational database within python?
Another option might be to use the multiprocessing module, dividing the query up and sending it to multiple parallel processes, then concatenating the results.
Without knowing much about pandas chunking - I think you would have to do the chunking manually (which depends on the data)... Don't use LIMIT / OFFSET - performance would be terrible.
This might not be a good idea, depending on the data. If there is a useful way to split up the query (e.g if it's a timeseries, or there some kind of appropriate index column to use, it might make sense). I've put in two examples below to show different cases.
Example 1
import pandas as pd
import MySQLdb
def worker(y):
#where y is value in an indexed column, e.g. a category
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
query = "SELECT * FROM example_table WHERE col_x = {0}".format(y)
return pd.read_sql(query, connection)
p = multiprocessing.Pool(processes=10)
#(or however many process you want to allocate)
data = p.map(worker, [y for y in col_x_categories])
#assuming there is a reasonable number of categories in an indexed col_x
p.close()
results = pd.concat(data)
Example 2
import pandas as pd
import MySQLdb
import datetime
def worker(a,b):
#where a and b are timestamps
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
query = "SELECT * FROM example_table WHERE x >= {0} AND x < {1}".format(a,b)
return pd.read_sql(query, connection)
p = multiprocessing.Pool(processes=10)
#(or however many process you want to allocate)
date_range = pd.date_range(start=d1, end=d2, freq="A-JAN")
# this arbitrary here, and will depend on your data /knowing your data before hand (ie. d1, d2 and an appropriate freq to use)
date_pairs = list(zip(date_range, date_range[1:]))
data = p.map(worker, date_pairs)
p.close()
results = pd.concat(data)
Probably nicer ways doing this (and haven't properly tested etc). Be interested to know how it goes if you try it.
You could try using a different mysql connector. I would recommend trying mysqlclient which is the fastest mysql connector (by a considerable margin I believe).
pymysql is a pure python mysql client, whereas mysqlclient is wrapper around the (much faster) C libraries.
Usage is basically the same as pymsql:
import MySQLdb
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
Read more about the different connectors here: What's the difference between MySQLdb, mysqlclient and MySQL connector/Python?
For those using Windows and having troubles to install MySQLdb. I'm using this way to fetch data from huge table.
import mysql.connector
i = 1
limit = 1000
while True:
sql = "SELECT * FROM super_table LIMIT {}, {}".format(i, limit)
cursor.execute(sql)
rows = self.cursor.fetchall()
if not len(rows): # break the loop when no more rows
print("Done!")
break
for row in rows: # do something with results
print(row)
i += limit
I'm currently trying to fetch 100 million of rows from a MySQL table using the Jupyter Notebook. I have made some attempts with pymysql.cursors provided for open a MySQL connection. Actually I have tried to use batches in order to speed-up the query selection process cause it's a too much heavy operation to select all the rows together. Here below my test:
import pymysql.cursors
# Connect to the database
connection = pymysql.connect(host='XXX',
user='XXX',
password='XXX',
db='XXX',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
print(cursor.execute("SELECT count(*) FROM `table`"))
count = cursor.fetchone()[0]
batch_size = 50
for offset in xrange(0, count, batch_size):
cursor.execute(
"SELECT * FROM `table` LIMIT %s OFFSET %s",
(batch_size, offset))
for row in cursor:
print(row)
finally:
connection.close()
For now the test should just print out each row (more or less not so worth), but the best solution in my opinion would be to store everything in a pandas dataframe.
Unfortunately when I run it, I got this error:
KeyError Traceback (most recent call
last) in ()
print(cursor.execute("SELECT count(*) FROM `table`"))
---> count = cursor.fetchone()[0]
batch_size = 50
KeyError: 0
Someone has an idea of what would be the problem?
Maybe the use of chunksize would be a better idea?
Thanks in advance!
UPDATE
I have rewrite again the code without batch_size and storing the query result in a pandas dataframe. Finally it seems running but of course the execution time seems pretty much 'infinite' due to the fact that are 100mln rows as volume of data:
connection = pymysql.connect(user='XXX', password='XXX', database='XXX', host='XXX')
try:
with connection.cursor() as cursor:
query = "SELECT * FROM `table`"
cursor.execute(query)
cursor.fetchall()
df = pd.read_sql(query, connection)
finally:
connection.close()
What should be a correct approach for speed-up the process? Maybe by passing as parameter chunksize = 250?
And also If I try to print the type of df then it ouputs that is a generator. Actually this is not a dataframe.
If I print df the output is:
<generator object _query_iterator at 0x11358be10>
How can I get the data in a dataframe format? Cause if I print the fetch_all command I can see the correct output selection of the query, so till that point everything works as expected.
If I try to use Dataframe() with the result of the fetchAll command I get:
ValueError: DataFrame constructor not properly called!
Another UPDATE
I was able to output the result by iterating pd.read_sql like this:
for chunk in pd.read_sql(query, connection, chunksize = 250):
chunks.append(chunk)
result = pd.concat(chunks, ignore_index=True)
print(type(result))
#print(result)
And finally I got just one dataframe called result.
Now the questions are:
Is it possible to query all the data without a LIMIT?
What exactly influence the process benchmark?
I have below code to use cassandra python driver for pagination
I tried both overriding query and set session default_fetch_size. but none of them working, the results always all rows from the table. what am I missing?
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
# setup
cluster = Cluster(["10.40.10.xxx","10.40.10.xxx","10.40.22.xxx","10.40.22.xxx"])
session = cluster.connect()
session.set_keyspace("ad_realtime")
# session.default_fetch_size = 10
query = "SELECT * from campaign_offset"
statement = SimpleStatement(query, fetch_size=10)
results = session.execute(statement)
for row in results:
print row
Paging in the Python Driver doesn't mean getting only part of your query. It means only getting parts of your query at a time.
Your code
for row in results:
print row
Is invoking the paging machinery. Basically this is creating an iterator which only requests fetch_size rows at a time out of the resultset defined by your query.
Use LIMIT and WHERE clauses to restrict your actual result.
cassandra pagination: the difference between driver and CQL
You can get the rows of the current page by using current_rows, like:
for row in results.current_rows:
print row
Following code might help you to fetch the results in a paginated way-
def fetch_rows(stmt: Statement, fetch_size: int = 10):
stmt.fetch_size = fetch_size
rs: ResultSet = session.execute(stmt)
has_pages = rs.has_more_pages
while has_pages:
yield from rs.current_rows
print ('-----------------------------------------')
has_pages = rs.has_more_pages
rs = session.execute(ps, paging_state=rs.paging_state)
def execute():
query = "SELECT col1, col2 FROM my_table WHERE some_partition_key='part_val1' AND some_clustering_col='clus_val1'"
ps = session.prepare(query)
for row in fetch_rows(ps, 20):
print(row)
# Process the row and perform desired operation
Im using Pyodbc to connect to sqlserver to get few rows. The select query I execute fetches almost 200,000 rows causing a memory issue.
To resolve this issue Im using a generator object, to fetch 5000 rows at any point in time..
The problem with this kind of execution is the generator object. I lose the data column names..
For example, if my table1 has column NAME, through normal exection I can access the result set as result.NAME
but I can't do the same with the generator object..It doesn't allow me to access through column names.
Any inputs will be useful?
Using Cursor.fetchmany() to process query result in batches returns a list of pyodbc.Row objects, which allows reference by column name.
Take these examples of a SQL Server query that returns database names in batches of 5:
Without generator
connection = pyodbc.connect(driver='{SQL Server Native Client 11.0}',
server='localhost', database='master',
trusted_connection='yes')
sql = 'select name from sys.databases'
cursor = connection.cursor().execute(sql)
while True:
rows = cursor.fetchmany(5)
if not rows:
break
for row in rows:
print row.name
With generator (modified from sample here)
def rows(cursor, size=5):
while True:
rows = cursor.fetchmany(size)
if not rows:
break
for row in rows:
yield row
connection = pyodbc.connect(driver='{SQL Server Native Client 11.0}',
server='localhost', database='master',
trusted_connection='yes')
sql = 'select name from sys.databases'
cursor = connection.cursor().execute(sql)
for row in rows(cursor):
print row.name