I have python code that reads a very large Oracle db of unknown row numbers to extract some data confined by lat/lon guidelines but it takes about 20 minutes per query. I am trying to re-write or add something to my code to improve this efficiency time since i have many queries to run one at a time. Here is my code that i'm using now:
plant_name = 'NEW HARVEST'
conn= cx_Oracle.connect('DOMINA_CONSULTA/password#ex021-orc.corp.companyname.com:1540/domp_domi_bi')
try:
query1 = '''
SELECT * FROM DOMINAGE.DGE_RAYOS WHERE FECHA_RAYO >= '01-JAN-19' AND FECHA_RAYO < '01-JAN-
20' AND COORDENADA_X>=41.82 AND COORDENADA_X<=42.52 AND COORDENADA_Y>=-95.83 AND
COORDENADA_Y<=-95.13
'''
dfp = pd.read_sql(con = conn, sql = query1)
finally:
conn.close()
dfp.head()
#drop col's not needed
dfp = dfp[['FECHA_RAYO','INTENSIDAD_KA','COORDENADA_X','COORDENADA_Y']]
dfp = dfp.assign(SITE=plant_name)
Related
I have a dask dataframe which has 220 partitions and 7 columns. I have imported this file from a bcp file as and completed some wrangling in dask. I then want to write this whole file to mssql using turboodbc. I connect to the DB as follows:
mydb = 'TEST'
from turbodbc import connect, make_options
connection = connect(driver="ODBC Driver 17 for SQL Server",
server="TEST SERVER",
port="1433",
database=mydb,
uid="sa",
pwd="5pITfir3")
I then use the function i found from a medium article to write to a test table in the DB:
def turbo_write(mydb, df, table):
"""Use turbodbc to insert data into sql."""
start = time.time()
# preparing columns
columns = '('
columns += ', '.join(df.columns)
columns += ')'
# preparing value place holders
val_place_holder = ['?' for col in df.columns]
sql_val = '('
sql_val += ', '.join(val_place_holder)
sql_val += ')'
# writing sql query for turbodbc
sql = f"""
INSERT INTO {mydb}.dbo.{table} {columns}
VALUES {sql_val}
"""
print(sql)
print(sql_val)
# writing array of values for turbodbc
values_df = [df[col].values for col in df.columns]
print(values_df)
# cleans the previous head insert
with connection.cursor() as cursor:
cursor.execute(f"delete from {mydb}.dbo.{table}")
connection.commit()
# inserts data, for real
with connection.cursor() as cursor:
#try:
cursor.executemanycolumns(sql, values_df)
connection.commit()
# except Exception:
# connection.rollback()
# print('something went wrong')
stop = time.time() - start
return print(f'finished in {stop} seconds')
This works when I upload a small amount of rows as follows:
turbo_write(mydb, df_train.head(1000), table)
When i try to do a larger number of rows it fails:
turbo_write(mydb, df_train.head(10000), table)
I get the error:
RuntimeError: Unable to cast Python instance to C++ type (compile in
debug mode for details)
How do i write the whole dask dataframe to mssql without any errors?
I needed to convert to a maskedarray by changing:
# writing array of values for turbodbc
values_df = [df[col].values for col in df.columns]
to
values_df = [np.ma.MaskedArray(df[col].values, pd.isnull(df[col].values)) for col in df.columns]
I can then write all the data using:
for i in range(df_train.npartitions):
partition = df_train.get_partition(i)
turbo_write(mydb, partition, table)
i += 1
This still takes a long time to write in comparison to saving the file and writing to the DB using BCP. If anyone has any more efficient suggestions I would love to see them.
I have a task to run 8 equal queries (1 query per 1 country) and doing so return data from MySQL database. The reason I can't run 1 query with all countries in one is that each country needs to have different column names. Also, results need to be updated daily with a dynamic date range (last 7 days). Yes, I could run all countries and do the column naming and everything with Pandas but I thought that the following solution would be more efficient. So, my solution was to create a for loop that uses predefined lists with all the countries their respective dimensions and date range variables that change according to the current date. The problem I'm having is that MySQL query running in the loop takes so much more time than if I run the same query directly in our data warehouse (~140-500 seconds vs. 30 seconds). The solution works with smaller tables from DWH. The things is that I don't know which part exactly is causing the problem and how to solve it.
Here is an example of my code with some smaller "tests" implemented in it:
#Import libraries:
from google.cloud import storage
from google.oauth2 import service_account
import mysql.connector
import pandas as pd
import time
from datetime import timedelta, date
#Create a connection to new DWH:
coon = mysql.connector.connect(
host="the host goes here",
user="the user goes here",
passwd="the password goes here"
)
#Create Google Cloud Service credential references:
credentials = service_account.Credentials.from_service_account_file(r'C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\my credential json goes here.json')
project_id='my project id goes here'
cursor = coon.cursor()
#Create lists of countries and dimensions
countries = ['EE','FI','LV','LT']
appp_id_dim = ['ga:dimension5','ga:dimension5','ga:dimension5','ga:dimension5']
status_dim = ['ga:dimension21','ga:dimension12','ga:dimension20','ga:dimension15']
score_dim = ['ga:dimension11','ga:dimension11','ga:dimension19','ga:dimension14']
#Define the current date and date that was 7 days before current date:
date_now = date.today() - timedelta(days=1)
date_7d_prev = date_now - timedelta(days=7)
#Create a loop
for c,s in zip(countries, score_dim):
start_time = time.time()
#Create the query using string formating:
query = f"""select ca.ID, sv.subType, SUM(svl.score) as '{s}'
from aio.CreditApplication ca
join aio.ScoringResult sr
on sr.creditApplication_ID = ca.ID
join aio.ScorecardVariableLine svl
on svl.id = sr.scorecardVariableLine_ID
join aio.ScorecardVariable sv
on sv.ID = svl.scorecardVariable_ID
where sv.country='{c}'
#and sv.subType ="asc"
and sv.subType != 'fsc'
and sr.created >= '2020-01-01'
and sr.created between '{date_7d_prev} 00:00:00' and '{date_now} 23:59:59'
group by ca.id,sv.subType"""
#Check of sql query
print('query is done', time.time()-start_time)
start_time = time.time()
sql = pd.read_sql_query(query, coon)
#check of assigning sql:
print ('sql is assigned',time.time()-start_time)
start_time = time.time()
df = pd.DataFrame(sql
#, columns = ['created','ID','state']
)
#Check the df assignment:
print ('df has been assigned', time.time()-start_time)
#Create a .csv file from the final dataframe:
start_time = time.time()
df.to_csv(fr"C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\Testing Ground\{c}_sql_loop_test.csv", index = False, header=True, encoding='utf-8', sep=';')
#Check csv file creation:
print ('csv has been created',time.time()-start_time)
#Close the session
start_time = time.time()
cursor.close()
#Check the session closing:
print('The cursor is closed',time.time()-start_time)
This example has 4 countries because I tried cutting the amount in half but that doesn't help also. That was me thinking that I have some sort of query restrictions on the DWH end because major slow down always started with the 5th country. Running them separately takes almost the same time for each but it still takes too long.
So, my tests show that the loop always lags at the step of querying data. Every other step takes less than a second, but querying time goes up to 140-500, sometimes even more, seconds as mentioned previously. So, what do you think is the problem?
Found the solution! After talking to a person in my company who has a lot more experience with SQL and our particular DWH engine, he agreed to help and rewrote the SQL part. Instead of left joining a subquery, I had to rewrite it so that there would be no subquery. Why? Because our particular engine doesn't create an index for subquery, bet separately joined tables will have indexes. That improved the time of the whole script dramatically, from ~40 minutes run time to ~ less than 1 minute.
I'm trying to read data from an oracle db.
I have to read on python the results of a simple select that returns a million of rows.
I use the fetchall() function, changing the arraysize property of the cursor.
select_qry = db_functions.read_sql_file('src/data/scripts/03_perimetro_select.sql')
dsn_tns = cx_Oracle.makedsn(ip, port, sid)
con = cx_Oracle.connect(user, pwd, dsn_tns)
start = time.time()
cur = con.cursor()
cur.arraysize = 1000
cur.execute('select * from bigtable where rownum < 10000')
res = cur.fetchall()
# print res # uncomment to display the query results
elapsed = (time.time() - start)
print(elapsed, " seconds")
cur.close()
con.close()
If I remove the where condition where rownum < 10000 the python environment freezes and the fetchall() function never ends.
After some trials I found a limit for this precise select, it works till 50k lines, but it fails if I select 60k lines.
What is causing this problem? Do I have to find another way to fetch this amount of data or the problem is the ODBC connection? How can I test it?
Consider running in batches using Oracle's ROWNUM. To combine back into single object append to a growing list. Below assumes total row count for table is 1 mill. Adjust as needed:
table_row_count = 1000000
batch_size = 10000
# PREPARED STATEMENT
sql = """SELECT t.* FROM
(SELECT *, ROWNUM AS row_num
FROM
(SELECT * FROM bigtable ORDER BY primary_id) sub_t
) AS t
WHERE t.row_num BETWEEN :LOWER_BOUND AND :UPPER_BOUND;"""
data = []
for lower_bound in range(0, table_row_count, batch_size):
# BIND PARAMS WITH BOUND LIMITS
cursor.execute(sql, {'LOWER_BOUND': lower_bound,
'UPPER_BOUND': lower_bound + batch_size - 1})
for row in cur.fetchall():
data.append(row)
You are probably running out of memory on the computer running cx_Oracle. Don't use fetchall() because this will require cx_Oracle to hold all result in memory. Use something like this to fetch batches of records:
cursor = connection.cursor()
cursor.execute("select employee_id from employees")
res = cursor.fetchmany(numRows=3)
print(res)
res = cursor.fetchmany(numRows=3)
print(res)
Stick the fetchmany() calls in a loop, process each batch of rows in your app before fetching the next set of rows, and exit the loop when there is no more data.
What ever solution you use, tune cursor.arraysize to get best performance.
The already given suggestion to repeat the query and select subsets of rows is also worth considering. If you are using Oracle DB 12 there is a newer (easier) syntax like SELECT * FROM mytab ORDER BY id OFFSET 5 ROWS FETCH NEXT 5 ROWS ONLY.
PS cx_Oracle does not use ODBC.
I was trying to read a very huge MySQL table made of several millions of rows. I have used Pandas library and chunks. See the code below:
import pandas as pd
import numpy as np
import pymysql.cursors
connection = pymysql.connect(user='xxx', password='xxx', database='xxx', host='xxx')
try:
with connection.cursor() as cursor:
query = "SELECT * FROM example_table;"
chunks=[]
for chunk in pd.read_sql(query, connection, chunksize = 1000):
chunks.append(chunk)
#print(len(chunks))
result = pd.concat(chunks, ignore_index=True)
#print(type(result))
#print(result)
finally:
print("Done!")
connection.close()
Actually the execution time is acceptable if I limit the number of rows to select. But if want to select also just a minimum of data (for example 1 mln of rows) then the execution time dramatically increases.
Maybe is there a better/faster way to select the data from a relational database within python?
Another option might be to use the multiprocessing module, dividing the query up and sending it to multiple parallel processes, then concatenating the results.
Without knowing much about pandas chunking - I think you would have to do the chunking manually (which depends on the data)... Don't use LIMIT / OFFSET - performance would be terrible.
This might not be a good idea, depending on the data. If there is a useful way to split up the query (e.g if it's a timeseries, or there some kind of appropriate index column to use, it might make sense). I've put in two examples below to show different cases.
Example 1
import pandas as pd
import MySQLdb
def worker(y):
#where y is value in an indexed column, e.g. a category
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
query = "SELECT * FROM example_table WHERE col_x = {0}".format(y)
return pd.read_sql(query, connection)
p = multiprocessing.Pool(processes=10)
#(or however many process you want to allocate)
data = p.map(worker, [y for y in col_x_categories])
#assuming there is a reasonable number of categories in an indexed col_x
p.close()
results = pd.concat(data)
Example 2
import pandas as pd
import MySQLdb
import datetime
def worker(a,b):
#where a and b are timestamps
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
query = "SELECT * FROM example_table WHERE x >= {0} AND x < {1}".format(a,b)
return pd.read_sql(query, connection)
p = multiprocessing.Pool(processes=10)
#(or however many process you want to allocate)
date_range = pd.date_range(start=d1, end=d2, freq="A-JAN")
# this arbitrary here, and will depend on your data /knowing your data before hand (ie. d1, d2 and an appropriate freq to use)
date_pairs = list(zip(date_range, date_range[1:]))
data = p.map(worker, date_pairs)
p.close()
results = pd.concat(data)
Probably nicer ways doing this (and haven't properly tested etc). Be interested to know how it goes if you try it.
You could try using a different mysql connector. I would recommend trying mysqlclient which is the fastest mysql connector (by a considerable margin I believe).
pymysql is a pure python mysql client, whereas mysqlclient is wrapper around the (much faster) C libraries.
Usage is basically the same as pymsql:
import MySQLdb
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
Read more about the different connectors here: What's the difference between MySQLdb, mysqlclient and MySQL connector/Python?
For those using Windows and having troubles to install MySQLdb. I'm using this way to fetch data from huge table.
import mysql.connector
i = 1
limit = 1000
while True:
sql = "SELECT * FROM super_table LIMIT {}, {}".format(i, limit)
cursor.execute(sql)
rows = self.cursor.fetchall()
if not len(rows): # break the loop when no more rows
print("Done!")
break
for row in rows: # do something with results
print(row)
i += limit
Experts,
I am struggling to find an efficient way to work with pandas and sqlite.
I am building a tool that let's users
extract part of a sql database (sub_table) based on some filters
change part of sub_table
upload changed sub_table back to
overall sql table replacing old values
Users will only see excel data (so I need to write back and forth to excel which is not part of my example as out of scope).
Users can
replace existing rows (entries) with new data
delete existing rows
add new rows
Question: how can I most efficiently do this "replace/delete/add" using Pandas / sqlite3?
Here is my example code. If I use df_sub.to_sql("MyTable", con = conn, index = False, if_exists="replace") at the bottom than obviously the entire table is replaced...so there must be another way I cannot think of.
import pandas as pd
import sqlite3
import numpy as np
#### SETTING EXAMPLE UP
### Create DataFrame
data = dict({"City": ["London","Frankfurt","Berlin","Paris","Brondby"],
"Population":[8,2,4,9,0.5]})
df = pd.DataFrame(data,index = pd.Index(np.arange(5)))
### Create SQL DataBase
conn = sqlite3.connect("MyDB.db")
### Upload DataFrame as Table into SQL Database
df.to_sql("MyTable", con = conn, index = False, if_exists="replace")
### Read DataFrame from SQL DB
query = "SELECT * from MyTable"
pd.read_sql_query(query, con = conn)
#### CREATE SUB_TABLE AND AMEND
#### EXTRACT sub_table FROM SQL TABLE
query = "SELECT * from MyTable WHERE Population > 2"
df_sub = pd.read_sql_query(query, con = conn)
df_sub
#### Amend Sub DF
df_sub[df_sub["City"] == "London"] = ["Brussel",4]
df_sub
#### Replace new data in SQL DB
df_sub.to_sql("MyTable", con = conn, index = False, if_exists="replace")
query = "SELECT * from MyTable"
pd.read_sql_query(query, con = conn)
Thanks for your help!
Note: I did try to achieve via pure SQL queries but gave up. As I am not an expert on SQL I would want to go with pandas if a solution exists. If not a hint on how to achieve this via sql would be great!
I think there is no way around using SQL queries for this task.
With pandas it is only possible to read a query to a DataFrame and to write a DataFrame to a Database (replace or append).
If you want to update specific values/ rows or want to delete rows, you have to use SQL queries.
Commands you should look into are for example:
UPDATE, REPLACE, INSERT, DELETE
# Update the database, change City to 'Brussel' and Population to 4, for the first row
# (Attention! python indices start at 0, SQL indices at 1)
cur = conn.cursor()
cur.execute('UPDATE MyTable SET City=?, Population=? WHERE ROWID=?', ('Brussel', 4, 1))
conn.commit()
conn.close()
# Display the changes
conn = sqlite3.connect("MyDB.db")
query = "SELECT * from MyTable"
pd.read_sql_query(query, con=conn)
For more examples on sql and pandas you can look at
https://www.dataquest.io/blog/python-pandas-databases/