Can tqdm be used with Database Reads? - python

While reading large relations from a SQL database to a pandas dataframe, it would be nice to have a progress bar, because the number of tuples is known statically and the I/O rate could be estimated. It looks like the tqdm module has a function tqdm_pandas which will report progress on mapping functions over columns, but by default calling it does not have the effect of reporting progress on I/O like this. Is it possible to use tqdm to make a progress bar on a call to pd.read_sql?

Edit: Answer is misleading - chunksize has no effect on database side of the operation. See comments below.
You could use the chunksize parameter to do something like this:
chunks = pd.read_sql('SELECT * FROM table', con=conn, chunksize=100)
df = pd.DataFrame()
for chunk in tqdm(chunks):
df = pd.concat([df, chunk])
I think this would use less memory as well.

yes! you can!
expanding the answer here, and Alex answer, to include tqdm, we get:
# get total number or rows
q = f"SELECT COUNT(*) FROM table"
total_rows = pd.read_sql_query(q, conn).values[0, 0]
# note that COUNT implementation should not download the whole table.
# some engine will prefer you to use SELECT MAX(ROWID) or whatever...
# read table with tqdm status bar
q = f"SELECT * FROM table"
rows_in_chunk = 1_000
chunks = pd.read_sql_query(q, conn, chunksize=rows_in_chunk)
df = tqdm(chunks, total=total_rows/rows_in_chunk)
df = pd.concat(df)
output example:
39%|███▉ | 99/254.787 [01:40<02:09, 1.20it/s]

Related

Batching the bulk updates of millions of rows with peewee

I have an SQLite3 database with a table that has twenty million rows.
I would like to update the values of some of the columns in the table (for all rows).
I am running into performance issues (about only 1'000 rows processed per second).
I would like to continue using the peewee module in python to interact with the
database.
So I'm not sure if I am taking the right approach with my code. After trying some ideas that all failed, I attempted to perform the update in batches. My first solution here was to iterate in over the cursor with islice as so:
import math, itertools
from tqdm import tqdm
from cool_project.database import db, MyTable
def update_row(row):
row.column_a = computation(row.column_d)
row.column_b = computation(row.column_d)
row.column_c = computation(row.column_d)
fields = (MyTable.column_a
MyTable.column_b
MyTable.column_c)
rows = MyTable.select()
total_rows = rows.count()
page_size = 1000
total_pages = math.ceil(total_rows / page_size)
# Start #
with db.atomic():
for page_num in tqdm(range(total_pages)):
page = list(itertools.islice(rows, page_size))
for row in page: update_row(row)
MyTable.bulk_update(page, fields=fields)
This failed, because it would attempt to put the result of the whole query into memory. So I adapted the code to use the paginate function.
import math
from tqdm import tqdm
from cool_project.database import db, MyTable
def update_row(row):
row.column_a = computation(row.column_d)
row.column_b = computation(row.column_d)
row.column_c = computation(row.column_d)
fields = (MyTable.column_a
MyTable.column_b
MyTable.column_c)
rows = MyTable.select()
total_rows = rows.count()
page_size = 1000
total_pages = math.ceil(total_rows / page_size)
# Start #
with db.atomic():
for page_num in tqdm(range(1, total_pages+1)):
# Get a batch #
page = MyTable.select().paginate(page_num, page_size)
# Update #
for row in page: update_row(row)
# Commit #
MyTable.bulk_update(page, fields=fields)
But it's still quite slow, and would take >24 hours to complete.
What is strange is that the speed (in number of rows per second) notably decreases as time goes by. The scripts starts with ~1000 rows per second. But after half an hour it's down to 250 rows per second.
Am I missing something? Thanks!
The issues are twofold -- you are pulling all results into memory and you are using the bulk update API, which is quite complex/special and which is also completely unnecessary for SQLite.
Try the following:
def update_row(row):
row.column_a = computation(row.column_d)
row.column_b = computation(row.column_d)
row.column_c = computation(row.column_d)
fields = (MyTable.column_a
MyTable.column_b
MyTable.column_c)
rows = MyTable.select()
with db.atomic():
for row in rows.iterator(): # Add ".iterator()" to avoid caching rows
update_row(row)
row.save(only=fields)

SAS like Macros in Python

I am very new to the world of Python and have started to learn the coding gradually. I am actually trying to implement all my SAS codes in Python to see how they work. One of my code involves using the macros.
The code looks something like this
%macro bank_data (var,month);
proc sql;
create table work_&var. as select a.accountid, a.customerseg. a.product, a.month, b.deliquency_bucket from
table one as a left join mibase.corporate_&month. as b
on a.accountid=b. accountid and a.month=b.&month;
quit
%mend;
% bank_data (1, 202010);
%bank_data(2,202011);
%bank_data(3,202012);
I am quite comfortable with the merging step in python but want to understand how do i do this macro step in Python?
Swati, this a a great way to learn Python, I hope that my answer helps.
Background
First, the data structure that best resembles a SAS dataset is the Pandas DataFrame. If you have not installed the Pandas library, I strongly encourage you to do so and follow these examples.
Second, I assume that the table 'one' is already a Pandas DataFrame. If that is not a helpful assumption, you may need to see code to import SAS datasets, assign file paths, or connect to a database, depending on your data management choices.
Also, here is my interpretation of your code:
%macro bank_data(var,month);
proc sql;
create table work.work_&var. as
select a.accountid,
a.customerseg,
a.product,
a.month,
b.deliquency_bucket
from work.one as a
left join mibase.corporate_&month. as b
on a.accountid = b.accountid
and a.month = b.&month;
quit;
%mend;
Pandas Merge
Generally, to do a left join in Pandas, use Dataframe.merge(). If the left table is called "one" and the right table is called "corporate_month", then the merge statement looks as follows. The argument left_on applies to the left-dataset "one" and the right_on argument applies to the right-dataset "corporate_month".
month = 202010
corporate_month = 'corporate_{}.sas7bdat'.format(month)
work_var = one.merge(right=corporate_month, how='left', left_on=['accountid', 'month'], right_on=['accountid', month])
Dynamic Variable Assignment
Now, to name the resulting dataset based on a variable. SAS Macro's are simply text replacement, but you cannot use that concept in variable assignment in Python. Instead, if you insist on doing this, you will need to get comfortable with dictionaries. Below is how I would implement your requirement.
var = 1
month = 202010
dict_of_dfs = {}
corporate_month = 'corporate_{}.sas7bdat'.format(month)
work_var = 'work_{}'.format(var)
dict_of_dfs[work_var] = one.merge(right=corporate_month, how='left', left_on=['accountid', 'month'], right_on=['accountid', month])
As a Function
Lastly, to turn this into a function where you pass "var" and "month" as arguments:
dict_of_dfs = {}
def bank_data(var, month):
corporate_month = 'corporate_{}.sas7bdat'.format(month)
work_var = 'work_{}'.format(var)
dict_of_dfs[work_var] = one.merge(right=corporate_month, how='left', left_on=['accountid', 'month'], right_on=['accountid', month])
bank_data(1, 202010)
bank_data(2, 202011)
bank_data(3, 202012)
Export
If you want to export each of the resulting tables as SAS datasets, look into the SASPy library.
If you are trying to run this SQL from python I would suggest something like this
import pyodbc
var = [1,2,3]
months = [202010,202011,202012]
def bank_data(var, month):
# search for how to format connection string
cnc_string = "DRIVER={SQL Server};SERVER=YOURSERVER;DATABASE=YOURDATABASE;Trusted_Connection=yes"
query = f"""
proc sql;
create table work_{var}. as select a.accountid, a.customerseg. a.product, a.month, b.deliquency_bucket from
table one as a left join mibase.corporate_{month}. as b
on a.accountid=b. accountid and a.month=b.{month};
quit
"""
with pyodbc.connect(conn_str) as conn:
conn.execute(query)
for v, m in zip(var, months):
bank_data(v, m)
also I got a bit lazy you should really parameterize this to prevent sql injections pyodbc - How to perform a select statement using a variable for a parameter

How to divide pandas dataframe in other blocks of data? [duplicate]

I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory.
This works:
import pandas.io.sql as psql
sql = "SELECT TOP 1000000 * FROM MyTable"
data = psql.read_frame(sql, cnxn)
...but this does not work:
sql = "SELECT TOP 2000000 * FROM MyTable"
data = psql.read_frame(sql, cnxn)
It returns this error:
File "inference.pyx", line 931, in pandas.lib.to_object_array_tuples
(pandas\lib.c:42733) Memory Error
I have read here that a similar problem exists when creating a dataframe from a csv file, and that the work-around is to use the 'iterator' and 'chunksize' parameters like this:
read_csv('exp4326.csv', iterator=True, chunksize=1000)
Is there a similar solution for querying from an SQL database? If not, what is the preferred work-around? Should I use some other methods to read the records in chunks? I read a bit of discussion here about working with large datasets in pandas, but it seems like a lot of work to execute a SELECT * query. Surely there is a simpler approach.
As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:
sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
print(chunk)
Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying
Update: Make sure to check out the answer below, as Pandas now has built-in support for chunked loading.
You could simply try to read the input table chunk-wise and assemble your full dataframe from the individual pieces afterwards, like this:
import pandas as pd
import pandas.io.sql as psql
chunk_size = 10000
offset = 0
dfs = []
while True:
sql = "SELECT * FROM MyTable limit %d offset %d order by ID" % (chunk_size,offset)
dfs.append(psql.read_frame(sql, cnxn))
offset += chunk_size
if len(dfs[-1]) < chunk_size:
break
full_df = pd.concat(dfs)
It might also be possible that the whole dataframe is simply too large to fit in memory, in that case you will have no other option than to restrict the number of rows or columns you're selecting.
Code solution and remarks.
# Create empty list
dfl = []
# Create empty dataframe
dfs = pd.DataFrame()
# Start Chunking
for chunk in pd.read_sql(query, con=conct, ,chunksize=10000000):
# Start Appending Data Chunks from SQL Result set into List
dfl.append(chunk)
# Start appending data from list to dataframe
dfs = pd.concat(dfl, ignore_index=True)
However, my memory analysis tells me that even though the memory is released after each chunk is extracted, the list is growing bigger and bigger and occupying that memory resulting in a net net no gain on free RAM.
Would love to hear what the author / others have to say.
The best way I found to handle this is to leverage the SQLAlchemy steam_results connection options
conn = engine.connect().execution_options(stream_results=True)
And passing the conn object to pandas in
pd.read_sql("SELECT *...", conn, chunksize=10000)
This will ensure that the cursor is handled server-side rather than client-side
You can use Server Side Cursors (a.k.a. stream results)
import pandas as pd
from sqlalchemy import create_engine
def process_sql_using_pandas():
engine = create_engine(
"postgresql://postgres:pass#localhost/example"
)
conn = engine.connect().execution_options(
stream_results=True)
for chunk_dataframe in pd.read_sql(
"SELECT * FROM users", conn, chunksize=1000):
print(f"Got dataframe w/{len(chunk_dataframe)} rows")
# ... do something with dataframe ...
if __name__ == '__main__':
process_sql_using_pandas()
As mentioned in the comments by others, using the chunksize argument in pd.read_sql("SELECT * FROM users", engine, chunksize=1000) does not solve the problem as it still loads the whole data in the memory and then gives it to you chunk by chunk.
More explanation here
chunksize still loads all the data in memory, stream_results=True is the answer. it is server side cursor that loads the rows in given chunks and save memory.. efficiently using in many pipelines, it may also help when you load history data
stream_conn = engine.connect().execution_options(stream_results=True)
use pd.read_sql with thechunksize
pd.read_sql("SELECT * FROM SOURCE", stream_conn , chunksize=5000)
you can update version airflow.
for example, I had that error in the version 2.2.3 using docker-compose.
AIRFLOW__CORE__EXECUTOR=CeleryExecutor
mysq 6.7
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "750M"
redis:
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "250M"
airflow-webserver:
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "750M"
airflow-scheduler:
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "750M"
airflow-worker:
#cpus: "0.5"
#mem_reservation: "10M"
#mem_limit: "750M"
error: Task exited with return code Negsignal.SIGKILL
but update to the version
FROM apache/airflow:2.3.4.
and perform the pulls without problems, using the same resources configured in the docker-compose
enter image description here
my dag extractor:
function
def getDataForSchema(table,conecction,tmp_path, **kwargs):
conn=connect_sql_server(conecction)
query_count= f"select count(1) from {table['schema']}.{table['table_name']}"
logging.info(f"query: {query_count}")
real_count_rows = pd.read_sql_query(query_count, conn)
##sacar esquema de la tabla
metadataquery=f"SELECT COLUMN_NAME ,DATA_TYPE FROM information_schema.columns \
where table_name = '{table['table_name']}' and table_schema= '{table['schema']}'"
#logging.info(f"query metadata: {metadataquery}")
metadata = pd.read_sql_query(metadataquery, conn)
schema=generate_schema(metadata)
#logging.info(f"schema : {schema}")
#logging.info(f"schema: {schema}")
#consulta la tabla a extraer
query=f" SELECT {table['custom_column_names']} FROM {table['schema']}.{table['table_name']} "
logging.info(f"quere data :{query}")
chunksize=table["partition_field"]
data = pd.read_sql_query(query, conn, chunksize=chunksize)
count_rows=0
pqwriter=None
iteraccion=0
for df_row in data:
print(f"bloque {iteraccion} de total {count_rows} de un total {real_count_rows.iat[0, 0]}")
#logging.info(df_row.to_markdown())
if iteraccion == 0:
parquetName=f"{tmp_path}/{table['table_name']}_{iteraccion}.parquet"
pqwriter = pq.ParquetWriter(parquetName,schema)
tableData = pa.Table.from_pandas(df_row, schema=schema,safe=False, preserve_index=True)
#logging.info(f" tabledata {tableData.column(17)}")
pqwriter.write_table(tableData)
#logging.info(f"parquet name:::{parquetName}")
##pasar a parquet df directo
#df_row.to_parquet(parquetName)
iteraccion=iteraccion+1
count_rows += len(df_row)
del df_row
del tableData
if pqwriter:
print("Cerrando archivo parquet")
pqwriter.close()
del data
del chunksize
del iteraccion
Here is a one-liner. I was able to load in 49m records to the dataframe without running out of memory.
dfs = pd.concat(pd.read_sql(sql, engine, chunksize=500000), ignore_index=True)
Full one-line code using sqlalchemy and with operator:
db_engine = sqlalchemy.create_engine(db_url, pool_size=10, max_overflow=20)
with Session(db_engine) as session:
sql_qry = text("Your query")
data = pd.concat(pd.read_sql(sql_qry,session.connection().execution_options(stream_results=True), chunksize=500000), ignore_index=True)
You can try to change chunksize to find the optimal size for your case.
You can use chunksize option, but need to set it up to 6-7 digit if you have RAM issue.
for chunk in pd.read_sql(sql, engine, params = (fromdt, todt,filecode), chunksize=100000):
df1.append(chunk)
dfs = pd.concat(df1, ignore_index=True)
do this
If you want to limit the number of rows in output, just use:
data = psql.read_frame(sql, cnxn,chunksize=1000000).__next__()

Fastest way to read huge MySQL table in python

I was trying to read a very huge MySQL table made of several millions of rows. I have used Pandas library and chunks. See the code below:
import pandas as pd
import numpy as np
import pymysql.cursors
connection = pymysql.connect(user='xxx', password='xxx', database='xxx', host='xxx')
try:
with connection.cursor() as cursor:
query = "SELECT * FROM example_table;"
chunks=[]
for chunk in pd.read_sql(query, connection, chunksize = 1000):
chunks.append(chunk)
#print(len(chunks))
result = pd.concat(chunks, ignore_index=True)
#print(type(result))
#print(result)
finally:
print("Done!")
connection.close()
Actually the execution time is acceptable if I limit the number of rows to select. But if want to select also just a minimum of data (for example 1 mln of rows) then the execution time dramatically increases.
Maybe is there a better/faster way to select the data from a relational database within python?
Another option might be to use the multiprocessing module, dividing the query up and sending it to multiple parallel processes, then concatenating the results.
Without knowing much about pandas chunking - I think you would have to do the chunking manually (which depends on the data)... Don't use LIMIT / OFFSET - performance would be terrible.
This might not be a good idea, depending on the data. If there is a useful way to split up the query (e.g if it's a timeseries, or there some kind of appropriate index column to use, it might make sense). I've put in two examples below to show different cases.
Example 1
import pandas as pd
import MySQLdb
def worker(y):
#where y is value in an indexed column, e.g. a category
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
query = "SELECT * FROM example_table WHERE col_x = {0}".format(y)
return pd.read_sql(query, connection)
p = multiprocessing.Pool(processes=10)
#(or however many process you want to allocate)
data = p.map(worker, [y for y in col_x_categories])
#assuming there is a reasonable number of categories in an indexed col_x
p.close()
results = pd.concat(data)
Example 2
import pandas as pd
import MySQLdb
import datetime
def worker(a,b):
#where a and b are timestamps
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
query = "SELECT * FROM example_table WHERE x >= {0} AND x < {1}".format(a,b)
return pd.read_sql(query, connection)
p = multiprocessing.Pool(processes=10)
#(or however many process you want to allocate)
date_range = pd.date_range(start=d1, end=d2, freq="A-JAN")
# this arbitrary here, and will depend on your data /knowing your data before hand (ie. d1, d2 and an appropriate freq to use)
date_pairs = list(zip(date_range, date_range[1:]))
data = p.map(worker, date_pairs)
p.close()
results = pd.concat(data)
Probably nicer ways doing this (and haven't properly tested etc). Be interested to know how it goes if you try it.
You could try using a different mysql connector. I would recommend trying mysqlclient which is the fastest mysql connector (by a considerable margin I believe).
pymysql is a pure python mysql client, whereas mysqlclient is wrapper around the (much faster) C libraries.
Usage is basically the same as pymsql:
import MySQLdb
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
Read more about the different connectors here: What's the difference between MySQLdb, mysqlclient and MySQL connector/Python?
For those using Windows and having troubles to install MySQLdb. I'm using this way to fetch data from huge table.
import mysql.connector
i = 1
limit = 1000
while True:
sql = "SELECT * FROM super_table LIMIT {}, {}".format(i, limit)
cursor.execute(sql)
rows = self.cursor.fetchall()
if not len(rows): # break the loop when no more rows
print("Done!")
break
for row in rows: # do something with results
print(row)
i += limit

Appending to a Pandas Dataframe From a pd.read_sql Output

I'm coming from R but need to do this in Python for various reasons.
This very well could be a basic PEBKAC issue with my Python more than anything with Pandas, PyODBC or anything else.
Please bear with me.
My current Python 3 code:
import pandas as pd
import pyodbc
cnxn = pyodbc.connect(DSN="databasename", uid = "username", pwd = "password")
querystring = 'select order_number, creation_date from table_name where order_number = ?'
orders = ['1234',
'2345',
'3456',
'5678']
for i in orders:
print(pd.read_sql(querystring, cnxn, params = [i]))
What I need is a dataframe with the column names of "order_number" and "creation_date."
What the code outputs is:
Sorry for the screenshot, couldn't get the formatting right here.
Having read the dataframe.append page, I tried this:
df = pd.DataFrame()
for i in orders:
df.append(pd.read_sql(querystring, cnxn, params = [i]))
That appears to run fine (no errors thrown, anyway).
But when I try to output df, I get
Empty DataFrame
Columns: []
Index: []
So surely it must be possible to do a pd.read_sql with params from a list (or tuple, or dictionary, ymmv) and add those results as rows into a pd.DataFrame().
However, I am failing either at my Stack searching, Googling, or Python in general (with a distinct possibility of all three).
Any guidance here would be greatly appreciated.
How about
for i in orders:
df = df.append(pd.read_sql(querystring, cnxn, params = [i]))
Need to assign the result:
df = df.append(pd.read_sql(querystring, cnxn, params = [i]))
you may try to do it this way:
df = pd.concat([pd.read_sql(querystring, cnxn, params = [i] for i in orders], ignore_index=True)
so you don't need an extra loop ...
alternatively if your orders list is relatively small, you can select all your rows "in one shot":
querystring = 'select order_number, creation_date from table_name where order_number in ({})'.format(','.join(['?']*len(orders)))
df = pd.read_sql(querystring, cnxn, params=orders)
generated SQL
In [8]: querystring
Out[8]: 'select order_number, creation_date from table_name where order_number in (?,?,?,?)'
You can create the a dataframe directly by using the following code:
df = pd.read_sql_query(querystring, cnxn)

Categories

Resources