How to divide pandas dataframe in other blocks of data? [duplicate] - python

I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory.
This works:
import pandas.io.sql as psql
sql = "SELECT TOP 1000000 * FROM MyTable"
data = psql.read_frame(sql, cnxn)
...but this does not work:
sql = "SELECT TOP 2000000 * FROM MyTable"
data = psql.read_frame(sql, cnxn)
It returns this error:
File "inference.pyx", line 931, in pandas.lib.to_object_array_tuples
(pandas\lib.c:42733) Memory Error
I have read here that a similar problem exists when creating a dataframe from a csv file, and that the work-around is to use the 'iterator' and 'chunksize' parameters like this:
read_csv('exp4326.csv', iterator=True, chunksize=1000)
Is there a similar solution for querying from an SQL database? If not, what is the preferred work-around? Should I use some other methods to read the records in chunks? I read a bit of discussion here about working with large datasets in pandas, but it seems like a lot of work to execute a SELECT * query. Surely there is a simpler approach.

As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:
sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
print(chunk)
Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying

Update: Make sure to check out the answer below, as Pandas now has built-in support for chunked loading.
You could simply try to read the input table chunk-wise and assemble your full dataframe from the individual pieces afterwards, like this:
import pandas as pd
import pandas.io.sql as psql
chunk_size = 10000
offset = 0
dfs = []
while True:
sql = "SELECT * FROM MyTable limit %d offset %d order by ID" % (chunk_size,offset)
dfs.append(psql.read_frame(sql, cnxn))
offset += chunk_size
if len(dfs[-1]) < chunk_size:
break
full_df = pd.concat(dfs)
It might also be possible that the whole dataframe is simply too large to fit in memory, in that case you will have no other option than to restrict the number of rows or columns you're selecting.

Code solution and remarks.
# Create empty list
dfl = []
# Create empty dataframe
dfs = pd.DataFrame()
# Start Chunking
for chunk in pd.read_sql(query, con=conct, ,chunksize=10000000):
# Start Appending Data Chunks from SQL Result set into List
dfl.append(chunk)
# Start appending data from list to dataframe
dfs = pd.concat(dfl, ignore_index=True)
However, my memory analysis tells me that even though the memory is released after each chunk is extracted, the list is growing bigger and bigger and occupying that memory resulting in a net net no gain on free RAM.
Would love to hear what the author / others have to say.

The best way I found to handle this is to leverage the SQLAlchemy steam_results connection options
conn = engine.connect().execution_options(stream_results=True)
And passing the conn object to pandas in
pd.read_sql("SELECT *...", conn, chunksize=10000)
This will ensure that the cursor is handled server-side rather than client-side

You can use Server Side Cursors (a.k.a. stream results)
import pandas as pd
from sqlalchemy import create_engine
def process_sql_using_pandas():
engine = create_engine(
"postgresql://postgres:pass#localhost/example"
)
conn = engine.connect().execution_options(
stream_results=True)
for chunk_dataframe in pd.read_sql(
"SELECT * FROM users", conn, chunksize=1000):
print(f"Got dataframe w/{len(chunk_dataframe)} rows")
# ... do something with dataframe ...
if __name__ == '__main__':
process_sql_using_pandas()
As mentioned in the comments by others, using the chunksize argument in pd.read_sql("SELECT * FROM users", engine, chunksize=1000) does not solve the problem as it still loads the whole data in the memory and then gives it to you chunk by chunk.
More explanation here

chunksize still loads all the data in memory, stream_results=True is the answer. it is server side cursor that loads the rows in given chunks and save memory.. efficiently using in many pipelines, it may also help when you load history data
stream_conn = engine.connect().execution_options(stream_results=True)
use pd.read_sql with thechunksize
pd.read_sql("SELECT * FROM SOURCE", stream_conn , chunksize=5000)

you can update version airflow.
for example, I had that error in the version 2.2.3 using docker-compose.
AIRFLOW__CORE__EXECUTOR=CeleryExecutor
mysq 6.7
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "750M"
redis:
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "250M"
airflow-webserver:
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "750M"
airflow-scheduler:
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "750M"
airflow-worker:
#cpus: "0.5"
#mem_reservation: "10M"
#mem_limit: "750M"
error: Task exited with return code Negsignal.SIGKILL
but update to the version
FROM apache/airflow:2.3.4.
and perform the pulls without problems, using the same resources configured in the docker-compose
enter image description here
my dag extractor:
function
def getDataForSchema(table,conecction,tmp_path, **kwargs):
conn=connect_sql_server(conecction)
query_count= f"select count(1) from {table['schema']}.{table['table_name']}"
logging.info(f"query: {query_count}")
real_count_rows = pd.read_sql_query(query_count, conn)
##sacar esquema de la tabla
metadataquery=f"SELECT COLUMN_NAME ,DATA_TYPE FROM information_schema.columns \
where table_name = '{table['table_name']}' and table_schema= '{table['schema']}'"
#logging.info(f"query metadata: {metadataquery}")
metadata = pd.read_sql_query(metadataquery, conn)
schema=generate_schema(metadata)
#logging.info(f"schema : {schema}")
#logging.info(f"schema: {schema}")
#consulta la tabla a extraer
query=f" SELECT {table['custom_column_names']} FROM {table['schema']}.{table['table_name']} "
logging.info(f"quere data :{query}")
chunksize=table["partition_field"]
data = pd.read_sql_query(query, conn, chunksize=chunksize)
count_rows=0
pqwriter=None
iteraccion=0
for df_row in data:
print(f"bloque {iteraccion} de total {count_rows} de un total {real_count_rows.iat[0, 0]}")
#logging.info(df_row.to_markdown())
if iteraccion == 0:
parquetName=f"{tmp_path}/{table['table_name']}_{iteraccion}.parquet"
pqwriter = pq.ParquetWriter(parquetName,schema)
tableData = pa.Table.from_pandas(df_row, schema=schema,safe=False, preserve_index=True)
#logging.info(f" tabledata {tableData.column(17)}")
pqwriter.write_table(tableData)
#logging.info(f"parquet name:::{parquetName}")
##pasar a parquet df directo
#df_row.to_parquet(parquetName)
iteraccion=iteraccion+1
count_rows += len(df_row)
del df_row
del tableData
if pqwriter:
print("Cerrando archivo parquet")
pqwriter.close()
del data
del chunksize
del iteraccion

Here is a one-liner. I was able to load in 49m records to the dataframe without running out of memory.
dfs = pd.concat(pd.read_sql(sql, engine, chunksize=500000), ignore_index=True)

Full one-line code using sqlalchemy and with operator:
db_engine = sqlalchemy.create_engine(db_url, pool_size=10, max_overflow=20)
with Session(db_engine) as session:
sql_qry = text("Your query")
data = pd.concat(pd.read_sql(sql_qry,session.connection().execution_options(stream_results=True), chunksize=500000), ignore_index=True)
You can try to change chunksize to find the optimal size for your case.

You can use chunksize option, but need to set it up to 6-7 digit if you have RAM issue.
for chunk in pd.read_sql(sql, engine, params = (fromdt, todt,filecode), chunksize=100000):
df1.append(chunk)
dfs = pd.concat(df1, ignore_index=True)
do this

If you want to limit the number of rows in output, just use:
data = psql.read_frame(sql, cnxn,chunksize=1000000).__next__()

Related

Pandas .to_sql() Output exceeds the size limit

I'm using pydobc and sqlalchemy to insert data into a table in SQL Server, and I'm getting this error.
https://i.stack.imgur.com/miSp9.png
Here are snippets of the functions I use.
This is the function I use to connect to the SQL server (using fast_executemany)
def connect(server, database):
global cnxn_str, cnxn, cur, quoted, engine
cnxn_str = ("Driver={SQL Server Native Client 11.0};"
"Server=<server>;"
"Database=<database>;"
"UID=<user>;"
"PWD=<password>;")
cnxn = pyodbc.connect(cnxn_str)
cur = cnxn.cursor()
cur.fast_executemany=True
quoted = quote_plus(cnxn_str)
engine = create_engine('mssql+pyodbc:///?odbc_connect={}'.format(quoted), fast_executemany=True)
And this is the function I'm using to query and insert the data into the SQL server
def insert_to_sql_server():
global df, np_array
# Dataframe df is from a numpy array dtype = object
df = pd.DataFrame(np_array[1:,],columns=np_array[0])
# add new columns, data processing
df['comp_key'] = df['col1']+"-"+df['col2'].astype(str)
df['comp_key2'] = df['col3']+"-"+df['col4'].astype(str)+"-"+df['col5'].astype(str)
df['comp_statusID'] = df['col6']+"-"+df['col7'].astype(str)
convert_dict = {'col1': 'string', 'col2': 'string', ..., 'col_n': 'string'}
# convert data types of columns from objects to strings
df = df.astype(convert_dict)
connect(<server>, <database>)
cur.rollback()
# Delete old records
cur.execute("DELETE FROM <table>")
cur.commit()
# Insert dataframe to table
df.to_sql(<table name>, engine, index=False, \
if_exists='replace', schema='dbo', chunksize=1000, method='multi')
The insert function runs for about 30 minutes before finally returning the error message.
I encountered no errors when doing it with a smaller df size. The current df size I have is 27963 rows and 9 columns. One thing which I think contributes to the error is the length of the string. By default the numpy array is dtype='<U25', but I had to override this to dtype='object' because it was truncating the text data.
I'm out of ideas because it seems like the error is referring to limitations of either Pandas or the SQL Server, which I'm not familiar with.
Thanks
Thanks for all the input (still new here)! Accidentally stumbled upon the solution, which is by reducing the df.to_sql from
df.to_sql(chunksize=1000)
to
df.to_sql(chunksize=200)
After digging it turns out there's a limitation from SQL server (https://discuss.dizzycoding.com/to_sql-pyodbc-count-field-incorrect-or-syntax-error/)
In my case, I had the same "Output exceeds the size limit" error, and I fixed it adding "method='multi'" in df.to_sql(method='multi').
First I tried the "chuncksize" solution and it didn't work.
So... check that if you're at the same scenario!
with engine.connect().execution_options(autocommit=True) as conn:
df.to_sql('mytable', con=conn, method='multi', if_exists='replace', index=True)

Error when writing dask dataframe to mssql via turbodc

I have a dask dataframe which has 220 partitions and 7 columns. I have imported this file from a bcp file as and completed some wrangling in dask. I then want to write this whole file to mssql using turboodbc. I connect to the DB as follows:
mydb = 'TEST'
from turbodbc import connect, make_options
connection = connect(driver="ODBC Driver 17 for SQL Server",
server="TEST SERVER",
port="1433",
database=mydb,
uid="sa",
pwd="5pITfir3")
I then use the function i found from a medium article to write to a test table in the DB:
def turbo_write(mydb, df, table):
"""Use turbodbc to insert data into sql."""
start = time.time()
# preparing columns
columns = '('
columns += ', '.join(df.columns)
columns += ')'
# preparing value place holders
val_place_holder = ['?' for col in df.columns]
sql_val = '('
sql_val += ', '.join(val_place_holder)
sql_val += ')'
# writing sql query for turbodbc
sql = f"""
INSERT INTO {mydb}.dbo.{table} {columns}
VALUES {sql_val}
"""
print(sql)
print(sql_val)
# writing array of values for turbodbc
values_df = [df[col].values for col in df.columns]
print(values_df)
# cleans the previous head insert
with connection.cursor() as cursor:
cursor.execute(f"delete from {mydb}.dbo.{table}")
connection.commit()
# inserts data, for real
with connection.cursor() as cursor:
#try:
cursor.executemanycolumns(sql, values_df)
connection.commit()
# except Exception:
# connection.rollback()
# print('something went wrong')
stop = time.time() - start
return print(f'finished in {stop} seconds')
This works when I upload a small amount of rows as follows:
turbo_write(mydb, df_train.head(1000), table)
When i try to do a larger number of rows it fails:
turbo_write(mydb, df_train.head(10000), table)
I get the error:
RuntimeError: Unable to cast Python instance to C++ type (compile in
debug mode for details)
How do i write the whole dask dataframe to mssql without any errors?
I needed to convert to a maskedarray by changing:
# writing array of values for turbodbc
values_df = [df[col].values for col in df.columns]
to
values_df = [np.ma.MaskedArray(df[col].values, pd.isnull(df[col].values)) for col in df.columns]
I can then write all the data using:
for i in range(df_train.npartitions):
partition = df_train.get_partition(i)
turbo_write(mydb, partition, table)
i += 1
This still takes a long time to write in comparison to saving the file and writing to the DB using BCP. If anyone has any more efficient suggestions I would love to see them.

Fastest way to read huge MySQL table in python

I was trying to read a very huge MySQL table made of several millions of rows. I have used Pandas library and chunks. See the code below:
import pandas as pd
import numpy as np
import pymysql.cursors
connection = pymysql.connect(user='xxx', password='xxx', database='xxx', host='xxx')
try:
with connection.cursor() as cursor:
query = "SELECT * FROM example_table;"
chunks=[]
for chunk in pd.read_sql(query, connection, chunksize = 1000):
chunks.append(chunk)
#print(len(chunks))
result = pd.concat(chunks, ignore_index=True)
#print(type(result))
#print(result)
finally:
print("Done!")
connection.close()
Actually the execution time is acceptable if I limit the number of rows to select. But if want to select also just a minimum of data (for example 1 mln of rows) then the execution time dramatically increases.
Maybe is there a better/faster way to select the data from a relational database within python?
Another option might be to use the multiprocessing module, dividing the query up and sending it to multiple parallel processes, then concatenating the results.
Without knowing much about pandas chunking - I think you would have to do the chunking manually (which depends on the data)... Don't use LIMIT / OFFSET - performance would be terrible.
This might not be a good idea, depending on the data. If there is a useful way to split up the query (e.g if it's a timeseries, or there some kind of appropriate index column to use, it might make sense). I've put in two examples below to show different cases.
Example 1
import pandas as pd
import MySQLdb
def worker(y):
#where y is value in an indexed column, e.g. a category
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
query = "SELECT * FROM example_table WHERE col_x = {0}".format(y)
return pd.read_sql(query, connection)
p = multiprocessing.Pool(processes=10)
#(or however many process you want to allocate)
data = p.map(worker, [y for y in col_x_categories])
#assuming there is a reasonable number of categories in an indexed col_x
p.close()
results = pd.concat(data)
Example 2
import pandas as pd
import MySQLdb
import datetime
def worker(a,b):
#where a and b are timestamps
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
query = "SELECT * FROM example_table WHERE x >= {0} AND x < {1}".format(a,b)
return pd.read_sql(query, connection)
p = multiprocessing.Pool(processes=10)
#(or however many process you want to allocate)
date_range = pd.date_range(start=d1, end=d2, freq="A-JAN")
# this arbitrary here, and will depend on your data /knowing your data before hand (ie. d1, d2 and an appropriate freq to use)
date_pairs = list(zip(date_range, date_range[1:]))
data = p.map(worker, date_pairs)
p.close()
results = pd.concat(data)
Probably nicer ways doing this (and haven't properly tested etc). Be interested to know how it goes if you try it.
You could try using a different mysql connector. I would recommend trying mysqlclient which is the fastest mysql connector (by a considerable margin I believe).
pymysql is a pure python mysql client, whereas mysqlclient is wrapper around the (much faster) C libraries.
Usage is basically the same as pymsql:
import MySQLdb
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
Read more about the different connectors here: What's the difference between MySQLdb, mysqlclient and MySQL connector/Python?
For those using Windows and having troubles to install MySQLdb. I'm using this way to fetch data from huge table.
import mysql.connector
i = 1
limit = 1000
while True:
sql = "SELECT * FROM super_table LIMIT {}, {}".format(i, limit)
cursor.execute(sql)
rows = self.cursor.fetchall()
if not len(rows): # break the loop when no more rows
print("Done!")
break
for row in rows: # do something with results
print(row)
i += limit

Retrieve huge data table from MySQL within Jupyter notebook

I'm currently trying to fetch 100 million of rows from a MySQL table using the Jupyter Notebook. I have made some attempts with pymysql.cursors provided for open a MySQL connection. Actually I have tried to use batches in order to speed-up the query selection process cause it's a too much heavy operation to select all the rows together. Here below my test:
import pymysql.cursors
# Connect to the database
connection = pymysql.connect(host='XXX',
user='XXX',
password='XXX',
db='XXX',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
print(cursor.execute("SELECT count(*) FROM `table`"))
count = cursor.fetchone()[0]
batch_size = 50
for offset in xrange(0, count, batch_size):
cursor.execute(
"SELECT * FROM `table` LIMIT %s OFFSET %s",
(batch_size, offset))
for row in cursor:
print(row)
finally:
connection.close()
For now the test should just print out each row (more or less not so worth), but the best solution in my opinion would be to store everything in a pandas dataframe.
Unfortunately when I run it, I got this error:
KeyError Traceback (most recent call
last) in ()
print(cursor.execute("SELECT count(*) FROM `table`"))
---> count = cursor.fetchone()[0]
batch_size = 50
KeyError: 0
Someone has an idea of what would be the problem?
Maybe the use of chunksize would be a better idea?
Thanks in advance!
UPDATE
I have rewrite again the code without batch_size and storing the query result in a pandas dataframe. Finally it seems running but of course the execution time seems pretty much 'infinite' due to the fact that are 100mln rows as volume of data:
connection = pymysql.connect(user='XXX', password='XXX', database='XXX', host='XXX')
try:
with connection.cursor() as cursor:
query = "SELECT * FROM `table`"
cursor.execute(query)
cursor.fetchall()
df = pd.read_sql(query, connection)
finally:
connection.close()
What should be a correct approach for speed-up the process? Maybe by passing as parameter chunksize = 250?
And also If I try to print the type of df then it ouputs that is a generator. Actually this is not a dataframe.
If I print df the output is:
<generator object _query_iterator at 0x11358be10>
How can I get the data in a dataframe format? Cause if I print the fetch_all command I can see the correct output selection of the query, so till that point everything works as expected.
If I try to use Dataframe() with the result of the fetchAll command I get:
ValueError: DataFrame constructor not properly called!
Another UPDATE
I was able to output the result by iterating pd.read_sql like this:
for chunk in pd.read_sql(query, connection, chunksize = 250):
chunks.append(chunk)
result = pd.concat(chunks, ignore_index=True)
print(type(result))
#print(result)
And finally I got just one dataframe called result.
Now the questions are:
Is it possible to query all the data without a LIMIT?
What exactly influence the process benchmark?

Can tqdm be used with Database Reads?

While reading large relations from a SQL database to a pandas dataframe, it would be nice to have a progress bar, because the number of tuples is known statically and the I/O rate could be estimated. It looks like the tqdm module has a function tqdm_pandas which will report progress on mapping functions over columns, but by default calling it does not have the effect of reporting progress on I/O like this. Is it possible to use tqdm to make a progress bar on a call to pd.read_sql?
Edit: Answer is misleading - chunksize has no effect on database side of the operation. See comments below.
You could use the chunksize parameter to do something like this:
chunks = pd.read_sql('SELECT * FROM table', con=conn, chunksize=100)
df = pd.DataFrame()
for chunk in tqdm(chunks):
df = pd.concat([df, chunk])
I think this would use less memory as well.
yes! you can!
expanding the answer here, and Alex answer, to include tqdm, we get:
# get total number or rows
q = f"SELECT COUNT(*) FROM table"
total_rows = pd.read_sql_query(q, conn).values[0, 0]
# note that COUNT implementation should not download the whole table.
# some engine will prefer you to use SELECT MAX(ROWID) or whatever...
# read table with tqdm status bar
q = f"SELECT * FROM table"
rows_in_chunk = 1_000
chunks = pd.read_sql_query(q, conn, chunksize=rows_in_chunk)
df = tqdm(chunks, total=total_rows/rows_in_chunk)
df = pd.concat(df)
output example:
39%|███▉ | 99/254.787 [01:40<02:09, 1.20it/s]

Categories

Resources