Problem/Context: The problem i am facing is i am trying to insert the dask dataframe to postgres db but it seems i am going through bad luck. I tried almost 20-30 times by changing param also different methods and still the same error I am getting. This is might feel similar but it isn't.
Error: *TypeError: Provided index column is of type "object". If divisions is not provided the index column type must be numeric or datetime. OR cannot process UnicodeResultProcessor objects *
Some function help to get started:
This will help in creating the table in postgres
def create_table(db, table):
"""
Function to create table if not exist
"""
try:
conn = psycopg2.connect('postgresql://postgres:root#localhost:5432/dask_ml_table')
except:
print("Unable to connect to the database!")
cur = conn.cursor()
if table == "one_table":
try:
cur.execute("""CREATE TABLE one_table (date date,Sales integer,X_1 integer,X_2 integer,X_3 integer,X_4 integer,X_5 integer);""")
except Exception as e:
print(e)
conn.commit()
conn.close()
cur.close()
#Call Fn
create_table('dask_ml_table','one_table')
This will help in creating dataframe to insert
from faker import Faker
from dask.dataframe import from_pandas
import pandas as pd
import random
fake = Faker()
def create_rows_faker(num=1):
output = [{"X_5":fake.name(),
"X_4":fake.address(),
"X_3":fake.email(),
#"bs":fake.bs(),
"X_2":fake.city(),
"X_1":fake.state(),
"date":fake.date_time(),
#"paragraph":fake.paragraph(),
#"Conrad":fake.catch_phrase(),
"Sales":random.randint(1000,2000)} for x in range(num)]
return output
df_faker = pd.DataFrame(create_rows_faker(1000))
ddf = from_pandas(df_faker, npartitions=1)
Action Plan:
Insert the above generated dataframe to postgres db
Fetch all rows back in dask dataframe
My Approach:
i) First Write
ddf.to_sql('one_table',uri='postgresql://postgres:root#localhost/dask_ml_test',
if_exists='append',method='multi',index=False )
dataframe.compute()
Error: TypeError: can't pickle sqlalchemy.cprocessors.UnicodeResultProcessor objects
ii) Read
username = 'postgres'
password = 'XXXX'
server = 'localhost'
database = 'dask_ml_table'
connection_string = f'postgresql+psycopg2://{username}:{password}#{server}/{database}'
engine = create_engine(connection_string)
metadata = MetaData()
t = Table('one_table', metadata,
Column('date'))
sel = select([t]).limit(5).alias('foo')
dd.read_sql_table(sel, connection_string, index_col='date')
Error: TypeError: Provided index column is of type "object". If divisions is not provided the index column type must be numeric or datetime
iii) 2nd Read
ndf = dd.read_sql_table("select * from one_table", "postgresql://postgres:root#localhost:5432/dask_ml_table", 'date', npartitions=1)
Error: Table not found
Please Help, Need help in both inserting and reading data from dask only. I know it might feel similar, but i have tried to find majority of solution from SOF only, but hardly any worked.
I'm losing my sanity after this error, I've uploaded dozens of tables before, but this one keep giving me the error in the title.
I have a data-frame with 11 columns and an SQL table already set up for it. All columns match name.
df_rates = df_rates.replace('\t', '', regex=True)
data_to_upload_output = io.StringIO() # Create object to store csv output in
df_rates.to_csv(data_to_upload_output, sep='\t', header=False, index=False,date_format='%Y-%m-%d') # Send my_data to csv
data_to_upload_output.seek(0) # Return to start of file
conn = psycopg2.connect(host='xxxxx-xxx-x-x',
dbname='xxxx',
user=uid,
password=pwd,
port=xxxx,
options="-c search_path=dbo,development")
db_table = 'sandbox.gm_dt_input_dist_rates'
with conn:
with conn.cursor() as cur:
cur.copy_from(data_to_upload_output, db_table, null='', columns=my_data.columns) # null values become '', columns should be lowercase, at least for PostgreSQL
conn.commit()
conn.close()
The error continues saying:
CONTEXT: COPY gm_dt_input_dist_rates, line 43:
"IE00B44CGS96 USD 0.9088 0.9088 10323906 97.2815 97.2815 2022-05-12 2022-05-11 cfsad 2022-05-20"
Which makes me think the "/t" hasn't been recognized. But this same code works perfectly for all the other tables I'm uploading. I've checked for posts with same errors but I couldn't find a way to apply the solution they got to what I'm experiencing.
Thanks for your help!
It is much appreciated, have a great weekend!
I am trying to update a SQL table with updated information which is in a dataframe in pandas.
I have about 100,000 rows to iterate through and it's taking a long time. Any way I can make this code more efficient. Do I even need to truncate the data? Most rows will probably be the same.
conn = pyodbc.connect ("Driver={xxx};"
"Server=xxx;"
"Database=xxx;"
"Trusted_Connection=yes;")
cursor = conn.cursor()
cursor.execute('TRUNCATE dbo.Sheet1$')
for index, row in df_union.iterrows():
print(row)
cursor.execute("INSERT INTO dbo.Sheet1$ (Vendor, Plant) values(?,?)", row.Vendor, row.Plant)
Update: This is what I ended up doing.
params = urllib.parse.quote_plus(r'DRIVER={xxx};SERVER=xxx;DATABASE=xxx;Trusted_Connection=yes')
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine = create_engine(conn_str)
df = pd.read_excel('xxx.xlsx')
print("loaded")
df.to_sql(name='tablename',schema= 'dbo', con=engine, if_exists='replace',index=False, chunksize = 1000, method = 'multi')
Don't use for or cursors just SQL
insert into TABLENAMEA (A,B,C,D)
select A,B,C,D from TABLENAMEB
Take a look to this link to see another demo:
https://www.sqlservertutorial.net/sql-server-basics/sql-server-insert-into-select/
You just need to update this part to run a normal insert
conn = pyodbc.connect ("Driver={xxx};"
"Server=xxx;"
"Database=xxx;"
"Trusted_Connection=yes;")
cursor = conn.cursor()
cursor.execute('insert into TABLENAMEA (A,B,C,D) select A,B,C,D from TABLENAMEB')
You don't need to store the dataset in a variable, just run the query directly as normal SQL, performance will be better than a iteration
I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory.
This works:
import pandas.io.sql as psql
sql = "SELECT TOP 1000000 * FROM MyTable"
data = psql.read_frame(sql, cnxn)
...but this does not work:
sql = "SELECT TOP 2000000 * FROM MyTable"
data = psql.read_frame(sql, cnxn)
It returns this error:
File "inference.pyx", line 931, in pandas.lib.to_object_array_tuples
(pandas\lib.c:42733) Memory Error
I have read here that a similar problem exists when creating a dataframe from a csv file, and that the work-around is to use the 'iterator' and 'chunksize' parameters like this:
read_csv('exp4326.csv', iterator=True, chunksize=1000)
Is there a similar solution for querying from an SQL database? If not, what is the preferred work-around? Should I use some other methods to read the records in chunks? I read a bit of discussion here about working with large datasets in pandas, but it seems like a lot of work to execute a SELECT * query. Surely there is a simpler approach.
As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:
sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
print(chunk)
Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying
Update: Make sure to check out the answer below, as Pandas now has built-in support for chunked loading.
You could simply try to read the input table chunk-wise and assemble your full dataframe from the individual pieces afterwards, like this:
import pandas as pd
import pandas.io.sql as psql
chunk_size = 10000
offset = 0
dfs = []
while True:
sql = "SELECT * FROM MyTable limit %d offset %d order by ID" % (chunk_size,offset)
dfs.append(psql.read_frame(sql, cnxn))
offset += chunk_size
if len(dfs[-1]) < chunk_size:
break
full_df = pd.concat(dfs)
It might also be possible that the whole dataframe is simply too large to fit in memory, in that case you will have no other option than to restrict the number of rows or columns you're selecting.
Code solution and remarks.
# Create empty list
dfl = []
# Create empty dataframe
dfs = pd.DataFrame()
# Start Chunking
for chunk in pd.read_sql(query, con=conct, ,chunksize=10000000):
# Start Appending Data Chunks from SQL Result set into List
dfl.append(chunk)
# Start appending data from list to dataframe
dfs = pd.concat(dfl, ignore_index=True)
However, my memory analysis tells me that even though the memory is released after each chunk is extracted, the list is growing bigger and bigger and occupying that memory resulting in a net net no gain on free RAM.
Would love to hear what the author / others have to say.
The best way I found to handle this is to leverage the SQLAlchemy steam_results connection options
conn = engine.connect().execution_options(stream_results=True)
And passing the conn object to pandas in
pd.read_sql("SELECT *...", conn, chunksize=10000)
This will ensure that the cursor is handled server-side rather than client-side
You can use Server Side Cursors (a.k.a. stream results)
import pandas as pd
from sqlalchemy import create_engine
def process_sql_using_pandas():
engine = create_engine(
"postgresql://postgres:pass#localhost/example"
)
conn = engine.connect().execution_options(
stream_results=True)
for chunk_dataframe in pd.read_sql(
"SELECT * FROM users", conn, chunksize=1000):
print(f"Got dataframe w/{len(chunk_dataframe)} rows")
# ... do something with dataframe ...
if __name__ == '__main__':
process_sql_using_pandas()
As mentioned in the comments by others, using the chunksize argument in pd.read_sql("SELECT * FROM users", engine, chunksize=1000) does not solve the problem as it still loads the whole data in the memory and then gives it to you chunk by chunk.
More explanation here
chunksize still loads all the data in memory, stream_results=True is the answer. it is server side cursor that loads the rows in given chunks and save memory.. efficiently using in many pipelines, it may also help when you load history data
stream_conn = engine.connect().execution_options(stream_results=True)
use pd.read_sql with thechunksize
pd.read_sql("SELECT * FROM SOURCE", stream_conn , chunksize=5000)
you can update version airflow.
for example, I had that error in the version 2.2.3 using docker-compose.
AIRFLOW__CORE__EXECUTOR=CeleryExecutor
mysq 6.7
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "750M"
redis:
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "250M"
airflow-webserver:
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "750M"
airflow-scheduler:
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "750M"
airflow-worker:
#cpus: "0.5"
#mem_reservation: "10M"
#mem_limit: "750M"
error: Task exited with return code Negsignal.SIGKILL
but update to the version
FROM apache/airflow:2.3.4.
and perform the pulls without problems, using the same resources configured in the docker-compose
enter image description here
my dag extractor:
function
def getDataForSchema(table,conecction,tmp_path, **kwargs):
conn=connect_sql_server(conecction)
query_count= f"select count(1) from {table['schema']}.{table['table_name']}"
logging.info(f"query: {query_count}")
real_count_rows = pd.read_sql_query(query_count, conn)
##sacar esquema de la tabla
metadataquery=f"SELECT COLUMN_NAME ,DATA_TYPE FROM information_schema.columns \
where table_name = '{table['table_name']}' and table_schema= '{table['schema']}'"
#logging.info(f"query metadata: {metadataquery}")
metadata = pd.read_sql_query(metadataquery, conn)
schema=generate_schema(metadata)
#logging.info(f"schema : {schema}")
#logging.info(f"schema: {schema}")
#consulta la tabla a extraer
query=f" SELECT {table['custom_column_names']} FROM {table['schema']}.{table['table_name']} "
logging.info(f"quere data :{query}")
chunksize=table["partition_field"]
data = pd.read_sql_query(query, conn, chunksize=chunksize)
count_rows=0
pqwriter=None
iteraccion=0
for df_row in data:
print(f"bloque {iteraccion} de total {count_rows} de un total {real_count_rows.iat[0, 0]}")
#logging.info(df_row.to_markdown())
if iteraccion == 0:
parquetName=f"{tmp_path}/{table['table_name']}_{iteraccion}.parquet"
pqwriter = pq.ParquetWriter(parquetName,schema)
tableData = pa.Table.from_pandas(df_row, schema=schema,safe=False, preserve_index=True)
#logging.info(f" tabledata {tableData.column(17)}")
pqwriter.write_table(tableData)
#logging.info(f"parquet name:::{parquetName}")
##pasar a parquet df directo
#df_row.to_parquet(parquetName)
iteraccion=iteraccion+1
count_rows += len(df_row)
del df_row
del tableData
if pqwriter:
print("Cerrando archivo parquet")
pqwriter.close()
del data
del chunksize
del iteraccion
Here is a one-liner. I was able to load in 49m records to the dataframe without running out of memory.
dfs = pd.concat(pd.read_sql(sql, engine, chunksize=500000), ignore_index=True)
Full one-line code using sqlalchemy and with operator:
db_engine = sqlalchemy.create_engine(db_url, pool_size=10, max_overflow=20)
with Session(db_engine) as session:
sql_qry = text("Your query")
data = pd.concat(pd.read_sql(sql_qry,session.connection().execution_options(stream_results=True), chunksize=500000), ignore_index=True)
You can try to change chunksize to find the optimal size for your case.
You can use chunksize option, but need to set it up to 6-7 digit if you have RAM issue.
for chunk in pd.read_sql(sql, engine, params = (fromdt, todt,filecode), chunksize=100000):
df1.append(chunk)
dfs = pd.concat(df1, ignore_index=True)
do this
If you want to limit the number of rows in output, just use:
data = psql.read_frame(sql, cnxn,chunksize=1000000).__next__()
I'm currently trying to fetch 100 million of rows from a MySQL table using the Jupyter Notebook. I have made some attempts with pymysql.cursors provided for open a MySQL connection. Actually I have tried to use batches in order to speed-up the query selection process cause it's a too much heavy operation to select all the rows together. Here below my test:
import pymysql.cursors
# Connect to the database
connection = pymysql.connect(host='XXX',
user='XXX',
password='XXX',
db='XXX',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
print(cursor.execute("SELECT count(*) FROM `table`"))
count = cursor.fetchone()[0]
batch_size = 50
for offset in xrange(0, count, batch_size):
cursor.execute(
"SELECT * FROM `table` LIMIT %s OFFSET %s",
(batch_size, offset))
for row in cursor:
print(row)
finally:
connection.close()
For now the test should just print out each row (more or less not so worth), but the best solution in my opinion would be to store everything in a pandas dataframe.
Unfortunately when I run it, I got this error:
KeyError Traceback (most recent call
last) in ()
print(cursor.execute("SELECT count(*) FROM `table`"))
---> count = cursor.fetchone()[0]
batch_size = 50
KeyError: 0
Someone has an idea of what would be the problem?
Maybe the use of chunksize would be a better idea?
Thanks in advance!
UPDATE
I have rewrite again the code without batch_size and storing the query result in a pandas dataframe. Finally it seems running but of course the execution time seems pretty much 'infinite' due to the fact that are 100mln rows as volume of data:
connection = pymysql.connect(user='XXX', password='XXX', database='XXX', host='XXX')
try:
with connection.cursor() as cursor:
query = "SELECT * FROM `table`"
cursor.execute(query)
cursor.fetchall()
df = pd.read_sql(query, connection)
finally:
connection.close()
What should be a correct approach for speed-up the process? Maybe by passing as parameter chunksize = 250?
And also If I try to print the type of df then it ouputs that is a generator. Actually this is not a dataframe.
If I print df the output is:
<generator object _query_iterator at 0x11358be10>
How can I get the data in a dataframe format? Cause if I print the fetch_all command I can see the correct output selection of the query, so till that point everything works as expected.
If I try to use Dataframe() with the result of the fetchAll command I get:
ValueError: DataFrame constructor not properly called!
Another UPDATE
I was able to output the result by iterating pd.read_sql like this:
for chunk in pd.read_sql(query, connection, chunksize = 250):
chunks.append(chunk)
result = pd.concat(chunks, ignore_index=True)
print(type(result))
#print(result)
And finally I got just one dataframe called result.
Now the questions are:
Is it possible to query all the data without a LIMIT?
What exactly influence the process benchmark?