pandas .to_sql timing out with RDS - python

I have a 22 million row .csv file (~850mb) that I am trying to load into a postgres db on Amazon RDS. It fails every time (I get a time-out error), even when I split the file into smaller parts (each of 100,000 rows) and even when I use chunksize.
All I am doing at the moment is loading the .csv as a dataframe and then writing it to the db using df.to_sql(table_name, engine, index=False, if_exists='append', chunksize=1000)
I am using create_engine from sqlalchemy to create the connection: engine = create_engine('postgresql:database_info')
I have tested writing smaller amounts of data with psycopg2 without a problem, but it takes around 50 seconds to write 1000 rows. Obviously for 22m rows that won't work.
Is there anything else I can try?

The pandas DataFrame.to_sql() method is not especially designed for large inserts, since it does not utilize the PostgreSQL COPY command.
Regular SQL queries can time out, it's not the fault of pandas, it's controlled by the database server but can be modified per connection, see this page and search for 'statement_timeout'.
What I would recommend you to do is to consider using Redshift, which is optimized for datawarehousing and can read huge data dumps directly from S3 buckets using the Redshift Copy command.
If you are in no position to use Redshift, I would still recommend finding a way to do this operation using the PostgreSQL COPY command, since it was invented to circumvent exactly the problem you are experiencing.

You can to write the dataframe to a cString and then write this to the database using the copy_from method in Psycopg which I believe does implement the PostgreSql COPY command that #firelynx mentions.
import cStringIO
dboutput = cStringIO.StringIO()
output = output.T.to_dict().values()
dboutput.write('\n'.join([ ''.join([row['1_str'],'\t',
row['2_str'], '\t',
str(row['3_float'])
]) for row in output]))
dboutput.seek(0)
cursor.copy_from(dboutput, 'TABLE_NAME')
connenction.commit()
where output is originally a pandas dataframe with columns [1_str, 2_str, 3_float] that you want to write to the database.

Related

pyodbc - slow insert into SQL server for large dataframe

I have large text files that need parsing and cleaning. I am using jupyter notebook to do it. I have a sql server DB that I want to insert the data in after the data is prepared. I used pyodbc to insert the final dataframe into sql. df is my dataframe and I put my sql insert query in the field sqlInsertQuery
df_records = df.values.tolist()
cursor.executemany(sqlInsertQuery,df_records)
cursor.commit()
for a few rows it works fine, but when I want to insert the whole dataframe at once with the code above, with executemany() it lasts for hours and it keeps ruuning till I stop the kernel.
I exported one file/dataframe in an excel file and it is about 83 M, As my dataframe contains very large strings and lists.
Someone recommended using fast_executemany instead but seems it is faulty.
some others recommended other packages than pyodbc.
Some said better not to use jupyter and use pycharm or Ipython.
I did not get what is the best/fastet way to insert my data to my DB in my case. I am not a developer and I really appreciate if you help me on this.

Insert 2 Million rows into timscaldb with python-dataframes

i want to insdert About 2 Million rows from a csv into postgersql.
There are 2 ways.
With dataframes in python
or directly with Import csv in PostgreSQL
The python way:
engine = create_engine("postgresql+psycopg2://postgres:passwd#127.0.0.1/postgres")
con = engine.connect()
df = pd.read_csv(r"C:\2million.csv",delimiter=',',names=['x','y'],skiprows=1)
df.to_sql(name='tablename',con=con,schema='timeseries',if_exists='append',index=False)
print("stored")
Took 800 seconds to insert.
the way with directly Import in PostgreSQL took just 10 seconds.
I thought, that the inserttime with timescaledb is much faster than 800 seconds, for inserting 2Million rows.
Or is the way i am trying to insert the rows simply the limiting factor ?
I'm not an expert in timescaledb, but I don't think it does anything merely by being installed. You have to invoke it on each table you want to use it for, and you are not doing that. So you are just using plain PostgreSQL here.
Pandas' to_sql is infamously slow. By default it inserts one row per INSERT statement, which is quite bad for performance. If you are using a newer version of pandas (>=0.24.0), you could specify to_sql(...,method='multi',chunksize=10000) to make is suck a bit less by specifying multiple rows per INSERT statement. I think pandas implemented it this way, rather than using bulk import, because every database system does bulk import differently.
You are fundamentally misusing pandas. It is data analysis library, not a database bulk loader library. Not only do you not take advantage of database-specific bulk import features, but you are parsing the entire csv file into an in-memory dataframe before you start writing any of it to the database.
here's one way of doing it that's much faster than pandas' to_sql
(python) df.to_csv('tmp.csv')
(sql) COPY foobar FROM 'tmp.csv' DELIMITER ',' CSV HEADER;

Processing a large SQL query in Python using Pandas?

I would like to back-test some data which will be pulled from a Postgres database, using Python, psycopg2 and Pandas.
The data which will be pulled from Postgres is very large (over 10Gbs) - my system will not be able to hold this in terms of RAM, even if a Pandas data frame is able to store this much data.
As an overview, I expect my Python program will need to do the following:
1: Connect to a remote (LAN based) Postgres database server
2: Run a basic select query against a database table
3: Store the result of the query in a Pandas data frame
4: Perform calculation operations on the data within the Pandas data frame
5: Write the result of these operations back to an existing table within the database.
I expect the data that will be returned in step 2 will be very large.
Is it possible to stream the result of a large query to a Pandas data frame, so that my Python script can process data in smaller chunks, say of 1gb, as an example?
Any ideas, suggestions or resources you can point to, on how best to do this, or if I am not approaching this in the right way, will be much appreciated, and I am sure that this will be useful to others going forward.
Thank you.
Demo - how to read data from SQL DB in chunks and process single chunks:
from sqlalchemy import create_engine
# conn = create_engine('postgresql://user:password#host:port/dbname')
conn = create_engine('postgresql+psycopg2://user:password#host:port/dbname')
qry = "select * from table where ..."
sql_reader = pd.read_sql(qry, con=conn, chunksize=10**4)
for df in sql_reader:
# process `df` (chunk of 10.000 rows) here
UPDATE: very good point from #jeremycg
depending on the exact setup, OP might also need to use
conn.dialect.server_side_cursors = True and
conn.execution_options(stream_results = True) as the database driver
will otherwise fetch all the results locally, then stream them to
python in chunks

SQLalchemy slow with Redshift

I have a 44k rows table in a pandas Data Frame. When I try to export this table (or any other table) to a Redshift database, the process takes ages. I'm using sqlalchemy to create a conexion like this:
import sqlalchemy as sal
engine = sal.create_engine('redshift+psycopg2://blablamyhost/myschema')
The method I use to export the tables is Pandas to_sql like this:
dat.to_sql(name="olap_comercial",con=eng,schema="monetization",index=False,if_exists="replace" ,dtype={"description":sal.types.String(length=271),"date_postoffer":sal.types.DATE})
Is it normal that it is so slow? I'm talking about more than 15 minutes.
Yes, it is normal to be that slow (and possibly slower for large clusters). Regular sql inserts (as generated by sqlalchemy) are very slow for Redshift, and should be avoided.
You should consider using S3 as an intermediate staging layer, your data flow will be:
dataframe->S3->redshift
Ideally, you should also gzip your data before uploading to S3, this will improve your performance as well.
This can be coordinated from your python script using BOTO3 and psycopg2
https://boto3.readthedocs.io/en/latest/

Copying data from postgres to ZODB using pandas - read_csv or read_sql or blaze?

I am creating a new application which uses ZODB and I need to import legacy data mainly from a postgres database but also from some csv files. There is a limited amount of manipulation needed to the data (sql joins to merge linked tables and create properties, change names of some properties, deal with empty columns etc).
With a subset of the postgres data I did a dump to csv files of all the relevant tables, read these into pandas dataframes and did the manipulation. This works but there are errors which are partly due to transferring the data into a csv first.
I now want to load all of the data in (and get rid of the errors). I am wondering if it makes sense to connect directly to the database and use read_sql or to carry on using the csv files.
The largest table (csv file) is only 8MB so I shouldn't have memory issues, I hope. Most of the errors are to do with encoding and or choice of separator (the data contains |,;,: and ').
Any advice? I have also read about something called Blaze and wonder if I should actually be using that.
If your CSV files aren't very large (as you say) then I'd try loading everything into postgres with odo, then using blaze to perform the operations, then finally dumping to a format that ZODB can understand. I wouldn't worry about the performance of operations like join inside the database versus in memory at the scale you're talking about.
Here's some example code:
from blaze import odo, Data, join
for csv, tablename in zip(csvs, tablenames):
odo(csv, 'postgresql://localhost/db::%s' % tablename)
db = Data('postgresql://localhost/db')
# see the link above for more operations
expr = join(db.table1, db.table2, 'column_to_join_on')
# execute `expr` and dump the result to a CSV file for loading into ZODB
odo(expr, 'joined.csv')

Categories

Resources