I have a 44k rows table in a pandas Data Frame. When I try to export this table (or any other table) to a Redshift database, the process takes ages. I'm using sqlalchemy to create a conexion like this:
import sqlalchemy as sal
engine = sal.create_engine('redshift+psycopg2://blablamyhost/myschema')
The method I use to export the tables is Pandas to_sql like this:
dat.to_sql(name="olap_comercial",con=eng,schema="monetization",index=False,if_exists="replace" ,dtype={"description":sal.types.String(length=271),"date_postoffer":sal.types.DATE})
Is it normal that it is so slow? I'm talking about more than 15 minutes.
Yes, it is normal to be that slow (and possibly slower for large clusters). Regular sql inserts (as generated by sqlalchemy) are very slow for Redshift, and should be avoided.
You should consider using S3 as an intermediate staging layer, your data flow will be:
dataframe->S3->redshift
Ideally, you should also gzip your data before uploading to S3, this will improve your performance as well.
This can be coordinated from your python script using BOTO3 and psycopg2
https://boto3.readthedocs.io/en/latest/
Related
Say that I have a system that does not support SQL queries. This system can store tabular or maybe even non-tabular data.
This system has a REST API that allows me to access it's data objects (a table, for example).
Now, my solution for allowing SQL queries to be executed on this data has been to download the contents of the entire data object (table) into a pandas DataFrame and then use duckdb to execute SQL statements.
The obvious drawback of this method is that I am storing all of this data that I don't even need in a DataFrame before the query is even executed. This can potentially cause memory issues, especially when querying large data objects.
What is a more efficient way to approach this? I am open to approaches using duckdb or otherwise.
i want to insdert About 2 Million rows from a csv into postgersql.
There are 2 ways.
With dataframes in python
or directly with Import csv in PostgreSQL
The python way:
engine = create_engine("postgresql+psycopg2://postgres:passwd#127.0.0.1/postgres")
con = engine.connect()
df = pd.read_csv(r"C:\2million.csv",delimiter=',',names=['x','y'],skiprows=1)
df.to_sql(name='tablename',con=con,schema='timeseries',if_exists='append',index=False)
print("stored")
Took 800 seconds to insert.
the way with directly Import in PostgreSQL took just 10 seconds.
I thought, that the inserttime with timescaledb is much faster than 800 seconds, for inserting 2Million rows.
Or is the way i am trying to insert the rows simply the limiting factor ?
I'm not an expert in timescaledb, but I don't think it does anything merely by being installed. You have to invoke it on each table you want to use it for, and you are not doing that. So you are just using plain PostgreSQL here.
Pandas' to_sql is infamously slow. By default it inserts one row per INSERT statement, which is quite bad for performance. If you are using a newer version of pandas (>=0.24.0), you could specify to_sql(...,method='multi',chunksize=10000) to make is suck a bit less by specifying multiple rows per INSERT statement. I think pandas implemented it this way, rather than using bulk import, because every database system does bulk import differently.
You are fundamentally misusing pandas. It is data analysis library, not a database bulk loader library. Not only do you not take advantage of database-specific bulk import features, but you are parsing the entire csv file into an in-memory dataframe before you start writing any of it to the database.
here's one way of doing it that's much faster than pandas' to_sql
(python) df.to_csv('tmp.csv')
(sql) COPY foobar FROM 'tmp.csv' DELIMITER ',' CSV HEADER;
I have a table in Google BigQuery(GBQ) with almost 3 million records(rows) so-far that were created based on data coming from MySQL db every day. This data inserted in GBQ table using Python pandas data frame(.to_gbq()).
What is the optimal way to sync changes from MySQL to GBQ, in this direction, with python.
Several different ways to import data from MySQL to BigQuery that might suit your needs are described in this article. For example Binlog replication:
This approach (sometimes referred to as change data capture - CDC) utilizes MySQL’s binlog. MySQL’s binlog keeps an ordered log of every DELETE, INSERT, and UPDATE operation, as well as Data Definition Language (DDL) data that was performed by the database. After an initial dump of the current state of the MySQL database, the binlog changes are continuously streamed and loaded into Google BigQuery.
Seems to be exactly what you are searching for.
I am creating a new application which uses ZODB and I need to import legacy data mainly from a postgres database but also from some csv files. There is a limited amount of manipulation needed to the data (sql joins to merge linked tables and create properties, change names of some properties, deal with empty columns etc).
With a subset of the postgres data I did a dump to csv files of all the relevant tables, read these into pandas dataframes and did the manipulation. This works but there are errors which are partly due to transferring the data into a csv first.
I now want to load all of the data in (and get rid of the errors). I am wondering if it makes sense to connect directly to the database and use read_sql or to carry on using the csv files.
The largest table (csv file) is only 8MB so I shouldn't have memory issues, I hope. Most of the errors are to do with encoding and or choice of separator (the data contains |,;,: and ').
Any advice? I have also read about something called Blaze and wonder if I should actually be using that.
If your CSV files aren't very large (as you say) then I'd try loading everything into postgres with odo, then using blaze to perform the operations, then finally dumping to a format that ZODB can understand. I wouldn't worry about the performance of operations like join inside the database versus in memory at the scale you're talking about.
Here's some example code:
from blaze import odo, Data, join
for csv, tablename in zip(csvs, tablenames):
odo(csv, 'postgresql://localhost/db::%s' % tablename)
db = Data('postgresql://localhost/db')
# see the link above for more operations
expr = join(db.table1, db.table2, 'column_to_join_on')
# execute `expr` and dump the result to a CSV file for loading into ZODB
odo(expr, 'joined.csv')
I have a 22 million row .csv file (~850mb) that I am trying to load into a postgres db on Amazon RDS. It fails every time (I get a time-out error), even when I split the file into smaller parts (each of 100,000 rows) and even when I use chunksize.
All I am doing at the moment is loading the .csv as a dataframe and then writing it to the db using df.to_sql(table_name, engine, index=False, if_exists='append', chunksize=1000)
I am using create_engine from sqlalchemy to create the connection: engine = create_engine('postgresql:database_info')
I have tested writing smaller amounts of data with psycopg2 without a problem, but it takes around 50 seconds to write 1000 rows. Obviously for 22m rows that won't work.
Is there anything else I can try?
The pandas DataFrame.to_sql() method is not especially designed for large inserts, since it does not utilize the PostgreSQL COPY command.
Regular SQL queries can time out, it's not the fault of pandas, it's controlled by the database server but can be modified per connection, see this page and search for 'statement_timeout'.
What I would recommend you to do is to consider using Redshift, which is optimized for datawarehousing and can read huge data dumps directly from S3 buckets using the Redshift Copy command.
If you are in no position to use Redshift, I would still recommend finding a way to do this operation using the PostgreSQL COPY command, since it was invented to circumvent exactly the problem you are experiencing.
You can to write the dataframe to a cString and then write this to the database using the copy_from method in Psycopg which I believe does implement the PostgreSql COPY command that #firelynx mentions.
import cStringIO
dboutput = cStringIO.StringIO()
output = output.T.to_dict().values()
dboutput.write('\n'.join([ ''.join([row['1_str'],'\t',
row['2_str'], '\t',
str(row['3_float'])
]) for row in output]))
dboutput.seek(0)
cursor.copy_from(dboutput, 'TABLE_NAME')
connenction.commit()
where output is originally a pandas dataframe with columns [1_str, 2_str, 3_float] that you want to write to the database.