Insert 2 Million rows into timscaldb with python-dataframes - python

i want to insdert About 2 Million rows from a csv into postgersql.
There are 2 ways.
With dataframes in python
or directly with Import csv in PostgreSQL
The python way:
engine = create_engine("postgresql+psycopg2://postgres:passwd#127.0.0.1/postgres")
con = engine.connect()
df = pd.read_csv(r"C:\2million.csv",delimiter=',',names=['x','y'],skiprows=1)
df.to_sql(name='tablename',con=con,schema='timeseries',if_exists='append',index=False)
print("stored")
Took 800 seconds to insert.
the way with directly Import in PostgreSQL took just 10 seconds.
I thought, that the inserttime with timescaledb is much faster than 800 seconds, for inserting 2Million rows.
Or is the way i am trying to insert the rows simply the limiting factor ?

I'm not an expert in timescaledb, but I don't think it does anything merely by being installed. You have to invoke it on each table you want to use it for, and you are not doing that. So you are just using plain PostgreSQL here.
Pandas' to_sql is infamously slow. By default it inserts one row per INSERT statement, which is quite bad for performance. If you are using a newer version of pandas (>=0.24.0), you could specify to_sql(...,method='multi',chunksize=10000) to make is suck a bit less by specifying multiple rows per INSERT statement. I think pandas implemented it this way, rather than using bulk import, because every database system does bulk import differently.
You are fundamentally misusing pandas. It is data analysis library, not a database bulk loader library. Not only do you not take advantage of database-specific bulk import features, but you are parsing the entire csv file into an in-memory dataframe before you start writing any of it to the database.

here's one way of doing it that's much faster than pandas' to_sql
(python) df.to_csv('tmp.csv')
(sql) COPY foobar FROM 'tmp.csv' DELIMITER ',' CSV HEADER;

Related

Efficient SQL Using duckdb

Say that I have a system that does not support SQL queries. This system can store tabular or maybe even non-tabular data.
This system has a REST API that allows me to access it's data objects (a table, for example).
Now, my solution for allowing SQL queries to be executed on this data has been to download the contents of the entire data object (table) into a pandas DataFrame and then use duckdb to execute SQL statements.
The obvious drawback of this method is that I am storing all of this data that I don't even need in a DataFrame before the query is even executed. This can potentially cause memory issues, especially when querying large data objects.
What is a more efficient way to approach this? I am open to approaches using duckdb or otherwise.

Is there a way to set columns to null within dask read_sql_table?

I'm connecting to an oracle database and trying to bring across a table with roughly 77 million rows. At first I tried using chunksize in pandas but I always got a memory error no matter what chunksize I set. I then tried using Dask since I know its better for large amounts of data. However, there're some columns that need to be made NULL, is there away to do this within read_sql_table query like there is in pandas when you can write out your sql query?
Cheers
If possible, I recommend setting this up on the oracle side, making a view with the correct data types, and using read_sql_table with that.
You might be able to do it directly, since read_sql_table accepts sqlalchemy expressions. If you can phrase it as such, it ought to work.

Improve CSV push to MySQL with virtual files in python

I am trying to find a way to improve the speed while pushing data to a MySQL database using pandas in python.
After my performance tests I arrived to the same conclusion that other people did: the best way to push data to a MySQL database is to use the native query 'LOAD DATA INFILE..." instead of the to_sql pandas method (even with improvements like this one or this one).
My problem is that when I want to push my data, it is in memory. So in order to use the native MySQL query, I need to dump it first into a file on the disk and then use the 'LOAD DATA...' query.
So here my question, is there a way to 'simulate' a file written on the disk so i can avoid dumping my big files (200MB+) on it ?
It might happen that dumping a big file can take some minutes, so I would not want to lose too much time there...
This approach may be a viable alternative without touching disk (for the load file):
Write code to create multi-row INSERT statements and execute them. Suggest 1000 rows at a time, with autocommit=ON.

pandas .to_sql timing out with RDS

I have a 22 million row .csv file (~850mb) that I am trying to load into a postgres db on Amazon RDS. It fails every time (I get a time-out error), even when I split the file into smaller parts (each of 100,000 rows) and even when I use chunksize.
All I am doing at the moment is loading the .csv as a dataframe and then writing it to the db using df.to_sql(table_name, engine, index=False, if_exists='append', chunksize=1000)
I am using create_engine from sqlalchemy to create the connection: engine = create_engine('postgresql:database_info')
I have tested writing smaller amounts of data with psycopg2 without a problem, but it takes around 50 seconds to write 1000 rows. Obviously for 22m rows that won't work.
Is there anything else I can try?
The pandas DataFrame.to_sql() method is not especially designed for large inserts, since it does not utilize the PostgreSQL COPY command.
Regular SQL queries can time out, it's not the fault of pandas, it's controlled by the database server but can be modified per connection, see this page and search for 'statement_timeout'.
What I would recommend you to do is to consider using Redshift, which is optimized for datawarehousing and can read huge data dumps directly from S3 buckets using the Redshift Copy command.
If you are in no position to use Redshift, I would still recommend finding a way to do this operation using the PostgreSQL COPY command, since it was invented to circumvent exactly the problem you are experiencing.
You can to write the dataframe to a cString and then write this to the database using the copy_from method in Psycopg which I believe does implement the PostgreSql COPY command that #firelynx mentions.
import cStringIO
dboutput = cStringIO.StringIO()
output = output.T.to_dict().values()
dboutput.write('\n'.join([ ''.join([row['1_str'],'\t',
row['2_str'], '\t',
str(row['3_float'])
]) for row in output]))
dboutput.seek(0)
cursor.copy_from(dboutput, 'TABLE_NAME')
connenction.commit()
where output is originally a pandas dataframe with columns [1_str, 2_str, 3_float] that you want to write to the database.

How to export a large table (100M+ rows) to a text file?

I have a database with a large table containing more that a hundred million rows. I want to export this data (after some transformation, like joining this table with a few others, cleaning some fields, etc.) and store it int a big text file, for later processing with Hadoop.
So far, I tried two things:
Using Python, I browse the table by chunks (typically 10'000 records at a time) using this subquery trick, perform the transformation on each row and write directly to a text file. The trick helps, but the LIMIT becomes slower and slower as the export progresses. I have not been able to export the full table with this.
Using the mysql command-line tool, I tried to output the result of my query in CSV form to a text file directly. Because of the size, it ran out of memory and crashed.
I am currently investigating Sqoop as a tool to import the data directly to HDFS, but I was wondering how other people handle such large-scale exports?
Memory issues point towards using the wrong database query machanism.
Normally, it is advisable to use mysql_store_result() on C level, which corresponds to having a Cursor or DictCursor on Python level. This ensures that the database is free again as soon as possible and the client can do with thedata whatever he wants.
But it is not suitable for large amounts of data, as the data is cached in the client process. This can be very memory consuming.
In this case, it may be better to use mysql_use_result() (C) resp. SSCursor / SSDictCursor (Python). This limits you to have to take the whole result set and doing nothing else with the database connection in the meanwhile. But it saves your client process a lot of memory. With the mysql CLI, you would achieve this with the -q argument.
I don't know what query exactly you have used because you have not given it here, but I suppose you're specifying the limit and offset. This are quite quick queries at begin of data, but are going very slow.
If you have unique column such as ID, you can fetch only the first N row, but modify the query clause:
WHERE ID > (last_id)
This would use index and would be acceptably fast.
However, it should be generally faster to do simply
SELECT * FROM table
and open cursor for such query, with reasonable big fetch size.

Categories

Resources