I would like to back-test some data which will be pulled from a Postgres database, using Python, psycopg2 and Pandas.
The data which will be pulled from Postgres is very large (over 10Gbs) - my system will not be able to hold this in terms of RAM, even if a Pandas data frame is able to store this much data.
As an overview, I expect my Python program will need to do the following:
1: Connect to a remote (LAN based) Postgres database server
2: Run a basic select query against a database table
3: Store the result of the query in a Pandas data frame
4: Perform calculation operations on the data within the Pandas data frame
5: Write the result of these operations back to an existing table within the database.
I expect the data that will be returned in step 2 will be very large.
Is it possible to stream the result of a large query to a Pandas data frame, so that my Python script can process data in smaller chunks, say of 1gb, as an example?
Any ideas, suggestions or resources you can point to, on how best to do this, or if I am not approaching this in the right way, will be much appreciated, and I am sure that this will be useful to others going forward.
Thank you.
Demo - how to read data from SQL DB in chunks and process single chunks:
from sqlalchemy import create_engine
# conn = create_engine('postgresql://user:password#host:port/dbname')
conn = create_engine('postgresql+psycopg2://user:password#host:port/dbname')
qry = "select * from table where ..."
sql_reader = pd.read_sql(qry, con=conn, chunksize=10**4)
for df in sql_reader:
# process `df` (chunk of 10.000 rows) here
UPDATE: very good point from #jeremycg
depending on the exact setup, OP might also need to use
conn.dialect.server_side_cursors = True and
conn.execution_options(stream_results = True) as the database driver
will otherwise fetch all the results locally, then stream them to
python in chunks
Related
I came across problem, that when I query for large data (35M rows, 22GB data), the same query got executed multiple times (e.g. 400 times) in background. I understand that data is partitioned/shuffled in some way. It greatly increases query cost.
This is how I query for data:
from google.cloud import bigquery
bqclient = bigquery.Client(project)
query_job = bqclient.query(query).result()
df_result = query_job.to_dataframe()
Where project and query are Python strings.
I am using google-cloud-bigquery==2.30.1.
I am looking for any programmatic solutions to reduce query costs. E.g. is there different class/config/object/method/library that would handle such queries in better way?
I suspect it's because you're calling result() twice, once when you run query_job = bqclient.query(query).result() and once when you run df_result = query_job.to_dataframe() (by calling query_job again). Not sure why it's running so many times but probably has to do with how result() works (https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.query should have more info)
The "basic" answer to what you want is df_result = bqclient.query(query).to_dataframe(). However, if you're querying a large dataset, this will likely take time. See
Reading from BigQuery into a Pandas DataFrame and performance issues for a better way to do this.
Side note on reducing query costs: in case you're working in a local Python environment, you probably don't want to be processing 22GB worth of data there. If you're, say, building an ML model, you probably want to extract say 1/1000th of you data (a simple LIMIT in SQL won't reduce your query costs, you want to partition your table on a date column and filter on that OR create a new table with a subset of rows and query that) and work on that. Note the model on your subset of data won't be accurate, it's just to make sure your Python code works on that data. You'd then deploy your code to a cloud environment and run it on the full data. Good end-to-end example here: https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/courses/machine_learning/deepdive2/structured/solutions.
I just hoped you could help me understand this:
In the below, does chunk size submit lots of queries to the database? Or does it one run query & just process the chunks one by one?
I have a super slow database with lots of data. So, ideally, I want to query once, but handle it in chunks to avoid memory overload on the EC2 instance
Thanks
for data in pd.read_sql('SELECT * FROM TABLE ', iconn, chunksize=10000000)
I have read access to a SQL Server and I reference 2 separate databases on that server. I need to do a query on a set of filtered id's, ranging from 500 - 10,000 id's depending on the day received as an excel spread sheet and loaded into python via pandas DataFrame
Note, I don't have access to this database via SSMS so python is my only way in.
The query is very simple,
query = "SELECT case.id as m, case.owner as o FROM case WHERE case.id = ? "
I put this through a loop and append the data through a list ref
for i in case['id']:
ref.append(cursor.execute(query, i).fetchone()
or append to a dataframe
for i in case['id']:
df.append(pd.read_sql_query(query, con, params = [i]))
Fairly straightforward, however it is agonizingly slow. Am I doing something wrong here?
I used to do this with Visual Basic and, using arrays and loops, was blindingly fast.
Any advice on this would be duly appreciated.
I have a 22 million row .csv file (~850mb) that I am trying to load into a postgres db on Amazon RDS. It fails every time (I get a time-out error), even when I split the file into smaller parts (each of 100,000 rows) and even when I use chunksize.
All I am doing at the moment is loading the .csv as a dataframe and then writing it to the db using df.to_sql(table_name, engine, index=False, if_exists='append', chunksize=1000)
I am using create_engine from sqlalchemy to create the connection: engine = create_engine('postgresql:database_info')
I have tested writing smaller amounts of data with psycopg2 without a problem, but it takes around 50 seconds to write 1000 rows. Obviously for 22m rows that won't work.
Is there anything else I can try?
The pandas DataFrame.to_sql() method is not especially designed for large inserts, since it does not utilize the PostgreSQL COPY command.
Regular SQL queries can time out, it's not the fault of pandas, it's controlled by the database server but can be modified per connection, see this page and search for 'statement_timeout'.
What I would recommend you to do is to consider using Redshift, which is optimized for datawarehousing and can read huge data dumps directly from S3 buckets using the Redshift Copy command.
If you are in no position to use Redshift, I would still recommend finding a way to do this operation using the PostgreSQL COPY command, since it was invented to circumvent exactly the problem you are experiencing.
You can to write the dataframe to a cString and then write this to the database using the copy_from method in Psycopg which I believe does implement the PostgreSql COPY command that #firelynx mentions.
import cStringIO
dboutput = cStringIO.StringIO()
output = output.T.to_dict().values()
dboutput.write('\n'.join([ ''.join([row['1_str'],'\t',
row['2_str'], '\t',
str(row['3_float'])
]) for row in output]))
dboutput.seek(0)
cursor.copy_from(dboutput, 'TABLE_NAME')
connenction.commit()
where output is originally a pandas dataframe with columns [1_str, 2_str, 3_float] that you want to write to the database.
I have a database with a large table containing more that a hundred million rows. I want to export this data (after some transformation, like joining this table with a few others, cleaning some fields, etc.) and store it int a big text file, for later processing with Hadoop.
So far, I tried two things:
Using Python, I browse the table by chunks (typically 10'000 records at a time) using this subquery trick, perform the transformation on each row and write directly to a text file. The trick helps, but the LIMIT becomes slower and slower as the export progresses. I have not been able to export the full table with this.
Using the mysql command-line tool, I tried to output the result of my query in CSV form to a text file directly. Because of the size, it ran out of memory and crashed.
I am currently investigating Sqoop as a tool to import the data directly to HDFS, but I was wondering how other people handle such large-scale exports?
Memory issues point towards using the wrong database query machanism.
Normally, it is advisable to use mysql_store_result() on C level, which corresponds to having a Cursor or DictCursor on Python level. This ensures that the database is free again as soon as possible and the client can do with thedata whatever he wants.
But it is not suitable for large amounts of data, as the data is cached in the client process. This can be very memory consuming.
In this case, it may be better to use mysql_use_result() (C) resp. SSCursor / SSDictCursor (Python). This limits you to have to take the whole result set and doing nothing else with the database connection in the meanwhile. But it saves your client process a lot of memory. With the mysql CLI, you would achieve this with the -q argument.
I don't know what query exactly you have used because you have not given it here, but I suppose you're specifying the limit and offset. This are quite quick queries at begin of data, but are going very slow.
If you have unique column such as ID, you can fetch only the first N row, but modify the query clause:
WHERE ID > (last_id)
This would use index and would be acceptably fast.
However, it should be generally faster to do simply
SELECT * FROM table
and open cursor for such query, with reasonable big fetch size.