I just hoped you could help me understand this:
In the below, does chunk size submit lots of queries to the database? Or does it one run query & just process the chunks one by one?
I have a super slow database with lots of data. So, ideally, I want to query once, but handle it in chunks to avoid memory overload on the EC2 instance
Thanks
for data in pd.read_sql('SELECT * FROM TABLE ', iconn, chunksize=10000000)
Related
I came across problem, that when I query for large data (35M rows, 22GB data), the same query got executed multiple times (e.g. 400 times) in background. I understand that data is partitioned/shuffled in some way. It greatly increases query cost.
This is how I query for data:
from google.cloud import bigquery
bqclient = bigquery.Client(project)
query_job = bqclient.query(query).result()
df_result = query_job.to_dataframe()
Where project and query are Python strings.
I am using google-cloud-bigquery==2.30.1.
I am looking for any programmatic solutions to reduce query costs. E.g. is there different class/config/object/method/library that would handle such queries in better way?
I suspect it's because you're calling result() twice, once when you run query_job = bqclient.query(query).result() and once when you run df_result = query_job.to_dataframe() (by calling query_job again). Not sure why it's running so many times but probably has to do with how result() works (https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.query should have more info)
The "basic" answer to what you want is df_result = bqclient.query(query).to_dataframe(). However, if you're querying a large dataset, this will likely take time. See
Reading from BigQuery into a Pandas DataFrame and performance issues for a better way to do this.
Side note on reducing query costs: in case you're working in a local Python environment, you probably don't want to be processing 22GB worth of data there. If you're, say, building an ML model, you probably want to extract say 1/1000th of you data (a simple LIMIT in SQL won't reduce your query costs, you want to partition your table on a date column and filter on that OR create a new table with a subset of rows and query that) and work on that. Note the model on your subset of data won't be accurate, it's just to make sure your Python code works on that data. You'd then deploy your code to a cloud environment and run it on the full data. Good end-to-end example here: https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/courses/machine_learning/deepdive2/structured/solutions.
I have a large database table of more than a million records and django is taking really long time to retrieve the data. When I had less records the data was retrieved fast.
I am using the get() method to retrieve the data from the database. I did try the filter() method but when I did that then it gave me entire table rather than filtering on the given condition.
Currently I retrieve the data using the code shown below:
context['variables'] = variable.objects.get(id=self.kwargs['pk'])
I know why it is slow, because its trying to go through all the records and get the records whose id matched. But I was wondering if there was a way I could restrict the search to last 100 records or if there is something I am not doing correctly with the filter() function. Any help would be appretiated.
I would like to back-test some data which will be pulled from a Postgres database, using Python, psycopg2 and Pandas.
The data which will be pulled from Postgres is very large (over 10Gbs) - my system will not be able to hold this in terms of RAM, even if a Pandas data frame is able to store this much data.
As an overview, I expect my Python program will need to do the following:
1: Connect to a remote (LAN based) Postgres database server
2: Run a basic select query against a database table
3: Store the result of the query in a Pandas data frame
4: Perform calculation operations on the data within the Pandas data frame
5: Write the result of these operations back to an existing table within the database.
I expect the data that will be returned in step 2 will be very large.
Is it possible to stream the result of a large query to a Pandas data frame, so that my Python script can process data in smaller chunks, say of 1gb, as an example?
Any ideas, suggestions or resources you can point to, on how best to do this, or if I am not approaching this in the right way, will be much appreciated, and I am sure that this will be useful to others going forward.
Thank you.
Demo - how to read data from SQL DB in chunks and process single chunks:
from sqlalchemy import create_engine
# conn = create_engine('postgresql://user:password#host:port/dbname')
conn = create_engine('postgresql+psycopg2://user:password#host:port/dbname')
qry = "select * from table where ..."
sql_reader = pd.read_sql(qry, con=conn, chunksize=10**4)
for df in sql_reader:
# process `df` (chunk of 10.000 rows) here
UPDATE: very good point from #jeremycg
depending on the exact setup, OP might also need to use
conn.dialect.server_side_cursors = True and
conn.execution_options(stream_results = True) as the database driver
will otherwise fetch all the results locally, then stream them to
python in chunks
I've just set up a delta-loading data flow between multiple Mysql DBs and a Porgres DB. It's only copying tens of Mbs every 15mins.
Yet, I'd like to set a process to fully load the data between them in case of emergency...
Python is just crashing and seems not to be fast enough when using SQLachemy etc.
I've read that the best might be to just dump everything from MySQL into CSV and then use file_fdw to load the entire tables into Postgres..
Has anyone faced a similar issue? If yes, how did you proceed?
Long story made short, ORM overhead is killing your performance.
When you're not manipulating the objects involved, it's better to use SQA Core expressions ("SQL Expressions") which are almost as fast as pure SQL.
Solution:
Of course I'm presuming your MySQL and Postgres models have been meticulously synchronized (i.e. values from an object from MySQL are not a problem for creating object in Postgres model and vice versa).
Overview:
get Table objects out of declarative classes
select (SQLAlchemy Expression) from one database
convert rows to dicts
insert into the other database
More or less:
# get tables
m_table = ItemMySQL.__table__
pg_table = ItemPG.__table__
# SQL Expression that gets a range of rows quickly
pg_q = select([pg_table]).where(
and_(
pg_table.c.id >= id_start,
pg_table.c.id <= id_end,
))
# get PG DB rows
eng_pg = DBSessionPG.get_bind()
conn_pg = eng_pg.connect()
result = conn_pg.execute(pg_q)
rows_pg = result.fetchall()
for row_pg in rows_pg:
# convert PG row object into dict
value_d = dict(row_pg)
# insert into MySQL
m_table.insert().values(**value_d)
# close row proxy object and connection, else suffer leaks
result.close()
conn_pg.close()
Background on performance, see accepted answer (by SQA principal author himself):
Why is SQLAlchemy insert with sqlite 25 times slower than using sqlite3 directly?
Since you seem to have Python crashing, perhaps you're using too much memory? Hence I suggest reading and writing rows in batches.
A further improvement could be using .values to insert a number of rows in one call, see here: http://docs.sqlalchemy.org/en/latest/core/tutorial.html#inserts-and-updates
I have a database with a large table containing more that a hundred million rows. I want to export this data (after some transformation, like joining this table with a few others, cleaning some fields, etc.) and store it int a big text file, for later processing with Hadoop.
So far, I tried two things:
Using Python, I browse the table by chunks (typically 10'000 records at a time) using this subquery trick, perform the transformation on each row and write directly to a text file. The trick helps, but the LIMIT becomes slower and slower as the export progresses. I have not been able to export the full table with this.
Using the mysql command-line tool, I tried to output the result of my query in CSV form to a text file directly. Because of the size, it ran out of memory and crashed.
I am currently investigating Sqoop as a tool to import the data directly to HDFS, but I was wondering how other people handle such large-scale exports?
Memory issues point towards using the wrong database query machanism.
Normally, it is advisable to use mysql_store_result() on C level, which corresponds to having a Cursor or DictCursor on Python level. This ensures that the database is free again as soon as possible and the client can do with thedata whatever he wants.
But it is not suitable for large amounts of data, as the data is cached in the client process. This can be very memory consuming.
In this case, it may be better to use mysql_use_result() (C) resp. SSCursor / SSDictCursor (Python). This limits you to have to take the whole result set and doing nothing else with the database connection in the meanwhile. But it saves your client process a lot of memory. With the mysql CLI, you would achieve this with the -q argument.
I don't know what query exactly you have used because you have not given it here, but I suppose you're specifying the limit and offset. This are quite quick queries at begin of data, but are going very slow.
If you have unique column such as ID, you can fetch only the first N row, but modify the query clause:
WHERE ID > (last_id)
This would use index and would be acceptably fast.
However, it should be generally faster to do simply
SELECT * FROM table
and open cursor for such query, with reasonable big fetch size.