Python BigQuery client executing same query multiple times - python

I came across problem, that when I query for large data (35M rows, 22GB data), the same query got executed multiple times (e.g. 400 times) in background. I understand that data is partitioned/shuffled in some way. It greatly increases query cost.
This is how I query for data:
from google.cloud import bigquery
bqclient = bigquery.Client(project)
query_job = bqclient.query(query).result()
df_result = query_job.to_dataframe()
Where project and query are Python strings.
I am using google-cloud-bigquery==2.30.1.
I am looking for any programmatic solutions to reduce query costs. E.g. is there different class/config/object/method/library that would handle such queries in better way?

I suspect it's because you're calling result() twice, once when you run query_job = bqclient.query(query).result() and once when you run df_result = query_job.to_dataframe() (by calling query_job again). Not sure why it's running so many times but probably has to do with how result() works (https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.query should have more info)
The "basic" answer to what you want is df_result = bqclient.query(query).to_dataframe(). However, if you're querying a large dataset, this will likely take time. See
Reading from BigQuery into a Pandas DataFrame and performance issues for a better way to do this.
Side note on reducing query costs: in case you're working in a local Python environment, you probably don't want to be processing 22GB worth of data there. If you're, say, building an ML model, you probably want to extract say 1/1000th of you data (a simple LIMIT in SQL won't reduce your query costs, you want to partition your table on a date column and filter on that OR create a new table with a subset of rows and query that) and work on that. Note the model on your subset of data won't be accurate, it's just to make sure your Python code works on that data. You'd then deploy your code to a cloud environment and run it on the full data. Good end-to-end example here: https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/courses/machine_learning/deepdive2/structured/solutions.

Related

Reading from BigQuery into a Pandas DataFrame and performance issues

I have around 4M (million) lines that I am reading into a dataframe from BQ, but I find that it no longer seems to be working. As I cannot isolate that something has changed, I want to know if there is anything to change to the code to make it more performant?
My code is the following:
def get_df_categories(table_name):
query = """
select cat, ref, engine from `{table_name}`
""".format(table_name=table_name)
df = client.query(query).to_dataframe()
return df
Better read it via list_rows method in batches. In this way you can try to use multithread to read data for a fixed size. This will help you see output much faster and you will be able to handle heavy data loads in a systematic manner. You can also pass which fields you wish to see in the output. This replicates the column names inside your select clause in the sql query.
Here is the document that will help you get started. https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html

How to efficiently query a large database on a hourly basis?

Background:
I have multiple asset tables stored in a redshift database for each city, 8 cities in total. These asset tables display status updates on an hourly basis. 8 SQL tables and about 500 mil rows of data in a year.
(I also have access to the server that updates this data every minute.)
Example: One market can have 20k assets displaying 480k (20k*24 hrs) status updates a day.
These status updates are in a raw format and need to undergo a transformation process that is currently written in a SQL view. The end state is going into our BI tool (Tableau) for external stakeholders to look at.
Problem:
The current way the data is processed is slow and inefficient, and probably not realistic to run this job on an hourly basis in Tableau. The status transformation requires that I look back at 30 days of data, so I do need to look back at the history throughout the query.
Possible Solutions:
Here are some solutions that I think might work, I would like to get feedback on what makes the most sense in my situation.
Run a python script that looks at the most recent update and query the large history table 30 days as a cron job and send the result to a table in the redshift database.
Materialize the SQL view and run an incremental refresh every hour
Put the view in Tableau as a datasource and run an incremental refresh every hour
Please let me know how you would approach this problem. My knowledge is in SQL, limited Data Engineering experience, Tableau (Prep & Desktop) and scripting in Python or R.
So first things first - you say that the data processing is "slow and inefficient" and ask how to efficiently query a large database. First I'd look at how to improve this process. You indicate that the process is based on the past 30 days of data - is the large tables time sorted, vacuumed and analyzed? It is important to take maximum advantage of metadata when working with large tables. Make sure your where clauses are effective at eliminating fact table block - don't rely on dimension table where clauses to select the date range.
Next look at your distribution keys and how these are impacting the need for your critical query to move large amounts of data across the network. The internode network has the lowest bandwidth in a Redshift cluster and needlessly pushing lots of data across it will make things slow and inefficient. Using EVEN distribution can be a performance killer depending on your query pattern.
Now let me get to your question and let me paraphrase - "is it better to use summary tables, materialized views, or external storage (tableau datasource) to store summary data updated hourly?" All 3 work and each has its own pros and cons.
Summary tables are good because you can select the distribution of the data storage and if this data needs to be combined with other database tables it can be done most efficiently. However, there is more data management to be performed to keep this data up to data and in sync.
Materialized views are nice as there is a lot less management action to worry about - when the data changes, just refresh the view. The data is still in the database so is is easy to combine with other data tables but since you don't have control over storage of the data these action may not be the most efficient.
External storage is good in that the data is in your BI tool so if you need to refetch the results during the hour the data is local. However, it is not locked into your BI tool and far less efficient to combine with other database tables.
Summary data usually isn't that large so how it is stored isn't a huge concern and I'm a bit lazy so I'd go with a materialized view. Like I said at the beginning I'd first look at the "slow and inefficient" queries I'm running every hour first.
Hope this helps

How to get individual row from bigquery table less then a second?

I have a aggregated data table in bigquery that has millions of rows. This table is growing everyday.
I need a way to get 1 row from this aggregate table in milliseconds to append data in real time event.
What is the best way to tackle this problem?
BigQuery is not build to respond in miliseconds, so you need an other solution in between. It is perfectly fine to use BigQuery to do the large aggregration calculation. But you should never serve directly from BQ where response time is an issue of miliseconds.
Also be aware, that, if this is an web application for example, many reloads of a page, could cost you lots of money.. as you pay per Query.
There are many architectual solution to fix such issues, but what you should use is hard to tell without any project context and objectives.
For realtime data we often use PubSub to connect somewhere in between, but that might be an issue if the (near) realtime demand is an aggregrate.
You could also use materialized views concept, by exporting the aggregrated data to a sub component. For example cloud storage -> pubsub , or a SQL Instance / Memory store.. or any other kind of microservice.

Continuous aggregates over large datasets

I'm trying to think of an algorithm to solve this problem I have. It's not a HW problem, but for a side project I'm working on.
There's a table A that has about (order of) 10^5 rows and adds new in the order of 10^2 every day.
Table B has on the order of 10^6 rows and adds new at 10^3 every day. There's a one to many relation from A to B (many B rows for some row in A).
I was wondering how I could do continuous aggregates for this kind of data. I would like to have a job that runs every ~10mins and does this: For every row in A, find every row in B related to it that were created in the last day, week and month (and then sort by count) and save them in a different DB or cache them.
If this is confusing, here's a practical example: Say table A has Amazon products and table B has product reviews. We would like to show a sorted list of products with highest reviews in the last 4hrs, day, week etc. New products and reviews are added at a fast pace, and we'd like the said list to be as up-to-date as possible.
Current implementation I have is just a for loop (pseudo-code):
result = []
for product in db_products:
reviews = db_reviews(product_id=product.id, create>=some_time)
reviews_count = len(reviews)
result[product]['reviews'] = reviews
result[product]['reviews_count'] = reviews_count
sort(result, by=reviews_count)
return result
I do this every hour, and save the result in a json file to serve. The problem is that this doesn't really scale well, and takes a long time to compute.
So, where could I look to solve this problem?
UPDATE:
Thank you for your answers. But I ended up learning and using Apache Storm.
Summary of requirements
Having two bigger tables in a database, you need regularly creating some aggregates for past time periods (hour, day, week etc.) and store the results in another database.
I will assume, that once a time period is past, there are no changes to related records, in other words, the aggregate for past period has always the same result.
Proposed solution: Luigi
Luigi is framework for plumbing dependent tasks and one of typical uses is calculating aggregates for past periods.
The concept is as follows:
write simple Task instance, which defines required input data, output data (called Target) and process to create the target output.
Tasks can be parametrized, typical parameter is time period (specific day, hour, week etc.)
Luigi can stop tasks in the middle and start later. It will consider any task, for which is target already existing to be completed and will not rerun it (you would have to delete the target content to let it rerun).
In short: if the target exists, the task is done.
This works for multiple types of targets like files in local file system, on hadoop, at AWS S3, and also in database.
To prevent half done results, target implementations take care of atomicity, so e.g. files are first created in temporary location and are moved to final destination just after they are completed.
In databases there are structures to denote, that some database import is completed.
You are free to create your own target implementations (it has to create something and provide method exists to check, the result exists.
Using Luigi for your task
For the task you describe you will probably find everything you need already present. Just few tips:
class luigi.postgres.CopyToTable allowing to store records into Postgres database. The target will automatically create so called "marker table" where it will mark all completed tasks.
There are similar classes for other types of databases, one of them using SqlAlchemy which shall probably cover the database you use, see class luigi.contrib.sqla.CopyToTable
At Luigi doc is working example of importing data into sqlite database
Complete implementation is beyond extend feasible in StackOverflow answer, but I am sure, you will experience following:
The code to do the task is really clear - no boilerplate coding, just write only what has to be done.
nice support for working with time periods - even from command line, see e.g. Efficiently triggering recurring tasks. It even takes care of not going too far in past, to prevent generating too many tasks possibly overloading your servers (default values are very reasonably set and can be changed).
Option to run the task on multiple servers (using central scheduler, which is provided with Luigi implementation).
I have processed huge amounts of XML files with Luigi and also made some tasks, importing aggregated data into database and can recommend it (I am not author of Luigi, I am just happy user).
Speeding up database operations (queries)
If your task suffers from too long execution time to perform the database query, you have few options:
if you are counting reviews per product by Python, consider trying SQL query - it is often much faster. It shall be possible to create SQL query which uses count on proper records and returns directly the number you need. With group by you shall even get summary information for all products in one run.
set up proper index, probably on "reviews" table on "product" and "time period" column. This shall speed up the query, but make sure, it does not slow down inserting new records too much (too many indexes can cause that).
It might happen, that with optimized SQL query you will get working solution even without using Luigi.
Data Warehousing? Summary tables are the right way to go.
Does the data change (once it is written)? If it does, then incrementally updating Summary Tables becomes a challenge. Most DW applications do not have that problem
Update the summary table (day + dimension(s) + count(s) + sum(s)) as you insert into the raw data table(s). Since you are getting only one insert per minute, INSERT INTO SummaryTable ... ON DUPLICATE KEY UPDATE ... would be quite adequate, and simpler than running a script every 10 minutes.
Do any reporting from a summary table, not the raw data (the Fact table). It will be a lot faster.
My Blog on Summary Tables discusses details. (It is aimed at bigger DW applications, but should be useful reading.)
I agree with Rick, summary tables make the most sense for you. Update the summary tables every 10 minutes and just pull data from it, as user's request summaries.
Also, make sure that your DB is indexed properly for performance. I'm sure db_products.id set as a unique index. but, also make sure that db_products.create is defined as a DATE or DATETIME and also indexed since you are using it in your WHERE statement.

How to export a large table (100M+ rows) to a text file?

I have a database with a large table containing more that a hundred million rows. I want to export this data (after some transformation, like joining this table with a few others, cleaning some fields, etc.) and store it int a big text file, for later processing with Hadoop.
So far, I tried two things:
Using Python, I browse the table by chunks (typically 10'000 records at a time) using this subquery trick, perform the transformation on each row and write directly to a text file. The trick helps, but the LIMIT becomes slower and slower as the export progresses. I have not been able to export the full table with this.
Using the mysql command-line tool, I tried to output the result of my query in CSV form to a text file directly. Because of the size, it ran out of memory and crashed.
I am currently investigating Sqoop as a tool to import the data directly to HDFS, but I was wondering how other people handle such large-scale exports?
Memory issues point towards using the wrong database query machanism.
Normally, it is advisable to use mysql_store_result() on C level, which corresponds to having a Cursor or DictCursor on Python level. This ensures that the database is free again as soon as possible and the client can do with thedata whatever he wants.
But it is not suitable for large amounts of data, as the data is cached in the client process. This can be very memory consuming.
In this case, it may be better to use mysql_use_result() (C) resp. SSCursor / SSDictCursor (Python). This limits you to have to take the whole result set and doing nothing else with the database connection in the meanwhile. But it saves your client process a lot of memory. With the mysql CLI, you would achieve this with the -q argument.
I don't know what query exactly you have used because you have not given it here, but I suppose you're specifying the limit and offset. This are quite quick queries at begin of data, but are going very slow.
If you have unique column such as ID, you can fetch only the first N row, but modify the query clause:
WHERE ID > (last_id)
This would use index and would be acceptably fast.
However, it should be generally faster to do simply
SELECT * FROM table
and open cursor for such query, with reasonable big fetch size.

Categories

Resources