Bigquery query limits upper and lower bounds - python

On mysql I would enter the following query, but running the same on google BigQuery throws an error for the upper limit. How do I specify limits on a query? Say I have a query that returns 20 results and I want results between 5 and 10 only, how should I frame the query on Google BigQuery?)
For example:
SELECT id,
COUNT(total) AS total
FROM ABC.data
GROUP BY id
ORDER BY count DESC
LIMIT 5,10;
If I only put "LIMIT 5" on the end of the query, I get the top 5 and if I put "LIMIT 10" I ge t the top 10, but what syntax do I use to get between 5 and 10.
Could someone please shed some light on this?
Any help is much appreciated.
Thanks and have a great day.

I would use window functions...
something like
select * from
(Select id, total, row_number() over (order by total desc) as rnb
from
(SELECT id,
COUNT(total) AS total
FROM ABC.data
GROUP BY id
))
where rnb>=5 and rnb<=10

The windowing function answer is a good one, but I thought I'd give another option that involves how your result is fetched rather than how the query is run.
If you only need the first N rows you can add a LIMIT N to your query. But if you don't need the first M rows, you can change how you fetch the results. If you're using the the java API, you can use the setStartIndex() method on either the TableData.list() or the Jobs.getQueryResults() call to only fetch rows starting from a particular index.

That question makes no sense to an ever changing dataset. if you have a 1 second delay between when you ask for the first 5 and the next 5... the data could have changed. It's order is now different and you will miss data or get duplicate results. So databases like BigTable have a method for doing one query of the data and giving you the resultset to you in groups. If that were the case: What you are looking for is called query cursors. I can't say this any better than their own example so [Here is the documentation on them.][1]
But since you said the data does not change then fetch() will work just fine. fetch() has 2 options you will want to take note of limit and offset. 'limit' is the maximum number of results to return. If set to None, all available results will be retrieved. 'offset' is how many results to skip.
Check out other options here: https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_fetch

Related

Postres/Python - is it better to use one large query, or several smaller ones? How do you decide number of items to include in 'WHERE IN' clause?

I am writing a Python script that will be run regularly in a production environment where efficiency is key.
Below is an anonymized query that I have which pulls sales data for 3,000 different items.
I think I am getting slower results querying for all of them at once. When I try querying for different sizes, the amount of time it takes varies inconsistently (likely due to my internet connection). For example, sometimes querying for 1000 items 3 times is faster than all 3000 at once. However, running the same test 5 minutes later gets me different results. It is a production database where performance may be dependent on current traffic. I am not a database administrator but work in data science, using mostly similar select queries (I do the rest in Python).
Is there a best practice here? Some sort of logic that determines how many items to put in the WHERE IN clause?
date_min = pd.to_datetime('2021-11-01')
date_max = pd.to_datetime('2022-01-31')
sql = f"""
SELECT
product_code,
sales_date,
n_sold,
revenue
FROM
sales_daily
WHERE
product_code IN {tuple(item_list)}
and sales_date >= DATE('{date_min}')
and sales_date <= DATE('{date_max}')
ORDER BY
sales_date DESC, revenue
"""
df_act = pd.read_sql(sql, di.db_engine)
df_act
If your sales_date column is indexed in the database, I think using a function in the where clause (DATE) might cause the plan to not use that index. I believe you will have better luck if you concatenate date_min and date_max as strings (YYYY-MM-DD) into the SQL string and get rid of the function. Also, use BETWEEN...AND rather than >= ... AND ... <=.
As for IN with 1000 items, strongly recommend you don't do that. Create a single-column temp table of those values and index the item, then join to product_code.
Generally, something like this:
DROP TABLE IF EXISTS _item_list;
CREATE TEMP TABLE _item_list
AS
SELECT item
FROM VALUES (etc) t(item);
CREATE INDEX idx_items ON _item_list (item);
SELECT
product_code,
sales_date,
n_sold,
revenue
FROM
sales_daily x
INNER JOIN _item_list y ON x.product_code = y.item
WHERE
sales_date BETWEEN '{date_min}' AND '{date_max}'
ORDER BY
sales_date DESC, revenue
As an addendum, try to have the items in the item list in the same order as the index on the product_code.

Passing Parameter into a Query

I m trying to create a method where I can pass a parameter (a number) and get the number as my number of output. See below:
def get_data(i):
for i in range(0,i):
TNG = "SELECT DISTINCT hub, value, date_inserted FROM ZE_DATA.AESO_CSD_SUMMARY where opr_date >= trunc(sysdate) order by date_inserted desc fetch first i rows only"
Where i is a number. Inside the query "fetch first i rows only" , i want it to query i number of rows.
Thoughts on the syntax?
Seems like you're looking for a limit argument. You didn't mention what type of SQL you're using, but here are a couple of examples for various SQL languages.
I'm also a little confused by the structure of that function, seems like you may want to query the result set, then iterate through it rather than query the result set i number of times.

Select a random row from the table using Python

Below is the my table.I use MySQL for the database queries.
Structure of the table
I want to print questions randomly by taking the questions from the table. How can I do that using Python?
from random import randint
num = randint(1,5)
Then db query:
SELECT question FROM your_table WHERE ques_id = num;
Alternatively:
SELECT question FROM your_table LIMIT num-1, 1;
num would be a random number between 1 and 5, replace num in the query and it only returns 1 row. Be aware it is starting from index 0, therefore the first argument should be num-1 other than num, second argument is always 1 because you only want to get one row per query.
If all the Ids are in order, get the max one and use the random library to get a random number from 1 to the max id in database.
from random import randint
random_id = randint(1,my_max_id)
then use random_id to get the item from the database.
If you have not setup your python mysql connection, you can refer this
How do I connect to a MySQL Database in Python?.
You could do it at the database level (in MySQL) and thus you would gain an extra speed (by doing the calculations in a lower level of software).
In MySQL, you could get all the questions that you are going to show in a random way.
SELECT qus_id, question FROM your_table ORDER BY RAND();
And then, in python show them by sequentially obtaining the records previously out of order in MySQL.
for question in rows:
show_question(question)
Any "Random" operation is costly to process, so the lower the software level at which it is calculated, the more optimal your program will be.

Deduping a table while keeping record structures

I've got a weekly process which does a full replace operation on a few tables. The process is weekly since there are large amounts of data as a whole. However, we also want to do daily/hourly delta updates, so the system would be more in sync with production.
When we update data, we are creating duplications of rows (updates of an existing row), which I want to get rid of. To achieve this, I've written a python script which runs the following query on a table, inserting the results back into it:
QUERY = """#standardSQL
select {fields}
from (
select *
, max(record_insert_time) over (partition by id) as max_record_insert_time
from {client_name}_{environment}.{table} as a
)
where 1=1
and record_insert_time = max_record_insert_time"""
The {fields} variable is replaced with a list of all the table columns; I can't use * here because that would only work for 1 run (the next will already have a field called max_record_insert_time and that would cause an ambiguity issue).
Everything is working as expected, with one exception - some of the columns in the table are of RECORD datatype; despite not using aliases for them, and selecting their fully qualified name (e.g. record_name.child_name), when the output is written back into the table, the results are flattened. I've added the flattenResults: False config to my code, but this has not changed the outcome.
I would love to hear thoughts about how to resolve this issue using my existing plan, other methods of deduping, or other methods of handling delta updates altogether.
Perhaps you can use in the outer statement
SELECT * EXCEPT (max_record_insert_time)
This should keep the exact record structure. (for more detailed documentation see https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select-except)
Alternative approach, would be include in {fields} only top level columns even if they are non leaves, i.e. just record_name and not record_name.*
Below answer is definitely not better than use of straightforward SELECT * EXCEPT modifier, but wanted to present alternative version
SELECT t.*
FROM (
SELECT
id, MAX(record_insert_time) AS max_record_insert_time,
ARRAY_AGG(t) AS all_records_for_id
FROM yourTable AS t GROUP BY id
), UNNEST(all_records_for_id) AS t
WHERE t.record_insert_time = max_record_insert_time
ORDER BY id
What above query does is - first groups all records for each id into array of respective rows along with max value for insert_time. Then, for each id - it simply flattens all (previously aggregated) rows and picks only rows with insert_time matching max time. Result is as expected. No Analytic Function involved but rather simple Aggregation. But extra use of UNNEST ...
Still - at least different option :o)

How to get num results in mysqldb

I have the following query:
self.cursor.execute("SELECT platform_id_episode, title, from table WHERE asset_type='movie'")
Is there a way to get the number of results returned directly? Currently I am doing the inefficient:
r = self.cursor.fetchall()
num_results = len(r)
If you don't actually need the results,* don't ask MySQL for them; just use COUNT:**
self.cursor.execute("SELECT COUNT(*) FROM table WHERE asset_type='movie'")
Now, you'll get back one row, with one column, whose value is the number of rows your other query would have returns.
Notice that I ignored your specific columns and just did COUNT(*). A COUNT(platform_id_episode) would also be legal, but it means the number of found rows with non-NULL platform_id_episode values; COUNT(*) is the number of found rows full stop.***
* If you do need the results… well, you have to call fetchall() or equivalent to get them, so I don't see the problem.
** If you've never used aggregate functions in SQL before, make sure to look over some of the examples on that page; you've probably never realized you can do things like that so simply (and efficiently).
*** If someone taught you "never use * in a SELECT", well, that's good advice, but it's not relevant here. The problem with SELECT * is that it spams all of the columns, in random order, across your result set, instead of the columns you actually need in the order you need. SELECT COUNT(*) doesn't do that.

Categories

Resources