Difference of query results in BigQuery Python Client Library - python

I'd like to know the difference between query()'s return value and query().result()s'.
In BigQuery Python Client Library,
bigquery_client = bigquery.Client()
myQuery = "SELECT * FROM `mytable`"
## NOTE: This query result has just 1 row.
job = bigquery_client.query(myQuery)
for row in job:
val1 = row
result = job.result()
for row in result:
val2 = row
print(job == result) # False. I know QueryJob object is different to RowIterator object.
print(val1 == val2) # True
Why are val1 and val2 equivalent?
Can the values be different for a very large query?

This is a yearlong self-answer.
Basically, 'job' and 'result' are different in my code.
bigquery_client.query() returns QueryJob instance.
( See https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.query )
But the QueryJob class has own _iter_ method and it returns iter(self.result())
( See https://github.com/googleapis/python-bigquery/blob/main/google/cloud/bigquery/job/query.py#L1778 )
So 'job' becomes an iterator of result() for the for-in loop.
Thus, job != result but val1 == val2.

The query() method is used to execute SQL query.
BigQuery saves all the query results to a table which is either permanent or temporary.After a query finishes, the temporary table exists for up to 24 hours
and if we write the results to a new table it becomes a permanent table.
when writing a query result to a permanent table you can use the Python code which contains the result() method i. query().results() which is used to write the query data to a new permanent table.
so basically query() and query.results() gives the same output but in query().results() data fetched gets stored to a new table and in query() data resides in a temporary table.
As per you question val1==val2 you got true because data fetched by query() and query().results() are same but stored differently.
I am providing the public documentation link related to this.
writing query results

Related

How to order and read from a very large bigquery table line by line in Python?

Here is a piece of code I use to read from large bigquery tables line by line in python:
client = bigquery.Client('YOUR_CLIENT_NAME');
conn = dbapi.connect(client);
cursor = conn.cursor();
cursor.execute('SELECT * FROM MY_LARGE_TABLE ORDER BY COLUMN_A');
line = cursor.fetchone();
while line != None:
print('Do something with line')
line = cursor.fetchone();
And this works fine for some tables. However, it is showing the following error for very large tables:
google.cloud.bigquery.dbapi.exceptions.DatabaseError: 403 Response too large to return. Consider specifying a destination table in your job configuration. For more details, see https://cloud.google.com/bigquery/troubleshooting-errors
Basically, I have a very large table, MY_LARGE_TABLE, on CGP. There is a column in that table, COLUMN_A. I need to iterate over the table (in python) and extract all records with the same COLUMN_A and do some analysis on those records, and repeat this for all unique COLUMN_A values. My plan was (see the above python script) to use ORDER BY COLUMN_A in my query so that the results returned by cursor.execute() are ordered and all records with the same COLUMN_A are next to each other, and I can iterate over the table using fetchone() and do the task in one pass.
As confirmed by #khemedi, providing a destination table as suggested by the error solves the issue. See code snippet on adding a destination table on execute.
curr.execute(query,job_config=QueryJobConfig(destination="your_project.your_dataset.your_dest_table"))

Pass a Python Variable containing multiple ID Numbers into external BigQuery Script

I have created a python class, and one of my methods is meant to take in either a single ID number or a list of ID numbers. The function will then use the ID numbers to query from a table in BigQuery using a .sql script. Currently, the function works fine for a single ID number using the following:
def state_data(self, state, id_number):
if state == 'NY':
sql_script = self.sql_scripts['get_data_ny']
else:
sql_script = self.sql_scripts['get_data_rest']
sql_script = sql_script.replace('##id_number##', id_number)
I'm having issues with passing in multiple ID numbers at once. There are 3 different ways that I've tried without success:
The above method, passing in the multiple ID numbers as a tuple to use with WHERE ID_NUM IN('##id_number##'). This doesn't work, as when the .sql script gets called, a syntax error is returned, as parentheses and quotes are automatically added. For example, the SQL statement attempts to run as WHERE ID_NUM IN('('123', '124')'). This would run fine without one of the two sets of parentheses and quotes, but no matter what I try to pass in, they always get added.
The second technique I have tried is to create a table, populate it with the passed in ID numbers, and then join with the larger table in BQ. It goes as follows:
CREATE OR REPLACE TABLE ID_Numbers
(
ID_Number STRING
);
INSERT INTO ID_Numbers (ID_Number)
VALUES ('##id_number##');
-- rest of script is a simple left join of the above created table with the BQ table containing the data for each ID
This again works fine for single ID numbers, but passing in multiple VALUES (in this case ID Numbers) would require a ('##id_number##') per unique ID. One thing that I have not yet attempted - to assign a variable to each unique ID and pass each one in as a new VALUE. I am not sure if this technique will work.
The third technique I've tried is to include the full SQL query in the function, rather than calling a .sql script. The list of ID numbers get passed in as tuple, and the query goes as follows:
id_nums = tuple(id_number)
query = ("""SELECT * FROM `data_table`
WHERE ID_NUM IN{}""").format(id_nums)
This technique also does not work, as I get the following error:
AttributeError: 'QueryJob' object has no attribute 'format'.
I've attempted to look into this error but I cannot find anything that helps me out effectively.
Finally, I'll note that none of the posts asking the same or similar questions have solved my issues so far.
I am looking for any and all advice for a way that I can successfully pass a variable containing multiple ID numbers into my function that ultimately calls and runs a BQ query.
You should be able to use *args to get the id_numbers as a sequence and f-strings and str.join() to build the SQL query:
class MyClass:
def state_data(self, state, *id_numbers):
print(f"{state=}")
query = f"""
SELECT * FROM `data_table`
WHERE ID_NUM IN ({", ".join(str(id_number) for id_number in id_numbers)})
"""
print(query)
my_class = MyClass()
my_class.state_data("some state", 123)
my_class.state_data("some more state", 123, 124)
On my machine, this prints:
➜ sql python main.py
state='some state'
SELECT * FROM `data_table`
WHERE ID_NUM IN (123)
state='some more state'
SELECT * FROM `data_table`
WHERE ID_NUM IN (123, 124)

Adding a column from an existing BQ table to another BQ table using Python

I am trying to experiment with creating new tables from existing BQ tables, all within python. So far I've successfully created the table using some similar code, but now I want to add another column to it from another table - which I have not been successful with. I think the problem comes somewhere within my SQL code.
Basically what I want here is to add another column named "ip_address" and put all the info from another table into that column.
I've tried splitting up the two SQL statements and running them separately, I've tried many different combinations of the commands (taking our CHAR, adding (32) after, combining all into one statement, etc.), and still I run into problems.
from google.cloud import bigquery
def alter(client, sql_alter, job_config, table_id):
query_job = client.query(sql_alter, job_config=job_config)
query_job.result()
print(f'Query results appended to table {table_id}')
def main():
client = bigquery.Client.from_service_account_json('my_json')
table_id = 'ref.datasetid.tableid'
job_config = bigquery.QueryJobConfig()
sql_alter = """
ALTER TABLE `ref.datasetid.tableid`
ADD COLUMN ip_address CHAR;
INSERT INTO `ref.datasetid.tableid` ip_address
SELECT ip
FROM `ref.datasetid.table2id`;
"""
alter(client, sql_alter, job_config, table_id)
if __name__ == '__main__':
main()
With this code, the current error is "400 Syntax error: Unexpected extra token INSERT at [4:9]" Also, do I have to continuously reference my table with ref.datasetid.tableid or can I write just tableid? I've run into errors before it gets there so I'm still not sure. Still a beginner so help is greatly appreciated!
BigQuery does not support ALTER TABLE or other DDL statements, take a look into how Modifying table schemas there you can find an example of how to add a new column when you append data to a table during a load job.

What is the difference between querying tables using Delta format with Pyspark-SQL versus Pyspark?

I am querying tables but I have different results using two manners, I would like to understand the reason.
I created a table using Delta location. I want to query the data that I stored in that location. I'm using Amazon S3.
I created the table like this:
spark.sql("CREATE TABLE bronze_client_trackingcampaigns.TRACKING_BOUNCES (ClientID INT, SendID INT, SubscriberKey STRING) USING DELTA LOCATION 's3://example/bronze/client/trackingcampaigns/TRACKING_BOUNCES/delta'")
I want to query the data using the next line:
spark.sql("SELECT count(*) FROM bronze_client_trackingcampaigns.TRACKING_BOUNCES")
But the results is not okay, it should be 41832 but it returns 1.
When I did the same query in other way:
spark.read.option("header", True).option("inferSchema", True).format("delta").table("bronze_client_trackingcampaigns.TRACKING_BOUNCES").count()
I obtained the result 41832.
My current results are:
I want to have the same results in both ways.
The 1 you got back is actually the row count - not the actual result. Change the sql statement to be:
df = spark.sql("SELECT COUNT(*) FROM bronze_client_trackingcampaigns.TRACKING_BOUNCES")
df.show()
You should now get the same result.

Python SQLite Return Default String For Non-Existent Row

I have a DB with ID/Topic/Definition columns. When a select query is made, with possibly hundreds of parameters, I would like the fetchall call to also return the topic of any non-existent rows with a default text (i.e. "Not Found").
I realize this could be done in a loop, but that would query the DB every cycle and have a significant performance hit. With the parameters joined by "OR" in a single select statement the search is nearly instantaneous.
Is there a way to get a return of the query (topic) with default text for non-existent rows in SQLite?
Table Structure (named "dictionary")
ID|Topic|Definition
1|wd1|def1
2|wd3|def3
Sample Query
SELECT Topic,Definition FROM dictionary WHERE Topic = "wd1" or Topic = "wd2" or topic = "wd3"'
Desired Return
[(wd1, def1), (wd2, "Not Found"), (wd3, def3)]
To get data like wd2 out of a query, such data must be in the database in the first place.
You could put it into a temporary table, or use a common table expression.
To include rows without a match, use an outer join:
WITH IDs(ID) AS ( VALUES ('wd1'), ('wd2'), ('wd3') )
SELECT Topic,
IFNULL(Definition, 'Not Found') AS Definition
FROM IDs
LEFT JOIN dictionary USING (ID);

Categories

Resources