I have a set of IDs (~200k) and I need to get all the rows in a BigQuery Table with those IDs. I tried to construct a list in python and pass it as a parameter to the SQL query using # but I get TypeError: 'ArrayQueryParameter' object is not iterable error. Here is the code I tried (very similar to https://cloud.google.com/bigquery/querying-data#running_parameterized_queries):
id_list = ['id1', 'id2']
query = """
SELECT id
FROM `my-db`
WHERE id in UNNEST(#ids)
"""
query_job = client.run_async_query(
str(uuid.uuid4()),
query,
query_parameters=(
bigquery.ArrayQueryParameter('ids', 'ARRAY<STRING>', id_list)
)
)
Probably the issue here is that you are not passing a tuple to the function.
Try adding a comma before closing the parenthesis, like so:
id_list = ['id1', 'id2']
query = """
SELECT id
FROM `my-db`
WHERE id in UNNEST(#ids)
"""
query_job = client.run_async_query(
str(uuid.uuid4()),
query,
query_parameters=(
bigquery.ArrayQueryParameter('ids', 'STRING', id_list),
)
)
In Python if you do:
t = (1)
and then run:
type(t)
You will find the result to be int. But if you do:
t = (1,)
Then it results in a tuple.
You need to use 'STRING' rather than 'ARRAY<STRING>' for the array element type, e.g.:
query_parameters=(
bigquery.ArrayQueryParameter('ids', 'STRING', id_list)
The example from the querying data topic is:
def query_array_params(gender, states):
client = bigquery.Client()
query = """
SELECT name, sum(number) as count
FROM `bigquery-public-data.usa_names.usa_1910_2013`
WHERE gender = #gender
AND state IN UNNEST(#states)
GROUP BY name
ORDER BY count DESC
LIMIT 10;
"""
query_job = client.run_async_query(
str(uuid.uuid4()),
query,
query_parameters=(
bigquery.ScalarQueryParameter('gender', 'STRING', gender),
bigquery.ArrayQueryParameter('states', 'STRING', states)))
query_job.use_legacy_sql = False
# Start the query and wait for the job to complete.
query_job.begin()
wait_for_job(query_job)
print_results(query_job.results())
Above answers are a better solution but you may find a use for this too whe quickly drafting something in notebooks:
turn a list into a string of date values, comma-separated and in quotes. Then pass the string into the query like so:
id_list = ['id1', 'id2']
# format into a query valid string
id_string = '"'+'","'.join(id_list)+'"'
client = bigquery.Client()
query = f"""
SELECT id
FROM `my-db`
WHERE id in {id_string}
"""
query_job=client.query(query)
results = query_job.result()
If you want to use the simple query like client.query, not client.run_async_query as shown in the answers above. You can to pass an additional parameter QueryJobConfig. Simply add your arrays to query_parameters using bigquery.ArrayQueryParameter.
The following code worked for me:
query = f"""
SELECT distinct pipeline_commit_id, pipeline_id, name
FROM `{self.project_id}.{self.dataset_id}.pipelines_{self.table_suffix}`,
UNNEST(labels) AS label
where label.value IN UNNEST(#labels)
"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ArrayQueryParameter('labels', 'STRING', labels)
]
)
query_job = self.client.query(query, job_config=job_config)
Based on those examples:
https://cloud.google.com/bigquery/docs/parameterized-queries
Related
My application uses SQLAlchemy/SQL to query a database. I want to print out the result of the query, but I am getting a <sqlalchemy.engine.result.ResultProxy object in response.
I tried out the suggestions in How to access the results of queries? but I am getting an "Uncaught exception"
See code below:
query = f"SELECT COUNT(DISTINCT id)"\
f"FROM group"
result = db.session.execute(query)
id_count = result.first()[0]
print(id_count)
Try this one:
query = f"SELECT COUNT(DISTINCT id)"\
f"FROM group"
result = db.session.execute(query)
id_count = result.first()
for i in id_count:
print(i[0]) # Access by positional index
print(i['my_column']) # Access by column name as a string
r_dict = dict(i.items()) # convert to dict keyed by column names
I'm using tuple to set params in sql query but it converts str params to int. I need the params keep being a string.
From a function I create a query using .sql file.
My .sql file is something like that:
update table t
set t.status = %s
where t.id in (%s)
My function to create a query from file is:
def create_query(where=[]):
path = f'./file.sql'
with open(path, 'r') as file:
query = file.readlines()
return ' '.join(query)
return None
I call my function from this way passing the parameters:
status = 'CREATED'
ids = ('123', '1324', '124512')
params = list([ status, ids ])
query = create_query() % tuple(params)
I get a query like this:
update table t set t.status = CREATED where t.id in (22457, 22458,
22459)
I would like to do the interpolation keeping the quotations marks.
So, the query should look like this:
update table t set t.status = 'CREATED' where t.id in (22457, 22458,
22459)
If I do this:
status = ('CREATED',)
ids = ('123', '1324', '124512')
params = list([ status, ids ])
query = create_query() % params
I get this:
update table t set t.status = ('CREATED',) where t.id in (22457, 22458, 22459)
And it doesn't work for errors in my sql (comma in status).
I'm using sqlalchemy
I solved the problem modifying my .sql file adding single quote.
update table t
set t.status = '%s'
where t.id in (%s)
One approach here is to generate the IN clause with the correct number of placeholders based on the number of values. For example:
status = ('CREATED',)
ids = ('123', '1324', '124512',)
params = status + ids
where = '(%s' + ',%s'*(len(ids) - 1) + ')'
sql = 'update some_table t set t.status = %s where t.id in ' + where
cursor.execute(sql, params)
You can use ? in the query so the parameters are replaced by sql and you don't have to worry about the quotes:
status = ('CREATED',)
ids = (123, 1324, 124512, 33333, 44444)
params = status + ids
sql = f'update t set status = ? where id in (?{", ?"*(len(ids)-1)});'
cursor.execute(sql, params)
asuming id is INT
I am using bigquery hook in my airflow code.
Query example : select count(*) from 'table-name';
so it will return only 1 integer as a result.
How can I save it in an Integer python variable instead of entire pandas dataframe ?
Below is my code example,
hook = BigQueryHook(bigquery_conn_id=BQ_CON, use_legacy_sql=False)
bq_client = bigquery.Client(project = hook._get_field("project"), credentials = hook._get_credentials())
query = "select count(*) from dataset1.table1;"
df = bq_client.query(query).to_dataframe()
If it is just a single row, you could name the column col1 and access it by this key name
query = "select count(*) as col1 from dataset1.table1;"
query_result = client.query(query)
result = query_result[0]['col1']
or if you have already called to_dataframe()
result = int(df.values[0])
I have a permanent table in bigquery that I want to append to with data coming from a csv in google cloud storage. I first read the csv file into a big query temp table:
table_id = "incremental_custs"
external_config = bigquery.ExternalConfig("CSV")
external_config.source_uris = [
"gs://location/to/csv/customers_5083983446185_test.csv"
]
external_config.schema=schema
external_config.options.skip_leading_rows = 1
job_config = bigquery.QueryJobConfig(table_definitions={table_id: external_config})
sql_test = "SELECT * FROM `{table_id}`;".format(table_id=table_id)
query_job = bq_client.query(sql_test,job_config=job_config)
customer_updates = query_job.result()
print(customer_updates.total_rows)
Up until here all works and I retrieve the records from the tmp table. Issue arises when I try to then combine it with a permanent table:
sql = """
create table `{project_id}.{dataset}.{table_new}` as (
select customer_id, email, accepts_marketing, first_name, last_name,phone,updated_at,orders_count,state,
total_spent,last_order_name,tags,ll_email,points_approved,points_spent,guest,enrolled_at,ll_updated_at,referral_id,
referred_by,referral_url,loyalty_tier_membership,insights_segment,rewards_claimed
from (
select * from `{project_id}.{dataset}.{old_table}`
union all
select * from `{table_id}`
ORDER BY customer_id, orders_count DESC
))
order by orders_count desc
""".format(project_id=project_id, dataset=dataset_id, table_new=table_new, old_table=old_table, table_id=table_id)
query_job = bq_client.query(sql)
query_result = query_job.result()
I get the following error:
BadRequest: 400 Table name "incremental_custs" missing dataset while no default dataset is set in the request.
Am I missing something here? Thanks !
Arf, you forgot the external config! You don't pass it in your second script
query_job = bq_client.query(sql)
Simply update it like in the first one
query_job = bq_client.query(sql_test,job_config=job_config)
A fresh look is always easier!
I'd like to be able to add a restriction to the query if user_id != None ... for example:
"AND user_id = 5"
but I am not sure how to add this into the below function?
Thank you.
def get(id, user_id=None):
query = """SELECT *
FROM USERS
WHERE text LIKE %s AND
id = %s
"""
values = (search_text, id)
results = DB.get(query, values)
This way I can call:
get(5)
get(5,103524234) (contains user_id restriction)
def get(id, user_id=None):
query = """SELECT *
FROM USERS
WHERE text LIKE %s AND
id = %s
"""
values = [search_text, id]
if user_id is not None:
query += ' AND user_id = %s'
values.append(user_id)
results = DB.get(query, values)
As you see, the main difference wrt your original code is the small if block in the middle, which enriches query string and values if needed. I also made values a list, rather than a tuple, so it can be enriched with the more natural append rather than with
values += (user_id,)
which is arguably less readable - however, you can use it if you want to keep values a tuple for some other reasons.
edit: the OP now clarifies in a comment (!) that his original query has an ending LIMIT clause. In this case I would suggest a different approach, such as:
query_pieces = ["""SELECT *
FROM USERS
WHERE text LIKE %s AND
id = %s
""", "LIMIT 5"]
values = [search_text, id]
if user_id is not None:
query_pieces.insert(1, ' AND user_id = %s')
values.append(user_id)
query = ' '.join(query_pieces)
results = DB.get(query, values)
You could do it in other ways, but keeping a list of query pieces in the proper order, enriching it as you go (e.g. by insert), and joining it with some whitespace at the end, is a pretty general and usable approach.
What's wrong with something like:
def get(id, user_id=None):
query = "SELECT * FROM USERS WHERE text LIKE %s"
if user_id != None:
query = query + " AND id = %s"%(user_id)
:
:
That syntax may not be perfect, I haven't done Python for a while - I'm just trying to get the basic idea across. This defaults to the None case and only adds the extra restriction if you give a real user ID.
You could build the SQL query using a list of conditions:
def get(id, user_id=None):
query = """SELECT *
FROM USERS
WHERE
"""
values = [search_text, id]
conditions=[
'text LIKE %s',
'id = %s']
if user_id is not None:
conditions.append('user_id = %s')
values.append(user_id)
query+=' AND '.join(conditions)+' LIMIT 1'
results = DB.get(query, values)