Downloading Large data from bigquery dataset and pandas - python

I'm trying to download data from the big query public dataset and store it locally in a CSV file. When I add LIMIT 10 at the end of the query, my code works but if not, I get an error that says:
Response too large to return. Consider setting allowLargeResults to true in your job configuration.
Thank you in Advance!
Here is my code:
import pandas as pd
import pandas_gbq as gbq
import tqdm
def get_data(query,project_id):
data = gbq.read_gbq(query, project_id=project_id,configuration={"allow_large_results":True})
data.to_csv('blockchain.csv',header=True,index=False)
if __name__ == "__main__":
query = """SELECT * FROM `bigquery-public-data.crypto_bitcoin.transactions` WHERE block_timestamp>='2017-09-1' and block_timestamp<'2017-10-1';"""
project_id = "bitcoin-274091"
get_data(query,project_id)

As was mentioned by #Graham Polley, at first you may consider to save results of your source query to some Bigquery table and then extract data from this table to GCS. Due to the current pandas_gbq library limitations, to achieve your goal I would recommend using google-cloud-bigquery package as the officially advised Python library managing interaction with Bigquery API.
In the following example, I've used bigquery.Client.query() method to trigger a query job with job_config configuration and then invoke bigquery.Client.extract_table() method to fetch the data and store it in GCS bucket:
from google.cloud import bigquery
client = bigquery.Client()
job_config = bigquery.QueryJobConfig(destination="project_id.dataset.table")
sql = """SELECT * FROM ..."""
query_job = client.query(sql, job_config=job_config)
query_job.result()
gs_path = "gs://bucket/test.csv"
ds = client.dataset(dataset, project=project_id)
tb = ds.table(table)
extract_job = client.extract_table(tb,gs_path,location='US')
extract_job.result()
As the end you can delete the table consisting staging data.

Related

Python Bigquery create temp table

When I create temp table via python, an error throws
400 Use of CREATE TEMPORARY TABLE requires a script or session
How can I create a session?
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
client = bigquery.Client(project=project, location = location)
client.query('''
create temp table t_acquisted_users as
select *
from table_a
limit 10
''').result()
You can create a session using the BigQuery API using the create_session parameter in a job config, for example:
job_config=bigquery.QueryJobConfig(create_session=True)
More details on this excellent article:
https://dev.to/stack-labs/bigquery-transactions-over-multiple-queries-with-sessions-2ll5
That's how I fix it in quick. Awaiting others provide a better answer
# create session
client0 = bigquery.Client(project=project, location=location)
job = client0.query(
"SELECT 1;", # a query can't fail
job_config=bigquery.QueryJobConfig(create_session=True)
)
session_id = job.session_info.session_id
job.result()
# set default session
client = bigquery.Client(project=project, location=location,
default_query_job_config=bigquery.QueryJobConfig(
connection_properties=[
bigquery.query.ConnectionProperty(
key="session_id", value=session_id
)
]
))

Pandas Dataframe from Cloud Functions to BigQuery - only PARQUET and CSV source_formats?

I'm querying an API with GCP Cloud Functions and would like to write the result to BigQuery. I'm getting this error:
Got unexpected source_format: 'NEWLINE_DELIMITED_JSON'. Currently, only PARQUET and CSV are supported
This is my code
from google.cloud import bigquery
import pandas as pd
import requests
import datetime
def hello_pubsub(event, context):
response = requests.get("https://api.openweathermap.org/data/2.5/weather?q=berlin&appid=12345&units=metric&lang=de")
responseJson = response.json()
# Creates DataFrame
df = pd.DataFrame({'datetime':pd.to_datetime(format(datetime.datetime.now())),
'name':str(responseJson['name']),
'temp':float(responseJson['main']['temp']),
'windspeed':float(responseJson['wind']['speed']),
'winddeg':int(responseJson['wind']['deg'])
}, index=[0])
project_id = 'myproj'
client = bigquery.Client(project=project_id)
dataset_id = 'weather'
dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.autodetect = True
job_config.write_disposition = "WRITE_APPEND"
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
load_job = client.load_table_from_dataframe(df, dataset_ref.table("weather_de"), job_config=job_config)
What's the best way to do this?
The BigQuery client library reference states that this is intended behavior when loading into a table from a dataframe using load_table_from_dataframe():
By default, this method uses the parquet source format. To override this, supply a value for source_format with the format name. Currently only CSV and PARQUET are supported.
Something you can try is replacing that method with load_table_from_json(), which is also available, and uses NEWLINE_DELIMITED_JSON as the source format. This method clearly will not accept a dataframe as input, so I could recommend using a JSON object to store the data you need from the API response. Otherwise, you can convert the existing dataframe you created to json using the to_json() method from the pandas doc.
You can read more into how the BigQuery client works from the reference, and you can also see the built source formats.

Can you delete rows in BigQuery from Python script?

Hi is there a way for deleting rows in BigQuery from a Python Script? I tried looking in the documentation and finding an example on internet, but I could not find anything.
Something that looks like this.
table_id = "a.dataset.table" # Table ID for faulty_gla_entry
statement = """ DELETE FROM a.dataset.table where value = 2 """
client.delete(table_id, statement)
Like #SergeyGeron stated. https://googleapis.dev/python/bigquery/latest/usage/index.html#bigquery-basics
has nice stuff.
Wrote something like this.
from google.cloud import bigquery
client = bigquery.Client()
query = """DELETE FROM a.dataset.table WHERE value = 4"""
query_job = client.query(query)
print(query_job.result())
here you can see a code a documentation to execute a query with python
You can see this example code, with the “Delete” statement.
from google.cloud import bigquery
client = bigquery.Client()
dml_statement = (
"Delete from dataset.Inventory where ID=5"
)
query_job = client.query(dml_statement) # API request
query_job.result() # Waits for statement to finish
How to build queries in BigQuery

Appending CSV to BigQuery table with Python client

I have a new CSV file each week in the same format, which I need to append to a BigQuery table using the Python client. I successfully created the table using the first CSV, but I am unsure how to append subsequent CSVs going forward. The only way I have found is the google.cloud.bigquery.client.Client().insert_rows() method. See api link here. This would require me to first read the CSV in as a list of dictionaries. Is there a better way to append data from a CSV to a BigQuery table?
See simple example below
# from google.cloud import bigquery
# client = bigquery.Client()
# table_ref = client.dataset('my_dataset').table('existing_table')
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = "gs://your_bucket/path/your_file.csv"
load_job = client.load_table_from_uri(
uri, table_ref, job_config=job_config
) # API request
print("Starting job {}".format(load_job.job_id))
load_job.result() # Waits for table load to complete.
print("Job finished.")
destination_table = client.get_table(table_ref)
print("Loaded {} rows.".format(destination_table.num_rows))
see more details in BigQuery Documentation

How to run a BigQuery query in Python

This is the query that I have been running in BigQuery that I want to run in my python script. How would I change this/ what do I have to add for it to run in Python.
#standardSQL
SELECT
Serial,
MAX(createdAt) AS Latest_Use,
SUM(ConnectionTime/3600) as Total_Hours,
COUNT(DISTINCT DeviceID) AS Devices_Connected
FROM `dataworks-356fa.FirebaseArchive.testf`
WHERE Model = "BlueBox-pH"
GROUP BY Serial
ORDER BY Serial
LIMIT 1000;
From what I have been researching it is saying that I cant save this query as a permanent table using Python. Is that true? and if it is true is it possible to still export a temporary table?
You need to use the BigQuery Python client lib, then something like this should get you up and running:
from google.cloud import bigquery
client = bigquery.Client(project='PROJECT_ID')
query = "SELECT...."
dataset = client.dataset('dataset')
table = dataset.table(name='table')
job = client.run_async_query('my-job', query)
job.destination = table
job.write_disposition= 'WRITE_TRUNCATE'
job.begin()
https://googlecloudplatform.github.io/google-cloud-python/stable/bigquery-usage.html
See the current BigQuery Python client tutorial.
Here is another way using a JSON file for the service account:
>>> from google.cloud import bigquery
>>>
>>> CREDS = 'test_service_account.json'
>>> client = bigquery.Client.from_service_account_json(json_credentials_path=CREDS)
>>> job = client.query('select * from dataset1.mytable')
>>> for row in job.result():
... print(row)
This is a good usage guide:
https://googleapis.github.io/google-cloud-python/latest/bigquery/usage/index.html
To simply run and write a query:
# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'your_dataset_id'
job_config = bigquery.QueryJobConfig()
# Set the destination table
table_ref = client.dataset(dataset_id).table("your_table_id")
job_config.destination = table_ref
sql = """
SELECT corpus
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;
"""
# Start the query, passing in the extra configuration.
query_job = client.query(
sql,
# Location must match that of the dataset(s) referenced in the query
# and of the destination table.
location="US",
job_config=job_config,
) # API request - starts the query
query_job.result() # Waits for the query to finish
print("Query results loaded to table {}".format(table_ref.path))
I personally prefer querying using pandas:
# BQ authentication
import pydata_google_auth
SCOPES = [
'https://www.googleapis.com/auth/cloud-platform',
'https://www.googleapis.com/auth/drive',
]
credentials = pydata_google_auth.get_user_credentials(
SCOPES,
# Set auth_local_webserver to True to have a slightly more convienient
# authorization flow. Note, this doesn't work if you're running from a
# notebook on a remote sever, such as over SSH or with Google Colab.
auth_local_webserver=True,
)
query = "SELECT * FROM my_table"
data = pd.read_gbq(query, project_id = MY_PROJECT_ID, credentials=credentials, dialect = 'standard')
The pythonbq package is very simple to use and a great place to start. It uses python-gbq.
To get started you would need to generate a BQ json key for external app access. You can generate your key here.
Your code would look something like:
from pythonbq import pythonbq
myProject=pythonbq(
bq_key_path='path/to/bq/key.json',
project_id='myGoogleProjectID'
)
SQL_CODE="""
SELECT
Serial,
MAX(createdAt) AS Latest_Use,
SUM(ConnectionTime/3600) as Total_Hours,
COUNT(DISTINCT DeviceID) AS Devices_Connected
FROM `dataworks-356fa.FirebaseArchive.testf`
WHERE Model = "BlueBox-pH"
GROUP BY Serial
ORDER BY Serial
LIMIT 1000;
"""
output=myProject.query(sql=SQL_CODE)

Categories

Resources