How to use projection_fields parameter in Bigquery Python API - python

I am trying to load the Cloud Firestore export in Google Cloud Storage into Bigquery using the Python API. I need to load only a select few fields for which I want to use --projection_fields parameter. However, I couldn't able to successfully use this parameter in my code. I'm referring this doc: https://cloud.google.com/bigquery/docs/loading-data-cloud-firestore
I am using google.cloud library.
Cannot find this field in the bigquery or firestore libraries.
Any tip on how to use this field using the Python API will be of great help.
import os
from google.cloud import bigquery
creds_file_path = "xxxx.json"
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = creds_file_path
bigquery_client = bigquery.Client()
dataset_ref = bigquery_client.dataset('abcd')
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
job_config.source_format = bigquery.SourceFormat.DATASTORE_BACKUP

Reviewing the pyhon client library changelog it seems that does not support this option yet. However, you can use this workaround to include the projectionFields property, and for that matter, any property that is not supported by the client yet, but it is for the API.
my_list_of_properties = [] # The properties you want to include on the table
job_config._set_sub_prop('projectionFields', my_list_of_properties)

Related

How can I see the service account that the python bigquery client uses?

To create a default bigquery client I use:
from google.cloud import bigquery
client = bigquery.Client()
This uses the (default) credentials available in the environment.
But how I see then which (default) service account is used?
While you can interrogate the credentials directly (be it json keys, metadata server, etc), I have occasionally found it valuable to simply query bigquery using the SESSION_USER() function.
Something quick like this should suffice:
client = bigquery.Client()
query_job = client.query("SELECT SESSION_USER() as whoami")
results = query_job.result()
for row in results:
print("i am {}".format(row.whoami))
This led me in the right direction:
Google BigQuery Python Client using the wrong credentials
To see the service-account used you can do:
client._credentials.service_account_email
However:
This statement above works when you run it on a jupyter notebook (in Vertex AI), but when you run it in a cloud function with print(client._credentials.service_account_email) then it just logs 'default' to Cloud Logging. But the default service account for a Cloud Function should be: <project_id>#appspot.gserviceaccount.com.
This will also give you the wrong answer:
client.get_service_account_email()
The call to client.get_service_account_email() does not return the credential's service account email address. Instead, it returns the BigQuery service account email address used for KMS encryption/decryption.
Following John Hanley's comment (when running on a Compute Engine) you can query the metadata service to get the email user name:
https://cloud.google.com/compute/docs/metadata/default-metadata-values
So you can either use linux:
curl "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email" -H "Metadata-Flavor: Google"
Or python:
import requests
headers = {'Metadata-Flavor': 'Google'}
response = requests.get(
"http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email",
headers=headers
)
print(response.text)
The default in the url used is the alias of the actual service account used.

Google Big Query from Python

I am trying to run a simple query on BigQuery from Python and follow this document. To set the client I generated the JSON file for my project via service account:
import pandas as pd
from google.cloud import bigquery
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]=*****
client = bigquery.Client()
QUERY = (
'SELECT name FROM `mythic-music-326213.mytestdata.trainData` '
'LIMIT 100')
query_job = client.query(QUERY)
However, I am getting the following error:
DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started
Technically, I want to be able to query my dataset from Python. Any help would be appreciated.
I've tried your code snippet with my service account JSON file and dataset in my project. It worked as expected. Not clear why it's not working in your case.
Hovewer you can try to use service account JSON file directly like that:
import pandas as pd
from google.cloud import bigquery
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file('<path to JSON file>')
client = bigquery.Client(credentials=credentials)
QUERY = (
'SELECT state FROM `so-project-a.test.states` '
'LIMIT 100')
query_job = client.query(QUERY)

Cannot query tables from sheets in BigQuery

I am trying to use BigQuery inside python to query a table that is generated via a sheet:
from google.cloud import bigquery
# Prepare connexion and query
bigquery_client = bigquery.Client(project="my_project")
query = """
select * from `table-from-sheets`
"""
df = bigquery_client.query(query).to_dataframe()
I can usually do queries to BigQuery tables, but now I am getting the following error:
Forbidden: 403 Access Denied: BigQuery BigQuery: Permission denied while getting Drive credentials.
What do I need to do to access drive from python?
Is there another way around?
You are missing the scopes for the credentials. I'm pasting the code snippet from the official documentation.
In addition, do not forget to give at least VIEWER access to the Service Account in the Google sheet.
from google.cloud import bigquery
import google.auth
# Create credentials with Drive & BigQuery API scopes.
# Both APIs must be enabled for your project before running this code.
credentials, project = google.auth.default(
scopes=[
"https://www.googleapis.com/auth/drive",
"https://www.googleapis.com/auth/bigquery",
]
)
# Construct a BigQuery client object.
client = bigquery.Client(credentials=credentials, project=project)

bigQuery Google Cloud how to share dataset with other users? using python

I have a bigQuery dataset defined in Google Cloud with my userA account, and I want my colleague userB, who is a member of the same group, to be able to see the dataset that I have defined. Using the bq command-line interface, userB can see the project, but not the dataset. How can I share the dataset created by userA with userB using python script?
Another thing you may run into is that you must give access at the data set level in BigQuery. Depending on how you have setup user roles in cloud platform and BigQuery, you may need to give the service account direct access to the Bigquery data set.
To do this go into BigQuery, hover on your dataset and click the down arrow, select 'share data set'. A modal will open where you can then specify which email address's and service accounts to share the data set with and control their access rights.
Let me know if my instructions are too confusing and I'll upload some images showing exactly how to do this.
Good Luck!!
An example using the Python Client Library. Adapted from here but adding a get_dataset call to get the current ACL policy for already existing datasets:
from google.cloud import bigquery
project_id = "PROJECT_ID"
dataset_id = "DATASET_NAME"
group_name= "google-group-name#google.com"
role = "READER"
client = bigquery.Client(project=project_id)
dataset_info = client.get_dataset(client.dataset(dataset_id))
access_entries = dataset_info.access_entries
access_entries.append(
bigquery.AccessEntry(role, "groupByEmail", group_name)
)
dataset_info.access_entries = access_entries
dataset_info = client.update_dataset(
dataset_info, ['access_entries'])
Another way to do it is using the Google Python API Client and the get and patch methods. First, we retrieve the existing dataset ACL, add the group as READER to the response and patch the dataset metadata:
from oauth2client.client import GoogleCredentials
from googleapiclient import discovery
project_id="PROJECT_ID"
dataset_id="DATASET_NAME"
group_name="google-group-name#google.com"
role="READER"
credentials = GoogleCredentials.get_application_default()
bq = discovery.build("bigquery", "v2", credentials=credentials)
response = bq.datasets().get(projectId=project_id, datasetId=dataset_id).execute()
response['access'].append({u'role': u'{}'.format(role), u'groupByEmail': u'{}'.format(group_name)})
bq.datasets().patch(projectId=project_id, datasetId=dataset_id, body=response).execute()
Replace the project_id, dataset_id, group_name and role variables accordingly.
Versions used:
$ pip freeze | grep -E 'bigquery|api-python'
google-api-python-client==1.7.7
google-cloud-bigquery==1.8.1

How to use Bigquery streaming insertall on app engine & python

I would like to develop an app engine application that directly stream data into a BigQuery table.
According to Google's documentation there is a simple way to stream data into bigquery:
http://googlecloudplatform.blogspot.co.il/2013/09/google-bigquery-goes-real-time-with-streaming-inserts-time-based-queries-and-more.html
https://developers.google.com/bigquery/streaming-data-into-bigquery#streaminginsertexamples
(note: in the above link you should select the python tab and not Java)
Here is the sample code snippet on how streaming insert should be coded:
body = {"rows":[
{"json": {"column_name":7.7,}}
]}
response = bigquery.tabledata().insertAll(
projectId=PROJECT_ID,
datasetId=DATASET_ID,
tableId=TABLE_ID,
body=body).execute()
Although I've downloaded the client api I didn't find any reference to a "bigquery" module/object referenced in the above Google's example.
Where is the the bigquery object (from snippet) should be located?
Can anyone show a more complete way to use this snippet (with the right imports)?
I've Been searching for that a lot and found documentation confusing and partial.
Minimal working (as long as you fill in the right ids for your project) example:
import httplib2
from apiclient import discovery
from oauth2client import appengine
_SCOPE = 'https://www.googleapis.com/auth/bigquery'
# Change the following 3 values:
PROJECT_ID = 'your_project'
DATASET_ID = 'your_dataset'
TABLE_ID = 'TestTable'
body = {"rows":[
{"json": {"Col1":7,}}
]}
credentials = appengine.AppAssertionCredentials(scope=_SCOPE)
http = credentials.authorize(httplib2.Http())
bigquery = discovery.build('bigquery', 'v2', http=http)
response = bigquery.tabledata().insertAll(
projectId=PROJECT_ID,
datasetId=DATASET_ID,
tableId=TABLE_ID,
body=body).execute()
print response
As Jordan says: "Note that this uses the appengine robot to authenticate with BigQuery, so you'll to add the robot account to the ACL of the dataset. Note that if you also want to use the robot to run queries, not just stream, you need the robot to be a member of the project 'team' so that it is authorized to run jobs."
Here is a working code example from an appengine app that streams records to a BigQuery table. It is open source at code.google.com:
http://code.google.com/p/bigquery-e2e/source/browse/sensors/cloud/src/main.py#124
To find out where the bigquery object comes from, see
http://code.google.com/p/bigquery-e2e/source/browse/sensors/cloud/src/config.py
Note that this uses the appengine robot to authenticate with BigQuery, so you'll to add the robot account to the ACL of the dataset.
Note that if you also want to use the robot to run queries, not just stream, you need to robot to be a member of the project 'team' so that it is authorized to run jobs.

Categories

Resources