Firestore to BigQuery using Cloud function - python

I have a Bigquery table that link using the firestore to bigQuery extension. But, I want the BigQuery to run based on the event trigger from cloud firestore (e.g. if there is any changes/new document inserted in specific collection) so that it can load the new data streaming direct to BigQuery table.
And I want all the cloud function to be written using Python script. But most of the online use case/examples are in JS (example: https://blog.questionable.services/article/from-firestore-to-bigquery-firebase-functions/).
When I tried to follow (https://cloud.google.com/functions/docs/calling/cloud-firestore) and deploy in cloud function as below, it keep fails.
import json
cred = credentials.Certificate("YYYYY.json")
firebase_admin.initialize_app(cred)
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="YYYYY.json"
db = firestore.client()
bq = bigquery.Client()
collection = db.collection('XXXX')
def hello_firestore(data, context):
""" Triggered by a change to a Firestore document.
Args:
data (dict): The event payload.
context (google.cloud.functions.Context): Metadata for the event.
"""
trigger_resource = context.resource
print ('Function triggered by change to: %s' % trigger_resource)
print ('\nOld value:')
print (json.dumps(data["oldValue"]))
print ('\nNew value:')
print (json.dumps(data["value"]))
Request for ideas/suggestion. Thank you for clarifying me in advance.

Related

How to point to the ARN of a dynamodb table instead of using the name when using boto3

I'm trying to access a dynamodb table in another account without having to make any code changes if possible. I've setup the IAM users, roles and policies to make this possible and have succeeded with other services such as sqs and s3.
The problem I have now is with dynamodb as the code to intialise the boto3.resource connection seems to only allow me to point to the name. docs
dynamodb = boto3.resource('dynamodb', region_name='us-east-2')
table = dynamodb.Table(config['dynamo_table_1'])
This causes the problem of the code trying to access a table with that particular name in the account the code is executing in which errors out as the table exists in a different AWS account.
Is there a way to pass the ARN of the table or some identifier that would allow me to specify the accountID?
There's sample code at https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/configure-cross-account-access-to-amazon-dynamodb.html which shows how to do cross-account access. Here is a snippet from the the attached zip. I expect you could do .resource() as well as .client() with the same arguments.
import boto3
from datetime import datetime
sts_client = boto3.client('sts')
sts_session = sts_client.assume_role(RoleArn='arn:aws:iam::<Account-A ID>::role/DynamoDB-FullAccess-For-Account-B',
RoleSessionName='test-dynamodb-session')
KEY_ID = sts_session['Credentials']['AccessKeyId']
ACCESS_KEY = sts_session['Credentials']['SecretAccessKey']
TOKEN = sts_session['Credentials']['SessionToken']
dynamodb_client = boto3.client('dynamodb',
region_name='us-east-2',
aws_access_key_id=KEY_ID,
aws_secret_access_key=ACCESS_KEY,
aws_session_token=TOKEN)

Cannot query tables from sheets in BigQuery

I am trying to use BigQuery inside python to query a table that is generated via a sheet:
from google.cloud import bigquery
# Prepare connexion and query
bigquery_client = bigquery.Client(project="my_project")
query = """
select * from `table-from-sheets`
"""
df = bigquery_client.query(query).to_dataframe()
I can usually do queries to BigQuery tables, but now I am getting the following error:
Forbidden: 403 Access Denied: BigQuery BigQuery: Permission denied while getting Drive credentials.
What do I need to do to access drive from python?
Is there another way around?
You are missing the scopes for the credentials. I'm pasting the code snippet from the official documentation.
In addition, do not forget to give at least VIEWER access to the Service Account in the Google sheet.
from google.cloud import bigquery
import google.auth
# Create credentials with Drive & BigQuery API scopes.
# Both APIs must be enabled for your project before running this code.
credentials, project = google.auth.default(
scopes=[
"https://www.googleapis.com/auth/drive",
"https://www.googleapis.com/auth/bigquery",
]
)
# Construct a BigQuery client object.
client = bigquery.Client(credentials=credentials, project=project)

BigQuery unit testing using Python

I am trying to test a BigQuery class with a Mock object to represent the table. Instances of my BigQueryRequest class must provide the BigQuery table uri. Would it be possible for me to create a mock BigQuery table directly from Python? How would it possible?
class BigQueryRequest:
"""BigQueryRequest
Contains a BigQuery request with its parameter.
Receive a table uri ($project_id.$dataset.$table) to run query
Args:
uri (str): BigQuery table uri
Properties:
BigQueryRequest.project: return the project running BigQuery
BigQueryRequest.dataset: return the dataset
BigQueryRequest.table: return the table to query
BigQueryRequest.destination_project: same as project but for destination project
BigQueryRequest.destination_dataset: same as project but for destination dataset
BigQueryRequest.destination_table: same as project but for destination table
Methods:
from_uri(): (#classmethod) parse a BigQuery uri to its project, dataset, table
destination(): return a uri of the BigQuery request destination table
query(): run the given BigQuery query
Private methods:
__set_destination(): generate a destination uri following the nomenclature or reuse entry uri
"""
def __init__(self, uri="", step="", params={}):
self.project, self.dataset, self.table = self.from_uri(uri)
self.step = step
self.params = self.set_params(params)
self.overwrite = False
(
self.destination_project,
self.destination_dataset,
self.destination_table,
) = self.__set_destination()
You'd have to do it yourself, Google Cloud does not provide an official mocking library for GCP products or service.
You could also try https://github.com/Khan/tinyquery as an alternative.
if you plan to test your SQL and assert your result based on input, I would suggest bq-test-kit. This framework allows you to interact with BigQuery in Python and make tests reliables.
You have 3 ways to inject data into it :
Create datasets and tables with an ability to isolate their name and therefore have your own namespace
Rely on temp tables, where data is inserted with data literals
data literal merged into your query
Hope that this helps.

How to use body parameters in google cloud function which are defined in google cloud scheduler(Python)?

In my cloud function I have a query which I am executing and writing job result into new bigquery table. I want my query should be dynamic based on some unit values (using outside dynamic parameters), I am triggering this cloud function from google cloud scheduler(Which contains some parameter values in Body section (using post method with http call)), Can anybody suggest how to use this parameter values from body section of cloud scheduler into my cloud function to make my query dynamic
Passing certain parameters in Body section of cloud scheduler but dont know how to use them in cloud function.
Body of cloud scheduler:
{
'unit': 'myunitname'
'interval':'1'
}
cloud function:
import flask
from google.cloud import bigquery
app = flask.Flask(__name__)
def main(request):
with app.app_context():
query = "SELECT unitId FROM `myproject.mydataset.mytable`
where unit ='{}' and interval='{}'".format(unit,interval)
client = bigquery.Client()
job_config = bigquery.QueryJobConfig()
dest_dataset = client.dataset('mydataset', 'myproject')
dest_table = dest_dataset.table('mytable')
job_config.destination = dest_table
job_config.create_disposition = 'CREATE_IF_NEEDED'
job_config.write_disposition = 'WRITE_APPEND'
job = client.query(query, job_config=job_config)
job.result()
return "Triggered"
The best way to do this is to keep all internal to Google Cloud using pub/sub:
[official Google tutorial]
Your Cloud Scheduler sends a message into Pub/Sub with a payload object of the information that your Cloud Function needs. Cloud Function should be triggered off of Pub/Sub topic and then you can access the message.attributes that you need for the dynamic portion you are referencing.
Cloud Scheduler -- publishes to --> Pub/Sub Topic -- subscriber push --> Cloud Function
If you want to keep using HTTP, you can follow here where it describes how to use POST from your function. You are using the requests module to pull the body attributes that you passed in your Cloud Scheduler job.

How to use Bigquery streaming insertall on app engine & python

I would like to develop an app engine application that directly stream data into a BigQuery table.
According to Google's documentation there is a simple way to stream data into bigquery:
http://googlecloudplatform.blogspot.co.il/2013/09/google-bigquery-goes-real-time-with-streaming-inserts-time-based-queries-and-more.html
https://developers.google.com/bigquery/streaming-data-into-bigquery#streaminginsertexamples
(note: in the above link you should select the python tab and not Java)
Here is the sample code snippet on how streaming insert should be coded:
body = {"rows":[
{"json": {"column_name":7.7,}}
]}
response = bigquery.tabledata().insertAll(
projectId=PROJECT_ID,
datasetId=DATASET_ID,
tableId=TABLE_ID,
body=body).execute()
Although I've downloaded the client api I didn't find any reference to a "bigquery" module/object referenced in the above Google's example.
Where is the the bigquery object (from snippet) should be located?
Can anyone show a more complete way to use this snippet (with the right imports)?
I've Been searching for that a lot and found documentation confusing and partial.
Minimal working (as long as you fill in the right ids for your project) example:
import httplib2
from apiclient import discovery
from oauth2client import appengine
_SCOPE = 'https://www.googleapis.com/auth/bigquery'
# Change the following 3 values:
PROJECT_ID = 'your_project'
DATASET_ID = 'your_dataset'
TABLE_ID = 'TestTable'
body = {"rows":[
{"json": {"Col1":7,}}
]}
credentials = appengine.AppAssertionCredentials(scope=_SCOPE)
http = credentials.authorize(httplib2.Http())
bigquery = discovery.build('bigquery', 'v2', http=http)
response = bigquery.tabledata().insertAll(
projectId=PROJECT_ID,
datasetId=DATASET_ID,
tableId=TABLE_ID,
body=body).execute()
print response
As Jordan says: "Note that this uses the appengine robot to authenticate with BigQuery, so you'll to add the robot account to the ACL of the dataset. Note that if you also want to use the robot to run queries, not just stream, you need the robot to be a member of the project 'team' so that it is authorized to run jobs."
Here is a working code example from an appengine app that streams records to a BigQuery table. It is open source at code.google.com:
http://code.google.com/p/bigquery-e2e/source/browse/sensors/cloud/src/main.py#124
To find out where the bigquery object comes from, see
http://code.google.com/p/bigquery-e2e/source/browse/sensors/cloud/src/config.py
Note that this uses the appengine robot to authenticate with BigQuery, so you'll to add the robot account to the ACL of the dataset.
Note that if you also want to use the robot to run queries, not just stream, you need to robot to be a member of the project 'team' so that it is authorized to run jobs.

Categories

Resources