How to create BigQuery Data Transfer Service using Python - python

I tried creating a Data Transfer Service using bigquery_datatransfer. I used the following python library,
pip install --upgrade google-cloud-bigquery-datatransfer
Used the method
create_transfer_config(parent, transfer_config)
I have defined the transfer_config values for the data_source_id: amazon_s3
transfer_config = {
"destination_dataset_id": "My Dataset",
"display_name": "test_bqdts",
"data_source_id": "amazon_s3",
"params": {
"destination_table_name_template":"destination_table_name",
"data_path": <data_path>,
"access_key_id": args.access_key_id,
"secret_access_key": args.secret_access_key,
"file_format": <>
},
"schedule": "every 10 minutes"
}
But while running the script I'm getting the following error,
ValueError: Protocol message Struct has no "destination_table_name_template" field.
The fields given inside the params are not recognized. Also, I couldn't find what are the fields to be defined inside the "params" struct
What are the fields to be defined inside the "params" of transfer_config to create the Data Transfer job successfully?

As you can see in the documentation, you should try putting your code inside the google.protobuf.json_format.ParseDict() function.
transfer_config = google.protobuf.json_format.ParseDict(
{
"destination_dataset_id": dataset_id,
"display_name": "Your Scheduled Query Name",
"data_source_id": "scheduled_query",
"params": {
"query": query_string,
"destination_table_name_template": "your_table_{run_date}",
"write_disposition": "WRITE_TRUNCATE",
"partitioning_field": "",
},
"schedule": "every 24 hours",
},
bigquery_datatransfer_v1.types.TransferConfig(),
)
Please let me know if it helps you

Related

BigQueryInsertJobOperator with Export Configuration

I am trying to retrieve GA data from BigQuery using the operators provided in the airflow documentation.
The documentation is not very explicit concerning the usage of the BigQueryInsertJobOperator which is replacing BigQueryExecuteQueryOperator.
My Dag work as follow:
In a Dataset List the table names
Using BigQueryInsertJobOperator query all the table using this syntax from the cookbook:
`{my-project}.{my-dataset}.events_*`
WHERE _TABLE_SUFFIX BETWEEN '{start}' AND '{end}'
select_query_job = BigQueryInsertJobOperator(
task_id="select_query_job",
gcp_conn_id='big_query',
configuration={
"query": {
"query": build_query.output,
"useLegacySql": False,
"allowLargeResults": True,
"useQueryCache": True,
}
}
)
Retrieve the job id from the Xcom and use BigQueryInsertJobOperator with extract in the configuration to get query results, like in this api
However, I receive an error message and I am unable to access the data. All the steps before step 3 are working perfectly, I can see it from the cloud console.
The Operator I tried:
retrieve_job_data = BigQueryInsertJobOperator(
task_id="get_job_data",
gcp_conn_id='big_query',
job_id=select_query_job.output,
project_id=project_name,
configuration={
"extract": {
}
}
)
#Or
retrieve_job_data = BigQueryInsertJobOperator(
task_id="get_job_data",
gcp_conn_id='big_query',
configuration={
"extract": {
"jobId": select_query_job.output,
"projectId": project_name
}
}
)
google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/{my-project}/jobs?prettyPrint=false: Required parameter is missing
[2022-08-16, 09:44:01 UTC] {taskinstance.py:1415} INFO - Marking task as FAILED. dag_id=BIG_QUERY, task_id=get_job_data, execution_date=20220816T054346, start_date=20220816T054358, end_date=20220816T054401
[2022-08-16, 09:44:01 UTC] {standard_task_runner.py:92} ERROR - Failed to execute job 628 for task get_job_data (400 POST https://bigquery.googleapis.com/bigquery/v2/projects/{my-project}/jobs?prettyPrint=false: Required parameter is missing; 100144)
Following the above link gives:
{
"error": {
"code": 401,
"message": "Request is missing required authentication credential. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.",
"errors": [
{
"message": "Login Required.",
"domain": "global",
"reason": "required",
"location": "Authorization",
"locationType": "header"
}
],
"status": "UNAUTHENTICATED",
"details": [
{
"#type": "type.googleapis.com/google.rpc.ErrorInfo",
"reason": "CREDENTIALS_MISSING",
"domain": "googleapis.com",
"metadata": {
"service": "bigquery.googleapis.com",
"method": "google.cloud.bigquery.v2.JobService.ListJobs"
}
}
]
}
}
I see that the error is http 401, and I don't have access to gc, which is not normal since my gcp_conn_id works in the other operators (and specifying the project Id!).
For the ExtractJob type, you must pass a destinationUri or destinationUris and sourceTable.
This explains the 401 Required parameter is missing error message.
Now that you have a job_id, you can implement a pre_execute hook in your constructor to fetch the job.
The destinationTable field in the job configuration is needed to configure the Extract job. Even though you have configured the Query job to useQueryCache, Bigquery will store the results in anonymised table.
The configuration for the Query job when it is retrieved looks like:
{
/*...*/
"configuration": {
"query": {
"query": "SELECT weight_pounds, state, year, gestation_weeks FROM [bigquery-public-data:samples.natality] ORDER BY weight_pounds DESC LIMIT 10;",
"destinationTable": {
"projectId": "redacted",
"datasetId": "_redacted",
"tableId": "anon0d85adcadde61fa17550f9841810e343fb5bc82d"
},
"writeDisposition": "WRITE_TRUNCATE",
"priority": "INTERACTIVE",
"useQueryCache": true,
"useLegacySql": true
},
"jobType": "QUERY"
},
/*...*/
}
retrieve_job_data = BigQueryInsertJobOperator(
task_id="get_job_data",
gcp_conn_id='big_query',
job_id=select_query_job.output,
project_id=project_name,
pre_execute=populate_extract_source_table
configuration={
"extract": {
"destinationUris": ["gs://your-bucket/some-path"]
}
}
)
def populate_extract_source_table(ctx):
job_id = ctx['task'].job_id # the job id of the query job
task = ctx['task']
hook = BigQueryHook(
gcp_conn_id=task.gcp_conn_id,
delegate_to=task.delegate_to,
impersonation_chain=task.impersonation_chain,
)
# Retreive job
job = hook.get_job(
project_id=task.project_id,
location=task.location,
job_id=job_id,
)
# Set the sourceTable for the extract insert job to that for the query insert job.
jr = job.to_api_repr()
dag.configuration['extract']['sourceTable'] = jr["configuration"]["query"]["destinationTable"]

Timeout errors when testing Azure function app

Using Azure functions in a python runtime env. to take latitude/longitude tuples from a snowflake database and return the respective countries. We also want to convert any non-english country names into English.
We initially found that although the script would show output in the terminal while testing on azure, it would soon return a 503 error (although the script continues to run at this point). If we cancelled the script it would show as a success in the monitor screen of azure portal, however leaving the script to run to completion would result in the script failing. We decided (partially based on this post) this was due to the runtime exceeding the maximum http response time allowed. To combat this we tried a number of solutions.
First we extended the function timeout value in the function.json file:
{
"version": "2.0",
"logging": {
"applicationInsights": {
"samplingSettings": {
"isEnabled": true,
"excludedTypes": "Request"
}
}
},
"extensionBundle": {
"id": "Microsoft.Azure.Functions.ExtensionBundle",
"version": "[2.*, 3.0.0)"
},
"functionTimeout": "00:10:00"
}
We then modified our script to use a queue trigger by adding the output
def main(req: func.HttpRequest, msg: func.Out[func.QueueMessage]) ->func.HttpResponse:
to the main .py script. We also then modified the function.json file to
{
"scriptFile": "__init__.py",
"bindings": [
{
"authLevel": "function",
"type": "httpTrigger",
"direction": "in",
"name": "req",
"methods": [
"get",
"post"
]
},
{
"type": "http",
"direction": "out",
"name": "$return"
},
{
"type": "queue",
"direction": "out",
"name": "msg",
"queueName": "processing",
"connection": "QueueConnectionString"
}
]
}
and the local.settings.json file to
{
"IsEncrypted": false,
"Values": {
"FUNCTIONS_WORKER_RUNTIME": "python",
"AzureWebJobsStorage": "{AzureWebJobsStorage}",
"QueueConnectionString": "<Connection String for Storage Account>",
"STORAGE_CONTAINER_NAME": "testdocs",
"STORAGE_TABLE_NAME": "status"
}
}
We also then added a check to see if the country name was already in English. The intention here was to cut down on calls to the translate function.
After each of these changes we redeployed to the functions app and tested again. Same result. The function will run, and print output to terminal, however after a few seconds it will show a 503 error and eventually fail.
I can show a code sample but cannot provide the tables unfortunately.
from snowflake import connector
import pandas as pd
import pyarrow
from geopy.geocoders import Nominatim
from deep_translator import GoogleTranslator
from pprint import pprint
import langdetect
import logging
import azure.functions as func
def main(req: func.HttpRequest, msg: func.Out[func.QueueMessage]) -> func.HttpResponse:
logging.info('Python HTTP trigger function processed a request.')
# Connecting string to Snowflake
conn = connector.connect(
user='<USER>',
password='<PASSWORD>',
account='<ACCOUNT>',
warehouse='<WH>',
database='<DB>',
schema='<SCH>'
)
# Creating objects for Snowflake, Geolocation, Translate
cur = conn.cursor()
geolocator = Nominatim(user_agent="geoapiExercises")
translator = GoogleTranslator(target='en')
# Fetching weblog data to get the current latlong list
fetchsql = "SELECT PAGELATLONG FROM <TABLE_NAME> WHERE PAGELATLONG IS NOT NULL GROUP BY PAGELATLONG;"
logging.info(fetchsql)
cur.execute(fetchsql)
df = pd.DataFrame(cur.fetchall(), columns = ['PageLatLong'])
logging.info('created data frame')
# Creating and Inserting the mapping into final table
for index, row in df.iterrows():
latlong = row['PageLatLong']
location = geolocator.reverse(row['PageLatLong']).raw['address']
logging.info('got addresses')
city = str(location.get('state_district'))
country = str(location.get('country'))
countrycd = str(location.get('country_code'))
logging.info('got countries')
# detect language of country
res = langdetect.detect_langs(country)
lang = str(res[0]).split(':')[0]
conf = str(res[0]).split(':')[0]
if lang != 'en' and conf > 0.99:
country = translator.translate(country)
logging.info('translated non-english country names')
insertstmt = "INSERT INTO <RESULTS_TABLE> VALUES('"+latlong+"','"+city+"','"+country+"','"+countrycd+"')"
logging.info(insertstmt)
try:
cur.execute(insertstmt)
except Exception:
pass
return func.HttpResponse("success")
If anyone had an idea what may be causing this issue I'd appreciate any suggestions.
Thanks.
To resolve timeout errors, you can try following ways:
As suggested by MayankBargali-MSFT , You can try to define the retry policies and for Trigger like HTTP and timer, don't resume on a new instance. This means that the max retry count is a best effort, and in some rare cases an execution could be retried more than the maximum, or for triggers like HTTP and timer be retried less than the maximum. You can navigate to Diagnose and solve problems to see if it helps you to know the root cause of 503 error as there can be multiple reasons for this error
As suggested by ryanchill , 503 issue is the result of high memory consumption which exceeded the limits of the consumption plan. The best resolve for this issue is switching to a dedicated hosting plan which provides more resources. However, if that isn't an option, reducing the amount of data being retrieved should be explored.
References: https://learn.microsoft.com/en-us/answers/questions/539967/azure-function-app-503-service-unavailable-in-code.html , https://learn.microsoft.com/en-us/answers/questions/522216/503-service-unavailable-while-executing-an-azure-f.html , https://learn.microsoft.com/en-us/answers/questions/328952/azure-durable-functions-timeout-error-in-activity.html and https://learn.microsoft.com/en-us/answers/questions/250623/azure-function-not-running-successfully.html

Azure vm provisioning and user assigned identity?

I am looking to resolve issue with trying to apply to the vm I am creating an identity using python sdk. The code:
print("Creating VM " + resource_name)
compute_client.virtual_machines.begin_create_or_update(
resource_group_name,
resource_name,
{
"location": "eastus",
"storage_profile": {
"image_reference": {
# Image ID can be retrieved from `az sig image-version show -g $RG -r $SIG -i $IMAGE_DEFINITION -e $VERSION --query id -o tsv`
"id": "/subscriptions/..image id"
}
},
"hardware_profile": {
"vm_size": "Standard_F8s_v2"
},
"os_profile": {
"computer_name": resource_name,
"admin_username": "adminuser",
"admin_password": "somepassword",
"linux_configuration": {
"disable_password_authentication": True,
"ssh": {
"public_keys": [
{
"path": "/home/adminuser/.ssh/authorized_keys",
# Add the public key for a key pair that can get access to SSH to the runners
"key_data": "ssh-rsa …"
}
]
}
}
},
"network_profile": {
"network_interfaces": [
{
"id": nic_result.id
}
]
},
"identity": {
"type": "UserAssigned",
"user_assigned_identities": {
"identity_id": { myidentity }
}
}
}
The last part, identity: I found somewhere on the web, (not sure where), but it is failing with some weird set/get error when I try to use it. The vm will create fine if I comment out the identity: block, but I need the user assigned identity. I spent the better part of today trying to find information on the options for the begin_create_or_update and info on the identity piece, but I have had no luck. I am looking for help on how to apply a user assigned identity with python to the vm I am creating.
The Set and Get error is because you are declaring the identity block in a wrong way.
If you have a existing User Assigned Identity then you can use the identity block as below:
"identity": {
"type": "UserAssigned",
"user_assigned_identities": {
'/subscriptions/948d4068-xxxxxxxxxxxxxxx/resourceGroups/ansumantest/providers/Microsoft.ManagedIdentity/userAssignedIdentities/mi-identity' : {}
}
As you can see, inside the user_assigned_identities it will be :
'User Assigned Identity ResourceID':{}
instead of
"identity_id":{'User Assigned Identity ResourceID'}
Output:

How to access EventGrid trigger Event properties in Azure functions input bindings

I want to pull in cosmos documents into my azure function based on contents of eventgrid events that it triggers on (python worker runtime). Is it possible to do this?
I have below function.json:
{
"scriptFile": "__init__.py",
"bindings": [
{
"type": "eventGridTrigger",
"name": "event",
"direction": "in"
},
{
"name": "documents",
"type": "cosmosDB",
"databaseName": "Metadata",
"collectionName": "DataElementMappings",
"sqlQuery" : "SELECT * from c where c.id = {subject}",
"connectionStringSetting": "MyCosmosDBConnectionString",
"direction": "in"
}
]
}
I want to use properties of the event in the query for the cosmos input binding. I tried it with subject here. It fails with:
[6/4/2020 5:34:45 PM] System.Private.CoreLib: Exception while executing function: Functions.orch_taskchecker. System.Private.CoreLib: The given key 'subject' was not present in the dictionary.
So I am not using the right key name. The only key I got to work is {data} but unfortunately I was not able to access any properties inside the event data using the syntax {data.property} as it says in the docs. I would expect all of the event grid schema keys to be available to use in other bindings since an event is a JSON payload. I got none of them to work, e.g. eventTime, event_type, topic, etc.
Since I saw an example of the same idea for storage queues and they used for example {Queue.id} I tried things like {Event.subject}, {EventGridEvent.subject} all to no avail.
Nowhere in the docs can I find samples of event grid trigger data being used in other bindings. Can this be achieved?
For EventGridTrigger (C# script) can be used for input bindings a custom data object of the EventGridEvent class.
The following example shows a blob input bindings for event storage account:
{
"bindings": [
{
"type": "eventGridTrigger",
"name": "eventGridEvent",
"direction": "in"
},
{
"type": "blob",
"name": "body",
"path": "{data.url}",
"connection": "rk2018ebstg_STORAGE",
"direction": "in"
}
],
"disabled": false
}
Note, that only a property from the data object can be referenced.
There is the event.get_json() method that you can use to access the data field in the python code, I did not look at the bindings

Translating Elasticsearch request from Kibana into elasticsearch-dsl

Recently migrated from AWS Elasticsearch Service (used Elasticsearch 1.5.2) to Elastic Cloud (currently using Elasticsearch 5.1.2). Glad I did it, but with that change comes a newer version of Elasticsearch and newer API's. Struggling to get my head around the new way of requesting stuff. Formerly, I could more or less copy/paste from Kibana's "Elasticsearch Request Body", adjust a few things, run elasticsearch.Elasticsearch.search() and get what I expect.
Here's my Elasticsearch Request Body from Kibana (for brevity, removed some of the extraneous stuff that Kibana usually inserts):
{
"size": 500,
"sort": [
{
"Time.ISO8601": {
"order": "desc",
"unmapped_type": "boolean"
}
}
],
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "Message\\ ID: 2003",
"analyze_wildcard": true
}
},
{
"range": {
"Time.ISO8601": {
"gte": 1484355455678,
"lte": 1484359055678,
"format": "epoch_millis"
}
}
}
],
"must_not": []
}
},
"stored_fields": [
"*"
],
"script_fields": {},
}
Now I want to use elasticsearch-dsl to do it, since that seems to be the recommended method (instead of using elasticsearch-py). How would I translate the above into elasticsearch-dsl?
Here's what I have so far:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
client = Elasticsearch(
hosts=['HASH.REGION.aws.found.io/elasticsearch'],
use_ssl=True,
port=443,
http_auth=('USER','PASS')
)
s = Search(using=client, index="emp*")
s = s.query("query_string", query="Message\ ID:2003", analyze_wildcards=True)
s = s.query("range", **{"Time.ISO8601": {"gte": 1484355455678, "lte": 1484359055678, "format": "epoch_millis"}})
s = s.sort("Time.ISO8601")
response = s.execute()
for hit in response:
print '%s %s' % (hit['Time']['ISO8601'], hit['Message ID'])
My code written as above is not giving me what I expect. Getting results that include stuff that doesn't match "Message\ ID:2003", and also it's giving me things outside the requested range of Time.ISO8601 as well.
Totally new to elasticsearch-dsl and ES 5.1.2's way of doing things, so I know I've got lots to learn. What am I doing wrong? Thanks in advance for the help!
I don't have elasticsearch running right now but the query looks like what you wanted (you can always see the query produced by looking at s.to_dict()) with the exception of escaping the \ sign. In the original query it was escaped yet in python the result might be different due to different escaping.
I wuld strongly advise to not have spaces in your fields and also to use a more structured query than query_string:
s = Search(using=client, index="emp*")
s = s.filter("term", message_id=2003)
s = s.query("range", Time__ISO8601={"gte": 1484355455678, "lte": 1484359055678, "format": "epoch_millis"})
s = s.sort("Time.ISO8601")
Note that I also changed query() to filter() for a slight speedup and used __ instead of . in the field name keyword argument. elasticsearch-dsl will automatically expand that to ..
Hope this helps...

Categories

Resources