BigQueryInsertJobOperator with Export Configuration - python

I am trying to retrieve GA data from BigQuery using the operators provided in the airflow documentation.
The documentation is not very explicit concerning the usage of the BigQueryInsertJobOperator which is replacing BigQueryExecuteQueryOperator.
My Dag work as follow:
In a Dataset List the table names
Using BigQueryInsertJobOperator query all the table using this syntax from the cookbook:
`{my-project}.{my-dataset}.events_*`
WHERE _TABLE_SUFFIX BETWEEN '{start}' AND '{end}'
select_query_job = BigQueryInsertJobOperator(
task_id="select_query_job",
gcp_conn_id='big_query',
configuration={
"query": {
"query": build_query.output,
"useLegacySql": False,
"allowLargeResults": True,
"useQueryCache": True,
}
}
)
Retrieve the job id from the Xcom and use BigQueryInsertJobOperator with extract in the configuration to get query results, like in this api
However, I receive an error message and I am unable to access the data. All the steps before step 3 are working perfectly, I can see it from the cloud console.
The Operator I tried:
retrieve_job_data = BigQueryInsertJobOperator(
task_id="get_job_data",
gcp_conn_id='big_query',
job_id=select_query_job.output,
project_id=project_name,
configuration={
"extract": {
}
}
)
#Or
retrieve_job_data = BigQueryInsertJobOperator(
task_id="get_job_data",
gcp_conn_id='big_query',
configuration={
"extract": {
"jobId": select_query_job.output,
"projectId": project_name
}
}
)
google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/{my-project}/jobs?prettyPrint=false: Required parameter is missing
[2022-08-16, 09:44:01 UTC] {taskinstance.py:1415} INFO - Marking task as FAILED. dag_id=BIG_QUERY, task_id=get_job_data, execution_date=20220816T054346, start_date=20220816T054358, end_date=20220816T054401
[2022-08-16, 09:44:01 UTC] {standard_task_runner.py:92} ERROR - Failed to execute job 628 for task get_job_data (400 POST https://bigquery.googleapis.com/bigquery/v2/projects/{my-project}/jobs?prettyPrint=false: Required parameter is missing; 100144)
Following the above link gives:
{
"error": {
"code": 401,
"message": "Request is missing required authentication credential. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.",
"errors": [
{
"message": "Login Required.",
"domain": "global",
"reason": "required",
"location": "Authorization",
"locationType": "header"
}
],
"status": "UNAUTHENTICATED",
"details": [
{
"#type": "type.googleapis.com/google.rpc.ErrorInfo",
"reason": "CREDENTIALS_MISSING",
"domain": "googleapis.com",
"metadata": {
"service": "bigquery.googleapis.com",
"method": "google.cloud.bigquery.v2.JobService.ListJobs"
}
}
]
}
}
I see that the error is http 401, and I don't have access to gc, which is not normal since my gcp_conn_id works in the other operators (and specifying the project Id!).

For the ExtractJob type, you must pass a destinationUri or destinationUris and sourceTable.
This explains the 401 Required parameter is missing error message.
Now that you have a job_id, you can implement a pre_execute hook in your constructor to fetch the job.
The destinationTable field in the job configuration is needed to configure the Extract job. Even though you have configured the Query job to useQueryCache, Bigquery will store the results in anonymised table.
The configuration for the Query job when it is retrieved looks like:
{
/*...*/
"configuration": {
"query": {
"query": "SELECT weight_pounds, state, year, gestation_weeks FROM [bigquery-public-data:samples.natality] ORDER BY weight_pounds DESC LIMIT 10;",
"destinationTable": {
"projectId": "redacted",
"datasetId": "_redacted",
"tableId": "anon0d85adcadde61fa17550f9841810e343fb5bc82d"
},
"writeDisposition": "WRITE_TRUNCATE",
"priority": "INTERACTIVE",
"useQueryCache": true,
"useLegacySql": true
},
"jobType": "QUERY"
},
/*...*/
}
retrieve_job_data = BigQueryInsertJobOperator(
task_id="get_job_data",
gcp_conn_id='big_query',
job_id=select_query_job.output,
project_id=project_name,
pre_execute=populate_extract_source_table
configuration={
"extract": {
"destinationUris": ["gs://your-bucket/some-path"]
}
}
)
def populate_extract_source_table(ctx):
job_id = ctx['task'].job_id # the job id of the query job
task = ctx['task']
hook = BigQueryHook(
gcp_conn_id=task.gcp_conn_id,
delegate_to=task.delegate_to,
impersonation_chain=task.impersonation_chain,
)
# Retreive job
job = hook.get_job(
project_id=task.project_id,
location=task.location,
job_id=job_id,
)
# Set the sourceTable for the extract insert job to that for the query insert job.
jr = job.to_api_repr()
dag.configuration['extract']['sourceTable'] = jr["configuration"]["query"]["destinationTable"]

Related

INVALID_ARGUMENT Error for Google Cloud Dataflow

I've got a python pipeline which takes a file from Cloud Storage, removes some columns, then uploads the result to BigQuery. If I run this locally using the Dataflow runner, everything works as expected, but whenever I try to set it up with the Dataflow UI so I can schedule it etc. no job gets created and this
INVALID_ARGUMENT error gets thrown in the logs:
(Ids etc. removed)
{
"insertId": "...",
"jsonPayload": {
"url": "https://datapipelines.googleapis.com/v1/projects/.../locations/europe-west1/pipelines/cloudpipeline:run",
"jobName": "projects/.../locations/europe-west1/jobs/...",
"#type": "type.googleapis.com/google.cloud.scheduler.logging.AttemptFinished",
"targetType": "HTTP",
"status": "INVALID_ARGUMENT"
},
"httpRequest": {
"status": 400
},
"resource": {
"type": "cloud_scheduler_job",
"labels": {
"location": "europe-west1",
"project_id": "...",
"job_id": "..."
}
},
"timestamp": "2022-11-01T10:45:00.657145402Z",
"severity": "ERROR",
"logName": "projects/.../logs/cloudscheduler.googleapis.com%2Fexecutions",
"receiveTimestamp": "2022-11-01T10:45:00.657145402Z"
}
I can't find out anything about this error and gcp doesn't seem to provide any additional info.
I've tried to use gcp logging in the python code but the gcp error seems to get thrown before any code is executed. I've also removed all of my optional parameters so there shouldn't be anything you're required to enter in the gcp set-up that isn't default.

SQS, Lambda & SES param undefined

I had this stack setup and working perfectly before, however, all of a sudden I am seeing a strange error in my CloudWatch.
This is my function (Python) for posting a message to SQS (which triggers a lambda function to send an email with SES):
def post_email(data, creatingUser=None):
sqs = boto3.client("sqs", region_name=settings.AWS_REGION)
# Send message to SQS queue
response = sqs.send_message(
QueueUrl=settings.QUEUE_EMAIL,
DelaySeconds=10,
MessageAttributes={
"ToAddress": {"DataType": "String", "StringValue": data.get("ToAddress")},
"Subject": {"DataType": "String", "StringValue": data.get("Subject")},
"Source": {
"DataType": "String",
"StringValue": data.get("Source", "ANS <noreply#ansfire.net"),
},
},
MessageBody=(data.get("BodyText"))
# When SQS pulls this message off, need to ensure that the email was
# actually delivered, if so create a notification
)
I print the params out and it is setting the above attributes correctly, however when I look in my CloudWatch this is the message:
2020-02-03T20:41:59.847Z f483293f-e48b-56e5-bb85-7f8d6341c0bf INFO {
Destination: { ToAddresses: [ undefined ] },
Message: {
Body: { Text: [Object] },
Subject: { Charset: 'UTF-8', Data: undefined }
},
Source: undefined
}
Any idea of what is going on?
I figured out the error, I needed to get the attributes from the data dictionary prior to calling the send_message function.

Problems creating new query DoubleClick Bid Manager - Python

dict = {
"kind": "doubleclickbidmanager#query",
"metadata": {
"dataRange": "LAST_30_DAYS",
"format": "CSV",
"title": "test API"
},
"params": {
"filters": [
{
"type": "FILTER_PARTNER",
"value": "Nestle (GCC&Levant)_PM MENA (2410734)"
}
],
"metrics": [
"METRIC_CLICKS",
"METRIC_UNIQUE_REACH_CLICK_REACH",
"METRIC_UNIQUE_REACH_IMPRESSION_REACH"
]
}
}
r = requests.post('https://www.googleapis.com/doubleclickbidmanager/v1.1/query',data = dict)
This is the code i am trying to use for creating Query for offline report on google bid manager.
It give me following error
{
"error": {
"errors": [
{
"domain": "global",
"reason": "required",
"message": "Login Required",
"locationType": "header",
"location": "Authorization"
}
],
"code": 401,
"message": "Login Required"
}
}
I have tried different ways even tried using the request type call and put the authorization keys in the API call but it didn't work. Surely something is missing can anyone confirm?
You can follow these python exemples for login :https://github.com/googleads/googleads-bidmanager-examples/tree/master/python
But anyway, there is always something wrong after the login, i post another question below : HttpEroor 500 Backend Error and HttpError 403 using DoubleClick Bid Manager API in python

How to create BigQuery Data Transfer Service using Python

I tried creating a Data Transfer Service using bigquery_datatransfer. I used the following python library,
pip install --upgrade google-cloud-bigquery-datatransfer
Used the method
create_transfer_config(parent, transfer_config)
I have defined the transfer_config values for the data_source_id: amazon_s3
transfer_config = {
"destination_dataset_id": "My Dataset",
"display_name": "test_bqdts",
"data_source_id": "amazon_s3",
"params": {
"destination_table_name_template":"destination_table_name",
"data_path": <data_path>,
"access_key_id": args.access_key_id,
"secret_access_key": args.secret_access_key,
"file_format": <>
},
"schedule": "every 10 minutes"
}
But while running the script I'm getting the following error,
ValueError: Protocol message Struct has no "destination_table_name_template" field.
The fields given inside the params are not recognized. Also, I couldn't find what are the fields to be defined inside the "params" struct
What are the fields to be defined inside the "params" of transfer_config to create the Data Transfer job successfully?
As you can see in the documentation, you should try putting your code inside the google.protobuf.json_format.ParseDict() function.
transfer_config = google.protobuf.json_format.ParseDict(
{
"destination_dataset_id": dataset_id,
"display_name": "Your Scheduled Query Name",
"data_source_id": "scheduled_query",
"params": {
"query": query_string,
"destination_table_name_template": "your_table_{run_date}",
"write_disposition": "WRITE_TRUNCATE",
"partitioning_field": "",
},
"schedule": "every 24 hours",
},
bigquery_datatransfer_v1.types.TransferConfig(),
)
Please let me know if it helps you

Login failed during python Insert job with BigQuery API

I am trying load local file to bigquery by setting up a server-server auth.
I've done following steps
Created service account
Create JSON key file for this account
Activated service acount with
gcloud auth activate-service-account command
Logged in with
gcloud auth login
Trying to execute python script to upload file to BigQuery
scopes =
['https://www.googleapis.com/auth/bigquery',
'https://www.googleapis.com/auth/bigquery.insertdata']
credentials = ServiceAccountCredentials.from_json_keyfile_name(
'/path/privatekey.json', scopes)
# Construct the service object for interacting with the BigQuery API.
service = build('bigquery', 'v2', credentials=credentials)
# Load configuration with the destination specified.
load_config = {
'destinationTable': {
'projectId': "project id",
'datasetId': "data set id",
'tableId': "table name"
}
}
# Setup the job here.
# load[property] = value
load_config['schema'] = {
'fields': [
<several field>
]
}
upload = MediaFileUpload('/path/to/csv/file',
mimetype='application/octet-stream',
# This enables resumable uploads.
resumable=True)
# End of job configuration.
run_load.start_and_wait(service.jobs(),
"my project id",
load_config,
media_body=upload)
The result is
"error": {
"errors": [
{
"domain": "global",
"reason": "required",
"message": "Login Required",
"locationType": "header",
"location": "Authorization"
}
],
"code": 401,
"message": "Login Required"
}
}
But I have enough rights to create query jobs
query_request = service.jobs()
query_data = {
'query': (
'SELECT COUNT(*) FROM [dmrebg.testDay];')
}
query_response = query_request.query(
projectId=project_id,
body=query_data).execute()
print('Query Results:')
for row in query_response['rows']:
print('\t'.join(field['v'] for field in row['f']))
What did I miss? I thought that I've already logged in.
The problem is that any call to
https://www.googleapis.com/bigquery/v2/projects/project_id/jobs/* will cause the same problem
{
"error": {
"errors": [
{
"domain": "global",
"reason": "required",
"message": "Login Required",
"locationType": "header",
"location": "Authorization"
}
],
"code": 401,
"message": "Login Required"
}
}
So it is a problem with my browser auth, python auth is good.
And root cause is that I my CSV Schema and data do not match.
Errors:
Too many errors encountered. (error code: invalid)

Categories

Resources