Pass Dynamic Parameters to AWS Glue

Pass Dynamic Parameters to AWS Glue - python

I am trying to pass dynamic parameters to a glue job. I followed this question: AWS Glue Job Input Parameters
And configured my parameters like so:
I'm triggering the glue job with boto3 with the following code:
event = {
'--ncoa': "True",
'--files': 'file.csv',
'--group_file': '3e93475d45b4ebecc9a09533ce57b1e7.csv',
'--client_slug': 'test',
'--slm_id': '12345'
}
glueClient.start_job_run(JobName='TriggerNCOA', Arguments=event)
and when I run this glue code:
args = getResolvedOptions(sys.argv, ['NCOA','Files','GroupFile','ClientSlug', 'SLMID'])
v_list=[{"ncoa":args['NCOA'],"files":args['Files'],"group_file":args['GroupFile'], "client_slug":args['ClientSlug'], "slm_id":args['SLMID']}]
print(v_list)
It just gives me 'a' for every value, not the values of the original event that I passed in from boto3. how do I fix that? Seems like im missing something very slight, but ive looked around and haven't found anything conclusive.

You are using CamelCase and Capital letters into Glue Job Parameters, but you are using small letters in python code to override the Parameters.
Ex.
The key of the job parameter in Glue is --ClientSlug but the key for Argument set in python code is --client_slug

Related

Airflow XCOM pull and push for a BigQueryInsertJobOperator and BigQueryOperator

I am very new to airflow and I am trying to create a DAG based on the below requirement.
Task 1 - Run a Bigquery query to get a value which I need to push to 2nd task in the dag
Task 2 - Use the value from the above query and run another query and export the data into google cloud bucket.
I have read other answers related to this and I understand we cannot use xcom_pull or xcom_push in bigqueryoperator in airflow. So what I am doing is using a python operator where I can use jinja template variables by using "provide_context=True".
Below is the snipped of my code. Just the task 1 where I want to do "task_instance.xcom_push" in order to see the value in airflow under logs xcom.
def get_bq_operator(dag, task_id, configuration, table_params=None, trigger_rule='all_success'):
bq_operator = BigQueryInsertJobOperator(
task_id=task_id,
configuration=configuration,
gcp_conn_id=gcp_connection_id,
dag=dag,
params=table_params,
trigger_rule=trigger_rule,
**task_instance.xcom_push(key='yr_wk', value=yr_wk),**
)
return bq_operator
def get_bq_wm_yr_wk():
get_bq_operator(dag,app_name,bigquery_util.get_bq_job_configuration(
bq_query,
query_params=None))
get_wm_yr_wk = PythonOperator(task_id='get_wm_yr_wk',
python_callable=get_bq_wm_yr_wk,
provide_context=True,
on_failure_callback=failure_callback,
on_retry_callback=failure_callback,
dag=dag)
"bq_query" is the one I am passing the sql file which has my query and the query returns the value of yr_wk which I need to use in my 2nd task.
The highlighted task_instance.xcom_push(key='yr_wk', value=yr_wk), in get_bq_operator is failing and the errror i am getting is as below
raise KeyError(f'Variable {key} does not exist')
KeyError: 'Variable ei_migration_hour does not exist'
If I comment the line above , the DAG runs fine. However, how do I validate the value of yr_wk?? I want to push it so that I can view the value in logs.

I do not fully understand your code :), but if you want to do something with results of BigQuery query, then by far better way to approach it is to use BigQueryHook in your python callable.
Operators in Airflow are usually thin wrappers around Hooks that really provide a "complete" taks (for example you can use it run an update operation) but if you want to do something with the result of it and you already do it via Python Operator, it is far better to use Hooks directly as you do not make all the assumptions that operators have in execute method.
In your case it should be something like (and I am using here the new TaskFlow syntax which is preferred to do this kind of operations. See https://airflow.apache.org/docs/apache-airflow/stable/tutorial_taskflow_api.html for the tutorial on Task Flow API. Aspecially in Airflow 2 it became the de-facto default way of writing tasks.
#task(.....)
def my_task():
hook = BigQueryHook(....) # initialize it with the right parameters
result = hook.run(sql='YOUR_QUERY', ...) # add other necessary params
processed_result = process_result(result) # do something with the result
return processed_result
This way you do not evey have to run xcom_push (task_flow API will do it for you automatically and other tasks will be able to use by just doing :
#task
next_task(input):
pass
And then:
result = my_task()
next_task(result)
Then all the xcom push/pull will be handled for you automatically via TaskFlow.

Can't Schedule Query in BigQuery Via Python SDK

I'll preface this by saying I'm fairly new to BigQuery. I'm running into an issue when trying to schedule a query using the Python SDK. I used the example on the documentation page and modified it a bit but I'm running into errors.
Note that my query does use scripting to set some variables, and it's using a MERGE statement to update one of my tables. I'm not sure if that makes a huge difference.
def create_scheduled_query(dataset_id, project, name, schedule, service_account, query):
parent = transfer_client.common_project_path(project)
transfer_config = bigquery_datatransfer.TransferConfig(
destination_dataset_id=dataset_id,
display_name=name,
data_source_id="scheduled_query",
params={
"query": query
},
schedule=schedule,
)
transfer_config = transfer_client.create_transfer_config(
bigquery_datatransfer.CreateTransferConfigRequest(
parent=parent,
transfer_config=transfer_config,
service_account_name=service_account,
)
)
print("Created scheduled query '{}'".format(transfer_config.name))
I was able to successfully create a query with the function above. However the query errors out with the following message:
Error code 9 : Dataset specified in the query ('') is not consistent with Destination dataset '{my_dataset_name}'.
I've tried changing passing in "" as the dataset_id parameter, but I get the following error from the Python SDK:
google.api_core.exceptions.InvalidArgument: 400 Cannot create a transfer with parent projects/{my_project_name} without location info when destination dataset is not specified.
Interestingly enough I was able to successfully create this scheduled query in the GUI; the same query executed without issue.
I saw that the GUI showed the scheduled query's "Resource name" referenced a transferConfig, so I used the following command to see what that transferConfig looked like, to see if I could apply the same parameters using my Python script:
bq show --format=prettyjson --transfer_config {my_transfer_config}
Which gave me the following output:
{
"dataSourceId": "scheduled_query",
"datasetRegion": "us",
"destinationDatasetId": "",
"displayName": "test_scheduled_query",
"emailPreferences": {},
"name": "{REDACTED_TRANSFER_CONFIG_ID}",
"nextRunTime": "2021-06-18T00:35:00Z",
"params": {
"query": ....
So it looks like the GUI was able to use "" for destinationDataSetId but for whatever reason the Python SDK won't let me use that value.
Any help would be appreciated, since I prefer to avoid the GUI whenever possible.
UPDATE:
This does appear to be related to the scripting I used in my query. I removed the scripts from the query and it's working. I'm going to leave this open because I feel like this should be possible using the SDK since the query with scripting works in the console without issue.

This same thing also threw me through a loop but I managed to figure out what was wrong. The problem is with the
parent = transfer_client.common_project_path(project)
line that is given in the example query. By default, this returns something of the form projects/{project_id}. However, the CreateTransferConfigRequest documentation says of the parent parameter:
The BigQuery project id where the transfer configuration should be created. Must be in the format projects/{project_id}/locations/{location_id} or projects/{project_id}. If specified location and location of the destination bigquery dataset do not match - the request will fail.
Sure enough, if you use the projects/{project_id}/locations/{location_id} format instead, it resolves the error and allows you to pass a null destination_dataset_id.

I had the exact same issue. the fix for the issue is as below.
The below method returns Projects/{projectid}
parent = transfer_client.common_project_path(project_id)
instead use the below method , which returns projects/{project}/locations/{location}
parent = transfer_client.common_location_path(project_id , "EU")
I had tried with the above change , i am able to schedule a script in BQ.

How do you get the run parameters and runId within Databricks notebook?

When running a Databricks notebook as a job, you can specify job or run parameters that can be used within the code of the notebook. However, it wasn't clear from documentation how you actually fetch them. I'd like to be able to get all the parameters as well as job id and run id.

Job/run parameters
When the notebook is run as a job, then any job parameters can be fetched as a dictionary using the dbutils package that Databricks automatically provides and imports. Here's the code:
run_parameters = dbutils.notebook.entry_point.getCurrentBindings()
If the job parameters were {"foo": "bar"}, then the result of the code above gives you the dict {'foo': 'bar'}. Note that Databricks only allows job parameter mappings of str to str, so keys and values will always be strings.
Note that if the notebook is run interactively (not as a job), then the dict will be empty. The getCurrentBinding() method also appears to work for getting any active widget values for the notebook (when run interactively).
Getting the jobId and runId
To get the jobId and runId you can get a context json from dbutils that contains that information. (Adapted from databricks forum):
import json
context_str = dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson()
context = json.loads(context_str)
run_id_obj = context.get('currentRunId', {})
run_id = run_id_obj.get('id', None) if run_id_obj else None
job_id = context.get('tags', {}).get('jobId', None)
So within the context object, the path of keys for runId is currentRunId > id and the path of keys to jobId is tags > jobId.

Nowadays you can easily get the parameters from a job through the widget API. This is pretty well described in the official documentation from Databricks. Below, I'll elaborate on the steps you have to take to get there, it is fairly easy.
Create or use an existing notebook that has to accept some parameters. We want to know the job_id and run_id, and let's also add two user-defined parameters environment and animal.
# Get parameters from job
job_id = dbutils.widgets.get("job_id")
run_id = dbutils.widgets.get("run_id")
environment = dbutils.widgets.get("environment")
animal = dbutils.widgets.get("animal")
print(job_id)
print(run_id)
print(environment)
print(animal)
Now let's go to Workflows > Jobs to create a parameterised job. Make sure you select the correct notebook and specify the parameters for the job at the bottom. According to the documentation, we need to use curly brackets for the parameter values of job_id and run_id. For the other parameters, we can pick a value ourselves.
Note: The reason why you are not allowed to get the job_id and run_id directly from the notebook, is because of security reasons (as you can see from the stack trace when you try to access the attributes of the context). Within a notebook you are in a different context, those parameters live at a "higher" context.
Run the job and observe that it outputs something like:
dev
squirrel
137355915119346
7492
Command took 0.09 seconds
You can even set default parameters in the notebook itself, that will be used if you run the notebook or if the notebook is triggered from a job without parameters. This makes testing easier, and allows you to default certain values.
# Adding widgets to a notebook
dbutils.widgets.text("environment", "tst")
dbutils.widgets.text("animal", "turtle")
# Removing widgets from a notebook
dbutils.widgets.remove("environment")
dbutils.widgets.remove("animal")
# Or removing all widgets from a notebook
dbutils.widgets.removeAll()
And last but not least, I tested this on different cluster types, so far I found no limitations. My current settings are:
spark.databricks.cluster.profile serverless
spark.databricks.passthrough.enabled true
spark.databricks.pyspark.enableProcessIsolation true
spark.databricks.repl.allowedLanguages python,sql

AWS Glue: Failed to start job run due to missing metadata

In order to run a job using boto3, the documentation states only JobName is required. However, my code:
def start_job_run(self, name):
print("The name of the job to be run via client is: {}".format(name))
self.response_de_start_job = self.client.start_job_run(
JobName=name
)
print(self.response_de_start_job)
and the client is:
self.client = boto3.client(
'glue',
region_name='ap-south-1',
aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY'),
)
when executed via Python3, gives out an error:
botocore.errorfactory.EntityNotFoundException: An error occurred (EntityNotFoundException) when calling the StartJobRun operation: Failed to start job run due to missing metadata
but when I do the same operation on the same job from UI and from the cli(aws glue start-job-run --job-name march15_9), it works all alright.

In my experience the error often means Can not find the job.
As soon as jobs are bound to regions, a combination of name and region uniquely identifies a job, and errors in any of these (including trivial mistypes) will lead to the error you experience(d).
As an example, a job I am using is in the us-east-1, thus the following statement executes successfully.
glue_client = boto3.client('glue', region_name='us-east-1')
response = glue_client.start_job_run(
JobName = glue_job_name)
However, the snippet below will produce the same error as you have
glue_client = boto3.client('glue', region_name='us-west-1')
response = glue_client.start_job_run(
JobName = glue_job_name)
botocore.errorfactory.EntityNotFoundException: An error occurred (EntityNotFoundException) when calling the StartJobRun operation: Failed to start job run due to missing metadata
In the case above, it's relatively easy to check, by specifying running cli with --region parameter
It would be something like:
aws glue start-job-run --job-name march15_9 --region ap-south-1
If this runs successfully(thus the region is indeed ap-south-1), I'd explicitly set parameters in the code to remove unknown factors, and instead of passing them through environment variables you can temporary put string values in the code.
Once the code works with hardcoded values you may remove them one by one, thus finding one (or a few) that need to passed correctly.
All the best
P.S. Indeed, the documentation is correct, only JobName needs to be set as a parameter, I have code that works this way

I too faced the same error, the problem is passing ARN of glue job as JobName. Resolved by passing only Name of the glue job.
response = client.start_job_run(
JobName='Glue Job Name not ARN'
)

Check if the name of your glue job is correctly written. I had a similar case and I fixed it that way. (For example: Job_ 01 instead of Job_01)

What glue error log indicating?
You may be using some parameters in glue job which you are not passing while calling job

Error when changing instance type in a python for loop

I have a Python 2 script which uses boto3 library.
Basically, I have a list of instance ids and I need to iterate over it changing the type of each instance from c4.xlarge to t2.micro.
In order to accomplish that task, I'm calling the modify_instance_attribute method.
I don't know why, but my script hangs with the following error message:
EBS-optimized instances are not supported for your requested configuration.
Here is my general scenario:
Say I have a piece of code like this one below:
def change_instance_type(instance_id):
client = boto3.client('ec2')
response = client.modify_instance_attribute(
InstanceId=instance_id,
InstanceType={
'Value': 't2.micro'
}
)
So, If I execute it like this:
change_instance_type('id-929102')
everything works with no problem at all.
However, strange enough, if I execute it in a for loop like the following
instances_list = ['id-929102']
for instance_id in instances_list:
change_instance_type(instance_id)
I get the error message above (i.e., EBS-optimized instances are not supported for your requested configuration) and my script dies.
Any idea why this happens?

When I look at EBS optimized instances I don't see that T2 micros are supported:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
I think you would need to add EbsOptimized=false as well.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.