AWS Glue error - Invalid input provided while running python shell program - python

I have Glue job, a python shell code. When I try to run it I end up getting the below error.
Job Name : xxxxx Job Run Id : yyyyyy failed to execute with exception Internal service error : Invalid input provided
It is not specific to code, even if I just put
import boto3
print('loaded')
I am getting the error right after clicking the run job option. What is the issue here?

It happend to me but the same job is working on a different account.
AWS documentation is not really explainative about this error:
The input provided was not valid.
I doubt this is an Amazon issue as mentionned #Quartermass

Same issue here in eu-west-2 yesterday, working now. This was only happening with Pythonshell jobs, not Pyspark ones, and job runs weren't getting as far as outputting any log streams. I can only assume it was an AWS issue they've now fixed and not issued a service announcement for.

I think Quatermass is right, the jobs started working out of the blue the next day without any changes.

I too received this super helpful error message.
What worked for me was explicitly setting properties like worker type, number of workers, Glue version and Python version.
In Terraform code:
resource "aws_glue_job" "my_job" {
name = "my_job"
role_arn = aws_iam_role.glue.arn
worker_type = "Standard"
number_of_workers = 2
glue_version = "4.0"
command {
script_location = "s3://my-bucket/my-script.py"
python_version = "3"
}
default_arguments = {
"--enable-job-insights" = "true",
"--additional-python-modules" : "boto3==1.26.52,pandas==1.5.2,SQLAlchemy==1.4.46,requests==2.28.2",
}
}
Update
After doing some more digging, I realised that what I needed was a Python shell script Glue job, not an ETL (Spark) job. By choosing this flavour of job, setting the Python version to 3.9 and "ticking the box" for Glue's pre-installed analytics libraries, my script, incidentally, had access to all the libraries I needed.
My Terraform code ended up looking like this:
resource "aws_glue_job" "my_job" {
name = "my-job"
role_arn = aws_iam_role.glue.arn
glue_version = "1.0"
max_capacity = 1
connections = [
aws_glue_connection.redshift.name
]
command {
name = "pythonshell"
script_location = "s3://my-bucket/my-script.py"
python_version = "3.9"
}
default_arguments = {
"--enable-job-insights" = "true",
"--library-set" : "analytics",
}
}
Note that I have switched to using Glue version 1.0. I arrived at this after some trial and error, and could not find this explicitly stated as the compatible version for pythonshell jobs… but it works!

Well, in my case, I get this error from time to time without any clear reason. The only thing that seems to cause the issue, is modifying some job parameter and saving the modifications. As soon as I save and try to execute the job, I usually get this error and, the only way to solve the issue, is destroying the job and, then, re-creating it again. Does anybody solved this issue by other means? As I saw in the accepted answer, the job simply begun to work again wthout any manual action, giving an understanding that the problem was a bug in AWS that was corrected.

I was facing a similar issue. I was invoking my job from a workflow. I could solve it by adding WorkerType, GlueVersion, NumberOfWorkers to the job before adding the job to the workflow. I could see it consistently fail before and succeed after this addition.

Related

Can't Schedule Query in BigQuery Via Python SDK

I'll preface this by saying I'm fairly new to BigQuery. I'm running into an issue when trying to schedule a query using the Python SDK. I used the example on the documentation page and modified it a bit but I'm running into errors.
Note that my query does use scripting to set some variables, and it's using a MERGE statement to update one of my tables. I'm not sure if that makes a huge difference.
def create_scheduled_query(dataset_id, project, name, schedule, service_account, query):
parent = transfer_client.common_project_path(project)
transfer_config = bigquery_datatransfer.TransferConfig(
destination_dataset_id=dataset_id,
display_name=name,
data_source_id="scheduled_query",
params={
"query": query
},
schedule=schedule,
)
transfer_config = transfer_client.create_transfer_config(
bigquery_datatransfer.CreateTransferConfigRequest(
parent=parent,
transfer_config=transfer_config,
service_account_name=service_account,
)
)
print("Created scheduled query '{}'".format(transfer_config.name))
I was able to successfully create a query with the function above. However the query errors out with the following message:
Error code 9 : Dataset specified in the query ('') is not consistent with Destination dataset '{my_dataset_name}'.
I've tried changing passing in "" as the dataset_id parameter, but I get the following error from the Python SDK:
google.api_core.exceptions.InvalidArgument: 400 Cannot create a transfer with parent projects/{my_project_name} without location info when destination dataset is not specified.
Interestingly enough I was able to successfully create this scheduled query in the GUI; the same query executed without issue.
I saw that the GUI showed the scheduled query's "Resource name" referenced a transferConfig, so I used the following command to see what that transferConfig looked like, to see if I could apply the same parameters using my Python script:
bq show --format=prettyjson --transfer_config {my_transfer_config}
Which gave me the following output:
{
"dataSourceId": "scheduled_query",
"datasetRegion": "us",
"destinationDatasetId": "",
"displayName": "test_scheduled_query",
"emailPreferences": {},
"name": "{REDACTED_TRANSFER_CONFIG_ID}",
"nextRunTime": "2021-06-18T00:35:00Z",
"params": {
"query": ....
So it looks like the GUI was able to use "" for destinationDataSetId but for whatever reason the Python SDK won't let me use that value.
Any help would be appreciated, since I prefer to avoid the GUI whenever possible.
UPDATE:
This does appear to be related to the scripting I used in my query. I removed the scripts from the query and it's working. I'm going to leave this open because I feel like this should be possible using the SDK since the query with scripting works in the console without issue.
This same thing also threw me through a loop but I managed to figure out what was wrong. The problem is with the
parent = transfer_client.common_project_path(project)
line that is given in the example query. By default, this returns something of the form projects/{project_id}. However, the CreateTransferConfigRequest documentation says of the parent parameter:
The BigQuery project id where the transfer configuration should be created. Must be in the format projects/{project_id}/locations/{location_id} or projects/{project_id}. If specified location and location of the destination bigquery dataset do not match - the request will fail.
Sure enough, if you use the projects/{project_id}/locations/{location_id} format instead, it resolves the error and allows you to pass a null destination_dataset_id.
I had the exact same issue. the fix for the issue is as below.
The below method returns Projects/{projectid}
parent = transfer_client.common_project_path(project_id)
instead use the below method , which returns projects/{project}/locations/{location}
parent = transfer_client.common_location_path(project_id , "EU")
I had tried with the above change , i am able to schedule a script in BQ.

Pyinstaller with APScheduler - TypeError with IntervalTrigger

I had the same problem as here (see link below), brielfy: unable to create .exe of a python script that uses APScheduler
Pyinstaller 3.3.1 & 3.4.0-dev build with apscheduler
So I did as suggested:
from apscheduler.triggers import interval
scheduler.add_job(Run, 'interval', interval.IntervalTrigger(minutes = time_int),
args = (input_file, output_dir, time_int),
id = theID, replace_existing=True)
And indeed importing interval.IntervalTrigger and passing it as an argument to add_job solved this particular error.
However, now I am encountring:
TypeError: add_job() got multiple values for argument 'args'
I tested it and I can ascertain it is occurring because of the way trigger is called now. I also tried defining trigger = interval.IntervalTrigger(minutes = time_int) separately and then just passing trigger, and the same happens.
If I ignore the error with try/except, I see that it does not add the job to the sql database at all (I am using SQLAlchemy as a jobstore). Initially I thought it is because I am adding several jobs in a for loop, but it happens with a single job add as well.
Anyone know of some other workaround if the initial problem, or any idea why this error might occur? I can't find anything online either :(
Things always work better in the morning.
For anyone else who encounters this: you don't need both 'interval' and interval.IntervalTrigger() as arguments, the code should be, this is where the error comes from.
scheduler.add_job(Run, interval.IntervalTrigger(minutes = time_int),
args = (input_file, output_dir, time_int),
id = theID, replace_existing=True)

bingads V13 report request fails in python sdk

I try to download a bingads report using python SDK, but I keep getting an error says: "Type not found: 'Aggregation'" after submitting a report request. I've tried all 4 options mentioned in the following link:
https://github.com/BingAds/BingAds-Python-SDK/blob/master/examples/v13/report_requests.py
Authentication process prior to request works just fine.
I execute the following:
report_request = get_report_request(authorization_data.account_id)
reporting_download_parameters = ReportingDownloadParameters(
report_request=report_request,
result_file_directory=FILE_DIRECTORY,
result_file_name=RESULT_FILE_NAME,
overwrite_result_file=True, # Set this value true if you want to overwrite the same file.
timeout_in_milliseconds=TIMEOUT_IN_MILLISECONDS
)
output_status_message("-----\nAwaiting download_report...")
download_report(reporting_download_parameters)
after a careful debugging, it seems that the program fails when trying to execute a command within "reporting_service_manager.py". Here is workflow:
download_report(self, download_parameters):
report_file_path = self.download_file(download_parameters)
then:
download_file(self, download_parameters):
operation = self.submit_download(download_parameters.report_request)
then:
submit_download(self, report_request):
self.normalize_request(report_request)
response = self.service_client.SubmitGenerateReport(report_request)
SubmitGenerateReport starts a sequence of events ending with a call to "_SeviceCall.init" function within "service_client.py", returning an exception "Type not found: 'Aggregation'"
try:
response = self.service_client.soap_client.service.__getattr__(self.name)(*args, **kwargs)
return response
except Exception as ex:
if need_to_refresh_token is False \
and self.service_client.refresh_oauth_tokens_automatically \
and self.service_client._is_expired_token_exception(ex):
need_to_refresh_token = True
else:
raise ex
Can anyone shed some light? .
Thanks
Please be sure to set Aggregation e.g., as shown here.
aggregation = 'Daily'
If the report type does not use aggregation, you can set Aggregation=None.
Does this help?
This may be a bit late 2 months after the fact but maybe this will help someone else. I had the same error (though I suppose it may not be the same issue). It does look like you did what I did (and I'm sure others will as well): copy-paste the Microsoft example code and tried to run it only to find that it didn't work.
I spent quite some time trying to debug the issue and it looked to me like the XML wasn't being searched correctly. I was using suds-py3 for the script at the time so I tried suds-community and everything just worked after that.
I also re-read the Bing Ads API walkthrough for getting started again and found that they recommend suds-jurko instead.
Long story short: If you want to use the bingads API don't use suds-py3, use either suds-community (which I can confirm works for everything I've used the API for) or suds-jurko (which is the one recommended by Microsoft).

Can someone suggest alternative to HdfsSensor for airflow python3?

I am trying to listen to changes in HDFS to trigger my ETL pipeline in Airflow using HdfsSensor in python3. I am getting the following error as snakebite is not supported for python3
This HDFSHook implementation requires snakebite, but '
ImportError: This HDFSHook implementation requires snakebite, but snakebite is not compatible with Python 3
Thanks to the suggestion by #AyushGoyal, I solved the same problem using WebHDFSSensor. This sensor looks like HdfsSensor and you can just replace the function names. just remember to make sure:
you pass the connection id via webhdfs_conn_id parameter (in HdfsSensor the parameter name was hdfs_conn_id)
the port with which you should try to connect to name node is 50700 (not 8020)
The rest is the same!
example:
from airflow.sensors.web_hdfs_sensor import WebHdfsSensor
file_sensor = WebHdfsSensor(
task_id='check_if_data_is_ready',
filepath="some_file_path",
webhdfs_conn_id='hdfs_conn_id',
poke_interval=10,
timeout=5,
dag=dag,
env={
'JAVA_HOME': '/usr/java/latest'
}
)

AWS Glue: Failed to start job run due to missing metadata

In order to run a job using boto3, the documentation states only JobName is required. However, my code:
def start_job_run(self, name):
print("The name of the job to be run via client is: {}".format(name))
self.response_de_start_job = self.client.start_job_run(
JobName=name
)
print(self.response_de_start_job)
and the client is:
self.client = boto3.client(
'glue',
region_name='ap-south-1',
aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY'),
)
when executed via Python3, gives out an error:
botocore.errorfactory.EntityNotFoundException: An error occurred (EntityNotFoundException) when calling the StartJobRun operation: Failed to start job run due to missing metadata
but when I do the same operation on the same job from UI and from the cli(aws glue start-job-run --job-name march15_9), it works all alright.
In my experience the error often means Can not find the job.
As soon as jobs are bound to regions, a combination of name and region uniquely identifies a job, and errors in any of these (including trivial mistypes) will lead to the error you experience(d).
As an example, a job I am using is in the us-east-1, thus the following statement executes successfully.
glue_client = boto3.client('glue', region_name='us-east-1')
response = glue_client.start_job_run(
JobName = glue_job_name)
However, the snippet below will produce the same error as you have
glue_client = boto3.client('glue', region_name='us-west-1')
response = glue_client.start_job_run(
JobName = glue_job_name)
botocore.errorfactory.EntityNotFoundException: An error occurred (EntityNotFoundException) when calling the StartJobRun operation: Failed to start job run due to missing metadata
In the case above, it's relatively easy to check, by specifying running cli with --region parameter
It would be something like:
aws glue start-job-run --job-name march15_9 --region ap-south-1
If this runs successfully(thus the region is indeed ap-south-1), I'd explicitly set parameters in the code to remove unknown factors, and instead of passing them through environment variables you can temporary put string values in the code.
Once the code works with hardcoded values you may remove them one by one, thus finding one (or a few) that need to passed correctly.
All the best
P.S. Indeed, the documentation is correct, only JobName needs to be set as a parameter, I have code that works this way
I too faced the same error, the problem is passing ARN of glue job as JobName. Resolved by passing only Name of the glue job.
response = client.start_job_run(
JobName='Glue Job Name not ARN'
)
Check if the name of your glue job is correctly written. I had a similar case and I fixed it that way. (For example: Job_ 01 instead of Job_01)
What glue error log indicating?
You may be using some parameters in glue job which you are not passing while calling job

Categories

Resources