My apache beam pipeline (using Python SDK+ DirecrRunner for testing purpose…) is reading from Pubsub topic
The message & attributes published are as follows:
message: [{"col1": "test column 1", "col2": "test column 1"}]
attributes:{
'event_time_v1': str(time.time()),
'record_id': 'row-1’,
}
I’m using the function beam.io.gcp.pubsub.ReadFromPubSub. The code/doc mentions id_label and timestamp_attribute arguments (I believe these are very new additions?! Updated only 13 days ago..)
When I use id_label in order to assign each element a unique id for dedupe purpose, I get following error:
NotImplementedError: DirectRunner: id_label is not supported for PubSub reads```
why so? am I correct in my understanding that some code implementation is still missing or am I missing something here ?
When I use timestamp_attribute = 'event_time_v1’, in order to assign my own timestamp to each element (client side event time passed in message attribute event_time_v1), I notice timestamp actually assigned to the element is still the message publish time
why so? I expected it would be the time passed in event_time_v1
I'm using following DoFn to print element's timestamp
class PrintFn(beam.DoFn):
print(element, timestamp)
return [element]
Thanks a lot in advance for any explanation to that
I have had the same problem with this today, there is actually an open issue on Jira for id_label and timestamp_attribute being unavailable in the direct runner (and I'm assuming from reading, any non dataflow runners). I've successfully been able to use id_label when specifying DataflowRunner as the runner (with some other issues, but that's by the by).
The Jira issue is below:
https://issues.apache.org/jira/browse/BEAM-4275?jql=text%20~%20%22python%20id_label%22
So, at the moment, it would appear this is not yet possible to do using the direct runner.
Related
I have Glue job, a python shell code. When I try to run it I end up getting the below error.
Job Name : xxxxx Job Run Id : yyyyyy failed to execute with exception Internal service error : Invalid input provided
It is not specific to code, even if I just put
import boto3
print('loaded')
I am getting the error right after clicking the run job option. What is the issue here?
It happend to me but the same job is working on a different account.
AWS documentation is not really explainative about this error:
The input provided was not valid.
I doubt this is an Amazon issue as mentionned #Quartermass
Same issue here in eu-west-2 yesterday, working now. This was only happening with Pythonshell jobs, not Pyspark ones, and job runs weren't getting as far as outputting any log streams. I can only assume it was an AWS issue they've now fixed and not issued a service announcement for.
I think Quatermass is right, the jobs started working out of the blue the next day without any changes.
I too received this super helpful error message.
What worked for me was explicitly setting properties like worker type, number of workers, Glue version and Python version.
In Terraform code:
resource "aws_glue_job" "my_job" {
name = "my_job"
role_arn = aws_iam_role.glue.arn
worker_type = "Standard"
number_of_workers = 2
glue_version = "4.0"
command {
script_location = "s3://my-bucket/my-script.py"
python_version = "3"
}
default_arguments = {
"--enable-job-insights" = "true",
"--additional-python-modules" : "boto3==1.26.52,pandas==1.5.2,SQLAlchemy==1.4.46,requests==2.28.2",
}
}
Update
After doing some more digging, I realised that what I needed was a Python shell script Glue job, not an ETL (Spark) job. By choosing this flavour of job, setting the Python version to 3.9 and "ticking the box" for Glue's pre-installed analytics libraries, my script, incidentally, had access to all the libraries I needed.
My Terraform code ended up looking like this:
resource "aws_glue_job" "my_job" {
name = "my-job"
role_arn = aws_iam_role.glue.arn
glue_version = "1.0"
max_capacity = 1
connections = [
aws_glue_connection.redshift.name
]
command {
name = "pythonshell"
script_location = "s3://my-bucket/my-script.py"
python_version = "3.9"
}
default_arguments = {
"--enable-job-insights" = "true",
"--library-set" : "analytics",
}
}
Note that I have switched to using Glue version 1.0. I arrived at this after some trial and error, and could not find this explicitly stated as the compatible version for pythonshell jobs… but it works!
Well, in my case, I get this error from time to time without any clear reason. The only thing that seems to cause the issue, is modifying some job parameter and saving the modifications. As soon as I save and try to execute the job, I usually get this error and, the only way to solve the issue, is destroying the job and, then, re-creating it again. Does anybody solved this issue by other means? As I saw in the accepted answer, the job simply begun to work again wthout any manual action, giving an understanding that the problem was a bug in AWS that was corrected.
I was facing a similar issue. I was invoking my job from a workflow. I could solve it by adding WorkerType, GlueVersion, NumberOfWorkers to the job before adding the job to the workflow. I could see it consistently fail before and succeed after this addition.
I am accessing an Intersystems cache 2017.1.xx instance through a python process to get various attributes about the database in able to monitor the database.
One of the items I want to monitor is license usage. I wrote a objectscript script in a Terminal window to access license usage by user:
s Rset=##class(%ResultSet).%New("%SYSTEM.License.UserListAll")
s r=Rset.Execute()
s ncol=Rset.GetColumnCount()
While (Rset.Next()) {f i=1:1:ncol w !,Rset.GetData(i)}
But, I have been unable to determine how to convert this script into a Python equivalent. I am using the intersys.pythonbind3 import for connecting and accessing the cache instance. I have been able to create python functions that accessing most everything else in the instance but this one piece of data I can not figure out how to translate it to Python (3.7).
Following should work (based on the documentation):
query = intersys.pythonbind.query(database)
query.prepare_class("%SYSTEM.License","UserListAll")
query.execute();
# Fetch each row in the result set, and print the
# name and value of each column in a row:
while 1:
cols = query.fetch([None])
if len(cols) == 0: break
print str(cols[0])
Also, notice that InterSystems IRIS -- successor to the Caché now has Python as an embedded language. See more in the docs
Since the noted query "UserListAll" is not defined correctly in the library; not SqlProc. So to resolve this issue would require a ObjectScript with the query and the use of #Result set or similar in Python to get the results. So I am marking this as resolved.
Not sure which Python interface you're using for Cache/IRIS, but this Open Source 3rd party one is worth investigating for the kind of things you're trying to do:
https://github.com/chrisemunt/mg_python
I'll preface this by saying I'm fairly new to BigQuery. I'm running into an issue when trying to schedule a query using the Python SDK. I used the example on the documentation page and modified it a bit but I'm running into errors.
Note that my query does use scripting to set some variables, and it's using a MERGE statement to update one of my tables. I'm not sure if that makes a huge difference.
def create_scheduled_query(dataset_id, project, name, schedule, service_account, query):
parent = transfer_client.common_project_path(project)
transfer_config = bigquery_datatransfer.TransferConfig(
destination_dataset_id=dataset_id,
display_name=name,
data_source_id="scheduled_query",
params={
"query": query
},
schedule=schedule,
)
transfer_config = transfer_client.create_transfer_config(
bigquery_datatransfer.CreateTransferConfigRequest(
parent=parent,
transfer_config=transfer_config,
service_account_name=service_account,
)
)
print("Created scheduled query '{}'".format(transfer_config.name))
I was able to successfully create a query with the function above. However the query errors out with the following message:
Error code 9 : Dataset specified in the query ('') is not consistent with Destination dataset '{my_dataset_name}'.
I've tried changing passing in "" as the dataset_id parameter, but I get the following error from the Python SDK:
google.api_core.exceptions.InvalidArgument: 400 Cannot create a transfer with parent projects/{my_project_name} without location info when destination dataset is not specified.
Interestingly enough I was able to successfully create this scheduled query in the GUI; the same query executed without issue.
I saw that the GUI showed the scheduled query's "Resource name" referenced a transferConfig, so I used the following command to see what that transferConfig looked like, to see if I could apply the same parameters using my Python script:
bq show --format=prettyjson --transfer_config {my_transfer_config}
Which gave me the following output:
{
"dataSourceId": "scheduled_query",
"datasetRegion": "us",
"destinationDatasetId": "",
"displayName": "test_scheduled_query",
"emailPreferences": {},
"name": "{REDACTED_TRANSFER_CONFIG_ID}",
"nextRunTime": "2021-06-18T00:35:00Z",
"params": {
"query": ....
So it looks like the GUI was able to use "" for destinationDataSetId but for whatever reason the Python SDK won't let me use that value.
Any help would be appreciated, since I prefer to avoid the GUI whenever possible.
UPDATE:
This does appear to be related to the scripting I used in my query. I removed the scripts from the query and it's working. I'm going to leave this open because I feel like this should be possible using the SDK since the query with scripting works in the console without issue.
This same thing also threw me through a loop but I managed to figure out what was wrong. The problem is with the
parent = transfer_client.common_project_path(project)
line that is given in the example query. By default, this returns something of the form projects/{project_id}. However, the CreateTransferConfigRequest documentation says of the parent parameter:
The BigQuery project id where the transfer configuration should be created. Must be in the format projects/{project_id}/locations/{location_id} or projects/{project_id}. If specified location and location of the destination bigquery dataset do not match - the request will fail.
Sure enough, if you use the projects/{project_id}/locations/{location_id} format instead, it resolves the error and allows you to pass a null destination_dataset_id.
I had the exact same issue. the fix for the issue is as below.
The below method returns Projects/{projectid}
parent = transfer_client.common_project_path(project_id)
instead use the below method , which returns projects/{project}/locations/{location}
parent = transfer_client.common_location_path(project_id , "EU")
I had tried with the above change , i am able to schedule a script in BQ.
I try to download a bingads report using python SDK, but I keep getting an error says: "Type not found: 'Aggregation'" after submitting a report request. I've tried all 4 options mentioned in the following link:
https://github.com/BingAds/BingAds-Python-SDK/blob/master/examples/v13/report_requests.py
Authentication process prior to request works just fine.
I execute the following:
report_request = get_report_request(authorization_data.account_id)
reporting_download_parameters = ReportingDownloadParameters(
report_request=report_request,
result_file_directory=FILE_DIRECTORY,
result_file_name=RESULT_FILE_NAME,
overwrite_result_file=True, # Set this value true if you want to overwrite the same file.
timeout_in_milliseconds=TIMEOUT_IN_MILLISECONDS
)
output_status_message("-----\nAwaiting download_report...")
download_report(reporting_download_parameters)
after a careful debugging, it seems that the program fails when trying to execute a command within "reporting_service_manager.py". Here is workflow:
download_report(self, download_parameters):
report_file_path = self.download_file(download_parameters)
then:
download_file(self, download_parameters):
operation = self.submit_download(download_parameters.report_request)
then:
submit_download(self, report_request):
self.normalize_request(report_request)
response = self.service_client.SubmitGenerateReport(report_request)
SubmitGenerateReport starts a sequence of events ending with a call to "_SeviceCall.init" function within "service_client.py", returning an exception "Type not found: 'Aggregation'"
try:
response = self.service_client.soap_client.service.__getattr__(self.name)(*args, **kwargs)
return response
except Exception as ex:
if need_to_refresh_token is False \
and self.service_client.refresh_oauth_tokens_automatically \
and self.service_client._is_expired_token_exception(ex):
need_to_refresh_token = True
else:
raise ex
Can anyone shed some light? .
Thanks
Please be sure to set Aggregation e.g., as shown here.
aggregation = 'Daily'
If the report type does not use aggregation, you can set Aggregation=None.
Does this help?
This may be a bit late 2 months after the fact but maybe this will help someone else. I had the same error (though I suppose it may not be the same issue). It does look like you did what I did (and I'm sure others will as well): copy-paste the Microsoft example code and tried to run it only to find that it didn't work.
I spent quite some time trying to debug the issue and it looked to me like the XML wasn't being searched correctly. I was using suds-py3 for the script at the time so I tried suds-community and everything just worked after that.
I also re-read the Bing Ads API walkthrough for getting started again and found that they recommend suds-jurko instead.
Long story short: If you want to use the bingads API don't use suds-py3, use either suds-community (which I can confirm works for everything I've used the API for) or suds-jurko (which is the one recommended by Microsoft).
I have a Python 2 script which uses boto3 library.
Basically, I have a list of instance ids and I need to iterate over it changing the type of each instance from c4.xlarge to t2.micro.
In order to accomplish that task, I'm calling the modify_instance_attribute method.
I don't know why, but my script hangs with the following error message:
EBS-optimized instances are not supported for your requested configuration.
Here is my general scenario:
Say I have a piece of code like this one below:
def change_instance_type(instance_id):
client = boto3.client('ec2')
response = client.modify_instance_attribute(
InstanceId=instance_id,
InstanceType={
'Value': 't2.micro'
}
)
So, If I execute it like this:
change_instance_type('id-929102')
everything works with no problem at all.
However, strange enough, if I execute it in a for loop like the following
instances_list = ['id-929102']
for instance_id in instances_list:
change_instance_type(instance_id)
I get the error message above (i.e., EBS-optimized instances are not supported for your requested configuration) and my script dies.
Any idea why this happens?
When I look at EBS optimized instances I don't see that T2 micros are supported:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
I think you would need to add EbsOptimized=false as well.