Can someone suggest alternative to HdfsSensor for airflow python3? - python

I am trying to listen to changes in HDFS to trigger my ETL pipeline in Airflow using HdfsSensor in python3. I am getting the following error as snakebite is not supported for python3
This HDFSHook implementation requires snakebite, but '
ImportError: This HDFSHook implementation requires snakebite, but snakebite is not compatible with Python 3

Thanks to the suggestion by #AyushGoyal, I solved the same problem using WebHDFSSensor. This sensor looks like HdfsSensor and you can just replace the function names. just remember to make sure:
you pass the connection id via webhdfs_conn_id parameter (in HdfsSensor the parameter name was hdfs_conn_id)
the port with which you should try to connect to name node is 50700 (not 8020)
The rest is the same!
example:
from airflow.sensors.web_hdfs_sensor import WebHdfsSensor
file_sensor = WebHdfsSensor(
task_id='check_if_data_is_ready',
filepath="some_file_path",
webhdfs_conn_id='hdfs_conn_id',
poke_interval=10,
timeout=5,
dag=dag,
env={
'JAVA_HOME': '/usr/java/latest'
}
)

Related

Python KustoManagementClient using my own credentials

I'm trying to use azure-mgmt-kusto Pkg for some Kusto Cluster operations, using KustoManagementClient. This client requires TokenCredential on constructor. For my own scenario, I would like to use my own AAD credentials, preferably using interactive login or IWA (Integrated Windows Authentication). The closest I was able to achieve this is using the following code:
creds = DefaultAzureCredential(exclude_interactive_browser_credential=False).get_token('')
kusto_client = azure.mgmt.kusto.KustoManagementClient(credential=creds, subscription_id='<>')
but this raises an error in the second line:
Expected type 'TokenCredential', got 'AccessToken' instead
which I couldn't find any way around!
Any suggestions on how to resolve this? or other methods to use?
Actually, after simply trying despite the Pycharm warning, this worked:
from azure.identity import DefaultAzureCredential
from azure.mgmt.kusto import KustoManagementClient
credential = DefaultAzureCredential()
kusto_management_client = KustoManagementClient(credential, subId)

AWS Glue error - Invalid input provided while running python shell program

I have Glue job, a python shell code. When I try to run it I end up getting the below error.
Job Name : xxxxx Job Run Id : yyyyyy failed to execute with exception Internal service error : Invalid input provided
It is not specific to code, even if I just put
import boto3
print('loaded')
I am getting the error right after clicking the run job option. What is the issue here?
It happend to me but the same job is working on a different account.
AWS documentation is not really explainative about this error:
The input provided was not valid.
I doubt this is an Amazon issue as mentionned #Quartermass
Same issue here in eu-west-2 yesterday, working now. This was only happening with Pythonshell jobs, not Pyspark ones, and job runs weren't getting as far as outputting any log streams. I can only assume it was an AWS issue they've now fixed and not issued a service announcement for.
I think Quatermass is right, the jobs started working out of the blue the next day without any changes.
I too received this super helpful error message.
What worked for me was explicitly setting properties like worker type, number of workers, Glue version and Python version.
In Terraform code:
resource "aws_glue_job" "my_job" {
name = "my_job"
role_arn = aws_iam_role.glue.arn
worker_type = "Standard"
number_of_workers = 2
glue_version = "4.0"
command {
script_location = "s3://my-bucket/my-script.py"
python_version = "3"
}
default_arguments = {
"--enable-job-insights" = "true",
"--additional-python-modules" : "boto3==1.26.52,pandas==1.5.2,SQLAlchemy==1.4.46,requests==2.28.2",
}
}
Update
After doing some more digging, I realised that what I needed was a Python shell script Glue job, not an ETL (Spark) job. By choosing this flavour of job, setting the Python version to 3.9 and "ticking the box" for Glue's pre-installed analytics libraries, my script, incidentally, had access to all the libraries I needed.
My Terraform code ended up looking like this:
resource "aws_glue_job" "my_job" {
name = "my-job"
role_arn = aws_iam_role.glue.arn
glue_version = "1.0"
max_capacity = 1
connections = [
aws_glue_connection.redshift.name
]
command {
name = "pythonshell"
script_location = "s3://my-bucket/my-script.py"
python_version = "3.9"
}
default_arguments = {
"--enable-job-insights" = "true",
"--library-set" : "analytics",
}
}
Note that I have switched to using Glue version 1.0. I arrived at this after some trial and error, and could not find this explicitly stated as the compatible version for pythonshell jobs… but it works!
Well, in my case, I get this error from time to time without any clear reason. The only thing that seems to cause the issue, is modifying some job parameter and saving the modifications. As soon as I save and try to execute the job, I usually get this error and, the only way to solve the issue, is destroying the job and, then, re-creating it again. Does anybody solved this issue by other means? As I saw in the accepted answer, the job simply begun to work again wthout any manual action, giving an understanding that the problem was a bug in AWS that was corrected.
I was facing a similar issue. I was invoking my job from a workflow. I could solve it by adding WorkerType, GlueVersion, NumberOfWorkers to the job before adding the job to the workflow. I could see it consistently fail before and succeed after this addition.

Get SMQ queues in Python?

I'm using the PYRFC library and so far I've managed to connect to SAP.
conn = Connection(ashost='xxxxxxxxx', sysnr='02', client='100', user='xxxxx', passwd='xxxxxxxx.')
result = conn.call('STFC_CONNECTION', REQUTEXT=u'Hello SAP!')
print (result)
I did the connection test and everything was ok. But now I'm trying to run the queues created in SAP.
I performed some tests, trying to simulate the F8 but without success.
Is there any way to make this execution using via python?
Directly you cannot query SMQ1/2 results, for an extended explanation read my answer here.
To get the SMQ results in Pyton you need to find a module with an equivalent functionality, and good news pa-pam!, I did it for you: this is TRFC_QIN_GET_CURRENT_QUEUES function module.
How to call you can find it in any pyrfc tutorial. Probable sample:
def func():
import pyrfc
from pyrfc import Connection
conn=pyrfc.Connection(user='', passwd='', ashost='', etc.)
queue_name = '*'
client = '100'
result=con.call("TRFC_QIN_GET_CURRENT_QUEUES", qname=queue_name,
client=client)
print(result['QVIEW'])
all you need to put here is queue name (qname parameter) and client value (what is it?).

Error when changing instance type in a python for loop

I have a Python 2 script which uses boto3 library.
Basically, I have a list of instance ids and I need to iterate over it changing the type of each instance from c4.xlarge to t2.micro.
In order to accomplish that task, I'm calling the modify_instance_attribute method.
I don't know why, but my script hangs with the following error message:
EBS-optimized instances are not supported for your requested configuration.
Here is my general scenario:
Say I have a piece of code like this one below:
def change_instance_type(instance_id):
client = boto3.client('ec2')
response = client.modify_instance_attribute(
InstanceId=instance_id,
InstanceType={
'Value': 't2.micro'
}
)
So, If I execute it like this:
change_instance_type('id-929102')
everything works with no problem at all.
However, strange enough, if I execute it in a for loop like the following
instances_list = ['id-929102']
for instance_id in instances_list:
change_instance_type(instance_id)
I get the error message above (i.e., EBS-optimized instances are not supported for your requested configuration) and my script dies.
Any idea why this happens?
When I look at EBS optimized instances I don't see that T2 micros are supported:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
I think you would need to add EbsOptimized=false as well.

Check Domain Availability using Boto Route53

I love using Boto API for Amazon Web Services but now I'm not capable of finding where is the error.
I'm using AWS for check domain availability and I have created a script in Python that includes the class at this link:
https://www.codatlas.com/github.com/boto/boto/develop/boto/route53/domains/layer1.py?line=67
I call the method check_domain_availability() on passing domain name:
Route53DomainsConnection.check_domain_availability('example.com',None)
but the method returns this error:
AttributeError: 'str' object has no attribute 'make_request'
I can try to pass parameters in many modes but no result.
Where am I wrong? Thanks in advance.
P.S: I use Debian wheezy and Python3.2
More on status of subdomains
I have found a method to get the status of a record just create with route53.
this is the code:
changes = ResourceRecordSets(conn, "ZONEID")
change = changes.add_change("STRING FOR ADD NEW SUBDOMAIN")
change.add_value(MY_IP)
status = changes.commit()
If print the status variable is contained the response of commit and the status:
{u'ChangeResourceRecordSetsResponse':{u'ChangeInfo': {u'Status: u'PENDING etc.....
Now i would like to be able to swhitch to another operation only if the status of subdomamin is "SYNC" but i doesn't able to access dinamically to string for check status.
I can use a while ? Can i use sleep command ? Can anyone help me over to resolve my problem ? Thanks
You don't show your code which makes it harder to debug but this line:
Route53DomainsConnection.check_domain_availability('example.com',None)
looks suspicious. It looks like you are trying to access the check_domain_availability method from the class rather than an instance of the class. I just did the following and it worked for me:
In [1]: import boto.route53.domains
In [2]: c = boto.route53.domains.connect_to_region('us-east-1')
In [3]: c.check_domain_availability('foobar.com')
Out[3]: {u'Availability': u'UNAVAILABLE'}

Categories

Resources