MWAA EKSPodOperator error when assuming role - python

we have the following basic EKS Operator on MWAA (Airflow version 2.2.2)
start_pod = EKSPodOperator(
aws_conn_id="eks-connection",
task_id='start_pod',
namespace="airflow",
cluster_name="eks-data-stg",
in_cluster=False,
service_account_name="airflow-sa",
image='amazon/aws-cli:latest',
cmds=['sh', '-c', 'echo Test Airflow; date'],
labels={'demo': 'hello_world'},
get_logs=True,
# Delete the pod when it reaches its final state, or the execution is interrupted.
is_delete_operator_pod=True,
)
This fails with the following error:
airflow.exceptions.AirflowConfigException: `[logging] logging_level` should not be 'fatal'. Possible values: CRITICAL, FATAL, ERROR, WARN, WARNING, INFO, DEBUG.
This we traced back to the following issue that is fixed in a older version of the Operator(https://github.com/apache/airflow/issues/21421).
Unfortunately we are not able to override theairflow-providers-amazon.
Has anyone found a way around this bug by either overriding the dependency or fixing the operator?

Related

How to fix mypy type Any error in sqlalchemy Table.columns

I'm still new to type hints. Here's the minimal code example of the error I'm getting:
import sqlalchemy as sa
t = sa.Table("a", sa.MetaData(), sa.Column("id_", sa.Integer))
cols = t.columns
This raises the following error when I run mypy:
error: Expression type contains "Any" (has type "ReadOnlyColumnCollection[str, Column[Any]]") [misc]
I'm running mypy with the following configuration turned on (link):
disallow_any_expr = true
I've looked at the sql alchemy source code and the .colums method of the Table class does indeed have the return type that mypy states.
I don't know however how could I go about altering that to remove the Any. Would that be even the correct approach?
If it's not source code that you control(a), the easiest option is usually to drop a # type: ignore [xxx] at the end of the offending line.
I usually also place a comment stating why it's needed so that anyone looking at the code later understands since, during our PR process, we have to justify any disabling of checks for the compliance tools mypy/pylint/bandit/isort/black.
(a) We also follow this guideline if the effort to fix our legacy code is more than a certain threshold. For example, we don't want to have to refactor 10,000 lines of code to make a one-line bug fix :-) All new code (that we control) has to comply however.
Are you sure you are using the latest sqlalchemy version?
for me, when I run mypy to test your code, I find no issues:
f.py:5: note: Revealed type is "sqlalchemy.sql.base.ReadOnlyColumnCollection[builtins.str, sqlalchemy.sql.schema.Column[Any]]"
Success: no issues found in 1 source file
when running at:
import sqlalchemy as sa
t = sa.Table("a", sa.MetaData(), sa.Column("id_", sa.Integer))
cols = t.columns
reveal_type(cols)
Mypy: 1.0.0
sqlalchemy: 2.0.2

AWS Glue error - Invalid input provided while running python shell program

I have Glue job, a python shell code. When I try to run it I end up getting the below error.
Job Name : xxxxx Job Run Id : yyyyyy failed to execute with exception Internal service error : Invalid input provided
It is not specific to code, even if I just put
import boto3
print('loaded')
I am getting the error right after clicking the run job option. What is the issue here?
It happend to me but the same job is working on a different account.
AWS documentation is not really explainative about this error:
The input provided was not valid.
I doubt this is an Amazon issue as mentionned #Quartermass
Same issue here in eu-west-2 yesterday, working now. This was only happening with Pythonshell jobs, not Pyspark ones, and job runs weren't getting as far as outputting any log streams. I can only assume it was an AWS issue they've now fixed and not issued a service announcement for.
I think Quatermass is right, the jobs started working out of the blue the next day without any changes.
I too received this super helpful error message.
What worked for me was explicitly setting properties like worker type, number of workers, Glue version and Python version.
In Terraform code:
resource "aws_glue_job" "my_job" {
name = "my_job"
role_arn = aws_iam_role.glue.arn
worker_type = "Standard"
number_of_workers = 2
glue_version = "4.0"
command {
script_location = "s3://my-bucket/my-script.py"
python_version = "3"
}
default_arguments = {
"--enable-job-insights" = "true",
"--additional-python-modules" : "boto3==1.26.52,pandas==1.5.2,SQLAlchemy==1.4.46,requests==2.28.2",
}
}
Update
After doing some more digging, I realised that what I needed was a Python shell script Glue job, not an ETL (Spark) job. By choosing this flavour of job, setting the Python version to 3.9 and "ticking the box" for Glue's pre-installed analytics libraries, my script, incidentally, had access to all the libraries I needed.
My Terraform code ended up looking like this:
resource "aws_glue_job" "my_job" {
name = "my-job"
role_arn = aws_iam_role.glue.arn
glue_version = "1.0"
max_capacity = 1
connections = [
aws_glue_connection.redshift.name
]
command {
name = "pythonshell"
script_location = "s3://my-bucket/my-script.py"
python_version = "3.9"
}
default_arguments = {
"--enable-job-insights" = "true",
"--library-set" : "analytics",
}
}
Note that I have switched to using Glue version 1.0. I arrived at this after some trial and error, and could not find this explicitly stated as the compatible version for pythonshell jobs… but it works!
Well, in my case, I get this error from time to time without any clear reason. The only thing that seems to cause the issue, is modifying some job parameter and saving the modifications. As soon as I save and try to execute the job, I usually get this error and, the only way to solve the issue, is destroying the job and, then, re-creating it again. Does anybody solved this issue by other means? As I saw in the accepted answer, the job simply begun to work again wthout any manual action, giving an understanding that the problem was a bug in AWS that was corrected.
I was facing a similar issue. I was invoking my job from a workflow. I could solve it by adding WorkerType, GlueVersion, NumberOfWorkers to the job before adding the job to the workflow. I could see it consistently fail before and succeed after this addition.

Why pylint returns `unsubscriptable-object` for numpy.ndarray.shape?

I just put together the following "minimum" repro case (minimum in quotes because I wanted to ensure pylint threw no other errors, warnings, hints, or suggestions - meaning there's a bit of boilerplate):
pylint_error.py:
"""
Docstring
"""
import numpy as np
def main():
"""
Main entrypoint
"""
test = np.array([1])
print(test.shape[0])
if __name__ == "__main__":
main()
When I run pylint on this code (pylint pylint_error.py) I get the following output:
$> pylint pylint_error.py
************* Module pylint_error
pylint_error.py:13:10: E1136: Value 'test.shape' is unsubscriptable (unsubscriptable-object)
------------------------------------------------------------------
Your code has been rated at 1.67/10 (previous run: 1.67/10, +0.00)
It claims that test.shape is not subscriptable, even though it quite clearly is. When I run the code it works just fine:
$> python pylint_error.py
1
So what's causing pylint to become confused, and how can I fix it?
Some additional notes:
If I declare test as np.arange(1) the error goes away
If I declare test as np.zeros(1), np.zeros((1)), np.ones(1), or np.ones((1)) the error does not go away
If I declare test as np.full((1), 1) the error goes away
Specifying the type (test: np.ndarray = np.array([1])) does not fix the error
Specifying a dtype (np.array([1], dtype=np.uint8)) does not fix the error
Taking a slice of test (test[:].shape) makes the error go away
My first instinct says that the inconsistent behavior with various NumPY methods (arange vs zeros vs full, etc) suggests it's just a bug in NumPY. However it's possible there's some underlying concept to NumPY that I'm misunderstanding. I'd like to be sure I'm not writing code with undefined behavior that's only working on accident.
I don't have enough reputation to comment, but it looks like this is an open issue: https://github.com/PyCQA/pylint/issues/3139
Until the issue is resolved on their end, I would just change the line to
print(test.shape[0]) # pylint: disable=E1136 # pylint/issues/3139
to my pylintrc file.
As of November 2019:
As mentioned by one of the users in the discussion on GitHub you could resolve the problem by downgrading both pylint and astroid, e.g. in requirements.txt
astroid>=2.0, <2.3
pylint>=2.3, <2.4
or
pip install astroid==2.2.5 & pip install pylint==2.3.1
This was finally fixed with the release of astroid 2.4.0 in May 2020.
https://github.com/PyCQA/pylint/issues/3139

Ansible ERROR! no action detected in task

I am trying to run a playbook https://github.com/Datanexus/dn-cassandra
With the different deployment scenarios listed out there, I am going for multinode cassandra setup described here: deployment scenarios.
I have setup a static inventory file.
cassandra-seed-01 ansible_ssh_host=192.168.0.17 ansible_ssh_port=22 ansible_ssh_user='root' ansible_ssh_private_key_file='keys/id_rsa'
cassandra-seed-02 ansible_ssh_host=192.168.0.18 ansible_ssh_port=22 ansible_ssh_user='root' ansible_ssh_private_key_file='keys/id_rsa'
cassandra-non-seed-01 ansible_ssh_host=192.168.0.22 ansible_ssh_port=22 ansible_ssh_user='root' ansible_ssh_private_key_file='keys/id_rsa'
[cassandra_seed]
192.168.0.17
192.168.0.18
[cassandra]
192.168.0.22
However when I try running the playbook it throws the following error:
ERROR! no action detected in task
The error appears to have been in
'/home/laumair/workspace/dn-cassandra/provision-cassandra.yml': line
21, column 9, but may be elsewhere in the file depending on the exact
syntax problem.
The offending line appears to be:
# then, build the seed and non-seed host groups
- include_role:
^ here
I would appreciate any sort of direction with this error as I have tried out solutions for similar errors but no luck so far.
include_role is available since Ansible 2.2.
Please upgrade your Ansible installation.

How to Set spark.sql.parquet.output.committer.class in pyspark

I'm trying to set spark.sql.parquet.output.committer.class and nothing I do seems to get the setting to take effect.
I'm trying to have many threads write to the same output folder, which would work with org.apache.spark.sql.
parquet.DirectParquetOutputCommitter since it wouldn't use the _temporary folder. I'm getting the following error, which is how I know it's not working:
Caused by: java.io.FileNotFoundException: File hdfs://path/to/stuff/_temporary/0/task_201606281757_0048_m_000029/some_dir does not exist.
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:795)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$700(DistributedFileSystem.java:106)
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:853)
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:849)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:849)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:382)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
Note the call to org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob, the default class.
I've tried the following, based on other SO answers and searches:
sc._jsc.hadoopConfiguration().set(key, val) (this does work for settings like parquet.enable.summary-metadata)
dataframe.write.option(key, val).parquet
Adding --conf "spark.hadoop.spark.sql.parquet.output.committer.class=org.apache.spark.sql.parquet.DirectParquetOutputCommitter" to the spark-submit call
Adding --conf "spark.sql.parquet.output.committer.class"=" org.apache.spark.sql.parquet.DirectParquetOutputCommitter" to the spark-submit call.
That's all I've been able to find, and nothing works. It looks like it's not hard to set in Scala but appears impossible in Python.
The approach in this comment definitively worked for me:
16/06/28 18:49:59 INFO ParquetRelation: Using user defined output committer for Parquet: org.apache.spark.sql.execution.datasources.parquet.DirectParquetOutputCommitter
It was a lost log message in the flood that Spark gives, and the error I was seeing was unrelated. It's all moot anyway, since the DirectParquetOutputCommitter has been removed from Spark.

Categories

Resources