Airflow 2.2.2 remote worker logging getting 403 Forbidden

Airflow 2.2.2 remote worker logging getting 403 Forbidden - python

I have a setup where airflow is running in kubernetes (EKS) and remote worker running in docker-compose in a VM behind a firewall in a different location.
Problem
Airflow Web server in EKS is getting 403 forbidden error when trying to get logs on remote worker.
Build Version
Airflow - 2.2.2
OS - Linux - Ubuntu 20.04 LTS
Kubernetes
1.22 (EKS)
Redis (Celery Broker) - Service Port exposed on 6379
PostgreSQL (Celery Backend) - Service Port exposed on 5432
Airflow ENV config setup
AIRFLOW__API__AUTH_BACKEND: airflow.api.auth.backend.basic_auth
AIRFLOW__CELERY__BROKER_URL: redis://<username>:<password>#redis-master.airflow-dev.svc.cluster.local:6379/0
AIRFLOW__CELERY__RESULT_BACKEND: >-
db+postgresql://<username>:<password>#db-postgresql.airflow-dev.svc.cluster.local/<db>
AIRFLOW__CLI__ENDPOINT_URL: http://{hostname}:8080
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
AIRFLOW__CORE__FERNET_KEY: <fernet_key>
AIRFLOW__CORE__HOSTNAME_CALLABLE: socket.getfqdn
AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
AIRFLOW__CORE__SQL_ALCHEMY_CONN: >-
postgresql+psycopg2://<username>:<password>#db-postgresql.airflow-dev.svc.cluster.local/<db>
AIRFLOW__LOGGING__BASE_LOG_FOLDER: /opt/airflow/logs
AIRFLOW__LOGGING__WORKER_LOG_SERVER_PORT: '8793'
AIRFLOW__WEBSERVER__BASE_URL: http://{hostname}:8080
AIRFLOW__WEBSERVER__SECRET_KEY: <secret_key>
_AIRFLOW_DB_UPGRADE: 'true'
_AIRFLOW_WWW_USER_CREATE: 'true'
_AIRFLOW_WWW_USER_PASSWORD: <username-webserver>
_AIRFLOW_WWW_USER_USERNAME: <password-webserver>
Airflow is using CeleryExecutor
Setup Test
Network reach ability by ping - OK
Celery Broker reach ability for both EKS and remote worker - OK
Celery Backend reach ability for both EKS and remote worker - OK
Firewall Port expose for remote worker Gunicorn API - OK
curl -v telnet://:8793 test - OK (Connected)
Airflow flower recognizing both workers from Kubernetes and remote worker - OK
All the ENV on both webserver, worker (EKS, remote) and scheduler are identical
Queue is setup so the DAG runs exactly in that particular worker
Time on both docker, VM and EKS is on UTC. There is a slight 5 to 8 seconds difference in docker and the pod in EKS
Ran webserver on the remote VM as well which can pick up and show logs
Description
Airflow is able to execute the DAG in remote worker, the logs can be seen in the remote worker. I have tried all combinations of setting but still keep getting 403.
Another test which was done was just normal curl with webserver auth
This curl was done both from EKS and remote server which hosts docker-compose. Results are the same on all the server.
curl --user <username-webserver> -vvv http:<remote-worker>:8793/logs/?<rest-of-the-log-url>
Getting 403 Forbidden
I might have miss configured it, but I doubt that is the case.
Any tips on what I am missing here? Many thanks in advance.

https://github.com/apache/airflow/discussions/26624#discussioncomment-3715688
With the above discussion I had with airflow community in Github, I synced the servers to use NTP, EKS and the remote worker had 135sec time drift.
Later worked on the auth.
I rebuilt the curl auth from this file of branch 2.2 https://github.com/apache/airflow/blob/main/airflow/utils/log/file_task_handler.py
Later realized that the auth doesn't like special characters in secret key, and added to that there was NTP time drift of 135 seconds (2min 15seconds) which would also factor in causing confusion.
I would recommend people who would face this problem to avoid special characters in secret key. Just an airflow user recommendation, I wouldn't want to say it is the only solution but something which helped me.
Special character and combined with NTP caused confusion for debugging the issue, resolving NTP should be first thing than with the auth.

Related

Docker Swarm Failing to Resolve DNS by Service Name With Python Celery Workers Connecting to RabbitMQ Broker Resulting in Connection Timeout

Setup
I have Docker installed and connected 9 machines, 1 manager and 8 worker nodes, using Docker swarm. This arrangement has been used in our development servers for ~5 years now.
I'm using this to launch a task queue that uses Celery for Python. Celery is using RabbitMQ as its broker and Redis for the results backend.
I have created an overlay network in Docker so that all my Celery workers launched by Docker swarm can reference their broker and results backend by name; i.e., rabbitmq or redis, instead of by IP address. The network was created by running the following command:
docker network create -d overlay <network_name>
The RabbitMQ service and Redis service were launched on the manager node under this overlay network using the following commands:
docker service create --network <my_overlay_network> --name redis --constraint "node.hostname==manager" redis
docker service create --network <my_overlay_network> --name rabbitmq --constraint "node.hostname==manager" rabbitmq
Once both of these have been launched, I deploy my Celery workers, one per each Docker swarm worker node, on the same overlay network using the following command:
docker service create --network <my_overlay_network> --name celery-worker --constraint "node.hostname!=manager" --replicas 8 --replicas-max-per-node 1 <my_celery_worker_image>
Before someone suggest it, yes I know I should be using a Docker compose file to launch all of this. I'm currently testing, and I'll write up one after I can get everything working.
The Problem
The Celery workers are configured to reference their broker and backend by the container name:
app = Celery('tasks', backend='redis://redis', broker='pyamqp://guest#rabbitmq//')
Once all the services have been launched and verified by Docker, 3 of the 8 start successfully, connect to the broker and backend, and allow me to begin running task on them. The other 5 continuously time out when attempting to connect to RabbitMQ and report the following message:
consumer: Cannot connect to amqp://guest:**#rabbitmq:5672//: timed out.
I'm at my wits end trying to find out why only 3 of my worker nodes allow the connection to occur while the other 5 cause a continuous timeout. All launched services are connected over the same overlay network.
The issue persist when I attempt to use brokers other than RabbitMQ, leading me to think that it's not specific to any one broker. I'd likely have issues connecting to any service by name on the overlay network when on the machines that are reporting the timeout. Stopping the service and launching again always produces the same results - the same 3 nodes work while the other 5 timeout.
All nodes are running the same version of Docker (19.03.4, build 9013bf583a), and the machines were created from identical images. They're virtually the same. The only difference among them is their hostnames, e.g., manager, worker1, worker2, and etc.
I have been able to replicate this setup outside of Docker swarm (all on one machine) by using a bridge network instead of overlay when developing my application on my personal computer without issue. I didn't experience problems until I launched everything on our development server, using the steps detailed above, to test it before pushing it to production.
Any ideas on why this is occurring and how I can remedy it? Switching form Docker swarm to Kubernetes isn't an option for me currently.

It's not the answer I wanted, but this appears to be an on-going bug in Docker swarm. For any who are interested, I'll include the issue page.
https://github.com/docker/swarmkit/issues/1429
There's a work around listed by one user on there that may wake for some, but your mileage may vary. It didn't work for me. The work around is listed in the bullet below:
Don't try to use docker for Windows to get multi-node mesh network (swarm) running. It's simply not (yet) supported. If you google around, you find some Microsoft blogs telling about it. Also the docker documentation mentions it somewhere. It would be nice, if docker cmd itself would print an error/warning when trying to set something up under Windows - which simply doesn't work. It does work on a single node though.
Don't try to use a Linux in a Virtualbox under Windows and hoping to workaround with it. It, of course, doesn't work since it has the same limitations as the underlying Windows.
Make sure you open ports at least 7946 tcp/udp and 4789 udp for worker nodes. For master also 2377 tcp. Use e.g. netcat -vz -u for udp check. Without -u for tcp.
Make sure to pass --advertise-addr on the docker worker node (!) when executing the join swarm command. Here put the external IP address of the worker node which has the mentioned ports open. Doublecheck that the ports are really open!
Using ping to check the DNS resolution for container names works. If you forget the --advertise-addr or not opening port 7946 results in DNS resolution not working on worker nodes!
I suggest attempting all of the above first if you encounter the same issue. To clarify a few things in the above bullet points, the --advertise-addr flag should be used on a worker node when joining it to the swarm. If your worker node doesn't have a static IP address, you can use the interface to connect it. Run ifconfig to view your interfaces. You'll need to use the interface that has your external facing IP address. For most people, this will probably be eth0, but you should still check before running the command. Doing this, the command you would issue on the worker is:
docker swarm join --advertise-addr eth0:2377 --token <your_token> <manager_ip>:2377
With 2377 being the port Docker uses. Verify that you joined with your correct IP address by going into your manager node and running the following:
docker node inspect <your_node_name>
If you don't know your node name, it should be the host name of the machine which you joined as a worker node. You can see it by running:
docker node ls
If you joined on the right interface, you will see this at the bottom of the return when running inspect:
{
"Status": "ready",
"Addr": <your_workers_external_ip_addr>
}
If you verified that everything has joined correctly, but the issue still persist, you can try launching your services with the additional flag of --dns-option use-vc when running Docker swarm create as such:
docker swarm create --dns-option use-vc --network <my_overlay> ...
Lastly, if all the above fails for you as it did for me, then you can expose the port of the running service you wish connect to in the swarm. For me, I wished to connect my services on my worker nodes to RabbitMQ and Redis on my manager node. I did so by exposing the services port. You can do this at creation by running:
docker swarm create -p <port>:<port> ...
Or after the services has been launched by running
docker service update --publish-add <port>:<port> <service_name>
After this, your worker node services can connect to the manager node service by the IP address of the worker node host and the port you exposed. For example, using RabbitMQ, this would be:
pyamqp://<user>:<pass>#<worker_host_ip_addr>:<port>/<vhost>
Hopefully this helps someone who stumbles on this post.

How can I debug `Error while fetching server API version` when running docker?

Context
I am running Apache Airflow, and trying to run a sample Docker container using Airflow's DockerOperator. I am testing using docker-compose and deploying to Kubernetes (EKS). Whenever I run my task, I am receiving the Error: ERROR - Error while fetching server API version. The erros happens both on docker-compose as well as EKS (kubernetes).

I guess your Airflow Docker container is trying to launch a worker on the same Docker machine where it is running. To do so, you need to give Airflow's container special permissions and, as you said, acces to the Docker socket. This is called Docker In Docker (DIND). There are more than one way to do it. In this tutorial there are 3 different ways explained. It also depends on where those containers are run: Kubernetes, Docker machines, external services (like GitLab or GitHub), etc.

How to run RabbitMQ and Celery Service in Docker?

I am having issues while running my Python Flask application from Docker pull (remote pull).
In my app I had used RabbitMQ as message broker, and Celery as task scheduler. It is working as expected when running locally, But when I put my application on Docker, and Docker pull it from remote system, it runs fine, but Celery and RabbitMQ are not running with it, so all tasks (with method.delay()) are running infinitely and http request is not being processed.
I need help in putting my Python Flask application to Docker, as my application has asynchronous tasks to be processed with Celery. I am not aware about how to modify docker-compose.yml for including Celery service.
Thanks is advance.

I think you need to link celery container with rabbitmq.
From https://docs.docker.com/compose/compose-file/#links
Link to containers in another service. Either specify both the service name and a link alias (SERVICE:ALIAS), or just the service name.
links:
- rabbitmq
Or
- rabbitmq:rabbitmq

Local Dask scheduler failing to connect to workers on remote resource

Question
How do I specify the correct address of Dask workers on a remote resource to a Dask scheduler running locally?
Situation
I have a remote resource I can ssh into. There, I have a docker container that runs an image containing all the dependencies I need to run Dask, Distributed.
When run, the container executes the following:
dask-worker --nprocs 14 --nthreads 1 {inet_addr_local}:878
In the same network, but on my laptop, I run another container of the same image. In this container, I run the Dask scheduler, like so:
dask-scheduler --port 8786
When I start up the scheduler, everything is fine. When I start up the container of workers, it seems to connect to the scheduler. In the status I see the following:
Waiting to connect to: tcp://{this_matches_inet_address_of_local}:8786
On the scheduler, I see the following logged repeatedly, in a loop as it continually tries to contact/respond to each of the workers:
distributed.scheduler - INFO - Remove worker tcp://172.18.0.10:41508
distributed.scheduler - INFO - Removed worker tcp://172.18.0.10:41508
distributed.scheduler - ERROR - Failed to connect to worker 'tcp://172.18.0.10:44590': Timed out trying to connect to 'tcp://172.18.0.10:44590' after 3 s: OSError: [Errno 113] No route to host
The issue (I think) can be seen here. tcp://172.18.0.10 is incorrect. The workers on running on a resource db.foo.net that I can ssh into via me#db.foo.net.
From the scheduler container, I can see that I am able to ping db.foo.net successfully. I think that the workers are assuming their address is the local address for the container they are in, and not db.foo.net. I need to override this default as some sort of configuration for the workers. I thought --host tag would do it, but that causes Tornado to throw the following error: OSError: [Errno 99] Cannot assign requested address.

Dask workers need to be able to contact the scheduler with the address given to them. It sounds like this isn't happening for you. This could be for many reasons associated to your network. A couple of possibilities:
You've mis-typed the address (for example I noticed that you used port 878 in one place in your question and port 8786 in another)
Your network doesn't allow communication on certain ports (check with your system administrator)
Your docker containers aren't set up to publish ports externally (you may need to do some docker-wiring or use the host network explicitly)
Unfortunately there isn't much that Dask itself can do to help you identify these network issues. You might try running other services on the relevant ports and seeing if you can recreate the lack of connectivity with common tools like ping or python -m http.serve --port 8786

How to persist UWSGI server on AWS after console closed

I'm a newcomer to web applications and AWS, so please forgive me if this is the answer is bit trivial!
I'm hosting a python web application on a AWS EC2 server using nginx + uWSGI. This is all working perfectly, except when I terminate my connection (using putty), my uWSGI application stops running, producing a "502 Bad Gateway" error from nginx.
I'm aware of adding the "&" to the uwsgi start up command (below), but that does not work after I close out my connection.
uwsgi --socket 127.0.0.1:8000 -master -s /tmp/uwsgi.sock --chmod-socket=666 -w wsgi2 &
How do I persist my uWSGI application to continue hosting my web app after I log out/terminate my connection?
Thanks in advance!

You'll typically want to start uwsgi from an init script. There are several ways to do this and the exact procedure depends on the Linux distribution you're using:
SystemV init script
upstart script (Ubuntu)
supervisor (Python-based process manager)

For the purposes of my use case, I will not have to reboot my machine in the near future, so Linux's nohup (no hangup) command works perfectly for this. It's a quick and dirty hack that's super powerful.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.