Local Dask scheduler failing to connect to workers on remote resource

Local Dask scheduler failing to connect to workers on remote resource - python

Question
How do I specify the correct address of Dask workers on a remote resource to a Dask scheduler running locally?
Situation
I have a remote resource I can ssh into. There, I have a docker container that runs an image containing all the dependencies I need to run Dask, Distributed.
When run, the container executes the following:
dask-worker --nprocs 14 --nthreads 1 {inet_addr_local}:878
In the same network, but on my laptop, I run another container of the same image. In this container, I run the Dask scheduler, like so:
dask-scheduler --port 8786
When I start up the scheduler, everything is fine. When I start up the container of workers, it seems to connect to the scheduler. In the status I see the following:
Waiting to connect to: tcp://{this_matches_inet_address_of_local}:8786
On the scheduler, I see the following logged repeatedly, in a loop as it continually tries to contact/respond to each of the workers:
distributed.scheduler - INFO - Remove worker tcp://172.18.0.10:41508
distributed.scheduler - INFO - Removed worker tcp://172.18.0.10:41508
distributed.scheduler - ERROR - Failed to connect to worker 'tcp://172.18.0.10:44590': Timed out trying to connect to 'tcp://172.18.0.10:44590' after 3 s: OSError: [Errno 113] No route to host
The issue (I think) can be seen here. tcp://172.18.0.10 is incorrect. The workers on running on a resource db.foo.net that I can ssh into via me#db.foo.net.
From the scheduler container, I can see that I am able to ping db.foo.net successfully. I think that the workers are assuming their address is the local address for the container they are in, and not db.foo.net. I need to override this default as some sort of configuration for the workers. I thought --host tag would do it, but that causes Tornado to throw the following error: OSError: [Errno 99] Cannot assign requested address.

Dask workers need to be able to contact the scheduler with the address given to them. It sounds like this isn't happening for you. This could be for many reasons associated to your network. A couple of possibilities:
You've mis-typed the address (for example I noticed that you used port 878 in one place in your question and port 8786 in another)
Your network doesn't allow communication on certain ports (check with your system administrator)
Your docker containers aren't set up to publish ports externally (you may need to do some docker-wiring or use the host network explicitly)
Unfortunately there isn't much that Dask itself can do to help you identify these network issues. You might try running other services on the relevant ports and seeing if you can recreate the lack of connectivity with common tools like ping or python -m http.serve --port 8786

Related

Docker Swarm Failing to Resolve DNS by Service Name With Python Celery Workers Connecting to RabbitMQ Broker Resulting in Connection Timeout

Setup
I have Docker installed and connected 9 machines, 1 manager and 8 worker nodes, using Docker swarm. This arrangement has been used in our development servers for ~5 years now.
I'm using this to launch a task queue that uses Celery for Python. Celery is using RabbitMQ as its broker and Redis for the results backend.
I have created an overlay network in Docker so that all my Celery workers launched by Docker swarm can reference their broker and results backend by name; i.e., rabbitmq or redis, instead of by IP address. The network was created by running the following command:
docker network create -d overlay <network_name>
The RabbitMQ service and Redis service were launched on the manager node under this overlay network using the following commands:
docker service create --network <my_overlay_network> --name redis --constraint "node.hostname==manager" redis
docker service create --network <my_overlay_network> --name rabbitmq --constraint "node.hostname==manager" rabbitmq
Once both of these have been launched, I deploy my Celery workers, one per each Docker swarm worker node, on the same overlay network using the following command:
docker service create --network <my_overlay_network> --name celery-worker --constraint "node.hostname!=manager" --replicas 8 --replicas-max-per-node 1 <my_celery_worker_image>
Before someone suggest it, yes I know I should be using a Docker compose file to launch all of this. I'm currently testing, and I'll write up one after I can get everything working.
The Problem
The Celery workers are configured to reference their broker and backend by the container name:
app = Celery('tasks', backend='redis://redis', broker='pyamqp://guest#rabbitmq//')
Once all the services have been launched and verified by Docker, 3 of the 8 start successfully, connect to the broker and backend, and allow me to begin running task on them. The other 5 continuously time out when attempting to connect to RabbitMQ and report the following message:
consumer: Cannot connect to amqp://guest:**#rabbitmq:5672//: timed out.
I'm at my wits end trying to find out why only 3 of my worker nodes allow the connection to occur while the other 5 cause a continuous timeout. All launched services are connected over the same overlay network.
The issue persist when I attempt to use brokers other than RabbitMQ, leading me to think that it's not specific to any one broker. I'd likely have issues connecting to any service by name on the overlay network when on the machines that are reporting the timeout. Stopping the service and launching again always produces the same results - the same 3 nodes work while the other 5 timeout.
All nodes are running the same version of Docker (19.03.4, build 9013bf583a), and the machines were created from identical images. They're virtually the same. The only difference among them is their hostnames, e.g., manager, worker1, worker2, and etc.
I have been able to replicate this setup outside of Docker swarm (all on one machine) by using a bridge network instead of overlay when developing my application on my personal computer without issue. I didn't experience problems until I launched everything on our development server, using the steps detailed above, to test it before pushing it to production.
Any ideas on why this is occurring and how I can remedy it? Switching form Docker swarm to Kubernetes isn't an option for me currently.

It's not the answer I wanted, but this appears to be an on-going bug in Docker swarm. For any who are interested, I'll include the issue page.
https://github.com/docker/swarmkit/issues/1429
There's a work around listed by one user on there that may wake for some, but your mileage may vary. It didn't work for me. The work around is listed in the bullet below:
Don't try to use docker for Windows to get multi-node mesh network (swarm) running. It's simply not (yet) supported. If you google around, you find some Microsoft blogs telling about it. Also the docker documentation mentions it somewhere. It would be nice, if docker cmd itself would print an error/warning when trying to set something up under Windows - which simply doesn't work. It does work on a single node though.
Don't try to use a Linux in a Virtualbox under Windows and hoping to workaround with it. It, of course, doesn't work since it has the same limitations as the underlying Windows.
Make sure you open ports at least 7946 tcp/udp and 4789 udp for worker nodes. For master also 2377 tcp. Use e.g. netcat -vz -u for udp check. Without -u for tcp.
Make sure to pass --advertise-addr on the docker worker node (!) when executing the join swarm command. Here put the external IP address of the worker node which has the mentioned ports open. Doublecheck that the ports are really open!
Using ping to check the DNS resolution for container names works. If you forget the --advertise-addr or not opening port 7946 results in DNS resolution not working on worker nodes!
I suggest attempting all of the above first if you encounter the same issue. To clarify a few things in the above bullet points, the --advertise-addr flag should be used on a worker node when joining it to the swarm. If your worker node doesn't have a static IP address, you can use the interface to connect it. Run ifconfig to view your interfaces. You'll need to use the interface that has your external facing IP address. For most people, this will probably be eth0, but you should still check before running the command. Doing this, the command you would issue on the worker is:
docker swarm join --advertise-addr eth0:2377 --token <your_token> <manager_ip>:2377
With 2377 being the port Docker uses. Verify that you joined with your correct IP address by going into your manager node and running the following:
docker node inspect <your_node_name>
If you don't know your node name, it should be the host name of the machine which you joined as a worker node. You can see it by running:
docker node ls
If you joined on the right interface, you will see this at the bottom of the return when running inspect:
{
"Status": "ready",
"Addr": <your_workers_external_ip_addr>
}
If you verified that everything has joined correctly, but the issue still persist, you can try launching your services with the additional flag of --dns-option use-vc when running Docker swarm create as such:
docker swarm create --dns-option use-vc --network <my_overlay> ...
Lastly, if all the above fails for you as it did for me, then you can expose the port of the running service you wish connect to in the swarm. For me, I wished to connect my services on my worker nodes to RabbitMQ and Redis on my manager node. I did so by exposing the services port. You can do this at creation by running:
docker swarm create -p <port>:<port> ...
Or after the services has been launched by running
docker service update --publish-add <port>:<port> <service_name>
After this, your worker node services can connect to the manager node service by the IP address of the worker node host and the port you exposed. For example, using RabbitMQ, this would be:
pyamqp://<user>:<pass>#<worker_host_ip_addr>:<port>/<vhost>
Hopefully this helps someone who stumbles on this post.

Cannot connect to remote Database instance from my docker container, however can connect from my host computer

I have a problem connecting to remote database instances from a docker container.
I have a Docker container with a simple piece of Python code that connects to a remote MongoDB instance
client = MongoClient('mongodb-example_conn_string.com')
db = client.test_db
collection = db.test_collection
print(collection.find_one())
I can run this piece of code from my host machine (a laptop running Linux Mint 20) and it prints the result as expected.
When I build a Docker image (python:3.6.10-alpine) for this script and Docker Run then image I get an error message. The Container is running on the host laptop.
e.g.
docker build . -t py_connection_test
docker run --rm py_connection_test run
I get this error
pymongo.errors.ServerSelectionTimeoutError: mongodb-example_conn_string.com:27017: [Errno -2] Name does not resolve, Timeout: 30s, Topology Description: <TopologyDescription id: 60106f40288b81e007fe75a8, topology_type: Single, servers: [<ServerDescription ('mongodb-example_conn_string.com', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('mongodb-example_conn_string.com:27017: [Errno -2] Name does not resolve',)>]>
The MongoDB remote instance is an internal database at work and a VPN (Using OpenVPN) is required to access it. I've used traceroute on host machine and docker container to confirm that network traffic is routed through the VPN, all seems to be fine there.
I've tried Docker Run with flag
--network="host"
But the same thing happens
I'm scratching my head at this, why does the same connection url not working in both cases? Is there something simple I've missed?

I've figured out the issue, thanks to Max for pointing me to look into DNS.
My problem was a faulty /etc/resolv.conf file on my host machine that the Docker Container was picking up. It contained 2 nameserver entries
In my case I could create the file /etc/docker/daemon.json on my host and add my dns entry there for the Container to pickup when run. e.g. adding lines:
{
"dns": ["172.31.0.2"]
}
Editing / creating this file requires a Docker service restart
I got some helpful hints from https://l-lin.github.io/post/2018/2018-09-03-docker_ubuntu_18_dns/

If you are not using DNS to specify the connection but are using a VPN to reach the resource and run into this issue it is most likely to be related to a docker network covering the IP range of your VPN, see this github issue for more details.
For a temporary solution, try docker network prune, if that does not help try killing all containers then pruning and if that does not help then either try a full docker restart than prune or the next step.
For a permanent solution (or at least more longlasting) check this guide (it will require restarting the Docker Daemon).

Python celery backend/broker access via ssh

Im using an sql server and rabbitmq as a result backend/broker for celery workers.Everything works fine but for future purposes we plan to use several remote workers on diferent machines that need to monitor this broker/backend.The problem is that you need to provide direct access to your broker and database url , thing that open many security risks.Is there a way to provide remote celery worker the remote broker/database via ssh?

It seems like ssh port forwarding is working but still i have some reservations.
My plan works as follows:
port forward both remote database and broker on local ports(auto
ssh) in remote celery workers machine.
now celery workers consuming the tasks and writing to remote database from local ports port forwaded.
Is this implementations bad as noone seems to use remote celery workers like this.
Any different answer will be appreciated.

nslookup: isc_socket_bind: address in use - can't resolve dns in docker container (phusion image)

I am running a AWS instance with 2CPUs, 8GB Ram, 450Mbps Bandwidth, with a docker container that holds python application.
The container load average is almost ~6.0 during the day when Python is running, and after container is up in about 10 hours, host machine and container are still running but it fails to connect with any domain,
but still can connect by IP address. Also the host machine DNS stills working fine.
Here is the detail:
`nslookup google.com` results:
`nslookup: isc_socket_bind: address in use`
I aware that running under ~6.0 load average can leads to many problems, but in my case the DNS problem keep happens over time, thus I need to understand why before upgrading the AWS instance.

This is solved.
wc -l /proc/net/upp #resulting ~16000 hanging connections.
Then I need to stop Python application to establish too many UDP connections. Actually it's log component with SyslogHander, which implicitly opens UDP sockets.

Submit jobs to Apache-Spark while being behind a firewall

Usecase:
I'm behind a firewall, and I have a remote spark cluster I can access to, however those machines cannot connect directly to me.
As Spark doc states it is necessary for the worker to be able to reach the driver program:
Because the driver schedules tasks on the cluster, it should be run
close to the worker nodes, preferably on the same local area network.
If you’d like to send requests to the cluster remotely, it’s better to
open an RPC to the driver and have it submit operations from nearby
than to run a driver far away from the worker nodes.
The suggested solution is to have a server process running on the cluster listening to RPC and let it execute itself the spark driver program locally.
Does such a program already exists? Such a process should manage 1+ RPC, returning exceptions and handling logs.
Also in that case, is my local program or the spark driver who has to create the SparkContext?
Note:
I have a standalone cluster
Solution1:
A simple way would be to use cluster mode (similar to --deploy-mode cluster) for the standalone cluster, however the doc says:
Currently, standalone mode does not support cluster mode for Python
applications.

Just a few options:
Connect to the cluster node using ssh, start screen, submit Spark application, go back to check the results.
Deploy middleware like Job Server, Livy or Mist on your cluster, and use it for submissions.
Deploy notebook (Zeppelin, Toree) on your cluster and submit applications from the notebook.
Set fixed spark.driver.port and ssh forward all connections through one of the cluster nodes, using its IP as spark.driver.bindAddress.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.