Submit jobs to Apache-Spark while being behind a firewall

Submit jobs to Apache-Spark while being behind a firewall - python

Usecase:
I'm behind a firewall, and I have a remote spark cluster I can access to, however those machines cannot connect directly to me.
As Spark doc states it is necessary for the worker to be able to reach the driver program:
Because the driver schedules tasks on the cluster, it should be run
close to the worker nodes, preferably on the same local area network.
If you’d like to send requests to the cluster remotely, it’s better to
open an RPC to the driver and have it submit operations from nearby
than to run a driver far away from the worker nodes.
The suggested solution is to have a server process running on the cluster listening to RPC and let it execute itself the spark driver program locally.
Does such a program already exists? Such a process should manage 1+ RPC, returning exceptions and handling logs.
Also in that case, is my local program or the spark driver who has to create the SparkContext?
Note:
I have a standalone cluster
Solution1:
A simple way would be to use cluster mode (similar to --deploy-mode cluster) for the standalone cluster, however the doc says:
Currently, standalone mode does not support cluster mode for Python
applications.

Just a few options:
Connect to the cluster node using ssh, start screen, submit Spark application, go back to check the results.
Deploy middleware like Job Server, Livy or Mist on your cluster, and use it for submissions.
Deploy notebook (Zeppelin, Toree) on your cluster and submit applications from the notebook.
Set fixed spark.driver.port and ssh forward all connections through one of the cluster nodes, using its IP as spark.driver.bindAddress.

Related

Docker Swarm Failing to Resolve DNS by Service Name With Python Celery Workers Connecting to RabbitMQ Broker Resulting in Connection Timeout

Setup
I have Docker installed and connected 9 machines, 1 manager and 8 worker nodes, using Docker swarm. This arrangement has been used in our development servers for ~5 years now.
I'm using this to launch a task queue that uses Celery for Python. Celery is using RabbitMQ as its broker and Redis for the results backend.
I have created an overlay network in Docker so that all my Celery workers launched by Docker swarm can reference their broker and results backend by name; i.e., rabbitmq or redis, instead of by IP address. The network was created by running the following command:
docker network create -d overlay <network_name>
The RabbitMQ service and Redis service were launched on the manager node under this overlay network using the following commands:
docker service create --network <my_overlay_network> --name redis --constraint "node.hostname==manager" redis
docker service create --network <my_overlay_network> --name rabbitmq --constraint "node.hostname==manager" rabbitmq
Once both of these have been launched, I deploy my Celery workers, one per each Docker swarm worker node, on the same overlay network using the following command:
docker service create --network <my_overlay_network> --name celery-worker --constraint "node.hostname!=manager" --replicas 8 --replicas-max-per-node 1 <my_celery_worker_image>
Before someone suggest it, yes I know I should be using a Docker compose file to launch all of this. I'm currently testing, and I'll write up one after I can get everything working.
The Problem
The Celery workers are configured to reference their broker and backend by the container name:
app = Celery('tasks', backend='redis://redis', broker='pyamqp://guest#rabbitmq//')
Once all the services have been launched and verified by Docker, 3 of the 8 start successfully, connect to the broker and backend, and allow me to begin running task on them. The other 5 continuously time out when attempting to connect to RabbitMQ and report the following message:
consumer: Cannot connect to amqp://guest:**#rabbitmq:5672//: timed out.
I'm at my wits end trying to find out why only 3 of my worker nodes allow the connection to occur while the other 5 cause a continuous timeout. All launched services are connected over the same overlay network.
The issue persist when I attempt to use brokers other than RabbitMQ, leading me to think that it's not specific to any one broker. I'd likely have issues connecting to any service by name on the overlay network when on the machines that are reporting the timeout. Stopping the service and launching again always produces the same results - the same 3 nodes work while the other 5 timeout.
All nodes are running the same version of Docker (19.03.4, build 9013bf583a), and the machines were created from identical images. They're virtually the same. The only difference among them is their hostnames, e.g., manager, worker1, worker2, and etc.
I have been able to replicate this setup outside of Docker swarm (all on one machine) by using a bridge network instead of overlay when developing my application on my personal computer without issue. I didn't experience problems until I launched everything on our development server, using the steps detailed above, to test it before pushing it to production.
Any ideas on why this is occurring and how I can remedy it? Switching form Docker swarm to Kubernetes isn't an option for me currently.

It's not the answer I wanted, but this appears to be an on-going bug in Docker swarm. For any who are interested, I'll include the issue page.
https://github.com/docker/swarmkit/issues/1429
There's a work around listed by one user on there that may wake for some, but your mileage may vary. It didn't work for me. The work around is listed in the bullet below:
Don't try to use docker for Windows to get multi-node mesh network (swarm) running. It's simply not (yet) supported. If you google around, you find some Microsoft blogs telling about it. Also the docker documentation mentions it somewhere. It would be nice, if docker cmd itself would print an error/warning when trying to set something up under Windows - which simply doesn't work. It does work on a single node though.
Don't try to use a Linux in a Virtualbox under Windows and hoping to workaround with it. It, of course, doesn't work since it has the same limitations as the underlying Windows.
Make sure you open ports at least 7946 tcp/udp and 4789 udp for worker nodes. For master also 2377 tcp. Use e.g. netcat -vz -u for udp check. Without -u for tcp.
Make sure to pass --advertise-addr on the docker worker node (!) when executing the join swarm command. Here put the external IP address of the worker node which has the mentioned ports open. Doublecheck that the ports are really open!
Using ping to check the DNS resolution for container names works. If you forget the --advertise-addr or not opening port 7946 results in DNS resolution not working on worker nodes!
I suggest attempting all of the above first if you encounter the same issue. To clarify a few things in the above bullet points, the --advertise-addr flag should be used on a worker node when joining it to the swarm. If your worker node doesn't have a static IP address, you can use the interface to connect it. Run ifconfig to view your interfaces. You'll need to use the interface that has your external facing IP address. For most people, this will probably be eth0, but you should still check before running the command. Doing this, the command you would issue on the worker is:
docker swarm join --advertise-addr eth0:2377 --token <your_token> <manager_ip>:2377
With 2377 being the port Docker uses. Verify that you joined with your correct IP address by going into your manager node and running the following:
docker node inspect <your_node_name>
If you don't know your node name, it should be the host name of the machine which you joined as a worker node. You can see it by running:
docker node ls
If you joined on the right interface, you will see this at the bottom of the return when running inspect:
{
"Status": "ready",
"Addr": <your_workers_external_ip_addr>
}
If you verified that everything has joined correctly, but the issue still persist, you can try launching your services with the additional flag of --dns-option use-vc when running Docker swarm create as such:
docker swarm create --dns-option use-vc --network <my_overlay> ...
Lastly, if all the above fails for you as it did for me, then you can expose the port of the running service you wish connect to in the swarm. For me, I wished to connect my services on my worker nodes to RabbitMQ and Redis on my manager node. I did so by exposing the services port. You can do this at creation by running:
docker swarm create -p <port>:<port> ...
Or after the services has been launched by running
docker service update --publish-add <port>:<port> <service_name>
After this, your worker node services can connect to the manager node service by the IP address of the worker node host and the port you exposed. For example, using RabbitMQ, this would be:
pyamqp://<user>:<pass>#<worker_host_ip_addr>:<port>/<vhost>
Hopefully this helps someone who stumbles on this post.

Google Dataflow warnings when using Service & Host Projects, Shared VPCs & Firewall Rules

I currently run a Cloud Dataflow Streaming Job via the Python SDK.
To explain my setup:
one service project, "miles-qa"
hosts the dataflow job
one host project, "miles-qa-networking"
hosts the network, "miles-qa-vpc"
hosts the subnetwork the dataflow job targets
For example, some of the args I'm calling my job with:
--project <host project name>
--network <shared vpc project name>
--subnetwork "https://www.googleapis.com/compute/v1/projects/<shared vpc project name>/regions/<region job is running in service project>/subnetworks/<name of subnetwork in shared vpc project>"
--service_account_email=<service account with Compute Network User permission for both projects, shared vpc network & subnetwork>
When running, I receive the following error:
The network miles-qa-vpc doesn't have rules that open TCP ports 1-65535 for internal connection with other VMs. Only rules with a target tag 'dataflow' or empty target tags set apply. If you don't specify such a rule, any pipeline with more than one worker that shuffles data will hang. Causes: No firewall rules associated with your network.
However, firewall rules are configured to allow access.
I believe this an issue outside of the Apache Beam SDK; this error is not found in the GitHub for the DataflowRunner or ApacheBeam Python(or Java) SDK.
As suggested in the warning, the job will hang once we perform larger shuffles across nodes.
I've tried the following:
Only passing "--subnetwork" without "--network" only modifies the warning to state "default" instead of "miles-qa-vpc", which sounds like a logging error to me.
Firewall rules have been configured to:
allow all traffic
allow all internal traffic
allow all traffic with the source tag 'dataflow'
allow all traffic with the target tag 'dataflow'
Service Account has been configured to have Compute Network User permissions in both projects.
Ensured subnetwork is in the same region as the job.
Network in the service project is happily serving a dedicated cluster for other purposes in the host project.
It genuinely seems like the spawned Compute Instances are not gaining the configuration.
I expect the Dataflow job not to report the firewall issue and successfully deal with shuffling (GroupBys etc.)

Understanding smb and DCERPC for remote command execution capabilities

I'm trying to understand all the methods available to execute remote commands on Windows through the impacket scripts:
https://www.coresecurity.com/corelabs-research/open-source-tools/impacket
https://github.com/CoreSecurity/impacket
I understand the high level explanation of psexec.py and smbexec.py, how they create a service on the remote end and run commands through cmd.exe -c but I can't understand how can you create a service on a remote windows host through SMB. Wasn't smb supposed to be mainly for file transfers and printer sharing? Reading the source code I see in the notes that they use DCERPC to create this services, is this part of the smb protocol? All the resources on DCERPC i've found were kind of confusing, and not focused on its service creating capabilities. Looking at the sourcecode of atexec.py, it says that it interacts with the task scheduler service of the windows host, also through DCERPC. Can it be used to interact with all services running on the remote box?
Thanks!

DCERPC (https://en.wikipedia.org/wiki/DCE/RPC) : the initial protocol, which was used as a template for MSRPC (https://en.wikipedia.org/wiki/Microsoft_RPC).
MSRPC is a way to execute functions on the remote end and to transfer data (parameters to these functions). It is not a way to directly execute remote OS commands on the remote side.
SMB (https://en.wikipedia.org/wiki/Server_Message_Block ) is the file sharing protocol mainly used to access files on Windows file servers. In addition, it provides Named Pipes (https://msdn.microsoft.com/en-us/library/cc239733.aspx), a way to transfer data between a local process and a remote process.
One common way for MSRPC is to use it via Named Pipes over SMB, which has the advantage that the security layer provided by SMB is directly approached for MSRPC.
In fact, MSRPC is one of the most important, yet very less known protocols in the Windows world.
Neither MSRPC, nor SMB has something to do with remote execution of shell commands.
One common way to execute remote commands is:
Copy files (via SMB) to the remote side (Windows service EXE)
Create registry entries on the remote side (so that the copied Windows Service is installed and startable)
Start the Windows service.
The started Windows service can use any network protocol (e.g. MSRPC) to receive commands and to execute them.
After the work is done, the Windows service can be uninstalled (remove registry entries and delete the files).
In fact, this is what PSEXEC does.
All the resources on DCERPC i've found were kind of confusing, and not
focused on its service creating capabilities.
Yes, It’s just a remote procedure call protocol. But it can be used to start a procedure on the remote side, which can just do anything, e.g. creating a service.
Looking at the sourcecode of atexec.py, it says that it interacts with
the task scheduler service of the windows host, also through DCERPC.
Can it be used to interact with all services running on the remote
box?
There are some MSRPC commands which handle Task Scheduler, and others which handle generic service start and stop commands.
A few final words at the end:
SMB / CIFS and the protocols around are really complex and hard to understand. It seems ok trying to understand how to deal with e.g. remote service control, but this can be a very long journey.
Perhaps this page (which uses Java for trying to control Windows service) may also help understanding.
https://dev.c-ware.de/confluence/pages/viewpage.action?pageId=15007754

connect local python script to remote spark master

I am using python 2.7 with spark standalone cluster.
When I start the master on the same machine running the python script. it works smoothly.
When I start the master on a remote machine, and try to start spark context on the local machine to access the remote spark master. nothing happens and i get a massage saying that the task did not get any resources.
When i access the master's UI. i see the job, but nothing happens with it, it's just there.
How do i access a remote spark master via a local python script?
Thanks.
EDIT:
I read that in order to do this i need to run the cluster in cluster mode (not client mode), and I found that currently standalone mode does not support this for python application.
Ideas?

in order to do this i need to run the cluster in cluster mode (not client mode), and I found here that currently standalone mode does not support this for python application.

Why Spark Driver read local file

I use Spark Cluster Standalone.
The master and single slave are in the same server (server B).
I use Luigi (on Server A) to submit my application and deploy (client mode).
My application read local files on Server B. However, the application tries to read the files also on the server A. Why ?
sc.textFile('/path/to/the/file/*')

In client mode, the driver is launched in the same process as the client that submits the application.
In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster.
You should use cluster mode.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.