Why Spark Driver read local file - python

I use Spark Cluster Standalone.
The master and single slave are in the same server (server B).
I use Luigi (on Server A) to submit my application and deploy (client mode).
My application read local files on Server B. However, the application tries to read the files also on the server A. Why ?
sc.textFile('/path/to/the/file/*')

In client mode, the driver is launched in the same process as the client that submits the application.
In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster.
You should use cluster mode.

Related

Python celery backend/broker access via ssh

Im using an sql server and rabbitmq as a result backend/broker for celery workers.Everything works fine but for future purposes we plan to use several remote workers on diferent machines that need to monitor this broker/backend.The problem is that you need to provide direct access to your broker and database url , thing that open many security risks.Is there a way to provide remote celery worker the remote broker/database via ssh?
It seems like ssh port forwarding is working but still i have some reservations.
My plan works as follows:
port forward both remote database and broker on local ports(auto
ssh) in remote celery workers machine.
now celery workers consuming the tasks and writing to remote database from local ports port forwaded.
Is this implementations bad as noone seems to use remote celery workers like this.
Any different answer will be appreciated.

Understanding smb and DCERPC for remote command execution capabilities

I'm trying to understand all the methods available to execute remote commands on Windows through the impacket scripts:
https://www.coresecurity.com/corelabs-research/open-source-tools/impacket
https://github.com/CoreSecurity/impacket
I understand the high level explanation of psexec.py and smbexec.py, how they create a service on the remote end and run commands through cmd.exe -c but I can't understand how can you create a service on a remote windows host through SMB. Wasn't smb supposed to be mainly for file transfers and printer sharing? Reading the source code I see in the notes that they use DCERPC to create this services, is this part of the smb protocol? All the resources on DCERPC i've found were kind of confusing, and not focused on its service creating capabilities. Looking at the sourcecode of atexec.py, it says that it interacts with the task scheduler service of the windows host, also through DCERPC. Can it be used to interact with all services running on the remote box?
Thanks!
DCERPC (https://en.wikipedia.org/wiki/DCE/RPC) : the initial protocol, which was used as a template for MSRPC (https://en.wikipedia.org/wiki/Microsoft_RPC).
MSRPC is a way to execute functions on the remote end and to transfer data (parameters to these functions). It is not a way to directly execute remote OS commands on the remote side.
SMB (https://en.wikipedia.org/wiki/Server_Message_Block ) is the file sharing protocol mainly used to access files on Windows file servers. In addition, it provides Named Pipes (https://msdn.microsoft.com/en-us/library/cc239733.aspx), a way to transfer data between a local process and a remote process.
One common way for MSRPC is to use it via Named Pipes over SMB, which has the advantage that the security layer provided by SMB is directly approached for MSRPC.
In fact, MSRPC is one of the most important, yet very less known protocols in the Windows world.
Neither MSRPC, nor SMB has something to do with remote execution of shell commands.
One common way to execute remote commands is:
Copy files (via SMB) to the remote side (Windows service EXE)
Create registry entries on the remote side (so that the copied Windows Service is installed and startable)
Start the Windows service.
The started Windows service can use any network protocol (e.g. MSRPC) to receive commands and to execute them.
After the work is done, the Windows service can be uninstalled (remove registry entries and delete the files).
In fact, this is what PSEXEC does.
All the resources on DCERPC i've found were kind of confusing, and not
focused on its service creating capabilities.
Yes, It’s just a remote procedure call protocol. But it can be used to start a procedure on the remote side, which can just do anything, e.g. creating a service.
Looking at the sourcecode of atexec.py, it says that it interacts with
the task scheduler service of the windows host, also through DCERPC.
Can it be used to interact with all services running on the remote
box?
There are some MSRPC commands which handle Task Scheduler, and others which handle generic service start and stop commands.
A few final words at the end:
SMB / CIFS and the protocols around are really complex and hard to understand. It seems ok trying to understand how to deal with e.g. remote service control, but this can be a very long journey.
Perhaps this page (which uses Java for trying to control Windows service) may also help understanding.
https://dev.c-ware.de/confluence/pages/viewpage.action?pageId=15007754

connect local python script to remote spark master

I am using python 2.7 with spark standalone cluster.
When I start the master on the same machine running the python script. it works smoothly.
When I start the master on a remote machine, and try to start spark context on the local machine to access the remote spark master. nothing happens and i get a massage saying that the task did not get any resources.
When i access the master's UI. i see the job, but nothing happens with it, it's just there.
How do i access a remote spark master via a local python script?
Thanks.
EDIT:
I read that in order to do this i need to run the cluster in cluster mode (not client mode), and I found that currently standalone mode does not support this for python application.
Ideas?
in order to do this i need to run the cluster in cluster mode (not client mode), and I found here that currently standalone mode does not support this for python application.

Submit jobs to Apache-Spark while being behind a firewall

Usecase:
I'm behind a firewall, and I have a remote spark cluster I can access to, however those machines cannot connect directly to me.
As Spark doc states it is necessary for the worker to be able to reach the driver program:
Because the driver schedules tasks on the cluster, it should be run
close to the worker nodes, preferably on the same local area network.
If you’d like to send requests to the cluster remotely, it’s better to
open an RPC to the driver and have it submit operations from nearby
than to run a driver far away from the worker nodes.
The suggested solution is to have a server process running on the cluster listening to RPC and let it execute itself the spark driver program locally.
Does such a program already exists? Such a process should manage 1+ RPC, returning exceptions and handling logs.
Also in that case, is my local program or the spark driver who has to create the SparkContext?
Note:
I have a standalone cluster
Solution1:
A simple way would be to use cluster mode (similar to --deploy-mode cluster) for the standalone cluster, however the doc says:
Currently, standalone mode does not support cluster mode for Python
applications.
Just a few options:
Connect to the cluster node using ssh, start screen, submit Spark application, go back to check the results.
Deploy middleware like Job Server, Livy or Mist on your cluster, and use it for submissions.
Deploy notebook (Zeppelin, Toree) on your cluster and submit applications from the notebook.
Set fixed spark.driver.port and ssh forward all connections through one of the cluster nodes, using its IP as spark.driver.bindAddress.

Execute a Hadoop Job in remote server and get its results from python webservice

I have a Hadoop job packaged in a jar file that I can execute in a server using command line and store the results in the hdfs of the server using the command line.
Now, I need to create a Web Service in Python (Tornado) that must to execute the Hadoop Job and get the results to present them to the user. The Web Service is hosted in other server.
I googled a lot for call the Job from outside the server in python Script but unfortunately did not have answers.
Anyone have a solution for this?
Thanks
One option could be install the binaries of hadoop in your webservice server using the same configuration than in your hadoop cluster. You will require that to be able to talk with the cluster. You don't nead to lunch any hadoop deamon there. At least configure HADOOP_HOME, HADOOP_CONFIG_DIR, HADOOP_LIBS and set properly the PATH environment variables.
You need the binaries because you will use them to submit the job and the configurations to tell hadoop client where is the cluster (the namenode and the resourcemanager).
Then in python you can execute the hadoop jar command using subprocess: https://docs.python.org/2/library/subprocess.html
You can configure the job to notify your server when the job has finished using a callback: https://hadoopi.wordpress.com/2013/09/18/hadoop-get-a-callback-on-mapreduce-job-completion/
And finally you could read the results in HDFS using WebHDFS (HDFS WEB API) or using some python HDFS package like: https://pypi.python.org/pypi/hdfs/

Categories

Resources