Openmpi with mpi4py don't work on multiple nodes

Openmpi with mpi4py don't work on multiple nodes - python

I have paralell python program written with mpi4py. I'm trying to make it distributed. I set up virtual machine, installed openmpi, openssh server, exchanged keys and all that. On local machine I have hostfile:
127.0.0.1 slots=4
192.168.1.104 slots=2
and i try to run program with:
mpirun -np 2 --hostfile hostfile python2 algen.py 0.85 0.02 20 70
but I get the following error:
[Kreutz:13090] tcp_peer_recv_connect_ack: invalid header type: 0
ORTE was unable to reliably start one or more daemons. This usually is
caused by:
not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or
configure OMPI with --enable-orterun-prefix-by-default
lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities.
the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to
determine the correct location to use.
compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and
consider using one of the contrib/platform definitions for your
system type.
an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them.
Please check network connectivity (including firewalls and network
routing requirements).
And i don't know what to do now. Do you have any ideas what could I try?

Related

remote debug python on aarch64 edge device

I wish to debug python code which is deployed remotely on an edge device. The device has aarch64 cpu architecture and does not have connectivity to internet (It serves as an access point itself).
I tried debugging using vscode Remote-SSH ,
but that doesn't work.I tried a few versions of the plugin: 0.49, 0.51 and also changed settings such as Remote.SSH: Use Local Server , but it always fail,
since host can't run wget/curl to download data.
I also tried using pydevd and debugpy, but it seems they do not support aarch64.
Any suggestions on how to use one of the above, or another tool to get the job done are most appreciated.

To answer my own question,
I was able to clone debugpy and do the following changes:
deleted:
src/debugpy/_vendored/pydevd/pydevd_attach_to_process/attach_linux_amd64.so
src/debugpy/_vendored/pydevd/pydevd_attach_to_process/attach_linux_x86.so
modified: src/debugpy/_vendored/pydevd/pydevd_attach_to_process/linux_and_mac/compile_linux.sh
src/debugpy/_vendored/pydevd/pydevd_attach_to_process/linux_and_mac/compile_mac.sh and commented out the lines for building amd64/x86
I was than able to build and deploy manually the python site package.
When running it on the device, there is a warning about amd64, but seems to be working.

Local matplotlib display in PyCharm with remote ssh interpreter from Google Cloud Platform's Compute Engine

I am currently running Python 3.5 scripts on two VM instances on GCP from a local PyCharm session running on my Mac (see below for detailed environment specifications).
I have two different projects in GCP which look similar. I reviewed their setup with our cloud admin and we can't see any major difference, at least any trivial one. I created two Deep Learning images on GCP using the following cloud SDK command line, one within each project:
export PROJECT=[MY_PROJECT_NAME]
export INSTANCE_ROOT=$USER-vm
export ZONE=europe-west4-a
export IMAGE_FAMILY=tf-latest-gpu
export INSTANCE_TYPE=n1-highmem-8
export GPU_TYPE=v100
export GPU_COUNT=1
export INSTANCE_NAME=$INSTANCE_ROOT-$GPU_TYPE-$GPU_COUNT
gcloud compute instances create $INSTANCE_NAME \
--zone=$ZONE \
--image-family=$IMAGE_FAMILY \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--accelerator=type=nvidia-tesla-$GPU_TYPE,count=$GPU_COUNT \
--machine-type=$INSTANCE_TYPE \
--boot-disk-size=200GB \
--metadata=install-nvidia-driver=True \
--scopes=storage-rw
Both images are completely similar.
I configured two remote ssh interpreters in PyCharm and deployed my Python code on both virtual machines. Everything is absolutely similar in terms of VM instance configuration (OS, Python version / libs, source code, etc.) and PyCharm remote interpreter configuration.
In both cases, the ssh ingress connection to the instance (on port 22) works pretty well.
Yet, when calling plt.show() to display images using matplotlib, the images get displayed in one setup but not in the other one.
This is not a matter of setting the proper ssh configuration (-X option on the command line, X11Forwarding, etc.). I already checked that, and anyway one of my VMs does a pretty good job of displaying my images within this configuration.
I debugged the execution and discovered that PyCharm automatically handles X display by implementing its own matplotlib FigureCanvas. When in remote ssh, the show() function actually opens a socket on the defined host (i.e. my local Mac) and sends the buffer to be displayed:
sock = socket.socket()
sock.connect((HOST, PORT))
[..]
sock.send(buffer)
This is precisely where my two configurations diverge:
The one working tries to connect on localhost:53725 and succeeds:
<socket.socket fd=28, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 42316), raddr=('127.0.0.1', 53725)>
The one failing tries to connect on localhost:53725 as well but gets an exception.
My strongest assumption is that some network configuration in the two GCP projects differs somehow and prevents the connection on localhost:53725 for the second one.
However, beyond that I have no idea what might happen and/or how to fix it.
Any idea / suggestion will be appreciated.
Thanks,
Laurent
--
Detailed environment specifications:
PyCharm 2018.2.4 (Professional Edition)
Build #PY-182.4505.26, built on September 19, 2018
Licensed to PyCharm Evaluator
Expiration date: October 27, 2018
JRE: 1.8.0_152-release-1248-b8 x86_64
JVM: OpenJDK 64-Bit Server VM by JetBrains s.r.o
macOS 10.14

Ok. It seems to be a bug and I found a workaround.
I share it as it might save hours of troubleshooting and debugging to anyone stumbling on the same problem.
The problem actually occurs when you remain in the same PyCharm session and switch from one interpreter to the other one.
If you quit PyCharm and start it again, the local display will work with either of the interpreters / VMs you run first. Then, if you switch to the second one it fails.
Everything looks as if there were some kind of lock set on the port or anywhere else by PyCharm which prevents you from switching seamlessly from one interpreter to another.
I'll share these insights with the PyCharm support team. BTW, other than that, this local display feature with remote interpreters is awesome and works just fine.

Run Spyder /Python on remote server

So there are variants of this question - but none quite hit the nail on the head.
I want to run spyder and do interactive analysis on a server. I have two servers , neither have spyder. They both have python (linux server) but I dont have sudo rights to install packages I need.
In short the use case is: open spyder on local machine. Do something (need help here) to use the servers computation power , and then return results to local machine.
Update:
I have updated python with my packages on one server. Now to figure out the kernel name and link to spyder.
Leaving previous version of question up, as that is still useful.
The docker process is a little intimidating as does paramiko. What are my options?

(Spyder maintainer here) What you need to do is to create an Spyder kernel in your remote server and connect through SSH to it. That's the only facility we provide to do what you want.
You can find the precise instructions to do that in our docs.

I did a long search for something like this in my past job, when we wanted to quickly iterate on code which had to run across many workers in a cluster. All the commercial and open source task-queue projects that I found were based on running fixed code with arbitrary inputs, rather than running arbitrary code.
I'd also be interested to see if there's something out there that I missed. But in my case, I ended up building my own solution (unfortunately not open source).
My solution was:
1) I made a Redis queue where each task consisted of a zip file with a bash setup script (for pip installs, etc), a "payload" Python script to run, and a pickle file with input data.
2) The "payload" Python script would read in the pickle file or other files contained in the zip file. It would output a file named output.zip.
3) The task worker was a Python script (running on the remote machine, listening to the Redis queue) that would would unzip the file, run the bash setup script, then run the Python script. When the script exited, the worker would upload output.zip.
There were various optimizations, like the worker wouldn't run the same bash setup script twice in a row (it remembered the SHA1 hash of the most recent setup script). So, anyway, in the worst case you could do that. It was a week or two of work to setup.
Edit:
A second (much more manual) option, if you just need to run on one remote machine, is to use sshfs to mount the remote filesystem locally, so you can quickly edit the files in Spyder. Then keep an ssh window open to the remote machine, and run Python from the command line to test-run the scripts on that machine. (That's my standard setup for developing Raspberry Pi programs.)

BlueZ-4.101 on embedded ARM device

I'm implementing Bluetooth on an embedded device and have a few questions about the BlueZ protocol stack. I'm using BlueZ-4.101 (do not have the option to upgrade to BlueZ-5), and do not have Python available.
Here are my questions after spending some time looking into BlueZ:
Is bluetoothd needed in my situation? As in, is it just a daemon that handles Python dbus messages between user-space and the kernel, or is it more? I've looked through the source and can only find mostly dbus related calls
How does one determine the value of DeviceID in /etc/bluetooth/main.conf? I found these instructions (section 3.4), but they are for a different platform using BlueZ 5
Will sdptool work without setting the DeviceID value? I've tried the following command and receive timeouts every time (only for my local device):
# sdptool browse local
Browsing FF:FF:FF:00:00:00 ...
Service Search failed: Connection timed out
Is it viable to replace all of the python simple-agent scripts with libbluetooth instead, or do I need to try and port them over to a supported scripting language?
Any help would be greatly appreciated!!!
If more logs are needed I can try and get them.

Host key verification failed using mpi4py

I am building a MPI application using mpi4py (1.3.1) and openmpi (1.8.6-1) in Arch Linux ARM (on a Raspberry Pi cluster, to be more specific). I've run my program successfully on 3 nodes (4 processes), and when trying to add a new node, here's what happens:
Host key verification failed.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
The funny thing is, the ssh keys are fine, since I'm using the same nodes (I can remove any entry of the host file, add the new node, and it will work, so I am pretty sure that the problem is not with a misconfigured ssh setup. It only happens when I use 5 processes).
Could this be a bug in the library of some sort?
Here's my host file
192.168.1.26 slots=2
192.168.1.188 slots=1
#192.168.1.202 slots=1 If uncommented and run with -np 5, it will raise the error
192.168.1.100 slots=1
Thanks in advance!

I was having the same problem on a Linux x86_64 mini cluster running Fedora 22 and OpenMPI 1.8. I could SSH into any of my 5 machines from my launch machine, but when I tried to launch MPI with 3 or more nodes, it would give me an authentication error. And like you it seemed like 3 is a magic number, and it turns out that it is. OpenMPI uses a tree-based launch, and so when you have more than two nodes, one or more of the intermediate nodes are executing an ssh. In my case, I was not using a password-less setup. I had an SSH identity on the launch machine that I had added to my key chain. It was able to launch the first two nodes because I had that authenticated identity in my key chain. Then each of those nodes tried to launch more and those nodes did not have that key authenticated (I would have need to add it on each of them).
So the solution appears to be moving to a password-less SSH identity setup, but you obviously have to be careful how you do that. I created a specific identity (key pair) on my launch machine. I added the key to my authorized keys on the nodes I want to use (which is easy since they are all using NFS, but you could manually distribute the key once if you need to). Then I modified my SSH config to use that password-less identity when trying to go to my node machines. My ~/.ssh/config looks like:
Host node0
HostName node0
IdentityFile ~/.ssh/passwordless_rsa
Host node1
HostName node1
IdentityFile ~/.ssh/passwordless_rsa
...
I'm sure there is some way to scale this for N nodes with wildcards. Or you could consider changing the default identity file at the system level in system ssh config file (I bet a similar option is available there).
And that did the trick. Now I can spin up all 5 nodes without any authentication issues. The flaw in my thinking was that launch node would launch all the other nodes, but this tree based launch means you need to chain logins, which you cannot do with a passphrase authenticated identity since you never get the chance to authenticate it.
Having a password-less key still freaks me out, so to keep things extra safe on these nodes connected to an open network, I changed the sshd config (system level) to restrict logins to anyone except me coming from my launch node.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.