pxssh does not work between compute nodes in a slurm cluster

pxssh does not work between compute nodes in a slurm cluster - python

I'm using the following script for connecting two compute nodes in a slurm cluster.
from getpass import getuser
from socket import gethostname
from pexpect import pxssh
import sys
python = sys.executable
worker_command = "%s -m worker" % python + " %i " + server_socket
pid = 0
children = []
for node, ntasks in node_list.items():
if node == gethostname():
continue
if node != gethostname():
pid_range = range(pid, pid + ntasks)
pid += ntasks
ssh = pxssh.pxssh()
ssh.login(node, getuser())
for worker in pid_range:
ssh.sendline(worker_command % worker + '&')
children.append(ssh)
node_list is a dictionary {'cn000': 28, 'cn001': 28}. worker is a python file placed in the working dictionary.
I expect ssh.sendline to be the same as pexpect.spawn. However, nothing happened after I ran the script.
Although an ssh session was built by ssh.login(node, getuser()), it seems the line ssh.sendline(worker_command % worker) has no effect, because the script to be run by worker_command is not run.
How can I fix this? Or should I try something else?
How can I create one socket on one compute node and connect it with a socket on another compute node?

There is missing a '%s' from the content of worker_command. It contains something like this: "/usr/bin/python3 -m worker" -> worker_command%worker should result in error.
If not (it is possible, because this source looks like a short part of the original program), then add ">>workerprocess.log 2>&1" string before '&', then try to run your program and take a look at workerprocess.log on the server! If your $HOME is writable on the server, you should find the error message(s) in it.

Related

In Python, how do I get user's remote IP (their last hop) if they're connected over SSH?

I want to detect if the user is connected over SSH. In a term, the "env" command shows SSH_CONNECTION line. Accessed in Python in one of two ways:
#python:
import os
print os.getenv("SSH_CONNECTION") #works
print os.environ.get("SSH_CONNECTION") #works
But, if the user has ran my program using SUDO (as they will need to), env$ dooesn't show SSH_CONNECTION. So Python can't see it:
#sudo python:
import os
print os.getenv("SSH_CONNECTION") #not set
print os.environ.get("SSH_CONNECTION") #not set
The aim is to achieve the following:
#Detect if user is over remote IP
lRemoteIP="" #Is set if user on SSH
lStr=os.environ.get("SSH_CONNECTION") #Temp var
if lStr: lRemoteIP=lStr.split()[0].split("=")[1] #Store user's lasthop IP
#Later on in the code, for multiple purposes:
if lRemoteIP: pass #Do stuff (or not) depending on if they're on SSH
How do I retrieve SSH_CONNECTION environment variable under SUDO, when its not present in env$ ?
Or more precisely: how can I detect if the current session is via SSH when sudo?
I'm not a natural at Linuxy-type things, so be gentle with me...
[EDIT:] METHOD 2: Giving up on env$, I've tried the following:
pstree -ps $$ | grep "sshd("
If it returns anything then it means that the SSH daemon sits above the session. Ergo, it's a SSH connection. And the results are showing me the PIDs of the SSH daemons. Results of the pstree cmd:
init(1)---sshd(xxx)---sshd(xxx)---sshd(xxx)---bash(xxx)-+-grep(xxx)
But I'm struggling to get a src IP from the PID. Any ideas on this avenue?
[EDIT] METHOD 3: /run/utmp contains details of SSH logins. In python:
import os
import sys
lStr=open("/var/run/utmp").read().replace('\x00','') #Remove all those null values which make things hard to read
#Get the pseudo-session ID (pts) minus the /dev/ that it starts with:
lCurSess=os.ttyname(sys.stdout.fileno()).replace('/dev/','')
#Answer is like pts/10 (pseudo-term session number 10)
#Search lStr for pts/10
lInt=lStr.find(lCurSess.replace('/dev/',''))
#Print /var/utmp starting with where it first mentions current pts:
print lStr[lInt:]
So far, so good. This gives the following results (I've changed the IP and username to USERNAME)
pts/10/10USERNAME\x9e\x958Ym\xb2\x05\x0 74\x14pts/10s/10USERNAME192.168.1.1\xbf\x958Y\xf8\xa3\r\xc0\xa88\x01
So, when it comes to extracting the IP from the file, there's some bumf inbetween the occurances of pts/10 and the IP. What's the best way to parse it, given that (I reckon) the precise distance from the match to the IP will be different under different circumstances?

The OpenSSH daemon writes an entry to /var/run/utmp with the current terminal, the IP and the name of the user. Check the output of the w or who commands that parse /var/run/utmp.
It's just a question of getting the current terminal (similar to the tty command) and extracting the information you want.
Use pyutmp like this:
from pyutmp import UtmpFile
import os
import sys
for utmp in UtmpFile():
if os.ttyname(sys.stdout.fileno()) == utmp.ut_line:
print '%s logged from %s on tty %s' % (utmp.ut_user, utmp.ut_host, utmp.ut_line)
Then filter by using ut_pid field to parse the /proc/ut_pid/cmdline file which should contain:
sshd: ut_user [priv]

GOT IT AT LAST!!!
The "last" command has list of users and their IPs!! So simple.
It has "still logged in" marked against sessions. Filter by these
And then filter by current pts ID
To get the IP for the current SSH session in Python, do this:
import os,sys,subprocess
(out, err) = subprocess.Popen(['last | grep "still logged in" | grep "' + os.ttyname(sys.stdout.fileno()).replace('/dev/','') + '"'], stdout=subprocess.PIPE, shell=True).communicate()
RemoteIP=out.split()[2].replace(":0.0","") #Returns "" if not SSH
For readability, across multiple lines:
import os,sys,subprocess
pseudoTermID = os.ttyname(sys.stdout.fileno()).replace('/dev/','')
cmdStr = 'last | grep "still logged in" | grep "'+pseudoTermID+'"'
sp = subprocess.Popen([cmdStr], stdout=subprocess.PIPE, shell=True)
(out, err) = sp.communicate()
RemoteIP = out.split()[2].replace(":0.0","") #Returns "" if not SSH

Python: Parallel execution pysphere commands

My current for loop does 1 by 1 removing snapshots from my 16 VMs
for vmName in vmList:
snapshots = vmServer.get_vm_by_name(vmName).get_snapshots()
for i in range(len(snapshots)-3):
snapshotName = snapshots[i].get_name()
print "Deleting snapshot " + snapshotName + " of " + vmName
vmServer.get_vm_by_name(vmName).delete_named_snapshot(snapshotName)
I need to run it in parallel(so it wouldn't wait finish of previous job to start next one)
I was trying to apply "multiprocessing", here's full code:
import argparse
from pysphere import VIServer # Tested with vCenter Server 5.5.0 and pysphere package 0.1.7
from CONFIG import * # Contains username and password for vCenter connection, list of VM names to take snapshot
from multiprocessing.pool import ThreadPool as Pool
def purgeSnapshotStage(vmList):
# Connect to vCenter
vmServer = VIServer()
vmServer.connect("VM_ADDRESS", username, password)
snapshots = vmServer.get_vm_by_name(vmName).get_snapshots()
for i in range(len(snapshots) - 3):
snapshotName = snapshots[i].get_name()
print "Deleting snapshot " + snapshotName + " of VM: " + vmName
vmServer.get_vm_by_name(vmName).delete_named_snapshot(snapshotName)
vmServer.disconnect()
# Get the environment to delete snapshot from command line
parser = argparse.ArgumentParser(description="Take snapshot of VMs for stage or stage2")
parser.add_argument('env', choices=("stage", "stage2", "stage3"), help="Valid value stage or stage2 or stage3")
env = parser.parse_args().env
vmList = globals()[env + "VmList"]
pool_size = 5 # your "parallelness"
pool = Pool(pool_size)
for vmName in vmList:
pool.apply_async(purgeSnapshotStage, (vmList,))
pool.close()
pool.join()
But there is a mistake, because it's trying to execute "remove" command only on last one.
Didn't find good guide about multiprocessing, and can't find how to debug it.
Need help to find mistake.

You have an error in here:
for vmName in vmList:
pool.apply_async(purgeSnapshotStage, (vmList,))
It should be:
for vmName in vmList:
pool.apply_async(purgeSnapshotStage, (vmName,))
And then in your function header you need this:
def purgeSnapshotStage(vmList):
Then, there might be other errors in your code.
Generally: I doubt that parallelizing this might give you any performance benefit. Your bottleneck will be the vmware server. It will not be faster when you start many delete jobs at the same time.

subprocess kills child processes but not the processes the child spawns

I have been having an issue whereby I can kill the processes that spawns the nodes but the nodes do not get killed. Does anyone have any suggest how I can do this?
Some of my latest failed attempts to accomplish this are:
node.terminate()
and
node.send_signal(signal.SIGINT)
below is the code:
from subprocess import Popen
import json
import sys
import os
import signal
import requests
FNULL = open(os.devnull, 'w')
json_data = open('nodes.json', 'r').read()
data = json.loads(json_data)
port = data['port']
# launch hub
hub = Popen('java -jar selenium-server-standalone-2.37.0.jar -role hub -port %s' % port, stdout=FNULL, stderr=FNULL, shell=True)
#launch nodes
nodes = []
for node in data['nodes']:
options = ''
if node['name'] == 'CHROME':
options += '-Dwebdriver.chrome.driver=../grid/chromedriver '
#options += ' -browser browserName='+node['name']+' maxInstances='+str(node['maxInstances'])
nodes.append(Popen('java -jar selenium-server-standalone-2.37.0.jar -role node -hub http://localhost:%i/grid/register %s' % (port, options), stdout=FNULL, stderr=FNULL, shell=True))
# wait for user input
print "type 'q' and ENTER to close the grid:"
while True:
line = sys.stdin.readline()
if line == 'q\n':
break
# close nodes
for node in nodes:
#node.terminate()
node.send_signal(signal.SIGINT)
# close hub
r = requests.get('http://localhost:'+str(port)+'/lifecycle-manager?action=shutdown')
As far as im aware, I'm basically forced to use shell=True, to get redirections to work
Processing the child's stdout/stderr in the parent python process is not an option, since I couldn't find functionality for doing it in a non-waiting way (and the parent python process must do other things while the child is running)
# close nodes
for node in nodes:
node.send_signal(signal.SIGINT)
node.terminate()
this seems to kill all the processes except for 1 of the nodes. Not always the same one

You could try using os.killpg. This function sends the signal to the process group, it should work if your processes do not change process group.
import os
import signal
os.killpg(os.getpgid(pid), signal.SIGINT)
Note, that process group will be changed if you are creating process under shell (bash, zsh, etc.), in that case more complicated technique should be used.

how to get console output from a remote computer (ssh + python)

I have googled "python ssh". There is a wonderful module pexpect, which can access a remote computer using ssh (with password).
After the remote computer is connected, I can execute other commands. However I cannot get the result in python again.
p = pexpect.spawn("ssh user#remote_computer")
print "connecting..."
p.waitnoecho()
p.sendline(my_password)
print "connected"
p.sendline("ps -ef")
p.expect(pexpect.EOF) # this will take very long time
print p.before
How to get the result of ps -ef in my case?

Have you tried an even simpler approach?
>>> from subprocess import Popen, PIPE
>>> stdout, stderr = Popen(['ssh', 'user#remote_computer', 'ps -ef'],
... stdout=PIPE).communicate()
>>> print(stdout)
Granted, this only works because I have ssh-agent running preloaded with a private key that the remote host knows about.

child = pexpect.spawn("ssh user#remote_computer ps -ef")
print "connecting..."
i = child.expect(['user#remote_computer\'s password:'])
child.sendline(user_password)
i = child.expect([' .*']) #or use i = child.expect([pexpect.EOF])
if i == 0:
print child.after # uncomment when using [' .*'] pattern
#print child.before # uncomment when using EOF pattern
else:
print "Unable to capture output"
Hope this help..

You might also want to investigate paramiko which is another SSH library for Python.

Try to send
p.sendline("ps -ef\n")
IIRC, the text you send is interpreted verbatim, so the other computer is probably waiting for you to complete the command.

Check to see if python script is running

I have a python daemon running as a part of my web app/ How can I quickly check (using python) if my daemon is running and, if not, launch it?
I want to do it that way to fix any crashes of the daemon, and so the script does not have to be run manually, it will automatically run as soon as it is called and then stay running.
How can i check (using python) if my script is running?

A technique that is handy on a Linux system is using domain sockets:
import socket
import sys
import time
def get_lock(process_name):
# Without holding a reference to our socket somewhere it gets garbage
# collected when the function exits
get_lock._lock_socket = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM)
try:
# The null byte (\0) means the socket is created
# in the abstract namespace instead of being created
# on the file system itself.
# Works only in Linux
get_lock._lock_socket.bind('\0' + process_name)
print 'I got the lock'
except socket.error:
print 'lock exists'
sys.exit()
get_lock('running_test')
while True:
time.sleep(3)
It is atomic and avoids the problem of having lock files lying around if your process gets sent a SIGKILL
You can read in the documentation for socket.close that sockets are automatically closed when garbage collected.

Drop a pidfile somewhere (e.g. /tmp). Then you can check to see if the process is running by checking to see if the PID in the file exists. Don't forget to delete the file when you shut down cleanly, and check for it when you start up.
#/usr/bin/env python
import os
import sys
pid = str(os.getpid())
pidfile = "/tmp/mydaemon.pid"
if os.path.isfile(pidfile):
print "%s already exists, exiting" % pidfile
sys.exit()
file(pidfile, 'w').write(pid)
try:
# Do some actual work here
finally:
os.unlink(pidfile)
Then you can check to see if the process is running by checking to see if the contents of /tmp/mydaemon.pid are an existing process. Monit (mentioned above) can do this for you, or you can write a simple shell script to check it for you using the return code from ps.
ps up `cat /tmp/mydaemon.pid ` >/dev/null && echo "Running" || echo "Not running"
For extra credit, you can use the atexit module to ensure that your program cleans up its pidfile under any circumstances (when killed, exceptions raised, etc.).

The pid library can do exactly this.
from pid import PidFile
with PidFile():
do_something()
It will also automatically handle the case where the pidfile exists but the process is not running.

My solution is to check for the process and command line arguments
Tested on windows and ubuntu linux
import psutil
import os
def is_running(script):
for q in psutil.process_iter():
if q.name().startswith('python'):
if len(q.cmdline())>1 and script in q.cmdline()[1] and q.pid !=os.getpid():
print("'{}' Process is already running".format(script))
return True
return False
if not is_running("test.py"):
n = input("What is Your Name? ")
print ("Hello " + n)

Of course the example from Dan will not work as it should be.
Indeed, if the script crash, rise an exception, or does not clean pid file, the script will be run multiple times.
I suggest the following based from another website:
This is to check if there is already a lock file existing
\#/usr/bin/env python
import os
import sys
if os.access(os.path.expanduser("~/.lockfile.vestibular.lock"), os.F_OK):
#if the lockfile is already there then check the PID number
#in the lock file
pidfile = open(os.path.expanduser("~/.lockfile.vestibular.lock"), "r")
pidfile.seek(0)
old_pid = pidfile.readline()
# Now we check the PID from lock file matches to the current
# process PID
if os.path.exists("/proc/%s" % old_pid):
print "You already have an instance of the program running"
print "It is running as process %s," % old_pid
sys.exit(1)
else:
print "File is there but the program is not running"
print "Removing lock file for the: %s as it can be there because of the program last time it was run" % old_pid
os.remove(os.path.expanduser("~/.lockfile.vestibular.lock"))
This is part of code where we put a PID file in the lock file
pidfile = open(os.path.expanduser("~/.lockfile.vestibular.lock"), "w")
pidfile.write("%s" % os.getpid())
pidfile.close()
This code will check the value of pid compared to existing running process., avoiding double execution.
I hope it will help.

There are very good packages for restarting processes on UNIX. One that has a great tutorial about building and configuring it is monit. With some tweaking you can have a rock solid proven technology keeping up your daemon.

Came across this old question looking for solution myself.
Use psutil:
import psutil
import sys
from subprocess import Popen
for process in psutil.process_iter():
if process.cmdline() == ['python', 'your_script.py']:
sys.exit('Process found: exiting.')
print('Process not found: starting it.')
Popen(['python', 'your_script.py'])

There are a myriad of options. One method is using system calls or python libraries that perform such calls for you. The other is simply to spawn out a process like:
ps ax | grep processName
and parse the output. Many people choose this approach, it isn't necessarily a bad approach in my view.

I'm a big fan of Supervisor for managing daemons. It's written in Python, so there are plenty of examples of how to interact with or extend it from Python. For your purposes the XML-RPC process control API should work nicely.

Try this other version
def checkPidRunning(pid):
'''Check For the existence of a unix pid.
'''
try:
os.kill(pid, 0)
except OSError:
return False
else:
return True
# Entry point
if __name__ == '__main__':
pid = str(os.getpid())
pidfile = os.path.join("/", "tmp", __program__+".pid")
if os.path.isfile(pidfile) and checkPidRunning(int(file(pidfile,'r').readlines()[0])):
print "%s already exists, exiting" % pidfile
sys.exit()
else:
file(pidfile, 'w').write(pid)
# Do some actual work here
main()
os.unlink(pidfile)

Rather than developing your own PID file solution (which has more subtleties and corner cases than you might think), have a look at supervisord -- this is a process control system that makes it easy to wrap job control and daemon behaviors around an existing Python script.

The other answers are great for things like cron jobs, but if you're running a daemon you should monitor it with something like daemontools.

ps ax | grep processName
if yor debug script in pycharm always exit
pydevd.py --multiproc --client 127.0.0.1 --port 33882 --file processName

try this:
#/usr/bin/env python
import os, sys, atexit
try:
# Set PID file
def set_pid_file():
pid = str(os.getpid())
f = open('myCode.pid', 'w')
f.write(pid)
f.close()
def goodby():
pid = str('myCode.pid')
os.remove(pid)
atexit.register(goodby)
set_pid_file()
# Place your code here
except KeyboardInterrupt:
sys.exit(0)

Here is more useful code (with checking if exactly python executes the script):
#! /usr/bin/env python
import os
from sys import exit
def checkPidRunning(pid):
global script_name
if pid<1:
print "Incorrect pid number!"
exit()
try:
os.kill(pid, 0)
except OSError:
print "Abnormal termination of previous process."
return False
else:
ps_command = "ps -o command= %s | grep -Eq 'python .*/%s'" % (pid,script_name)
process_exist = os.system(ps_command)
if process_exist == 0:
return True
else:
print "Process with pid %s is not a Python process. Continue..." % pid
return False
if __name__ == '__main__':
script_name = os.path.basename(__file__)
pid = str(os.getpid())
pidfile = os.path.join("/", "tmp/", script_name+".pid")
if os.path.isfile(pidfile):
print "Warning! Pid file %s existing. Checking for process..." % pidfile
r_pid = int(file(pidfile,'r').readlines()[0])
if checkPidRunning(r_pid):
print "Python process with pid = %s is already running. Exit!" % r_pid
exit()
else:
file(pidfile, 'w').write(pid)
else:
file(pidfile, 'w').write(pid)
# main programm
....
....
os.unlink(pidfile)
Here is string:
ps_command = "ps -o command= %s | grep -Eq 'python .*/%s'" % (pid,script_name)
returns 0 if "grep" is successful, and the process "python" is currently running with the name of your script as a parameter .

A simple example if you only are looking for a process name exist or not:
import os
def pname_exists(inp):
os.system('ps -ef > /tmp/psef')
lines=open('/tmp/psef', 'r').read().split('\n')
res=[i for i in lines if inp in i]
return True if res else False
Result:
In [21]: pname_exists('syslog')
Out[21]: True
In [22]: pname_exists('syslog_')
Out[22]: False

I was looking for an answer on this and in my case, came to mind a very easy and very good solution, in my opinion (since it's not possible to exist a false positive on this, I guess - how can the timestamp on the TXT be updated if the program doesn't do it):
--> just keep writing on a TXT the current timestamp in some time interval, depending on your needs (here each half hour was perfect).
If the timestamp on the TXT is outdated relatively to the current one when you check, then there was a problem on the program and it should be restarted or what you prefer to do.

A portable solution that relies on multiprocessing.shared_memory:
import atexit
from multiprocessing import shared_memory
_ensure_single_process_store = {}
def ensure_single_process(name: str):
if name in _ensure_single_process_store:
return
try:
shm = shared_memory.SharedMemory(name='ensure_single_process__' + name,
create=True,
size=1)
except FileExistsError:
print(f"{name} is already running!")
raise
_ensure_single_process_store[name] = shm
atexit.register(shm.unlink)
Usually you wouldn't have to use atexit, but sometimes it helps to clean up upon abnormal exit.

Consider the following example to solve your problem:
#!/usr/bin/python
# -*- coding: latin-1 -*-
import os, sys, time, signal
def termination_handler (signum,frame):
global running
global pidfile
print 'You have requested to terminate the application...'
sys.stdout.flush()
running = 0
os.unlink(pidfile)
running = 1
signal.signal(signal.SIGINT,termination_handler)
pid = str(os.getpid())
pidfile = '/tmp/'+os.path.basename(__file__).split('.')[0]+'.pid'
if os.path.isfile(pidfile):
print "%s already exists, exiting" % pidfile
sys.exit()
else:
file(pidfile, 'w').write(pid)
# Do some actual work here
while running:
time.sleep(10)
I suggest this script because it can be executed one time only.

Using bash to look for a process with the current script's name. No extra file.
import commands
import os
import time
import sys
def stop_if_already_running():
script_name = os.path.basename(__file__)
l = commands.getstatusoutput("ps aux | grep -e '%s' | grep -v grep | awk '{print $2}'| awk '{print $2}'" % script_name)
if l[1]:
sys.exit(0);
To test, add
stop_if_already_running()
print "running normally"
while True:
time.sleep(3)

This is what I use in Linux to avoid starting a script if already running:
import os
import sys
script_name = os.path.basename(__file__)
pidfile = os.path.join("/tmp", os.path.splitext(script_name)[0]) + ".pid"
def create_pidfile():
if os.path.exists(pidfile):
with open(pidfile, "r") as _file:
last_pid = int(_file.read())
# Checking if process is still running
last_process_cmdline = "/proc/%d/cmdline" % last_pid
if os.path.exists(last_process_cmdline):
with open(last_process_cmdline, "r") as _file:
cmdline = _file.read()
if script_name in cmdline:
raise Exception("Script already running...")
with open(pidfile, "w") as _file:
pid = str(os.getpid())
_file.write(pid)
def main():
"""Your application logic goes here"""
if __name__ == "__main__":
create_pidfile()
main()
This approach works good without any dependency on an external module.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pxssh does not work between compute nodes in a slurm cluster - python

Related

In Python, how do I get user's remote IP (their last hop) if they're connected over SSH?

Python: Parallel execution pysphere commands

subprocess kills child processes but not the processes the child spawns

how to get console output from a remote computer (ssh + python)

Check to see if python script is running

Categories

Resources