I have a script that gets a list of nodes as an argument (could be 10 or even 50), and connects to each by SSH to run a service restart command.
At the moment, I'm using multiprocessing in order to parallelize the script (getting the batch size as an argument as well), however I've heard that threading module could help me with performing my tasks in a quicker and easier to manage way (I'm using try..except KeyboardInterrupt with sys.exit() and pool.terminate(), but it won't stop the entire script because it's a different process).
Since I understand the multithreading is more lightweight and easier to manage for my case, I am trying to convert my script to use threading instead of multiprocessing but it doesn't properly work.
The current code in multiprocessing (works):
def restart_service(node, initd_tup):
"""
Get a node name as an argument, connect to it via SSH and run the service restart command..
"""
command = 'service {0} restart'.format(initd_tup[node])
logger.info('[{0}] Connecting to {0} in order to restart {1} service...'.format(node, initd_tup[node]))
try:
ssh.connect(node)
stdin, stdout, stderr = ssh.exec_command(command)
result = stdout.read()
if not result:
result_err = stderr.read()
print '{0}{1}[{2}] ERROR: {3}{4}'.format(Color.BOLD, Color.RED, node, result_err, Color.END)
logger.error('[{0}] Result of command {1} output: {2}'.format(node, command, result_err))
else:
print '{0}{1}{2}[{3}]{4}\n{5}'.format(Color.BOLD, Color.UNDERLINE, Color.GREEN, node, Color.END, result)
logger.info('[{0}] Result of command {1} output: {2}'.format(node, command, result.replace("\n", "... ")))
ssh.close()
except paramiko.AuthenticationException:
print "{0}{1}ERROR! SSH failed with Authentication Error. Make sure you run the script as root and try again..{2}".format(Color.BOLD, Color.RED, Color.END)
logger.error('SSH Authentication failed, thrown error message to the user to make sure script is run with root permissions')
pool.terminate()
except socket.error as error:
print("[{0}]{1}{2} ERROR! SSH failed with error: {3}{4}\n".format(node, Color.RED, Color.BOLD, error, Color.END))
logger.error("[{0}] SSH failed with error: {1}".format(node, error))
except KeyboardInterrupt:
pool.terminate()
general_utils.terminate(logger)
def convert_to_tuple(a_b):
"""Convert 'f([1,2])' to 'f(1,2)' call."""
return restart_service(*a_b)
def iterate_nodes_and_call_exec_func(nodes_list):
"""
Iterate over the list of nodes to process,
create a list of nodes that shouldn't exceed the batch size provided (or 1 if not provided).
Then using the multiprocessing module, call the restart_service func on x nodes in parallel (where x is the batch size).
If batch_sleep arg was provided, call the sleep func and provide the batch_sleep argument between each batch.
"""
global pool
general_utils.banner('Initiating service restart')
pool = multiprocessing.Pool(10)
manager = multiprocessing.Manager()
work = manager.dict()
for line in nodes_list:
work[line] = general_utils.get_initd(logger, args, line)
if len(work) >= int(args.batch):
pool.map(convert_to_tuple, itertools.izip(work.keys(), itertools.repeat(work)))
work = {}
if int(args.batch_sleep) > 0:
logger.info('*** Sleeping for %d seconds before moving on to next batch ***', int(args.batch_sleep))
general_utils.sleep_func(int(args.batch_sleep))
if len(work) > 0:
try:
pool.map(convert_to_tuple, itertools.izip(work.keys(), itertools.repeat(work)))
except KeyboardInterrupt:
pool.terminate()
general_utils.terminate(logger)
And here's what I've tried to to with Threading, which doesn't work (when I assign a batch_size larger than 1, the script simply gets stuck and I have to kill it forcefully.
def parse_args():
"""Define the argument parser, and the arguments to accept.."""
global args, parser
parser = MyParser(description=__doc__)
parser.add_argument('-H', '--host', help='List of hosts to process, separated by "," and NO SPACES!')
parser.add_argument('--batch', help='Do requests in batches', default=1)
args = parser.parse_args()
# If no arguments were passed, print the help file and exit with ERROR..
if len(sys.argv) == 1:
parser.print_help()
print '\n\nERROR: No arguments passed!\n'
sys.exit(3)
def do_work(node):
logger.info('[{0}]'.format(node))
try:
ssh.connect(node)
stdin, stdout, stderr = ssh.exec_command('hostname ; date')
print stdout.read()
ssh.close()
except:
print 'ERROR!'
sys.exit(2)
def worker():
while True:
item = q.get()
do_work(item)
q.task_done()
def iterate():
for item in args.host.split(","):
q.put(item)
for i in range(int(args.batch)):
t = Thread(target=worker)
t.daemon = True
t.start()
q.join()
def main():
parse_args()
try:
iterate()
except KeyboardInterrupt:
exit(1)
In the script log I see a WARNING generated by Paramiko as below:
2016-01-04 22:51:37,613 WARNING: Oops, unhandled type 3
I tried to Google this unhandled type 3 error, but didn't find anything related to my issue, since it's talking about 2 factor authentication or trying to connect via both password and SSH key at the same time, but I'm only loading the host keys without providing any password to the SSH Client.
I would appreciate any help on this matter..
Managed to solve my problem using parallel-ssh module.
Here's the code, fixed with my desired actions:
def iterate_nodes_and_call_exec_func(nodes):
"""
Get a dict as an argument, containing linux services (initd) as the keys,
and a list of nodes on which the linux service needs to be checked/
Iterate over the list of nodes to process,
create a list of nodes that shouldn't exceed the batch size provided (or 1 if not provided).
Then using the parallel-ssh module, call the restart_service func on x nodes in parallel (where x is the batch size)
and provide the linux service (initd) to process.
If batch_sleep arg was provided, call the sleep func and provide the batch_sleep argument between each batch.
"""
for initd in nodes.keys():
work = dict()
work[initd] = []
count = 0
for node in nodes[initd]:
count += 1
work[initd].append(node)
if len(work[initd]) == args.batch:
restart_service(work[initd], initd)
work[initd] = []
if args.batch_sleep > 0 and count < len(nodes[initd]):
logger.info('*** Sleeping for %d seconds before moving on to next batch ***', args.batch_sleep)
general_utils.sleep_func(int(args.batch_sleep))
if len(work[initd]) > 0:
restart_service(work[initd], initd)
def restart_service(nodes, initd):
"""
Get a list of nodes and linux service as an argument,
then connect by Parallel SSH module to the nodes and run the service restart command..
"""
command = 'service {0} restart'.format(initd)
logger.info('Connecting to {0} to restart the {1} service...'.format(nodes, initd))
try:
client = pssh.ParallelSSHClient(nodes, pool_size=args.batch, timeout=10, num_retries=1)
output = client.run_command(command, sudo=True)
for node in output:
for line in output[node]['stdout']:
if client.get_exit_code(output[node]) == 0:
print '[{0}]{1}{2} {3}{4}'.format(node, Color.BOLD, Color.GREEN, line, Color.END)
else:
print '[{0}]{1}{2} ERROR! {3}{4}'.format(node, Color.BOLD, Color.RED, line, Color.END)
logger.error('[{0}] Result of command {1} output: {2}'.format(node, command, line))
except pssh.AuthenticationException:
print "{0}{1}ERROR! SSH failed with Authentication Error. Make sure you run the script as root and try again..{2}".format(Color.BOLD, Color.RED, Color.END)
logger.error('SSH Authentication failed, thrown error message to the user to make sure script is run with root permissions')
sys.exit(2)
except pssh.ConnectionErrorException as error:
print("[{0}]{1}{2} ERROR! SSH failed with error: {3}{4}\n".format(error[1], Color.RED, Color.BOLD, error[3], Color.END))
logger.error("[{0}] SSH Failed with error: {1}".format(error[1], error[3]))
restart_service(nodes[nodes.index(error[1])+1:], initd)
except KeyboardInterrupt:
general_utils.terminate(logger)
def generate_nodes_by_initd_dict(nodes_list):
"""
Get a list of nodes as an argument.
Then by calling the get_initd function for each of the nodes,
Build a dict based on linux services (initd) as keys and a list of nodes on which the initd
needs to be processed as values. Then call the iterate_nodes_and_call_exec_func and provide the generated dict
as its argument.
"""
nodes = {}
for node in nodes_list:
initd = general_utils.get_initd(logger, args, node)
if initd in nodes.keys():
nodes[initd].append(node)
else:
nodes[initd] = [node, ]
return iterate_nodes_and_call_exec_func(nodes)
def main():
parse_args()
try:
general_utils.init_script('Service Restart', logger, log)
log_args(logger, args)
generate_nodes_by_initd_dict(general_utils.generate_nodes_list(args, logger, ['service', 'datacenter', 'lob']))
except KeyboardInterrupt:
general_utils.terminate(logger)
finally:
general_utils.wrap_up(logger)
if __name__ == '__main__':
main()
In addition to using pssh module, after a more thorough troubleshooting effort, I was able to solve the the original code that I posted in the question using native Threading module, by creating a new paramiko client for every thread, rather than using the same client for all threads.
So basically (only updating the do_work function from the original question), here's the change:
def do_work(node):
logger.info('[{0}]'.format(node))
try:
ssh = paramiko.SSHClient()
ssh.connect(node)
stdin, stdout, stderr = ssh.exec_command('hostname ; date')
print stdout.read()
ssh.close()
except:
print 'ERROR!'
sys.exit(2)
When done this way, the native Threading module works perfectly!
Related
I am using Windows and am looking for a handler or wrapper using Python for a Minecraft server so that I can automatically enter commands without user input. I have searched through many questions on the website and only found half answers (in my case at least). I believe I will need to use the subprocess module but cannot decide which to use at the moment I am experimenting with the Popen functions. I have found an answer which I modified for my case:
server = Popen("java -jar minecraft_server.jar nogui", stdin=PIPE, stdout=PIPE, stderr=STDOUT)
while True:
print(server.stdout.readline())
server.stdout.flush()
command = input("> ")
if command:
server.stdin.write(bytes(command + "\r\n", "ascii"))
server.stdin.flush()
This does work in some way but only prints a line for every time you enter a command, which cannot work and all my efforts to change this end up with the program unable to execute anything else and instead just read. This is not a duplicate question because none of the answers in similar questions could help me enough.
As you already know, your server.stdout.readline() and input("> ") are blocking your code execution.
You need to make your code non-blocking, by not waiting to actually return what you want, but by checking, if there is anything to read and ignore it, if there isn't and continue to do other things.
On Linux systems you might be able to use select module, but on Windows it only works on sockets.
I was able to make it work on Windows by using threads and queues. (note: it's Python 2 code)
import subprocess, sys
from Queue import Queue, Empty
from threading import Thread
def process_line(line):
if line == "stop\n": # lines have trailing new line characters
print "SERVER SHUTDOWN PREVENTED"
return None
elif line == "quit\n":
return "stop\n"
elif line == "l\n":
return "list\n"
return line
s = subprocess.Popen("java -jar minecraft_server.jar nogui", stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
def read_lines(stream, queue):
while True:
queue.put(stream.readline())
# terminal reading thread
q = Queue()
t = Thread(target=read_lines, args=(sys.stdin, q))
t.daemon = True
t.start()
# server reading thread
qs = Queue()
ts = Thread(target=read_lines, args=(s.stdout, qs))
ts.daemon = True
ts.start()
while s.poll() == None: # loop while the server process is running
# get a user entered line and send it to the server
try:
line = q.get_nowait()
except Empty:
pass
else:
line = process_line(line) # do something with the user entered line
if line != None:
s.stdin.write(line)
s.stdin.flush()
# just pass-through data from the server to the terminal output
try:
line = qs.get_nowait()
except Empty:
pass
else:
sys.stdout.write(line)
sys.stdout.flush()
I am trying to run a python script using nrpe to monitor rabbitmq. Inside the script is a command 'sudo rabbiqmqctl list_queues' which gives me a message count on each queue. However this is resulting in nagios giving htis message:
CRITICAL - Command '['sudo', 'rabbitmqctl', 'list_queues']' returned non-zero exit status 1
I thought this might be a permissions issue so proceeded in the following manner
/etc/group:
ec2-user:x:500:
rabbitmq:x:498:nrpe,nagios,ec2-user
nagios:x:497:
nrpe:x:496:
rpc:x:32:
/etc/sudoers:
%rabbitmq ALL=NOPASSWD: /usr/sbin/rabbitmqctl
nagios configuration:
command[check_rabbitmq_queuecount_prod]=/usr/bin/python27 /etc/nagios/check_rabbitmq_prod -a queues_count -C 3000 -W 1500
check_rabbitmq_prod:
#!/usr/bin/env python
from optparse import OptionParser
import shlex
import subprocess
import sys
class RabbitCmdWrapper(object):
"""So basically this just runs rabbitmqctl commands and returns parsed output.
Typically this means you need root privs for this to work.
Made this it's own class so it could be used in other monitoring tools
if desired."""
#classmethod
def list_queues(cls):
args = shlex.split('sudo rabbitmqctl list_queues')
cmd_result = subprocess.check_output(args).strip()
results = cls._parse_list_results(cmd_result)
return results
#classmethod
def _parse_list_results(cls, result_string):
results = result_string.strip().split('\n')
#remove text fluff
results.remove(results[-1])
results.remove(results[0])
return_data = []
for row in results:
return_data.append(row.split('\t'))
return return_data
def check_queues_count(critical=1000, warning=1000):
"""
A blanket check to make sure all queues are within count parameters.
TODO: Possibly break this out so test can be done on individual queues.
"""
try:
critical_q = []
warning_q = []
ok_q = []
results = RabbitCmdWrapper.list_queues()
for queue in results:
if queue[0] == 'SFS_Production_Queue':
count = int(queue[1])
if count >= critical:
critical_q.append("%s: %s" % (queue[0], count))
elif count >= warning:
warning_q.append("%s: %s" % (queue[0], count))
else:
ok_q.append("%s: %s" % (queue[0], count))
if critical_q:
print "CRITICAL - %s" % ", ".join(critical_q)
sys.exit(2)
elif warning_q:
print "WARNING - %s" % ", ".join(warning_q)
sys.exit(1)
else:
print "OK - %s" % ", ".join(ok_q)
sys.exit(0)
except Exception, err:
print "CRITICAL - %s" % err
sys.exit(2)
USAGE = """Usage: ./check_rabbitmq -a [action] -C [critical] -W [warning]
Actions:
- queues_count
checks the count in each of the queues in rabbitmq's list_queues"""
if __name__ == "__main__":
parser = OptionParser(USAGE)
parser.add_option("-a", "--action", dest="action",
help="Action to Check")
parser.add_option("-C", "--critical", dest="critical",
type="int", help="Critical Threshold")
parser.add_option("-W", "--warning", dest="warning",
type="int", help="Warning Threshold")
(options, args) = parser.parse_args()
if options.action == "queues_count":
check_queues_count(options.critical, options.warning)
else:
print "Invalid action: %s" % options.action
print USAGE
At this point I'm not sure what is preventing the script from running. It runs fine via the command-line. Any help is appreciated.
The "non-zero exit code" error is often associated with requiretty being applied to all users by default in your sudoers file.
Disabling "requiretty" in your sudoers file for the user that runs the check is safe, and may potentially fix the issue.
E.g. (assuming nagios/nrpe are the users)
# /etc/sudoers
Defaults:nagios !requiretty
Defaults:nrpe !requiretty
I guess what Mr #EE1213 mentions is right. If you have the permission to see /var/log/secure, the log probably contains error messages regarding sudoers. Like:
"sorry, you must have a tty to run sudo"
I'm trying to read command outputs from hcitools in Linux (it scans for bluetooth devices).
I just need to read the first line that it returns, as sometimes this tool has an error. The issue is that this tool continues to run in a infinite loop, which locks up the rest of my Python script. The script is run with sudo so that it has root privileges to use the hcitool command.
I have created a class to try to pipe the data in asynchronously:
class ASyncThread(threading.Thread): #pOpen read and readline are blocking. So we must use an async thread to read from hciTool
def __init__(self, command, parameters = []):
self.stdout = None
self.stderr = None
self.command = command
self.parameters = parameters
self.process = None
threading.Thread.__init__(self)
def run(self):
if len(self.command) >= 1:
self.process = subprocess.Popen([self.command] + self.parameters, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
self.stdout, self.stderr = self.process.communicate()
else:
print "[ASyncThread::run()] Error: Empty command given."
def terminate(self):
try:
self.process.terminate()
except Exception, ex:
print "[ASyncThread::terminate()] Error: ", ex
And I'm calling it with:
print "Checking HCI Tool Status..."
hciThread = ASyncThread("/usr/local/bin/hciconfig", ["lescan"])
hciThread.start()
time.sleep(1) #Give the program time to run.
hciThread.terminate() #If terminate is not placed here, it locks up my Python script when the thread is joined.
hciThread.join()
outputText = hciThread.stdout + " | " + hciThread.stderr
When this is run, the output is just " | ".
If I run this command:
sudo /usr/local/bin/hcitool lescan
It instantly starts working immediately:
slyke#ubuntu ~ $ sudo hcitool lescan
Set scan parameters failed: Input/output error
I've been working on this for a few hours now. I originally tried to do this with pOpen, but read() and readline() are both blocking. This is not normally a problem, except that there may not be an error, or any data produced by this command, so my Python script hangs. This is why I moved to threading, so it can wait for a second before stopping it, and continuing on.
It seems to me you cannot possibly join a thread, after you have just terminated it on the line above.
Your particular issue about doing an lescan is probably better solved with the solution from mikerr/btle-scan.py - https://gist.github.com/mikerr/372911c955e2a94b96089fbc300c2b5d
The code what I am trying is:
def update_vm(si, vm):
env.host_string = vm
with settings(user=VM_USER, key_filename=inputs['ssh_key_path']):
put(local_file, remote_zip_file)
run('tar -zxpf %s' % remote_zip_file)
run('sudo sh %s' % REMOTE_UPDATE_SCRIPT)
response_msg = run('cat %s' % REMOTE_RESPONSE_FILE)
if 'success' in response_msg:
#do stuff
else:
#do stuff
def update_vm_wrapper(args):
return update_vm(*args)
def main():
try:
si = get_connection()
vms = [vm1, vm2, vm3...]
update_jobs = [(si, vm) for vm in vms]
pool = Pool(30)
pool.map(update_vm_wrapper, update_jobs)
pool.close()
pool.join()
except Exception as e:
print e
if __name__ == "__main__":
main()
Now the problem is I saw it is trying to put the zip file inside same vm(say vm1)for 3 times(I guess the length of vms). And trying to execute the other ssh commands 3 times.
Using locks for the update_vm() method is solving the issue. But it looks no longer a multiprocessor solution. It more like iterating over a loop.
What wrong am I doing here ?
Fabric has its own facilities for parallel execution of tasks - you should use those, rather than just trying to execute Fabric tasks in multiprocessing pools. The problem is that the env object is mutated when executing the tasks, so the different workers are stepping on each other (unless you put locking in).
I'm having some strange issues using subprocess.check_output(). At first I was just using subprocess.call() and everything was working fine. However when I simply switch out call() for check_output(), I receive a strange error.
Before code (works fine):
def execute(hosts):
''' Using psexec, execute the batch script on the list of hosts '''
successes = []
wd = r'c:\\'
file = r'c:\\script.exe'
for host in hosts:
res = subprocess.call(shlex.split(r'psexec \\\\%s -e -s -d -w %s %s' % (host,wd,file)))
if res.... # Want to check the output here
successes.append(host)
return successes
After code (doesn't work):
def execute(hosts):
''' Using psexec, execute the batch script on the list of hosts '''
successes = []
wd = r'c:\\'
file = r'c:\\script.exe'
for host in hosts:
res = subprocess.check_output(shlex.split(r'psexec \\\\%s -e -s -d -w %s %s' % (host,wd,file)))
if res.... # Want to check the output here
successes.append(host)
return successes
This gives the error:
I couldnt redirect this because the program hangs here and I can't ctrl-c out. Any ideas why this is happening? What's the difference between subprocess.call() and check_output() that could be causing this?
Here is the additional code including the multiprocessing portion:
PROCESSES = 2
host_sublists_execute = [.... list of hosts ... ]
poolE = multiprocessing.Pool(processes=PROCESSES)
success_executions = poolE.map(execute,host_sublists_execute)
success_executions = [entry for sub in success_executions for entry in sub]
poolE.close()
poolE.join()
Thanks!
You are encountering Python Issue 9400.
There is a key distinction you have to understand about subprocess.call() vs subprocess.check_output(). subprocess.call() will execute the command you give it, then provide you with the return code. On the other hand, subprocess.check_output() returns the program's output to you in a string, but it tries to do you a favor and check the program's return code and will raise an exception (subprocess.CalledProcessError) if the program did not execute successfully (returned a non-zero return code).
When you call pool.map() with a multiprocessing pool, it will try to propagate exceptions in the subprocesses back to main and raise an exception there. Apparently there is an issue with how the subprocess.CalledProcessError exception class is defined, so the pickling fails when the multiprocessing library tries to propagate the exception back to main.
The program you're calling is returning a non-zero return code, which makes subprocess.check_output() raise an exception, and pool.map() can't handle it properly, so you get the TypeError that results from the failed attempt to retrieve the exception.
As a side note, the definition of subprocess.CalledProcessError must be really screwed up, because if I open my 2.7.6 terminal, import subprocess, and manuallly raise the error, I still get the TypeError, so I don't think it's merely a pickling problem.