Weird behavior from pythons multiprocessing - python

The code what I am trying is:
def update_vm(si, vm):
env.host_string = vm
with settings(user=VM_USER, key_filename=inputs['ssh_key_path']):
put(local_file, remote_zip_file)
run('tar -zxpf %s' % remote_zip_file)
run('sudo sh %s' % REMOTE_UPDATE_SCRIPT)
response_msg = run('cat %s' % REMOTE_RESPONSE_FILE)
if 'success' in response_msg:
#do stuff
else:
#do stuff
def update_vm_wrapper(args):
return update_vm(*args)
def main():
try:
si = get_connection()
vms = [vm1, vm2, vm3...]
update_jobs = [(si, vm) for vm in vms]
pool = Pool(30)
pool.map(update_vm_wrapper, update_jobs)
pool.close()
pool.join()
except Exception as e:
print e
if __name__ == "__main__":
main()
Now the problem is I saw it is trying to put the zip file inside same vm(say vm1)for 3 times(I guess the length of vms). And trying to execute the other ssh commands 3 times.
Using locks for the update_vm() method is solving the issue. But it looks no longer a multiprocessor solution. It more like iterating over a loop.
What wrong am I doing here ?

Fabric has its own facilities for parallel execution of tasks - you should use those, rather than just trying to execute Fabric tasks in multiprocessing pools. The problem is that the env object is mutated when executing the tasks, so the different workers are stepping on each other (unless you put locking in).

Related

Python update Database during multiprocessing

I am using multiprocessing to perform jobs parallel, my Goal is to use multi cpu core and hence i choosen multiprocessing module instead of threading module
Now i have method, which uses subprocess module to execute linux shell command, i need to filter it and update the results to DB.
For every thread, subprocess execution time may differ some threads input execution time may be 10 seconds for other it may be 15 seconds.
My concern is will always get same thread execution result or different thread execution result,
or i have to go for locking mechanism, if yes can you provide me example that suitable for my requirement
Below is the example code:
#!/usr/bin/env python
import json
from subprocess import check_output
import multiprocessing
class Test:
# Convert bytes to UTF-8 string
#staticmethod
def bytes_to_string(string_convert):
if not isinstance(string_convert, bytes) and isinstance(string_convert, str):
return string_convert, True
elif isinstance(string_convert, bytes):
string_convert = string_convert.decode("utf-8")
else:
print("Passed in non-byte type to convert to string: {0}".format(string_convert))
return "", False
return string_convert, True
# Execute commands in Linux shell
#staticmethod
def command_output(command):
try:
output = check_output(command)
except Exception as e:
return e, False
output, state = Test.bytes_to_string(output)
return output, True
#staticmethod
def run_multi(num):
test_result, success = Test.command_output(["curl", "-sb", "-H", "Accept: application/json", "http://127.0.0.1:5500/stores"])
out = json.loads(test_result)
#Update Database is safer here or i need to use any locks
if __name__ == '__main__':
test = Test()
input_list = list(range(0, 1000))
numberOfThreads = 100
p = multiprocessing.Pool(numberOfThreads)
p.map(test.run_multi, input_list)
p.close()
p.join()
Depends on what sort of updates you're doing in the database...
If it's a full database, it'll have its own locking mechanisms; you'll need to work with them, but other than that it's already designed to handle concurrent access.
For example, if the update involves inserting a row, you can just do that; the database will end up with all the rows, each exactly once.

multiprocess pool doesnt close and join terminating the script before all the process

I have created a multiprocessor application that just loop some files and compare them but for some reason the pool never close and wait to join all the process responses.
from multiprocessing import Pool
def compare_from_database(row_id, connection_to_database):
now = datetime.now()
connection1 = sqlite3.connect(connection_to_database)
cursor = connection1.cursor()
grab_row_id_query = "SELECT * FROM MYTABLE WHERE rowid = {0};".format(row_id)
grab_row_id = cursor.execute(grab_row_id_query)
work_file_path = grab_row_id.fetchone()[1]
all_remaining_files_query = "SELECT * FROM MYTABLE WHERE rowid > {0};".format(row_id)
all_remaining_files = cursor.execute(all_remaining_files_query)
for i in all_remaining_files:
if i[1] == work_file_path:
completed_query = "UPDATE MYTABLE SET REPEATED = 1 WHERE ROWID = {1};".format(row_id)
work_file = cursor.execute(completed_query)
connection1.commit()
cursor.close()
connection1.close()
return "id {0} took: {1}".format(row_id, datetime.now()-now)
I have try it with:
def apply_async(range_max, connection_to_database):
pool = Pool()
for i in range_of_ids:
h = pool.apply_async(compare_from_database, args=(i, connection_to_database))
pool.close()
pool.join()
Also using a context and kind of force it:
from multiprocessing import Pool
with Pool() as pool:
for i in range_of_ids:
h = pool.apply_async(compare_from_database, args=(i, connection_to_database))
pool.close()
pool.join()
Even do with context shouldn't need the close/join.
The script just submit all the jobs, I can see in task manager all the python instance and are running, the print statements inside the function do print in the console fine, but once the main script finish submitting all the functions to the pools, just ends. doesn't respect the close/join
Process finished with exit code 0
if i run the function by itself runs fine returning the string.
compare_from_database(1, connection_to_database="my_path/sqlite.db")
or in a loop works fine as well
for i in range(1, 4):
compare_from_database(i, connection_to_database="my_path/sqlite.db")
I try using python 3.7 and 3.8 and wanted to validate it with the documentation
https://docs.python.org/2/library/multiprocessing.html#multiprocessing.pool.multiprocessing.Pool.join
Anyone gotten a similar issue or any ideas what might be?
since you want to do all the process before proceding to the next part of the script
change 'async' instead async_apply that way it force to run the process and wait for the
result.

Concurrently downloading files in python using multi process

I have made a code below to download files using pySmartDL. I would like to download more than one file at a time. Tried to implement it using multi process. But second process starts only when first finishes. Code is below:
import time
from multiprocessing import Process
from pySmartDL import SmartDL, HashFailedException
def down():
dest='/home/faheem/Downloads'
obj = SmartDL(url_100mb_file,dest, progress_bar=False,fix_urls=True)
obj.start(blocking=False)
#cnt=1
while not obj.isFinished():
print("Speed: %s" % obj.get_speed(human=True))
print("Already downloaded: %s" % obj.get_dl_size(human=True))
print("Eta: %s" % obj.get_eta(human=True))
print("Progress: %d%%" % (obj.get_progress()*100))
print("Progress bar: %s" % obj.get_progress_bar())
print("Status: %s" % obj.get_status())
print("\n"*2+"="*50+"\n"*2)
print("SIZE=%s"%obj.filesize)
time.sleep(2)
if obj.isSuccessful():
print("downloaded file to '%s'" % obj.get_dest())
print("download task took %ss" % obj.get_dl_time(human=True))
print("File hashes:")
print(" * MD5: %s" % obj.get_data_hash('md5'))
print(" * SHA1: %s" % obj.get_data_hash('sha1'))
print(" * SHA256: %s" % obj.get_data_hash('sha256'))
data=obj.get_data()
else:
print("There were some errors:")
for e in obj.get_errors():
print(str(e))
return
if __name__ == '__main__':
#jobs=[]
#for i in range(5):
print 'Link1'
url_100mb_file = ['https://softpedia-secure-download.com/dl/45b1fc44f6bfabeddeb7ce766c97a8f0/58b6eb0f/100255033/software/office/Text%20Comparator%20(v1.2).rar']
Process(target=down()).start()
print'link2'
url_100mb_file = ['https://www.crystalidea.com/downloads/macsfancontrol_setup.exe']
Process(target=down()).start()
Here link2 starts downloading when link1 finishes, but I need both download to perform concurrently. I would like to implement this method to perform upto 10 downloads at a time. So is it good to use multiprocessing?
Is there any other better memory efficient method.
I am a beginner in these codes, so kindly define the answer easily..
Regards
You can also use python module Thread. Here is a little snippet on how it works:
import threading
import time
def func(i):
time.sleep(i)
print i
for i in range(1, 11):
thread = threading.Thread(target = func, args=(i,))
thread.start()
print "Launched thread " + str(i)
print "Done"
Run this snippet and you will get a perfect idea on how it works.
Knowing that, you can actually run your code, passing as an argument to the function the url to use in each thread.
Hope that helps
The particular library you're using appears to already support non-blocking downloads so why no just do the following? Non-blocking means it'll run in a seperate process.
from time import sleep
from pySmartDL import SmartDL
links = [['https://softpedia-secure download.com/dl/45b1fc44f6bfabeddeb7ce766c97a8f0/58b6eb0f/100255033/software/office/Text%20Comparator%20(v1.2).rar'],['https://www.crystalidea.com/downloads/macsfancontrol_setup.exe']]
objs = [SmartDL(link, progress_bar=False) for link in links]
for obj in objs:
obj.start(blocking=False)
while not all(obj.isFinished() for obj in objs):
sleep(1)
Since your program is I/O-bound, you can use multi-processing or mult-threading.
Just in case, I'd like to remind the classical pattern for problems like this. Have a queue of URLs from which worker processes / threads pull URLs for processing, and have a status queue where the workers push their progress reports or errors.
A thread pool or a process pull greatly simplifies things, compared to manual control.

Multiple SSH Connections in a Python 2.7 script- Multiprocessing Vs Threading

I have a script that gets a list of nodes as an argument (could be 10 or even 50), and connects to each by SSH to run a service restart command.
At the moment, I'm using multiprocessing in order to parallelize the script (getting the batch size as an argument as well), however I've heard that threading module could help me with performing my tasks in a quicker and easier to manage way (I'm using try..except KeyboardInterrupt with sys.exit() and pool.terminate(), but it won't stop the entire script because it's a different process).
Since I understand the multithreading is more lightweight and easier to manage for my case, I am trying to convert my script to use threading instead of multiprocessing but it doesn't properly work.
The current code in multiprocessing (works):
def restart_service(node, initd_tup):
"""
Get a node name as an argument, connect to it via SSH and run the service restart command..
"""
command = 'service {0} restart'.format(initd_tup[node])
logger.info('[{0}] Connecting to {0} in order to restart {1} service...'.format(node, initd_tup[node]))
try:
ssh.connect(node)
stdin, stdout, stderr = ssh.exec_command(command)
result = stdout.read()
if not result:
result_err = stderr.read()
print '{0}{1}[{2}] ERROR: {3}{4}'.format(Color.BOLD, Color.RED, node, result_err, Color.END)
logger.error('[{0}] Result of command {1} output: {2}'.format(node, command, result_err))
else:
print '{0}{1}{2}[{3}]{4}\n{5}'.format(Color.BOLD, Color.UNDERLINE, Color.GREEN, node, Color.END, result)
logger.info('[{0}] Result of command {1} output: {2}'.format(node, command, result.replace("\n", "... ")))
ssh.close()
except paramiko.AuthenticationException:
print "{0}{1}ERROR! SSH failed with Authentication Error. Make sure you run the script as root and try again..{2}".format(Color.BOLD, Color.RED, Color.END)
logger.error('SSH Authentication failed, thrown error message to the user to make sure script is run with root permissions')
pool.terminate()
except socket.error as error:
print("[{0}]{1}{2} ERROR! SSH failed with error: {3}{4}\n".format(node, Color.RED, Color.BOLD, error, Color.END))
logger.error("[{0}] SSH failed with error: {1}".format(node, error))
except KeyboardInterrupt:
pool.terminate()
general_utils.terminate(logger)
def convert_to_tuple(a_b):
"""Convert 'f([1,2])' to 'f(1,2)' call."""
return restart_service(*a_b)
def iterate_nodes_and_call_exec_func(nodes_list):
"""
Iterate over the list of nodes to process,
create a list of nodes that shouldn't exceed the batch size provided (or 1 if not provided).
Then using the multiprocessing module, call the restart_service func on x nodes in parallel (where x is the batch size).
If batch_sleep arg was provided, call the sleep func and provide the batch_sleep argument between each batch.
"""
global pool
general_utils.banner('Initiating service restart')
pool = multiprocessing.Pool(10)
manager = multiprocessing.Manager()
work = manager.dict()
for line in nodes_list:
work[line] = general_utils.get_initd(logger, args, line)
if len(work) >= int(args.batch):
pool.map(convert_to_tuple, itertools.izip(work.keys(), itertools.repeat(work)))
work = {}
if int(args.batch_sleep) > 0:
logger.info('*** Sleeping for %d seconds before moving on to next batch ***', int(args.batch_sleep))
general_utils.sleep_func(int(args.batch_sleep))
if len(work) > 0:
try:
pool.map(convert_to_tuple, itertools.izip(work.keys(), itertools.repeat(work)))
except KeyboardInterrupt:
pool.terminate()
general_utils.terminate(logger)
And here's what I've tried to to with Threading, which doesn't work (when I assign a batch_size larger than 1, the script simply gets stuck and I have to kill it forcefully.
def parse_args():
"""Define the argument parser, and the arguments to accept.."""
global args, parser
parser = MyParser(description=__doc__)
parser.add_argument('-H', '--host', help='List of hosts to process, separated by "," and NO SPACES!')
parser.add_argument('--batch', help='Do requests in batches', default=1)
args = parser.parse_args()
# If no arguments were passed, print the help file and exit with ERROR..
if len(sys.argv) == 1:
parser.print_help()
print '\n\nERROR: No arguments passed!\n'
sys.exit(3)
def do_work(node):
logger.info('[{0}]'.format(node))
try:
ssh.connect(node)
stdin, stdout, stderr = ssh.exec_command('hostname ; date')
print stdout.read()
ssh.close()
except:
print 'ERROR!'
sys.exit(2)
def worker():
while True:
item = q.get()
do_work(item)
q.task_done()
def iterate():
for item in args.host.split(","):
q.put(item)
for i in range(int(args.batch)):
t = Thread(target=worker)
t.daemon = True
t.start()
q.join()
def main():
parse_args()
try:
iterate()
except KeyboardInterrupt:
exit(1)
In the script log I see a WARNING generated by Paramiko as below:
2016-01-04 22:51:37,613 WARNING: Oops, unhandled type 3
I tried to Google this unhandled type 3 error, but didn't find anything related to my issue, since it's talking about 2 factor authentication or trying to connect via both password and SSH key at the same time, but I'm only loading the host keys without providing any password to the SSH Client.
I would appreciate any help on this matter..
Managed to solve my problem using parallel-ssh module.
Here's the code, fixed with my desired actions:
def iterate_nodes_and_call_exec_func(nodes):
"""
Get a dict as an argument, containing linux services (initd) as the keys,
and a list of nodes on which the linux service needs to be checked/
Iterate over the list of nodes to process,
create a list of nodes that shouldn't exceed the batch size provided (or 1 if not provided).
Then using the parallel-ssh module, call the restart_service func on x nodes in parallel (where x is the batch size)
and provide the linux service (initd) to process.
If batch_sleep arg was provided, call the sleep func and provide the batch_sleep argument between each batch.
"""
for initd in nodes.keys():
work = dict()
work[initd] = []
count = 0
for node in nodes[initd]:
count += 1
work[initd].append(node)
if len(work[initd]) == args.batch:
restart_service(work[initd], initd)
work[initd] = []
if args.batch_sleep > 0 and count < len(nodes[initd]):
logger.info('*** Sleeping for %d seconds before moving on to next batch ***', args.batch_sleep)
general_utils.sleep_func(int(args.batch_sleep))
if len(work[initd]) > 0:
restart_service(work[initd], initd)
def restart_service(nodes, initd):
"""
Get a list of nodes and linux service as an argument,
then connect by Parallel SSH module to the nodes and run the service restart command..
"""
command = 'service {0} restart'.format(initd)
logger.info('Connecting to {0} to restart the {1} service...'.format(nodes, initd))
try:
client = pssh.ParallelSSHClient(nodes, pool_size=args.batch, timeout=10, num_retries=1)
output = client.run_command(command, sudo=True)
for node in output:
for line in output[node]['stdout']:
if client.get_exit_code(output[node]) == 0:
print '[{0}]{1}{2} {3}{4}'.format(node, Color.BOLD, Color.GREEN, line, Color.END)
else:
print '[{0}]{1}{2} ERROR! {3}{4}'.format(node, Color.BOLD, Color.RED, line, Color.END)
logger.error('[{0}] Result of command {1} output: {2}'.format(node, command, line))
except pssh.AuthenticationException:
print "{0}{1}ERROR! SSH failed with Authentication Error. Make sure you run the script as root and try again..{2}".format(Color.BOLD, Color.RED, Color.END)
logger.error('SSH Authentication failed, thrown error message to the user to make sure script is run with root permissions')
sys.exit(2)
except pssh.ConnectionErrorException as error:
print("[{0}]{1}{2} ERROR! SSH failed with error: {3}{4}\n".format(error[1], Color.RED, Color.BOLD, error[3], Color.END))
logger.error("[{0}] SSH Failed with error: {1}".format(error[1], error[3]))
restart_service(nodes[nodes.index(error[1])+1:], initd)
except KeyboardInterrupt:
general_utils.terminate(logger)
def generate_nodes_by_initd_dict(nodes_list):
"""
Get a list of nodes as an argument.
Then by calling the get_initd function for each of the nodes,
Build a dict based on linux services (initd) as keys and a list of nodes on which the initd
needs to be processed as values. Then call the iterate_nodes_and_call_exec_func and provide the generated dict
as its argument.
"""
nodes = {}
for node in nodes_list:
initd = general_utils.get_initd(logger, args, node)
if initd in nodes.keys():
nodes[initd].append(node)
else:
nodes[initd] = [node, ]
return iterate_nodes_and_call_exec_func(nodes)
def main():
parse_args()
try:
general_utils.init_script('Service Restart', logger, log)
log_args(logger, args)
generate_nodes_by_initd_dict(general_utils.generate_nodes_list(args, logger, ['service', 'datacenter', 'lob']))
except KeyboardInterrupt:
general_utils.terminate(logger)
finally:
general_utils.wrap_up(logger)
if __name__ == '__main__':
main()
In addition to using pssh module, after a more thorough troubleshooting effort, I was able to solve the the original code that I posted in the question using native Threading module, by creating a new paramiko client for every thread, rather than using the same client for all threads.
So basically (only updating the do_work function from the original question), here's the change:
def do_work(node):
logger.info('[{0}]'.format(node))
try:
ssh = paramiko.SSHClient()
ssh.connect(node)
stdin, stdout, stderr = ssh.exec_command('hostname ; date')
print stdout.read()
ssh.close()
except:
print 'ERROR!'
sys.exit(2)
When done this way, the native Threading module works perfectly!

Python, How to break out of multiple threads

I am following one of the examples in a book I am reading ("Violent Python"). It is to create a zip file password cracker from a dictionary. I have two questions about it. First it says to thread it as I have written in the code to increase performance but when I timed it (I know time.time() is not great for timing) there was about a twelve second difference in favor of not threading. Is this because it is taking longer to start the threads? Second if I do it without the threads I can break as soon as the correct value is found by printing the result and the entering the statement exit(0). Is there a way to get the same result using threading so that if I find the result I am looking for I can end all other threads simultaneously?
import zipfile
from threading import Thread
import time
def extractFile(z, password, starttime):
try:
z.extractall(pwd=password)
except:
pass
else:
z.close()
print('PWD IS ' + password)
print(str(time.time()-starttime))
def main():
start = time.time()
z = zipfile.ZipFile('test.zip')
pwdfile = open('words.txt')
pwds = pwdfile.read()
pwdfile.close()
for pwd in pwds.splitlines():
t = Thread(target=extractFile, args=(z, pwd, start))
t.start()
#extractFile(z, pwd, start)
print(str(time.time()-start))
if __name__ == '__main__':
main()
In CPython, the Global Interpreter Lock ("GIL") enforces the restriction that only one thread at a time can execute Python bytecode.
So in this application, it is probably better to use the map method of a multiprocessing.Pool, since every try is independant of the others;
import multiprocessing
import zipfile
def tryfile(password):
rv = passwd
with zipfile.ZipFile('test.zip') as z:
try:
z.extractall(pwd=password)
except:
rv = None
return rv
with open('words.txt') as pwdfile:
data = pwdfile.read()
pwds = data.split()
p = multiprocessing.Pool()
results = p.map(tryfile, pwds)
results = [r for r in results if r is not None]
This will start (by default) as many processes as your computer has cores. If will keep running tryfile() with a different passwords in these processes until the list pwds is exhausted, gather the results and return them. The last list comprehension is to discard the None results.
Note that this code could be improved to stop shut down the map once the password is found. You'd probably have to use map_async and a shared variable in that case. It would also be nice to load the zipfile only once and share that.
This code is slow because python has a Global Interpreter Lock, which means only one thread can execute at a time. This causes multithreaded code to run slower than serial code in Python. If you want to create a truly multithreaded application, you'd have to use the Multiprocessing Module.
To break out of the threads and get the return value, you can use os._exit(1) First, import the os module at the top of your file:
import os
Then, change your extractFile function to use os._exit(1):
def extractFile(z, password, starttime):
try:
z.extractall(pwd=password)
except:
pass
else:
z.close()
print('PWD IS ' + password)
print(str(time.time()-starttime))
os._exit(1)

Categories

Resources