Capture check_output value - python

I am trying to capture the return value of check_output instead of having it automatically print to the command line. Unfortunately, my solution is not working and I'm not sure why. I've included my code and it's output:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from multiprocessing import Pool
from subprocess import check_output,CalledProcessError
def job(cmd):
result = ""
try:
result = check_output(cmd.split()) # Split string into list.
print("job result length = {0}".format(len(result)), file=sys.stdout)
except CalledProcessError as error:
raise Exception("Exit status of the child process: {0}\
Command used to spawn child process: {1}\
Output of the child process: {2}".format(error.returncode,error.cmd,error.output))
def main():
# Sets up a process pool. Defaults to number of cores.
# Each input gets passed to job and processed in a separate process.
p = Pool()
result = []
try:
# cmd_list is just a list of system commands which have been verified to work.
result = list(p.imap_unordered(job, cmd_list))
print("main result length = {0}".format(len(result)), file=sys.stdout)
print("{0}".format(result), file=sys.stdout)
except Exception as error:
print("Error: {0}. Aborting...".format(error), file=sys.stderr)
p.close()
p.terminate()
else:
p.close()
p.join()
if __name__ == '__main__':
main()
Output
In addition to the output of each command executed by check_output, my print statements reveal some unexpected results:
job result length = 0
job result length = 0
main result length = 2
[None, None]
I would expect job result length to equal 2 and result to contain the return values of the child processes.

result is a local variable. Either return it:
def job(cmd):
# something goes here
return result
Or make it global:
result = ""
def job(cmd):
global result
# something goes here
result = whatever it shall be.
Or parameterize it:
def job(cmd, result):
result = whatever it shall be.

Related

how to "poll" python multiprocess pool apply_async

I have a task function like this:
def task (s) :
# doing some thing
return res
The original program is:
res = []
for i in data :
res.append(task(i))
# using pickle to save res every 30s
I need to process a lot of data and I don't care the output order of the results. Due to the long running time, I need to save the current progress regularly. Now I'll change it to multiprocessing
pool = Pool(4)
status = []
res = []
for i in data :
status.append(pool.apply_async(task, (i,))
for i in status :
res.append(i.get())
# using pickle to save res every 30s
Supposed I have processes p0,p1,p2,p3 in Pool and 10 task, (task(0) .... task(9)). If p0 takes a very long time to finish the task(0).
Does the main process be blocked at the first "res.append(i.get())" ?
If p1 finished task(1) and p0 still deal with task(0), will p1 continue to deal with task(4) or later ?
If the answer to the first question is yes, then how to get other results in advance. Finally, get the result of task (0)
I update my code but the main process was blocked somewhere while other process were still dealing tasks. What's wrong ? Here is the core of code
with concurrent.futures.ProcessPoolExecutor(4) as ex :
for i in self.inBuffer :
futuresList.append(ex.submit(warpper, i))
for i in concurrent.futures.as_completed(futuresList) :
(word, r) = i.result()
self.resDict[word] = r
self.logger.info("{} --> {}".format(word, r))
cur = datetime.now()
if (cur - self.timeStmp).total_seconds() > 30 :
self.outputPickle()
self.timeStmp = datetime.now()
The length of self.inBuffer is about 100000. self.logger.info will write the info to a log file. For some special input i, the wrapper function will print auxiliary information with print. self.resDict is a dict to store result. self.outputPickle() will write a .pkl file using pickle.dump
At first, the code run normally, both the update of log file and print by warpper. But at a moment, I found that the log file has not been updated for a long time (several hours, the time to complete a warper shall not exceed 120s), but the warpper is still printing information(Until I kill the process it print about 100 messages without any updates of log file). Also, the time stamp of the output .pkl file doesn't change. Here is the implementation of outputPickle()
def outputPickle (self) :
if os.path.exists(os.path.join(self.wordDir, self.outFile)) :
if os.path.exists(os.path.join(self.wordDir, "{}_backup".format(self.outFile))):
os.remove(os.path.join(self.wordDir, "{}_backup".format(self.outFile)))
shutil.copy(os.path.join(self.wordDir, self.outFile), os.path.join(self.wordDir, "{}_backup".format(self.outFile)))
with open(os.path.join(self.wordDir, self.outFile), 'wb') as f:
pickle.dump(self.resDict, f)
Then I add three printfunction :
print("getting res of something")
(word, r) = i.result()
print("finishing i.result")
self.resDict[word] = r
print("finished getting res of {}".format(word))
Here is the log:
getting res of something
finishing i.result
finished getting res of CNICnanotubesmolten
getting res of something
finishing i.result
finished getting res of CNN0
getting res of something
message by warpper
message by warpper
message by warpper
message by warpper
message by warpper
The log "message by warpper" can be printed at most once every time the warpper is called
Yes
Yes, as processes are submitted asynchronously. Also p1 (or other) will take another chunk of data if the size of the input iterable is larger than the max number of processes/workers
"... how to get other results in advance"
One of the convenient options is to rely on concurrent.futures.as_completed which will return the results as they are completed:
import time
import concurrent.futures
def func(x):
time.sleep(3)
return x ** 2
if __name__ == '__main__':
data = range(1, 5)
results = []
with concurrent.futures.ProcessPoolExecutor(4) as ex:
futures = [ex.submit(func, i) for i in data]
# processing the earlier results: as they are completed
for fut in concurrent.futures.as_completed(futures):
res = fut.result()
results.append(res)
print(res)
Sample output:
4
1
9
16
Another option is to use callback on apply_async(func[, args[, kwds[, callback[, error_callback]]]]) call; the callback accepts only single argument as the returned result of the function. In that callback you can process the result in minimal way (considering that it's tied to only a single argument/result from a concrete function). The general scheme looks as follows:
def res_callback(v):
# ... processing result
with open('test.txt', 'a') as f: # just an example
f.write(str(v))
print(v, flush=True)
if __name__ == '__main__':
data = range(1, 5)
results = []
with Pool(4) as pool:
tasks = [pool.apply_async(func, (i,), callback=res_callback) for i in data]
# await for tasks finished
But that schema would still require to somehow await (get() results) for submitted tasks.

Python: store results of ProcessPoolExecutor

I'm very new to parallel processing with "concurrent.futures". Code seems to work, but I am not sure how to store the result of each process, therey by marking the build as failed at last, if any of processes's return value is not zero.
Tried to create a list (exit_status) and append the results to that, but that shows IndexError. Wondering what can I do right?
#!/usr/bin/env python3
import concurrent.futures
import sys
import shutil
import os
import glob
import multiprocessing as mp
import json
from os import path
def slave(path1, path2, target):
os.makedirs(target)
shutil.copy(path1, target)
shutil.copy(path2, target)
os.system(<Login command>)
os.system(<Image creation command>)
os.system(<Copy to Other slaves or NFS>)
#If any one of the above operation or command fails for any of the process, the script should return 1 at the end of the execution or fail the build at last.
def main():
processed = {}
exit_status = []
with open('example.json', 'r') as f:
data = json.load(f)
for value in data.items():
for line in value[1]:
if line.endswith('.zip'):
targz = line
elif line.endswith('.yaml'):
yaml = line
processed[targz] = yaml
with concurrent.futures.ProcessPoolExecutor() as executor:
for id, (path2, path1) in enumerate(processed.items(), 1):
target = path.join("/tmp", "dir" + str(id))
ret = executor.submit(slave, path1, path2, target)
exit_status.append(ret.result())
for i in exit_status:
print("##########Result status: ", i)
if __name__ == "__main__":
mp.set_start_method('spawn')
main()
exit_status list's output:
##########Result status: None
##########Result status: None
re; comments
If you want to get the result of a system call in order to act on the results of it, using subprocess.run is much more flexible and powerful than os.system. Additionally, if you actually want to perform the operations in parallel, you can't wait on result() after each task. Otherwise you're only ever doing one thing at a time. Better to submit all the tasks, and collect the Future objects. Then you can iterate over those and wait on each result() now that you've submitted all the work you want the executor to do.
def target_func(path1, path2, target):
#...
#instead of os.system, use subprocess.run
#you can inspect the stdout from the process
complete_process = subprocess.run(<Login command>, text=True, capture_output=True)
if "success" not in complete_process.stdout:
return "uh-oh"
#you can also just check the return value (0 typically means clean exit)
if subprocess.run(<Image creation command>).returncode == 0:
return "uh-oh"
#or you can tell `run` to generate an error if the returncode is non-zero
try:
subprocess.run(<Copy to Other slaves or NFS>, check=True)
except subprocess.CalledProcessError:
return "uh-oh"
return "we did it!"
def main():
#...
#...
with concurrent.futures.ProcessPoolExecutor() as executor:
for id, (path2, path1) in enumerate(processed.items(), 1):
target = path.join("/tmp", "dir" + str(id))
ret = executor.submit(slave, path1, path2, target)
exit_status.append(ret)
for i in exit_status:
print("##########Result status: ", i.result())

TypeError: 'MapResult' object is not iterable using pathos.multiprocessing

I'm running a spell correction function on a dataset I have. I used from pathos.multiprocessing import ProcessingPool as Pool to do the job. Once the processing is done, I'd like to actually access the results. Here is my code:
import codecs
import nltk
from textblob import TextBlob
from nltk.tokenize import sent_tokenize
from pathos.multiprocessing import ProcessingPool as Pool
class SpellCorrect():
def load_data(self, path_1):
with codecs.open(path_1, "r", "utf-8") as file:
data = file.read()
return sent_tokenize(data)
def correct_spelling(self, data):
data = TextBlob(data)
return str(data.correct())
def run_clean(self, path_1):
pool = Pool()
data = self.load_data(path_1)
return pool.amap(self.correct_spelling, data)
if __name__ == "__main__":
path_1 = "../Data/training_data/training_corpus.txt"
SpellCorrect = SpellCorrect()
result = SpellCorrect.run_clean(path_1)
print(result)
result = " ".join(temp for temp in result)
with codecs.open("../Data/training_data/training_data_spell_corrected.txt", "a", "utf-8") as file:
file.write(result)
If you look at the main block, when I do print(result) I get an object of type <multiprocess.pool.MapResult object at 0x1a25519f28>.
I try to access the results with result = " ".join(temp for temp in result), but then I get the following error TypeError: 'MapResult' object is not iterable. I've tried typecasting it to a list list(result), but still the same error. What can I do to fix this?
The multiprocess.pool.MapResult object is not iterable as it is inherited from AsyncResult and has only the following methods:
wait([timeout])
Wait until the result is available or until timeout seconds pass. This method always returns None.
ready() Return whether the call has completed.
successful() Return whether the call completed without raising an
exception. Will raise AssertionError if the result is not ready.
get([timeout]) Return the result when it arrives. If timeout is not
None and the result does not arrive within timeout seconds then
TimeoutError is raised. If the remote call raised an exception then
that exception will be reraised as a RemoteError by get().
You can check the examples how to use the get() function here:
https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers
from multiprocessing import Pool, TimeoutError
import time
import os
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
# print "[0, 1, 4,..., 81]"
print pool.map(f, range(10))
# print same numbers in arbitrary order
for i in pool.imap_unordered(f, range(10)):
print i
# evaluate "f(20)" asynchronously
res = pool.apply_async(f, (20,)) # runs in *only* one process
print res.get(timeout=1) # prints "400"
# evaluate "os.getpid()" asynchronously
res = pool.apply_async(os.getpid, ()) # runs in *only* one process
print res.get(timeout=1) # prints the PID of that process
# launching multiple evaluations asynchronously *may* use more processes
multiple_results = [pool.apply_async(os.getpid, ()) for i in range(4)]
print [res.get(timeout=1) for res in multiple_results]
# make a single worker sleep for 10 secs
res = pool.apply_async(time.sleep, (10,))
try:
print res.get(timeout=1)
except TimeoutError:
print "We lacked patience and got a multiprocessing.TimeoutError"

Catching Errors in MultiProcessing Pool Map

I have a python code which uses multiprocessing Pool map. I am spawning multiple children from map, each of them reads a separate file, and I collect them in the end. My goal is to have a pandas dataframe in the end that is a concatenation of all the output from the children, with duplicates dropped. I use this dataframe to do more processing (the rest of the code seems unrelated to the question I ask here, so I am omitting that part for brevity). This code runs periodically at the end of the week with new input files to read every time. Sometimes there are errors in the files children read, like null values in integer columns, or missing files, etc.. If any of these errors occur, I want the main script to die, ideally as soon as possible. I do not know how to make this happen in the most efficient way.
I have tried, in turn:
1-Making the child die by a raising SystemExit(1) if it encounters an error. I couldn't make parent die.
2-Making child return an empty value or pandas dataframe in case of an error by try except blocks. I couldn't detect it properly in the parent.
3-Using map_async with callback functions instead of map.
The last one seems to work. However, I am not sure if this is the correct and most efficient way of doing this, as I do not use any output from the error callback function. Any comments and suggestions are appreciated.
Edit:
Sample input file: a.txt:
shipmentId,processing_time_epoch
4001,1455408024132
4231,1455408024373
b.txt:
shipmentId,processing_time_epoch
5001,1455408024132
4231,1455408024373
Desired final processing_time pandas dataframe:
shipmentId,processing_time_epoch
4001,1455408024132
4231,1455408024373
5001,1455408024132
My code:
import pandas as pd
import csv,glob,datetime,sys,pdb,subprocess,multiprocessing,io,os,shlex
from itertools import repeat
def myerrorcallback(x):
print('There seems to be an error in the child. Parent: Please die.')
return
def mycallback(x):
print('Returned successfully.')
return
def PrintException():
exc_type, exc_obj, tb = sys.exc_info()
f = tb.tb_frame
lineno = tb.tb_lineno
filename = f.f_code.co_filename
print('EXCEPTION IN ({}, LINE {} ): {} ({})'.format(filename, lineno, exc_obj, exc_type))
return
# ===================================================================
def Read_Processing_Times_v1(full_path_name):
try:
df = pd.read_csv(full_path_name,dtype={'shipmentId': pd.np.int64, 'processing_time_epoch': pd.np.int64}, usecols=['shipmentId','processing_time_epoch'])
return df.drop_duplicates()
except:
print("exception in file "+full_path_name)
PrintException()
raise(SystemExit(1))
# ===================================================================
def Read_Processing_Times_v2(full_path_name):
try:
df = pd.read_csv(full_path_name,dtype={'shipmentId': pd.np.int64, 'processing_time_epoch': pd.np.int64}, usecols=['shipmentId','processing_time_epoch'])
return df.drop_duplicates()
except:
print("exception in file "+full_path_name)
PrintException()
return pd.DataFrame()
# ===================================================================
def Read_Processing_Times_v3(full_path_name):
df = pd.read_csv(full_path_name,dtype={'shipmentId': pd.np.int64,'processing_time_epoch': pd.np.int64}, usecols=['shipmentId','processing_time_epoch'])
return df.drop_duplicates()
# ===========================================================================================================================
# Top-level
if __name__ == '__main__':
mycols = ['shipmentId', 'processing_time_epoch']
mydtypes = {'shipmentId': pd.np.int64, 'processing_time_epoch': pd.np.int64}
# The following two files should not give an error:
# files_to_read=["a.txt","b.txt"]
# The following two files should give an error, as a2.txt does not exist:
files_to_read=["a2.txt","b.txt"]
# version 1: Works with the correct files. Does not work if one of the children has an error: the child dies, the parent does not and waits forever.
# print("version 1")
# pool = multiprocessing.Pool(15)
# processing_times = pool.map(Read_Processing_Times_v1, files_to_read)
# pool.close()
# pool.join()
# processing_times = pd.concat(processing_times,ignore_index=True).drop_duplicates()
# print(processing_times)
# version 2: Does not work. Don't know how to fix it. The idea is make child return something, and catch the error in the parent.
# print("version 2")
# pool = multiprocessing.Pool(15)
# processing_times = pool.map(Read_Processing_Times_v2, files_to_read)
# pool.close()
# pool.join()
# if(processing_times.count(pd.DataFrame()) > 0):
# print("SLAM times are not read properly.")
# raise SystemExit(1)
# version 3:
print("version 3")
pool = multiprocessing.Pool(15)
processing_times = pool.map_async(Read_Processing_Times_v3, files_to_read,callback=mycallback,error_callback=myerrorcallback)
pool.close()
pool.join()
processing_times = processing_times.get()
processing_times = pd.concat(processing_times,ignore_index=True).drop_duplicates()
print("success!")
# Do more processing with processing_times after this line...
I think you could accomplish what you want by using the concurrent.futures module (https://docs.python.org/3/library/concurrent.futures.html). Below is an example from the doc page that I modified to be closer to your problem. In the example if work_func returns False that is considered an error and the program will terminate.
import sys
import concurrent.futures
import random
import time
def work_func(input_val):
"""
Do some work. Here a False value would mean there is an error
"""
time.sleep(0.5)
return random.choice([True, True, True, True, False])
if __name__ == "__main__":
# We can use a with statement to ensure processes are cleaned up promptly
with concurrent.futures.ProcessPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its input value
future_to_result = {executor.submit(work_func, val): val for val in range(30)}
# iterate over the futures as they become available
for future in concurrent.futures.as_completed(future_to_result):
# get the input value from the dict
input_val = future_to_result[future]
# now retrieve the result from the future
try:
data = future.result()
except Exception as exc:
print(input_val, data)
print('Something exceptional happend')
else:
print(input_val, data)
if not data:
print('Error - exiting')
sys.exit(1)
Sample output:
0 True
1 True
2 True
3 False
Error - exiting

Pytonic way of passing values between process

I need a simple way to pass the stdout of a subprocess as a list to another function using multiprocess:
The first function that invokes subprocess:
def beginRecvTest():
command = ["receivetest","-f=/dev/pcan33"]
incoming = Popen(command, stdout = PIPE)
processing = iter(incoming.stdout.readline, "")
lines = list(processing)
return lines
The function that should receive lines:
def readByLine(lines):
i = 0
while (i < len(lines)):
system("clear")
if(lines[i][0].isdigit()):
line = lines[i].split()
dictAdd(line)
else:
next
print ; print "-" *80
for _i in mydict.keys():
printMsg(mydict, _i)
print "Keys: ", ; print mydict.keys()
print ; print "-" *80
sleep(0.3)
i += 1
and the main from my program:
if __name__ == "__main__":
dataStream = beginRecvTest()
p = Process(target=dataStream)
reader = Process(target=readByLine, args=(dataStream,))
p.start()
reader.start()
I've read up on using queues, but I don't think that's exactly what I need.
The subprocess called returns infinite data so some people have suggested using tempfile, but I am totally confused about how to do this.
At the moment the script only returns the first line read, and all efforts on looping the beginRecvTest() function have ended in compilation errors.

Categories

Resources