multiprocessing a function with parameters that are iterated through - python

I'm trying to improve the speed of my program and I decided to use multiprocessing!
the problem is I can't seem to find any way to use the pool function (i think this is what i need) to use my function
here is the code that i am dealing with:
def dataLoading(output):
name = ""
link = ""
upCheck = ""
isSuccess = ""
for i in os.listdir():
with open(i) as currentFile:
data = json.loads(currentFile.read())
try:
name = data["name"]
link = data["link"]
upCheck = data["upCheck"]
isSuccess = data["isSuccess"]
except:
print("error in loading data from config: improper naming or formating used")
output[name] = [link, upCheck, isSuccess]
#working
def userCheck(link, user, isSuccess):
link = link.replace("<USERNAME>", user)
isSuccess = isSuccess.replace("<USERNAME>", user)
html = requests.get(link, headers=headers)
page_source = html.text
count = page_source.count(isSuccess)
if count > 0:
return True
else:
return False
I have a parent function to run these two together but I don't think i need to show the whole thing, just the part that gets the data iteratively:
for i in configData:
data = configData[i]
link = data[0]
print(link)
upCheck = data[1] #just for future use
isSuccess = data[2]
if userCheck(link, username, isSuccess) == True:
good.append(i)
you can see how I enter all of the data in there, how would I be able to use multiprocessing to do this when I am iterating through the dictionary to collect multiple parameters?

I like to use mp.Pool().map. I think it is easiest and most straight forward and handles most multiprocessing cases. So how does map work? For starts, we have to keep in mind that mp creates workers, each worker receives a copy of the namespace (ya the whole thing), then each worker works on what they are assigned and returns. Hence, doing something like "updating a global variable" while they work, doesn't work; since they are each going to receive a copy of the global variable and none of the workers are going to be communicating. (If you want communicating workers you need to use mp.Queue's and such, it gets complicated). Anyway, here is using map:
from multiprocessing import Pool
t = 'abcd'
def func(s):
return t[int(s)]
results = Pool().map(func,range(4))
Each worker received a copy of t, func, and the portion of range(4) they were assigned. They are then automatically tracked and everything is cleaned up in the end by Pool.
Something like your dataLoading won't work very well, we need to modify it. I also cleaned the code a little.
def loadfromfile(file):
data = json.loads(open(file).read())
items = [data.get(k,"") for k in ['name','link','upCheck','isSuccess']]
return items[0],items[1:]
output = dict(Pool().map(loadfromfile,os.listdir()))

Related

Best way to output concurrent task results in shell on the go

That is maybe a very basic question, and I can think of solutions, but I wondered if there is a more elegant one I don't know of (quick googling didn't bring anything useful).
I wrote a script to communicate to a remote device. However, now that I have more than one of that type, I thought I just make the communication in concurrent futures and handle it simultaneously:
with concurrent.futures.ThreadPoolExecutor(20) as executor:
executor.map(device_ctl, ids, repeat(args))
So it just calls up to 20 threads of device_ctl with respective IDs and the same args. device_ctl is now printing some results, but since they all run in parallel, it gets mixed up and looks messy. Ideally I could have 1 line per ID that shows the current state of the communication and gets updated once it changes, e.g.:
Dev1 Connecting...
Dev2 Connected! Status: Idle
Dev3 Connected! Status: Updating
However, I don't really know how to solve it nicely. I can think of a status list that outside of the threads gets assembled into one status string, which gets frequently updated. But it feels like there could be a simpler method! Ideas?
Since there was no good answer, I made my own solution, which is quite compact but efficient. I define a class that I call globally. Each thread populates it or updates a value based on its ID. The ID is meant to be the same list entry as taken for the thread. Here I made a simple example how to use it:
class collect:
ids = []
outs = []
LINE_UP = "\033[1A"
LINE_CLEAR = "\x1b[2K"
printed = 0
def init(list):
collect.ids = [i for i in list]
collect.outs = ["" for i in list]
collect.printall()
def write(id, out):
if id not in collect.ids:
collect.ids.append(id)
collect.outs.append(out)
else:
collect.outs[collect.ids.index(id)] = out
def writeout(id, out):
if id not in collect.ids:
collect.ids.append(str(id))
collect.outs.append(str(out))
else:
collect.outs[collect.ids.index(id)] = str(out)
collect.printall()
def append(id, out):
if id not in collect.ids:
collect.ids.append(str(id))
collect.outs.append(str(out))
else:
collect.outs[collect.ids.index(id)] += str(out)
def appendout(id, out):
if id not in collect.ids:
collect.ids.append(id)
collect.outs.append(out)
else:
collect.outs[collect.ids.index(id)] += str(out)
collect.printall()
def read(id):
return collect.outs[collect.ids.index(str(id))]
def readall():
return collect.outs, "\n".join(collect.outs)
def printall(filter=""):
if collect.printed > 0:
print(collect.LINE_CLEAR + collect.LINE_UP * len(collect.ids), end="")
print(
"\n".join(
[
collect.ids[i] + "\t" + collect.outs[i] + " " * 30
for i in range(len(collect.outs))
if filter in collect.ids[i]
]
)
)
collect.printed = len(collect.ids)
def device_ctl(id,args):
collect.writeout(id,"Connecting...")
if args.connected:
collect.writeout(id,"Connected")
collect.init(ids)
with concurrent.futures.ThreadPoolExecutor(20) as executor:
executor.map(device_ctl, ids, repeat(args))

Python Multiprocessing.Pool in a class

I am trying to change my code to a more object oriented format. In doing so I am lost on how to 'visualize' what is happening with multiprocessing and how to solve it. On the one hand, the class should track changes to local variables across functions, but on the other I believe multiprocessing creates a copy of the code which the original instance would not have access to. I need to figure out a way to manipulate classes, within a class, using multiprocessing, and have the parent class retain all manipulated values in the nested classes.
A simple version (OLD CODE):
function runMultProc():
...
dictReports = {}
listReports = ['reportName1.txt', 'reportName2.txt']
tasks = []
pool = multiprocessing.Pool()
for report in listReports:
if report not in dictReports:
dictReports[today][report] = {}
tasks.append(pool.apply_async(worker, args=([report, dictReports[today][report]])))
else:
continue
for task in tasks:
report, currentReportDict = task.get()
dictReports[report] = currentFileDict
function worker(report, currentReportDict):
<Manipulate_reports_dict>
return report, currentReportDict
NEW CODE:
class Transfer():
def __init__(self):
self.masterReportDictionary[<todays_date>] = [reportObj1, reportObj2]
def processReports(self):
self.pool = multiprocessing.Pool()
self.pool.map(processWorker, self.masterReportDictionary[<todays_date>])
self.pool.close()
self.pool.join()
def processWorker(self, report):
# **process and manipulate report, currently no return**
report.name = 'foo'
report.path = '/path/to/report'
class Report():
def init(self):
self.name = ''
self.path = ''
self.errors = {}
self.runTime = ''
self.timeProcessed = ''
self.hashes = {}
self.attempts = 0
I don't think this code does what I need it to do, which is to have it process the list of reports in parallel AND, as processWorker manipulates each report class object, store those results. As I am fairly new to this I was hoping someone could help.
The big difference between the two is that the first one build a dictionary and returned it. The second model shouldn't really be returning anything, I just need for the classes to finish being processed and they should have relevant information within them.
Thanks!

Accessible variables at the root of a python script

I've declared a number of variables at the start of my script, as I'm using them in a number of different methods ("Functions" in python?). When I try to access them, I can't seem to get their value = or set them to another value for that matter. For example:
baseFile = open('C:/Users/<redacted>/Documents/python dev/ATM/Data.ICSF', 'a+')
secFile = open('C:/Users/<redacted>/Documents/python dev/ATM/security.ICSF', 'a+')
def usrInput(raw_input):
if raw_input == "99999":
self.close(True)
else:
identity = raw_input
def splitValues(source, string):
if source == "ident":
usrTitle = string.split('>')[1]
usrFN = string.split('>')[2]
usrLN = string.split('>')[3]
x = string.split('>')[4]
usrBal = Decimal(x)
usrBalDisplay = str(locale.currency(usrBal))
elif source == "sec":
usrPIN = string.split('>')[1]
pinAttempts = string.split('>')[2]
def openAccount(identity):
#read all the file first. it's f***ing heavy but it'll do here.
plString = baseFile.read()
xList = plString.split('|')
parm = str(identity)
for i in xList:
substr = i[0:4]
if parm == substr:
print "success"
usrString = str(i)
else:
lNumFunds = lNumFunds + 1
splitValues("ident", usrString)
When I place baseFile and secFile in the openAccount method, I can access the respective files as normal. However, when I place them at the root of the script, as in the example above, I can no longer access the file - although I can still "see" the variable.
Is there a reason to this? For reference, I am using Python 2.7.
methods ("Functions" in python?)
"function" when they "stand free"; "methods" when they are members of a class. So, functions in your case.
What you describe does definitely work in python. Hence, my diagnosis is that you already read something from the file elsewhere before you call openAccount, so that the read pointer is not at the beginning of the file.

Pyqt and general python, can this be considered a correct approach for coding?

I have a dialog window containing check-boxes, when each of them is checked a particular class needs to be instantiated and a run a a task on a separated thread (one for each check box). I have 14 check-boxes to check the .isChecked() property and is comprehensible checking the returned Boolean for each of them is not efficient and requires a lot more coding.
Hence I decided to get all the children items corresponding to check-box element, get just those that are checked, appending their names to list and loop through them matching their name to d dictionary which key is the name of the check box and the value is the corresponding class to instantiate.
EXAMPLE:
# class dictionary
self.summary_runnables = {'dupStreetCheckBox': [DupStreetDesc(),0],
'notStreetEsuCheckBox': [StreetsNoEsuDesc(),1],
'notType3CheckBox': [Type3Desc(False),2],
'incFootPathCheckBox': [Type3Desc(True),2],
'dupEsuRefCheckBox': [DupEsuRef(True),3],
'notEsuStreetCheckBox': [NoLinkEsuStreets(),4],
'invCrossRefCheckBox': [InvalidCrossReferences()],
'startEndCheckBox': [CheckStartEnd(tol=10),8],
'tinyEsuCheckBox': [CheckTinyEsus("esu",1)],
'notMaintReinsCheckBox': [CheckMaintReins()],
'asdStartEndCheckBox': [CheckAsdCoords()],
'notMaintPolysCheckBox': [MaintNoPoly(),16],
'notPolysMaintCheckBox': [PolyNoMaint()],
'tinyPolysCheckBox': [CheckTinyEsus("rd_poly",1)]}
# looping through list
self.long_task = QThreadPool(None).globalInstance()
self.long_task.setMaxThreadCount(1)
start_report = StartReport(val_file_path)
end_report = EndReport()
# start_report.setAutoDelete(False)
# end_report.setAutoDelete(False)
end_report.signals.result.connect(self.log_progress)
end_report.signals.finished.connect(self.show_finished)
# end_report.setAutoDelete(False)
start_report.signals.result.connect(self.log_progress)
self.long_task.start(start_report)
# print str(self.check_boxes_names)
for check_box_name in self.check_boxes_names:
run_class = self.summary_runnables[check_box_name]
if run_class[0].__class__.__name__ is 'CheckStartEnd':
run_class[0].tolerance = tolerance
runnable = run_class[0]()
runnable.signals.result.connect(self.log_progress)
self.long_task.start(runnable)
self.long_task.start(end_report)
example of a runnable (even if some of them use different global functions)
I can't post the global functions that write content to file as they are too many and not all 14 tasks execute the same type function. arguments of these functions are int keys to other dictionaries that contain the report static content and the SQL queries to return report main dynamic contents.
class StartReport(QRunnable):
def __init__(self, file_path):
super(StartReport,self).__init__()
# open the db connection in thread
db.open()
self.signals = GeneralSignals()
# self.simple_signal = SimpleSignal()
# print self.signals.result
self.file_path = file_path
self.task = "Starting Report"
self.progress = 1
self.org_name = org_name
self.user = user
self.report_title = "Validation Report"
print "instantiation of start report "
def run(self):
self.signals.result.emit(self.task, self.progress)
if self.file_path is None:
print "I started and found file none "
return
else:
global report_file
# create the file and prints the header
report_file = open(self.file_path, 'wb')
report_file.write(str(self.report_title) + ' for {0} \n'.format(self.org_name))
report_file.write('Created on : {0} at {1} By : {2} \n'.format(datetime.today().strftime("%d/%m/%Y"),
datetime.now().strftime("%H:%M"),
str(self.user)))
report_file.write(
"------------------------------------------------------------------------------------------ \n \n \n \n")
report_file.flush()
os.fsync(report_file.fileno())
class EndReport(QRunnable):
def __init__(self):
super(EndReport,self).__init__()
self.signals = GeneralSignals()
self.task = "Finishing report"
self.progress = 100
def run(self):
self.signals.result.emit(self.task, self.progress)
if report_file is not None:
# write footer and close file
report_file.write("\n \n \n")
report_file.write("---------- End of Report -----------")
report_file.flush()
os.fsync(report_file.fileno())
report_file.close()
self.signals.finished.emit()
# TODO: checking whether opening a db connection in thread might affect the db on the GUI
# if db.isOpen():
# db.close()
else:
return
class DupStreetDesc(QRunnable):
"""
duplicate street description report section creation
:return: void if the report is to text
list[string] if the report is to screen
"""
def __init__(self):
super(DupStreetDesc,self).__init__()
self.signals = GeneralSignals()
self.task = "Checking duplicate street descriptions..."
self.progress = 16.6
def run(self):
self.signals.result.emit(self.task,self.progress)
if report_file is None:
print "report file is none "
# items_list = write_content(0, 0, 0, 0)
# for item in items_list:
# self.signals.list.emit(item)
else:
write_content(0, 0, 0, 0)
Now, I used this approach before and it has always worked fine without using multiprocessing. In this case it works good to some extent, I can run the tasks the first time but if I try to run for the second time I get the following Python Error :
self.long_task.start(run_class[0])
RuntimeError: wrapped C/C++ object of type DupStreetDesc has been deleted
I tried to use run_class[0].setAutoDelete(False) before running them in the loop but pyQt crashes with a minidump error (I am running the code in QGIS) and I the programs exists with few chances to understand what has happened.
On the other hand, if I run my classes separately, checking with an if else statement each check-box, then it works fine, I can run the tasks again and the C++ classes are not deleted, but it isn't a nice coding approach, at least from my very little experience.
Is there anyone else out there who can advise a different approach in order to make this run smoothly without using too many lines of code? Or knows whether there is a more efficient pattern to handle this problem, which I think must be quite common?
It seems that you should create a new instance of each runnable, and allow Qt to automatically delete it. So your dictionary entries could look like this:
'dupStreetCheckBox': [lambda: DupStreetDesc(), 0],
and then you can do:
for check_box_name in self.check_boxes_names:
run_class = self.summary_runnables[check_box_name]
runnable = run_class[0]()
runnable.signals.result.connect(self.log_progress)
self.long_task.start(runnable)
I don't know why setAutoDelete does not work (assuming you are calling it before starting the threadpool). I suppose there might be a bug, but it's impossible to be sure without having a fully-working example to test.

Making a python program wait until Twisted deferred returns a value

I have a program that fetches info from other pages and parses them using BeautifulSoup and Twisted's getPage. Later on in the program I print info that the deferred process creates. Currently my program tries to print it before the differed returns the info. How can I make it wait?
def twisAmaz(contents): #This parses the page (amazon api xml file)
stonesoup = BeautifulStoneSoup(contents)
if stonesoup.find("mediumimage") == None:
imageurl.append("/images/notfound.png")
else:
imageurl.append(stonesoup.find("mediumimage").url.contents[0])
usedPdata = stonesoup.find("lowestusedprice")
newPdata = stonesoup.find("lowestnewprice")
titledata = stonesoup.find("title")
reviewdata = stonesoup.find("editorialreview")
if stonesoup.find("asin") != None:
asin.append(stonesoup.find("asin").contents[0])
else:
asin.append("None")
reactor.stop()
deferred = dict()
for tmpISBN in isbn: #Go through ISBN numbers and get Amazon API information for each
deferred[(tmpISBN)] = getPage(fetchInfo(tmpISBN))
deferred[(tmpISBN)].addCallback(twisAmaz)
reactor.run()
.....print info on each ISBN
What it seems like is you're trying to make/run multiple reactors. Everything gets attached to the same reactor. Here's how to use a DeferredList to wait for all of your callbacks to finish.
Also note that twisAmaz returns a value. That value is passed through the callbacks DeferredList and comes out as value. Since a DeferredList keeps the order of the things that are put into it, you can cross-reference the index of the results with the index of your ISBNs.
from twisted.internet import defer
def twisAmazon(contents):
stonesoup = BeautifulStoneSoup(contents)
ret = {}
if stonesoup.find("mediumimage") is None:
ret['imageurl'] = "/images/notfound.png"
else:
ret['imageurl'] = stonesoup.find("mediumimage").url.contents[0]
ret['usedPdata'] = stonesoup.find("lowestusedprice")
ret['newPdata'] = stonesoup.find("lowestnewprice")
ret['titledata'] = stonesoup.find("title")
ret['reviewdata'] = stonesoup.find("editorialreview")
if stonesoup.find("asin") is not None:
ret['asin'] = stonesoup.find("asin").contents[0]
else:
ret['asin'] = 'None'
return ret
callbacks = []
for tmpISBN in isbn: #Go through ISBN numbers and get Amazon API information for each
callbacks.append(getPage(fetchInfo(tmpISBN)).addCallback(twisAmazon))
def printResult(result):
for e, (success, value) in enumerate(result):
print ('[%r]:' % isbn[e]),
if success:
print 'Success:', value
else:
print 'Failure:', value.getErrorMessage()
callbacks = defer.DeferredList(callbacks)
callbacks.addCallback(printResult)
reactor.run()
Another cool way to do this is with #defer.inlineCallbacks. It lets you write asynchronous code like a regular sequential function: http://twistedmatrix.com/documents/8.1.0/api/twisted.internet.defer.html#inlineCallbacks
First, you shouldn't put a reactor.stop() in your deferred method, as it kills everything.
Now, in Twisted, "Waiting" is not allowed. To print results of you callback, just add another callback after the first one.

Categories

Resources