Python Multiprocessing with shared data source and multiple class instances - python

My program needs to spawn multiple instances of a class, each processing data that is coming from a streaming data source.
For example:
parameters = [1, 2, 3]
class FakeStreamingApi:
def __init__(self):
pass
def data(self):
return 42
pass
class DoStuff:
def __init__(self, parameter):
self.parameter = parameter
def run(self):
data = streaming_api.data()
output = self.parameter ** 2 + data # Some CPU intensive task
print output
streaming_api = FakeStreamingApi()
# Here's how this would work with no multiprocessing
instance_1 = DoStuff(parameters[0])
instance_1.run()
Once the instances are running they don't need to interact with each other, they just have to get the data as it comes in. (and print error messages, etc)
I am totally at a loss how to make this work with multiprocessing, since I first have to create a new instance of the class DoStuff, and then have it run.
This is definitely not the way to do it:
# Let's try multiprocessing
import multiprocessing
for parameter in parameters:
processes = [ multiprocessing.Process(target = DoStuff, args = (parameter)) ]
# Hmm, this doesn't work...
We could try defining a function to spawn classes, but that seems ugly:
import multiprocessing
def spawn_classes(parameter):
instance = DoStuff(parameter)
instance.run()
for parameter in parameters:
processes = [ multiprocessing.Process(target = spawn_classes, args = (parameter,)) ]
# Can't tell if it works -- no output on screen?
Plus, I don't want to have 3 different copies of the API interface class running, I want that data to be shared between all the processes... and as far as I can tell, multiprocessing creates copies of everything for each new process.
Ideas?
Edit:
I think I may have got it... is there anything wrong with this?
import multiprocessing
parameters = [1, 2, 3]
class FakeStreamingApi:
def __init__(self):
pass
def data(self):
return 42
pass
class Worker(multiprocessing.Process):
def __init__(self, parameter):
super(Worker, self).__init__()
self.parameter = parameter
def run(self):
data = streaming_api.data()
output = self.parameter ** 2 + data # Some CPU intensive task
print output
streaming_api = FakeStreamingApi()
if __name__ == '__main__':
jobs = []
for parameter in parameters:
p = Worker(parameter)
jobs.append(p)
p.start()
for j in jobs:
j.join()

I came to the conclusion that it would be necessary to use multiprocessing.Queues to solve this. The data source (the streaming API) needs to pass copies of the data to all the different processes, so they can consume it.
There's another way to solve this using the multiprocessing.Manager to create a shared dict, but I didn't explore it further, as it looks fairly inefficient and cannot propagate changes to inner values (e.g if you have a dict of lists, changes to the inner lists will not propagate).

Related

What's the best way to parallelize this process

I've been trying parallelize a process inside a class method. When I try using Pool() from multiprocessing I get pickling errors. When I use Pool() from multiprocessing.dummy my execution is slower than serialized execution.
I've attempted several variations of my code below, using Stackoverflow posts as a guide, but none of them were a successful workaround for the problem outlined above.
One for example: if I move process_function above the class definition (globalizing it) it's doesn't work because I can't access my objects attributes.
Anyway, my code is similar to:
from multiprocessing.dummy import Pool as ThreadPool
from my_other_module import other_module_class
class myClass:
def __init__(self, some_list, number_iterations):
self.my_interface = other_module_class
self.relevant_list = []
self.some_list = some_list
self.number_iterations = number_iterations
# self.other_attributes = stuff from import statements
def load_relevant_data:
self.relevant_list = self.interface.other_function
def compute_foo(self, relevant_list_member_value):
# math involving class attributes
return foo_scalar
def higher_function(self):
self.relevant_list = self.load_relevant_data
np.random.seed(0)
pool = ThreadPool() # I've tried different args here, no help
pool.map(self.process_function, self.relevant_list)
def process_function(self, dict_from_relevant_list):
foo_bar = self.compute_foo(dict_from_relevant_list['key'])
a = 0
for i in some_other_list:
# do other stuff involving class attributes and foo_bar
# a = some of that
dict_from_relevant_list['other_key'] = a
if __name__ == '__main__':
import time
import pprint as pp
some_list = blah
number_of_iterations = 10**4
my_obj = myClass(some_list, number_of_iterations
my_obj.load_third_parties()
start = time.time()
my_obj.higher_function()
execution_time = time.time() - start
print()
print("Execution time for %s simulation runs: %s" % (number_of_iterations, execution_time))
print()
pp.pprint(my_obj.relevant_list[0:5])
I have a few hundred dictionaries inside relevant list. I just want to populate each of those dictionary's 'other_key' field from a computationally expensive simulation on my inner most loop, which yields a scalar value, like a above. It seems like there should be a simple way to do this since in Matlab I could just right parfor and it's done automatically. Maybe that instinct is wrong for Python.

Threading problem with threadpool and objects

I have a problem when using objects and treads.
Below follows a simplified example of the code.
I am using a threadpool to loop over a list of jobs.
class File(object):
def __init__(self, name, streams = [])
self.name = name
self.streams = streams
def appendStream(stream):
self.streams.append(stream)
class Job(object):
def __init__(self, file):
self.file = file
def main():
...
jobs = []
for f in input_files:
f_obj = File(f)
jobs.append(Job(f_obj))
with ThreadPool(processes = 2, initializer = init, initargs = (log, p_lock)) as pool:
pool.map(func = process_job, iterable = jobs, chunksize = 1)
...
The function (process_job) used by the thread pool resides in the same .py file.
def process_job(job):
...
get_info(job.file)
...
This function in turn uses a function (get_info) from a self defined package.
This function creates an argument list and then calls subprocess.check_output().
The subprocess returns a json struct which is looped over to update the contents of the input object.
def get_info(file):
...
args = ["ffprobe", ..., "-i", file.name]
try:
output = subprocess.check_output(args)
except Exception as e:
print(e)
data = info_json.decode('utf8')
json_data = json.loads(data)
for item in info_json:
file.appendStream(item["stream"])
...
The problem is that when running this code the threads spawned by the pool is updating each others file objects.
For example when running this with 5 input files the 5th job.file.streams will contain 5 streams i.e the 4 previous streams that belongs to the other files.
Why is this happening and how can I solve it.
Best regards!
As #torek spotted it seems to be a case of the "Mutable Default Argument".
“Least Astonishment” and the Mutable Default Argument

pickle.PicklingError when using joblib and passing class's methods as parameter

I am pretty new at Parallelism and am looking out for a way to parallelise a text tokenisation task. The task consists of millions of records and can be tokenised using different strategies.
I wrote the code as follow, and bumped into this error: pickle.PicklingError: Could not pickle the task to send it to the workers.. I've checked out the following but no luck at finding similar solution. Could anyone suggest a solution around this code or suggest a better way of implementation?
[1] Can't pickle Function
[2] Python multiprocessing pickling error
[3] Can't pickle <type 'instancemethod'> when using multiprocessing Pool.map()
from joblib import Parallel, delayed
from tokenizer import Predicates
class test():
def __init__(self):
# millions of records
self.data = [
("user name", "abc#gmail.com"),
("user1 abc", "abc#gmail.com"),
("abc user1 ", "abcd#gmail.com")
]
def proc(self, data, strategy):
# unwrap different strategy for different column
func_username, func_email = strategy
res = []
for uname, email in data:
# func call to tokenise column data
username_tok = func_username(uname)
email_tok = func_email(uname)
res.append((username_tok, email_tok))
return res
def run(self):
# define different tokenisation strategy
strategy = [(Predicates().tokenFingerprint, Predicates().tokenFingerprint),
(Predicates().otherMethod, Predicates().otherMethod)]
# assign tokenisation jobs to multiple thread.
# NOTE that to simplify the example, self.data is not splitted here. So both threads work on the same dataset now.
lsres_strategy = []
for func_username, func_email in strategy:
lsres = Parallel(n_jobs=2)(delayed(self.proc)(self.data, (func_username, func_email)))
lsres_strategy.append(lsres)
t = test()
t.run()
The tokenisation strategies are defined as
class Predicates:
def __init__(self):
pass
def tokenFingerprint(self, field):
return (u''.join(sorted(field.split())))
def otherMethod(self, field):
return (u''.join(field.split()))

Python Multiprocessing.Pool in a class

I am trying to change my code to a more object oriented format. In doing so I am lost on how to 'visualize' what is happening with multiprocessing and how to solve it. On the one hand, the class should track changes to local variables across functions, but on the other I believe multiprocessing creates a copy of the code which the original instance would not have access to. I need to figure out a way to manipulate classes, within a class, using multiprocessing, and have the parent class retain all manipulated values in the nested classes.
A simple version (OLD CODE):
function runMultProc():
...
dictReports = {}
listReports = ['reportName1.txt', 'reportName2.txt']
tasks = []
pool = multiprocessing.Pool()
for report in listReports:
if report not in dictReports:
dictReports[today][report] = {}
tasks.append(pool.apply_async(worker, args=([report, dictReports[today][report]])))
else:
continue
for task in tasks:
report, currentReportDict = task.get()
dictReports[report] = currentFileDict
function worker(report, currentReportDict):
<Manipulate_reports_dict>
return report, currentReportDict
NEW CODE:
class Transfer():
def __init__(self):
self.masterReportDictionary[<todays_date>] = [reportObj1, reportObj2]
def processReports(self):
self.pool = multiprocessing.Pool()
self.pool.map(processWorker, self.masterReportDictionary[<todays_date>])
self.pool.close()
self.pool.join()
def processWorker(self, report):
# **process and manipulate report, currently no return**
report.name = 'foo'
report.path = '/path/to/report'
class Report():
def init(self):
self.name = ''
self.path = ''
self.errors = {}
self.runTime = ''
self.timeProcessed = ''
self.hashes = {}
self.attempts = 0
I don't think this code does what I need it to do, which is to have it process the list of reports in parallel AND, as processWorker manipulates each report class object, store those results. As I am fairly new to this I was hoping someone could help.
The big difference between the two is that the first one build a dictionary and returned it. The second model shouldn't really be returning anything, I just need for the classes to finish being processed and they should have relevant information within them.
Thanks!

Working with parallel python and classes

I was shocked to learn how little tutorials and guides there is to be found on the internet regarding parallel python (PP) and handling classes. I've ran into a problem where I want to initiate a couple of instances of the same class and after that retreive some variables (for instances reading 5 datafiles in parallel, and then retreive their data). Here's a simple piece of code to illustrate my problem:
import pp
class TestClass:
def __init__(self, i):
self.i = i
def doSomething(self):
print "\nI'm being executed!, i = "+str(self.i)
self.j = 2*self.i
print "self.j is supposed to be "+str(self.j)
return self.i
class parallelClass:
def __init__(self):
job_server = pp.Server()
job_list = []
self.instances = [] # for storage of the class objects
for i in xrange(3):
TC = TestClass(i) # initiate a new instance of the TestClass
self.instances.append(TC) # store the instance
job_list.append(job_server.submit(TC.doSomething, (), ())) # add some jobs to the job_list
results = [job() for job in job_list] # execute order 66...
print "\nIf all went well there's a nice bunch of objects in here:"
print self.instances
print "\nAccessing an object's i works ok, but accessing j does not"
print "i = "+str(self.instances[2].i)
print "j = "+str(self.instances[2].j)
if __name__ == '__main__' :
parallelClass() # initiate the program
I've added comments for your convenience. What am I doing wrong here?
You should use callbacks
A callbacks is a function that you pass to the submit call. That function will be called with the result of the job as argument (have a look at the API for more arcane usage).
In your case
Set up a callback:
class TestClass:
def doSomething(self):
j = 2 * self.i
return j # It's REQUIRED that you return j here.
def set_j(self, j):
self.j = j
Add the callback to the job submit call
class parallellClass:
def __init__(self):
#your code...
job_list.append(job_server.submit(TC.doSomething, callback=TC.set_j))
And you're done.
I made some improvements to the code to avoid using self.j in the doSomething call, and only use a local jvariable.
As mentioned in the comments, in pp, you only communicate the result of your job. That's why you have to return this variable, it will be passed to the callback.

Categories

Resources