What's the best way to parallelize this process - python

I've been trying parallelize a process inside a class method. When I try using Pool() from multiprocessing I get pickling errors. When I use Pool() from multiprocessing.dummy my execution is slower than serialized execution.
I've attempted several variations of my code below, using Stackoverflow posts as a guide, but none of them were a successful workaround for the problem outlined above.
One for example: if I move process_function above the class definition (globalizing it) it's doesn't work because I can't access my objects attributes.
Anyway, my code is similar to:
from multiprocessing.dummy import Pool as ThreadPool
from my_other_module import other_module_class
class myClass:
def __init__(self, some_list, number_iterations):
self.my_interface = other_module_class
self.relevant_list = []
self.some_list = some_list
self.number_iterations = number_iterations
# self.other_attributes = stuff from import statements
def load_relevant_data:
self.relevant_list = self.interface.other_function
def compute_foo(self, relevant_list_member_value):
# math involving class attributes
return foo_scalar
def higher_function(self):
self.relevant_list = self.load_relevant_data
np.random.seed(0)
pool = ThreadPool() # I've tried different args here, no help
pool.map(self.process_function, self.relevant_list)
def process_function(self, dict_from_relevant_list):
foo_bar = self.compute_foo(dict_from_relevant_list['key'])
a = 0
for i in some_other_list:
# do other stuff involving class attributes and foo_bar
# a = some of that
dict_from_relevant_list['other_key'] = a
if __name__ == '__main__':
import time
import pprint as pp
some_list = blah
number_of_iterations = 10**4
my_obj = myClass(some_list, number_of_iterations
my_obj.load_third_parties()
start = time.time()
my_obj.higher_function()
execution_time = time.time() - start
print()
print("Execution time for %s simulation runs: %s" % (number_of_iterations, execution_time))
print()
pp.pprint(my_obj.relevant_list[0:5])
I have a few hundred dictionaries inside relevant list. I just want to populate each of those dictionary's 'other_key' field from a computationally expensive simulation on my inner most loop, which yields a scalar value, like a above. It seems like there should be a simple way to do this since in Matlab I could just right parfor and it's done automatically. Maybe that instinct is wrong for Python.

Related

Why my multiprocess queue might be "losing" items?

I've got some code where I want to share objects between processes using queues. I've got a parent:
processing_manager = mp.Manager()
to_cacher = processing_manager.Queue()
fetchers = get_fetchers()
fetcher_process = mp.Process(target=fetch_news, args=(to_cacher, fetchers))
fetcher_process.start()
while 1:
print(to_cacher.get())
And a child:
def fetch_news(pass_to: Queue, fetchers: List[Fetcher]):
def put_news_to_query(pass_to: Queue, fetchers: List[Fetcher]):
for fet in fetchers:
for news in fet.latest_news():
print(news)
pass_to.put(news)
print("----------------")
put_news_to_query(pass_to, fetchers)
I'm expecting to see N objects printed in put_news_to_query, then a line, and then the same objects printed in while loop in a parent. Problem is, objects appear to miss: if I get, say, 8 objects printed in put_news_to_query I get only 2-3 objects printed in while loop. What am I doing wrong here?
This is not the answer, unless the answer is that the code is already working. I've just modified the code to make it a running example of the same technique. The data gets from child to parent without data loss.
import multiprocessing as mp
import time
import random
def worker(pass_to):
for i in range(10):
time.sleep(random.randint(0,10)/1000)
print('child', i)
pass_to.put(i)
print("---------------------")
pass_to.put(None)
def main():
manager = mp.Manager()
to_cacher = manager.Queue()
fetcher = mp.Process(target=worker, args=(to_cacher,))
fetcher.start()
while 1:
msg = to_cacher.get()
if msg is None:
break
print(msg)
if __name__ == "__main__":
main()
So, apparently, it was something related to in which order put and get statements were executed. Basically, some of the objects from parent's print were printed before the line. If you struggle with something like this, I'd recommend adding something to distinguish prints, like this:
print(f"Worker: {news}")
print(f"Main: {to_cacher.get()}")

Multiprocessing Pool creating and killing processes indefinitely

EDIT
This short code below triggers the same issue.
# top_level.py
import to_import
if __name__ == '__main__':
# This does not work
t = to_import.Test()
from pprint import pprint
pprint(t.test())
#to_import.py
import multiprocessing as mp
def test_func(a, b):
return a * b
class Test:
def __init__(self):
self.pairs = list()
for i in range(10):
for j in range(10):
self.pairs.append((i, j))
def test(self):
pairs = tuple(self.pairs)
with mp.Pool() as pool:
results = pool.starmap(test_func, pairs)
return results
if __name__ == '__main__':
# This works fine
t = Test()
from pprint import pprint
pprint(t.test())
END EDIT
DOUBLE EDIT
Interestingly, this code works correctly when run from my command prompt, as opposed to how I'd been running it from within Spyder previously
EDIT END
I have a class Tin which stores a 3d surface as a series of points and triangles, and can generate a regular grid of points on that surface. The process of creating these points works fine when the multiprocessing flag is False.
However for very dense grids on large surfaces this process can be quite slow, so I implemented multiprocessing to speed it up.
# tin.py
from time import time
import multiprocessing as mp
def _points_from_face(points, grid_size):
create 3d points within triangle on grid, uses other functions withinin this module
def _multiprocess_function(function, vals_gen, pool_size):
with mp.Pool(processes=pool_size) as pool:
results = pool.starmap(func=function,
iterable=vals_gen)
return results
class Tin:
def __init__(self, name, surface_dict):
self.name = name
self.points = surface_dict['Points']
self.faces = dict(enumerate(surface_dict['Faces']))
def generate_regular_grid(self, grid_size,
multiprocess=False,
pool_size=(mp.cpu_count()//2)):
return_grid = dict()
if pool_size < 1:
multiprocess = False
if multiprocess:
faces_tuple = tuple(self.faces.values())
vals_tuple = tuple((tuple(self.points[pid] for pid in face), grid_size)
for face in faces_tuple)
results = _multiprocess_function(_points_from_face,
vals_tuple,
pool_size)
for result in results:
return_grid.update(result)
else:
for face in self.faces.values():
points = tuple(self.points[pid] for pid in face)
return_grid.update(_points_from_face(points, grid_size))
return return_grid
When the Tin class and associated functions are in the same python file as the code calling them, the script works fine, the processes spin up, do their thing, and then close.
But when I import tin.py into another script and try to use multiprocessing, the program gets stuck creating and killing processes over and over without returning anything.
e.g.
# landxml.py
from time import time
from tin import Tin
def parse_landxml(xml_path: str, print_times=False) -> Tin:
read xml file and return Tin contained within
if __name__ == '__main__':
st = time()
surface = parse_landxml('some_tin.xml',
print_times=True)
grid = surface.generate_regular_grid(grid_size=2,
print_times=True,
multiprocess=True)
Do I need to keep everything in one long script or is there a way I can still use multiprocessing inside an imported script.
In addition landxml.py will be imported into another file itself, is this likely to cause the same problem again?

pickle.PicklingError when using joblib and passing class's methods as parameter

I am pretty new at Parallelism and am looking out for a way to parallelise a text tokenisation task. The task consists of millions of records and can be tokenised using different strategies.
I wrote the code as follow, and bumped into this error: pickle.PicklingError: Could not pickle the task to send it to the workers.. I've checked out the following but no luck at finding similar solution. Could anyone suggest a solution around this code or suggest a better way of implementation?
[1] Can't pickle Function
[2] Python multiprocessing pickling error
[3] Can't pickle <type 'instancemethod'> when using multiprocessing Pool.map()
from joblib import Parallel, delayed
from tokenizer import Predicates
class test():
def __init__(self):
# millions of records
self.data = [
("user name", "abc#gmail.com"),
("user1 abc", "abc#gmail.com"),
("abc user1 ", "abcd#gmail.com")
]
def proc(self, data, strategy):
# unwrap different strategy for different column
func_username, func_email = strategy
res = []
for uname, email in data:
# func call to tokenise column data
username_tok = func_username(uname)
email_tok = func_email(uname)
res.append((username_tok, email_tok))
return res
def run(self):
# define different tokenisation strategy
strategy = [(Predicates().tokenFingerprint, Predicates().tokenFingerprint),
(Predicates().otherMethod, Predicates().otherMethod)]
# assign tokenisation jobs to multiple thread.
# NOTE that to simplify the example, self.data is not splitted here. So both threads work on the same dataset now.
lsres_strategy = []
for func_username, func_email in strategy:
lsres = Parallel(n_jobs=2)(delayed(self.proc)(self.data, (func_username, func_email)))
lsres_strategy.append(lsres)
t = test()
t.run()
The tokenisation strategies are defined as
class Predicates:
def __init__(self):
pass
def tokenFingerprint(self, field):
return (u''.join(sorted(field.split())))
def otherMethod(self, field):
return (u''.join(field.split()))

Python Multiprocessing with shared data source and multiple class instances

My program needs to spawn multiple instances of a class, each processing data that is coming from a streaming data source.
For example:
parameters = [1, 2, 3]
class FakeStreamingApi:
def __init__(self):
pass
def data(self):
return 42
pass
class DoStuff:
def __init__(self, parameter):
self.parameter = parameter
def run(self):
data = streaming_api.data()
output = self.parameter ** 2 + data # Some CPU intensive task
print output
streaming_api = FakeStreamingApi()
# Here's how this would work with no multiprocessing
instance_1 = DoStuff(parameters[0])
instance_1.run()
Once the instances are running they don't need to interact with each other, they just have to get the data as it comes in. (and print error messages, etc)
I am totally at a loss how to make this work with multiprocessing, since I first have to create a new instance of the class DoStuff, and then have it run.
This is definitely not the way to do it:
# Let's try multiprocessing
import multiprocessing
for parameter in parameters:
processes = [ multiprocessing.Process(target = DoStuff, args = (parameter)) ]
# Hmm, this doesn't work...
We could try defining a function to spawn classes, but that seems ugly:
import multiprocessing
def spawn_classes(parameter):
instance = DoStuff(parameter)
instance.run()
for parameter in parameters:
processes = [ multiprocessing.Process(target = spawn_classes, args = (parameter,)) ]
# Can't tell if it works -- no output on screen?
Plus, I don't want to have 3 different copies of the API interface class running, I want that data to be shared between all the processes... and as far as I can tell, multiprocessing creates copies of everything for each new process.
Ideas?
Edit:
I think I may have got it... is there anything wrong with this?
import multiprocessing
parameters = [1, 2, 3]
class FakeStreamingApi:
def __init__(self):
pass
def data(self):
return 42
pass
class Worker(multiprocessing.Process):
def __init__(self, parameter):
super(Worker, self).__init__()
self.parameter = parameter
def run(self):
data = streaming_api.data()
output = self.parameter ** 2 + data # Some CPU intensive task
print output
streaming_api = FakeStreamingApi()
if __name__ == '__main__':
jobs = []
for parameter in parameters:
p = Worker(parameter)
jobs.append(p)
p.start()
for j in jobs:
j.join()
I came to the conclusion that it would be necessary to use multiprocessing.Queues to solve this. The data source (the streaming API) needs to pass copies of the data to all the different processes, so they can consume it.
There's another way to solve this using the multiprocessing.Manager to create a shared dict, but I didn't explore it further, as it looks fairly inefficient and cannot propagate changes to inner values (e.g if you have a dict of lists, changes to the inner lists will not propagate).

Python Multiprocessing.Pool in a class

I am trying to change my code to a more object oriented format. In doing so I am lost on how to 'visualize' what is happening with multiprocessing and how to solve it. On the one hand, the class should track changes to local variables across functions, but on the other I believe multiprocessing creates a copy of the code which the original instance would not have access to. I need to figure out a way to manipulate classes, within a class, using multiprocessing, and have the parent class retain all manipulated values in the nested classes.
A simple version (OLD CODE):
function runMultProc():
...
dictReports = {}
listReports = ['reportName1.txt', 'reportName2.txt']
tasks = []
pool = multiprocessing.Pool()
for report in listReports:
if report not in dictReports:
dictReports[today][report] = {}
tasks.append(pool.apply_async(worker, args=([report, dictReports[today][report]])))
else:
continue
for task in tasks:
report, currentReportDict = task.get()
dictReports[report] = currentFileDict
function worker(report, currentReportDict):
<Manipulate_reports_dict>
return report, currentReportDict
NEW CODE:
class Transfer():
def __init__(self):
self.masterReportDictionary[<todays_date>] = [reportObj1, reportObj2]
def processReports(self):
self.pool = multiprocessing.Pool()
self.pool.map(processWorker, self.masterReportDictionary[<todays_date>])
self.pool.close()
self.pool.join()
def processWorker(self, report):
# **process and manipulate report, currently no return**
report.name = 'foo'
report.path = '/path/to/report'
class Report():
def init(self):
self.name = ''
self.path = ''
self.errors = {}
self.runTime = ''
self.timeProcessed = ''
self.hashes = {}
self.attempts = 0
I don't think this code does what I need it to do, which is to have it process the list of reports in parallel AND, as processWorker manipulates each report class object, store those results. As I am fairly new to this I was hoping someone could help.
The big difference between the two is that the first one build a dictionary and returned it. The second model shouldn't really be returning anything, I just need for the classes to finish being processed and they should have relevant information within them.
Thanks!

Categories

Resources