Sharing large arbitrary data structures during multiprocessing - python

I would like to parallelize a process in python which needs read access to several large, non-array data structures. What would be a recommended way to do this without copying all of the large data structures into every new process?
Thank you

The multiprocessing package provides two ways of sharing state: shared memory objects and server process managers. You should use server process managers as they support arbitrary object types.
The following program makes use of a server process manager:
#!/usr/bin/env python3
from multiprocessing import Process, Manager
# Simple data structure
class DataStruct:
data_id = None
data_str = None
def __init__(self, data_id, data_str):
self.data_id = data_id
self.data_str = data_str
def __str__(self):
return f"{self.data_str} has ID {self.data_id}"
def __repr__(self):
return f"({self.data_id}, {self.data_str})"
def set_data_id(self, data_id):
self.data_id = data_id
def set_data_str(self, data_str):
self.data_str = data_str
def get_data_id(self):
return self.data_id
def get_data_str(self):
return self.data_str
# Create function to manipulate data
def manipulate_data_structs(data_structs, find_str):
for ds in data_structs:
if ds.get_data_str() == find_str:
print(ds)
# Create manager context, modify the data
with Manager() as manager:
# List of DataStruct objects
l = manager.list([
DataStruct(32, "Andrea"),
DataStruct(45, "Bill"),
DataStruct(21, "Claire"),
])
# Processes that look for DataStructs with a given String
procs = [
Process(target = manipulate_data_structs, args = (l, "Andrea")),
Process(target = manipulate_data_structs, args = (l, "Claire")),
Process(target = manipulate_data_structs, args = (l, "David")),
]
for proc in procs:
proc.start()
for proc in procs:
proc.join()
For more information, see Sharing state between processes in the documentation.

Related

Why I am getting error at the time of parallel processing

I am passing the key and value of a dictionary for parallel processing
if __name__ == "__main__":
DATASETS = {
"Dataset_1": data_preprocess.dataset_1,
"Dataset_2": data_preprocess.dataset_2,}
pool = mp.Pool(8)
pool.starmap(main, zip(DATASETS.keys(), DATASETS.values()))
pool.close()
# As I am not joining any result and I am directly saving the output
# in CSV file from (main function) I did not used pool.join()
The main function
def main(dataset_name, generate_dataset):
REGRESSORS = {
"LinReg": LinearRegression(),
"Lasso": Lasso(),}
ROOT = Path(__file__).resolve().parent
dataset_name = dataset_name
generate_dataset = generate_dataset
dfs = []
for reg_name, regressor in REGRESSORS.items():
df = function_calling(
generate_dataset=generate_dataset,
regressor=regressor,
reg_name=reg_name,)
print(df)
dfs.append(df)
df = pd.concat(dfs, axis=0, ignore_index=True)
filename = dataset_name + "_result.csv"
outfile = str(PATH) + "/" + filename
df.to_csv(outfile)
I am getting an error AssertionError: daemonic processes are not allowed to have children.
Could you tell me why I am getting the error? How can I resolve this?
To just create your own Process instances:
import multiprocessing as mp
def main(dataset_name, generate_dataset):
print(dataset_name, generate_dataset, flush=True)
... # etc.
if __name__ == "__main__":
DATASETS = {
"Dataset_1": 1,
"Dataset_2": 2,}
processes = [mp.Process(target=main, args=(k, v)) for k, v in DATASETS.items()]
for process in processes:
process.start()
# wait for termination:
for process in processes:
process.join
Prints:
Dataset_1 1
Dataset_2 2
The issue is suppose you have 8 CPU cores and DATASETS had 100 key/value pairs. You would be creating 100 processes. Assuming these processes were CPU-intensive, you could not expect more than 8 of them to really be doing anything productive. Yet you incurred the CPU and storage overhead of having created all those processes. But as long as the number of processes you will be creating are not excessively greater than the number of CPU cores you have and your function main does not need to return a value back to your main process, this should be OK.
There is also a way of implementing your own multiprocessing pool with these Process instances and a Queue instance, but that's a bit more complicated:
import multiprocessing as mp
def main(dataset_name, generate_dataset):
print(dataset_name, generate_dataset, flush=True)
... # etc.
def worker(queue):
while True:
arg = queue.get()
if arg is None:
# signal to terminate
break
# unpack
dataset_name, generate_dataset = arg
main(dataset_name, generate_dataset)
if __name__ == "__main__":
DATASETS = {
"Dataset_1": 1,
"Dataset_2": 2,}
queue = mp.Queue()
items = list(DATASETS.items())
for k, v in items:
# put the arguments on the queue
queue.put((k, v))
# number of processors we will be using:
n_processors = min(mp.cpu_count(), len(items))
for _ in range(n_processors):
# special value to tell main there is no nore work: one for each task
queue.put(None)
processes = [mp.Process(target=worker, args=(queue,)) for _ in range(n_processors)]
for process in processes:
process.start()
for process in processes:
process.join

Python Multi-Processing List Issue

I am attempting to dynamically open and parse through several text files (~10) to extract a particular value from key, for which I am utilizing multi-processing within Python to do this. My issue is that the function that I am calling writes particular data to a class list which I can see in the method, however outside the method that list is empty. Refer to the following:
class:
class MyClass(object):
__id_list = []
def __init__(self):
self.process_wrapper()
Caller Method:
def process_wrapper(self):
from multiprocessing import Pool
import multiprocessing
info_file = 'info*'
file_list = []
p = Pool(processes = multiprocessing.cpu_count() - 1)
for file_name in Path('c:/').glob('**/*/' + info_file):
file_list.append(str(os.path.join('c:/', file_name)))
p.map_async(self.get_ids, file_list)
p.close()
p.join()
print(self.__id_list) # this is showing as empty
Worker method:
def get_ids(self, file_name):
try:
with open(file_name) as data:
for line in data:
temp_split = line.split()
for item in temp_split:
value_split = str(item).split('=')
if 'id' == value_split[0].lower():
if int(value_split[1]) not in self._id_list:
self.__id_list.append(int(value_split[1]))
except:
raise FileReadError(f'There was an issue parsing "{file_name}".')
print(self.__id_list) # here the list prints fine
The map call returns a AysncResult class object. you should use that to wait for the processing to finish before checking self.__id_list. also you might consider returning a local list, collected those lists and aggregating them into the final list.
1. It looks like you have a typo in your get_ids method (self._id_list instead of self.__id_list). You can see it if you wait for the result:
result = p.map_async(self.get_ids, file_list)
result.get()
2. When a new child process is created, it gets a copy of the parent's address space however any subsequent changes (either by parent or child) are not reflected in the memory of the other process. They each have their own private address space.
Example:
$ cat fork.py
import os
l = []
l.append('global')
# Return 0 in the child and the child’s process id in the parent
pid = os.fork()
if pid == 0:
l.append('child')
print(f'Child PID: {os.getpid()}, {l}')
else:
l.append('parent')
print(f'Parent PID: {os.getpid()}, {l}')
print(l)
$ python3 fork.py
Parent PID: 9933, ['global', 'parent']
['global', 'parent']
Child PID: 9934, ['global', 'child']
['global', 'child']
Now back to your problem, you can use multiprocessing.Manager.list to create an object that is shared between processes:
from multiprocessing import Manager, Pool
m = Manager()
self.__id_list = m.list()
Docs: Sharing state between processes
or use threads as your workload seems to be I/O bound anyway:
from multiprocessing.dummy import Pool as ThreadPool
p = ThreadPool(processes = multiprocessing.cpu_count() - 1)
Alternatively check concurrent.futures

python multiprocessing to create an excel file with multiple sheets [duplicate]

I am new to Python and I am trying to save the results of five different processes to one excel file (each process write to a different sheet). I have read different posts here, but still can't get it done as I'm very confused about pool.map, queues, and locks, and I'm not sure what is required here to fulfill this task.
This is my code so far:
list_of_days = ["2017.03.20", "2017.03.21", "2017.03.22", "2017.03.23", "2017.03.24"]
results = pd.DataFrame()
if __name__ == '__main__':
global list_of_days
writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
nr_of_cores = multiprocessing.cpu_count()
l = multiprocessing.Lock()
pool = multiprocessing.Pool(processes=nr_of_cores, initializer=init, initargs=(l,))
pool.map(f, range(len(list_of_days)))
pool.close()
pool.join()
def init(l):
global lock
lock = l
def f(k):
global results
*** DO SOME STUFF HERE***
results = results[ *** finished pandas dataframe *** ]
lock.acquire()
results.to_excel(writer, sheet_name=list_of_days[k])
writer.save()
lock.release()
The result is that only one sheet gets created in excel (I assume it is the process finishing last). Some questions about this code:
How to avoid defining global variables?
Is it even possible to pass around dataframes?
Should I move the locking to main instead?
Really appreciate some input here, as I consider mastering multiprocessing as instrumental. Thanks
1) Why did you implement time.sleep in several places in your 2nd method?
In __main__, time.sleep(0.1), to give the started process a timeslice to startup.
In f2(fq, q), to give the queue a timeslice to flushed all buffered data to the pipe and
as q.get_nowait() are used.
In w(q), are only for testing simulating long run of writer.to_excel(...),
i removed this one.
2) What is the difference between pool.map and pool = [mp.Process( . )]?
Using pool.map needs no Queue, no parameter passed, shorter code.
The worker_process have to return immediately the result and terminates.
pool.map starts a new process as long as all iteration are done.
The results have to be processed after that.
Using pool = [mp.Process( . )], starts n processes.
A process terminates on queue.Empty
Can you think of a situation where you would prefer one method over the other?
Methode 1: Quick setup, serialized, only interested in the result to continue.
Methode 2: If you want to do all workload parallel.
You could't use global writer in processes.
The writer instance has to belong to one process.
Usage of mp.Pool, for instance:
def f1(k):
# *** DO SOME STUFF HERE***
results = pd.DataFrame(df_)
return results
if __name__ == '__main__':
pool = mp.Pool()
results = pool.map(f1, range(len(list_of_days)))
writer = pd.ExcelWriter('../test/myfile.xlsx', engine='xlsxwriter')
for k, result in enumerate(results):
result.to_excel(writer, sheet_name=list_of_days[k])
writer.save()
pool.close()
This leads to .to_excel(...) are called in sequence in the __main__ process.
If you want parallel .to_excel(...) you have to use mp.Queue().
For instance:
The worker process:
# mp.Queue exeptions have to load from
try:
# Python3
import queue
except:
# Python 2
import Queue as queue
def f2(fq, q):
while True:
try:
k = fq.get_nowait()
except queue.Empty:
exit(0)
# *** DO SOME STUFF HERE***
results = pd.DataFrame(df_)
q.put( (list_of_days[k], results) )
time.sleep(0.1)
The writer process:
def w(q):
writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
while True:
try:
titel, result = q.get()
except ValueError:
writer.save()
exit(0)
result.to_excel(writer, sheet_name=titel)
The __main__ process:
if __name__ == '__main__':
w_q = mp.Queue()
w_p = mp.Process(target=w, args=(w_q,))
w_p.start()
time.sleep(0.1)
f_q = mp.Queue()
for i in range(len(list_of_days)):
f_q.put(i)
pool = [mp.Process(target=f2, args=(f_q, w_q,)) for p in range(os.cpu_count())]
for p in pool:
p.start()
time.sleep(0.1)
for p in pool:
p.join()
w_q.put('STOP')
w_p.join()
Tested with Python:3.4.2 - pandas:0.19.2 - xlsxwriter:0.9.6

python share thread-objects in manager.dict() between 2 processes

I want to share a dict of thread-objects between 2 processes. I have also another dict of objects which seems to work at the moment.
The problem is that it raises an exception when I try to add key/value pairs to the dict (key is an integer and value is the thread-object):
Exception with manager.dict()
TypeError: can't pickle _thread.lock objects
I try to switch from manager.dict() to manager.list(), it does not work either:
Exception with manager.list()
TypeError: can't pickle _thread.lock objects
The readFiles() function is working correctly.
I use python 3.5.1 (Anaconda)
def startAlgorithm(fNameGraph, fNameEnergyDistribution, fNameRouteTables):
global _manager, _allTiesets, _allNodes, _stopDistribution
_manager = Manager()
_allTiesets = _manager.dict()
_allNodes = _manager.dict()
_stopDistribution = Value(c_bool, False)
readFiles(fNameGraph, fNameEnergyDistribution, fNameRouteTables)
initializeAlgorithm()
procTADiC = Process(target=TADiC, args=(_stopDistribution, _allNodes))
procTA = Process(target=TIESET_AGENT, args=(_stopDistribution, _allNodes, _allTiesets))
procTADiC.start()
procTA.start()
procTADiC.join()
procTA.join()
def initializeAlgorithm():
global _graphNX, _routingTable, _energyDistribution, _energyMeanValue
#Init all Nodes
allNodeIDs = _graphNX.nodes()
energySum = 0
for node in allNodeIDs:
nodeEnergyLoad = float(_energyDistribution.get(str(node)))
nodeObj = Node(node, nodeEnergyLoad)
_allNodes[node] = nodeObj
energySum = energySum + nodeEnergyLoad
#Calculate the mean value from the whole energy in the graph
_energyMeanValue = energySum / len(allNodeIDs)
#Init all Tieset-Threads
for tieset in _routingTable:
tiesetID = int(tieset['TiesetID'])
connNodes = list(tieset['Nodes'])
connEdges = list(tieset['Edges'])
adjTiesets = list(tieset['AdjTiesets'])
tiesetThread = Tieset(tiesetID, connNodes, connEdges, adjTiesets)
_allTiesets[tiesetID] = tiesetThread # Raise Exception!!!!!!!!!!
class Node:
'Node-Class that hold information about a node in a tieset'
def __init__(self, nodeID, energyLoad):
self.nodeID = nodeID
self.energyLoad = energyLoad
self.tiesetFlag = False
class Tieset(threading.Thread):
'Tieset-Class as Thread to distribute the load within the tieset'
def __init__(self, tiesetID, connectedNodes, connectedEdges, adjTiesets):
threading.Thread.__init__(self)
self.tiesetID = tiesetID
self.connectedNodes = connectedNodes
self.connectedEdges = connectedEdges
self.adjTiesets = adjTiesets
self.leaderNodeID = min(int(n) for n in connectedNodes)
self.measureCnt = 0
def run(self):
print('start Thread')
What I can say that you can't share threads between processes, you can share arguments for those threads if you want to start them in different processes, or you can share some results. The problem you are seeing caused by nature of that process creation, in python all the parameters will be serialized in your current process, then passed to a new one, and then python will deserialize them there to run the "target". Apparently, thread object is not serializable (you can check this interesting thread to understand serialization problem debugging pickle).

Implementing multiprocessing in a loop scraper and appending the data

I am making a web scraper to build a database. The site I plan to use has index pages each containing 50 links. The amount of pages to be parsed is estimated to be around 60K and up, this is why I want to implement multiprocessing.
Here is some pseudo-code of what I want to do:
def harvester(index):
main=dict()
....
links = foo.findAll ( 'a')
for link in links:
main.append(worker(link))
# or maybe something like: map_async(worker(link))
def worker(url):
''' this function gather the data from the given url'''
return dictionary
Now what I want to do with that is to have a certain number of worker function to gather data in parallel on different pages. This data would then be appended to a big dictionary located in harvester or written directly in a csv file by the worker function.
I'm wondering how I can implement parallelism. I have done a faire
amount of research on using gevent, threading and multiprocessing but
I am not sure how to implement it.
I am also not sure if appending data to a large dictionary or writing
directly in a csv using DictWriter will be stable with that many input at the same time.
Thanks
I propose you to split your work into separate workers which communicate via Queues.
Here you mostly have IO wait time (crawling, csv writing)
So you can do the following (not tested, just see the idea):
import threading
import Queue
class CsvWriter(threading.Thread):
def __init__(self, resultq):
super(CsvWriter, self).__init__()
self.resultq = resultq
self.writer = csv.DictWriter(open('results.csv', 'wb'))
def run(self):
done = False
while not done:
row = self.requltq.get()
if row != -1:
self.writer.writerow(row)
else:
done = True
class Crawler(threading.Thread):
def __init__(self, inputqueue, resultq):
super(Crawler, self).__init__()
self.iq = inputq
self.oq = resultq
def run(self):
done = False
while not done:
link = self.iq.get()
if link != -1:
result = self.extract_data(link)
self.oq.put(result)
else:
done = True
def extract_data(self, link):
# crawl and extract what you need and return a dict
pass
def main():
linkq = Queue.Queue()
for url in your_urls:
linkq.put(url)
resultq = Queue.Queue()
writer = CsvWriter(resultq)
writer.start()
crawlers = [Crawler(linkq, resultq) for _ in xrange(10)]
[c.start() for c in crawlers]
[linkq.put(-1) for _ in crawlers]
[c.join() for c in crawlers]
resultq.put(-1)
writer.join()
This code should work (fix possible typos) and make it to exit when all the urls are finished

Categories

Resources