I am attempting to dynamically open and parse through several text files (~10) to extract a particular value from key, for which I am utilizing multi-processing within Python to do this. My issue is that the function that I am calling writes particular data to a class list which I can see in the method, however outside the method that list is empty. Refer to the following:
class:
class MyClass(object):
__id_list = []
def __init__(self):
self.process_wrapper()
Caller Method:
def process_wrapper(self):
from multiprocessing import Pool
import multiprocessing
info_file = 'info*'
file_list = []
p = Pool(processes = multiprocessing.cpu_count() - 1)
for file_name in Path('c:/').glob('**/*/' + info_file):
file_list.append(str(os.path.join('c:/', file_name)))
p.map_async(self.get_ids, file_list)
p.close()
p.join()
print(self.__id_list) # this is showing as empty
Worker method:
def get_ids(self, file_name):
try:
with open(file_name) as data:
for line in data:
temp_split = line.split()
for item in temp_split:
value_split = str(item).split('=')
if 'id' == value_split[0].lower():
if int(value_split[1]) not in self._id_list:
self.__id_list.append(int(value_split[1]))
except:
raise FileReadError(f'There was an issue parsing "{file_name}".')
print(self.__id_list) # here the list prints fine
The map call returns a AysncResult class object. you should use that to wait for the processing to finish before checking self.__id_list. also you might consider returning a local list, collected those lists and aggregating them into the final list.
1. It looks like you have a typo in your get_ids method (self._id_list instead of self.__id_list). You can see it if you wait for the result:
result = p.map_async(self.get_ids, file_list)
result.get()
2. When a new child process is created, it gets a copy of the parent's address space however any subsequent changes (either by parent or child) are not reflected in the memory of the other process. They each have their own private address space.
Example:
$ cat fork.py
import os
l = []
l.append('global')
# Return 0 in the child and the child’s process id in the parent
pid = os.fork()
if pid == 0:
l.append('child')
print(f'Child PID: {os.getpid()}, {l}')
else:
l.append('parent')
print(f'Parent PID: {os.getpid()}, {l}')
print(l)
$ python3 fork.py
Parent PID: 9933, ['global', 'parent']
['global', 'parent']
Child PID: 9934, ['global', 'child']
['global', 'child']
Now back to your problem, you can use multiprocessing.Manager.list to create an object that is shared between processes:
from multiprocessing import Manager, Pool
m = Manager()
self.__id_list = m.list()
Docs: Sharing state between processes
or use threads as your workload seems to be I/O bound anyway:
from multiprocessing.dummy import Pool as ThreadPool
p = ThreadPool(processes = multiprocessing.cpu_count() - 1)
Alternatively check concurrent.futures
Related
I'm using the following code to get a list of filings from AWS. I'm not sure where it went wrong.
import time
import datetime
from collections import deque
from typing import List, Deque, Iterable, Dict
import logging
import boto3
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed, Future
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
BUCKET: str = "irs-form-990"
EARLIEST_YEAR: int = 2009
cur_year: int = datetime.datetime.now().year
first_prefix: int = EARLIEST_YEAR * 100
last_prefix: int = (cur_year + 1) * 100
def get_keys_for_prefix(prefix: str) -> Iterable[str]:
"""Return a collection of all key names starting with the specified prefix."""
client = boto3.client('s3')
# See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/paginators.html
paginator = client.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(Bucket=BUCKET, Prefix=prefix)
# A deque is a collection with O(1) appends and O(n) iteration
results: Deque[str] = deque()
i = 0
for i, page in enumerate(page_iterator):
if "Contents" not in page:
continue
# You could also capture, e.g., the timestamp or checksum here
page_keys: Iterable = (element["Key"] for element in page["Contents"])
results.extend(page_keys)
logging.info("Scanned {} page(s) with prefix {}.".format(i + 1, prefix))
return results
start: float = time.time()
# ProcessPoolExecutor starts a completely separate copy of Python for each worker
with ProcessPoolExecutor() as executor:
futures: Deque[Future] = deque()
for prefix in range(first_prefix, last_prefix):
future: Future = executor.submit(get_keys_for_prefix, str(prefix))
futures.append(future)
n = 0
# as_completed ignores submission order to prevent unnecessary waiting
for future in as_completed(futures):
keys: Iterable = future.result()
for key in keys:
# Do your analysis here
n += 1
elapsed: float = time.time() - start
logging.info("Discovered {:,} keys in {:,.1f} seconds.".format(n, elapsed))
I'm getting the following errors:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
and
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
Since this is a third-party code, I can't fix it myself (also, being a novice in python). Any help is appreciated.
I would like to parallelize a process in python which needs read access to several large, non-array data structures. What would be a recommended way to do this without copying all of the large data structures into every new process?
Thank you
The multiprocessing package provides two ways of sharing state: shared memory objects and server process managers. You should use server process managers as they support arbitrary object types.
The following program makes use of a server process manager:
#!/usr/bin/env python3
from multiprocessing import Process, Manager
# Simple data structure
class DataStruct:
data_id = None
data_str = None
def __init__(self, data_id, data_str):
self.data_id = data_id
self.data_str = data_str
def __str__(self):
return f"{self.data_str} has ID {self.data_id}"
def __repr__(self):
return f"({self.data_id}, {self.data_str})"
def set_data_id(self, data_id):
self.data_id = data_id
def set_data_str(self, data_str):
self.data_str = data_str
def get_data_id(self):
return self.data_id
def get_data_str(self):
return self.data_str
# Create function to manipulate data
def manipulate_data_structs(data_structs, find_str):
for ds in data_structs:
if ds.get_data_str() == find_str:
print(ds)
# Create manager context, modify the data
with Manager() as manager:
# List of DataStruct objects
l = manager.list([
DataStruct(32, "Andrea"),
DataStruct(45, "Bill"),
DataStruct(21, "Claire"),
])
# Processes that look for DataStructs with a given String
procs = [
Process(target = manipulate_data_structs, args = (l, "Andrea")),
Process(target = manipulate_data_structs, args = (l, "Claire")),
Process(target = manipulate_data_structs, args = (l, "David")),
]
for proc in procs:
proc.start()
for proc in procs:
proc.join()
For more information, see Sharing state between processes in the documentation.
I am new to Python and I am trying to save the results of five different processes to one excel file (each process write to a different sheet). I have read different posts here, but still can't get it done as I'm very confused about pool.map, queues, and locks, and I'm not sure what is required here to fulfill this task.
This is my code so far:
list_of_days = ["2017.03.20", "2017.03.21", "2017.03.22", "2017.03.23", "2017.03.24"]
results = pd.DataFrame()
if __name__ == '__main__':
global list_of_days
writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
nr_of_cores = multiprocessing.cpu_count()
l = multiprocessing.Lock()
pool = multiprocessing.Pool(processes=nr_of_cores, initializer=init, initargs=(l,))
pool.map(f, range(len(list_of_days)))
pool.close()
pool.join()
def init(l):
global lock
lock = l
def f(k):
global results
*** DO SOME STUFF HERE***
results = results[ *** finished pandas dataframe *** ]
lock.acquire()
results.to_excel(writer, sheet_name=list_of_days[k])
writer.save()
lock.release()
The result is that only one sheet gets created in excel (I assume it is the process finishing last). Some questions about this code:
How to avoid defining global variables?
Is it even possible to pass around dataframes?
Should I move the locking to main instead?
Really appreciate some input here, as I consider mastering multiprocessing as instrumental. Thanks
1) Why did you implement time.sleep in several places in your 2nd method?
In __main__, time.sleep(0.1), to give the started process a timeslice to startup.
In f2(fq, q), to give the queue a timeslice to flushed all buffered data to the pipe and
as q.get_nowait() are used.
In w(q), are only for testing simulating long run of writer.to_excel(...),
i removed this one.
2) What is the difference between pool.map and pool = [mp.Process( . )]?
Using pool.map needs no Queue, no parameter passed, shorter code.
The worker_process have to return immediately the result and terminates.
pool.map starts a new process as long as all iteration are done.
The results have to be processed after that.
Using pool = [mp.Process( . )], starts n processes.
A process terminates on queue.Empty
Can you think of a situation where you would prefer one method over the other?
Methode 1: Quick setup, serialized, only interested in the result to continue.
Methode 2: If you want to do all workload parallel.
You could't use global writer in processes.
The writer instance has to belong to one process.
Usage of mp.Pool, for instance:
def f1(k):
# *** DO SOME STUFF HERE***
results = pd.DataFrame(df_)
return results
if __name__ == '__main__':
pool = mp.Pool()
results = pool.map(f1, range(len(list_of_days)))
writer = pd.ExcelWriter('../test/myfile.xlsx', engine='xlsxwriter')
for k, result in enumerate(results):
result.to_excel(writer, sheet_name=list_of_days[k])
writer.save()
pool.close()
This leads to .to_excel(...) are called in sequence in the __main__ process.
If you want parallel .to_excel(...) you have to use mp.Queue().
For instance:
The worker process:
# mp.Queue exeptions have to load from
try:
# Python3
import queue
except:
# Python 2
import Queue as queue
def f2(fq, q):
while True:
try:
k = fq.get_nowait()
except queue.Empty:
exit(0)
# *** DO SOME STUFF HERE***
results = pd.DataFrame(df_)
q.put( (list_of_days[k], results) )
time.sleep(0.1)
The writer process:
def w(q):
writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
while True:
try:
titel, result = q.get()
except ValueError:
writer.save()
exit(0)
result.to_excel(writer, sheet_name=titel)
The __main__ process:
if __name__ == '__main__':
w_q = mp.Queue()
w_p = mp.Process(target=w, args=(w_q,))
w_p.start()
time.sleep(0.1)
f_q = mp.Queue()
for i in range(len(list_of_days)):
f_q.put(i)
pool = [mp.Process(target=f2, args=(f_q, w_q,)) for p in range(os.cpu_count())]
for p in pool:
p.start()
time.sleep(0.1)
for p in pool:
p.join()
w_q.put('STOP')
w_p.join()
Tested with Python:3.4.2 - pandas:0.19.2 - xlsxwriter:0.9.6
I'm running a spell correction function on a dataset I have. I used from pathos.multiprocessing import ProcessingPool as Pool to do the job. Once the processing is done, I'd like to actually access the results. Here is my code:
import codecs
import nltk
from textblob import TextBlob
from nltk.tokenize import sent_tokenize
from pathos.multiprocessing import ProcessingPool as Pool
class SpellCorrect():
def load_data(self, path_1):
with codecs.open(path_1, "r", "utf-8") as file:
data = file.read()
return sent_tokenize(data)
def correct_spelling(self, data):
data = TextBlob(data)
return str(data.correct())
def run_clean(self, path_1):
pool = Pool()
data = self.load_data(path_1)
return pool.amap(self.correct_spelling, data)
if __name__ == "__main__":
path_1 = "../Data/training_data/training_corpus.txt"
SpellCorrect = SpellCorrect()
result = SpellCorrect.run_clean(path_1)
print(result)
result = " ".join(temp for temp in result)
with codecs.open("../Data/training_data/training_data_spell_corrected.txt", "a", "utf-8") as file:
file.write(result)
If you look at the main block, when I do print(result) I get an object of type <multiprocess.pool.MapResult object at 0x1a25519f28>.
I try to access the results with result = " ".join(temp for temp in result), but then I get the following error TypeError: 'MapResult' object is not iterable. I've tried typecasting it to a list list(result), but still the same error. What can I do to fix this?
The multiprocess.pool.MapResult object is not iterable as it is inherited from AsyncResult and has only the following methods:
wait([timeout])
Wait until the result is available or until timeout seconds pass. This method always returns None.
ready() Return whether the call has completed.
successful() Return whether the call completed without raising an
exception. Will raise AssertionError if the result is not ready.
get([timeout]) Return the result when it arrives. If timeout is not
None and the result does not arrive within timeout seconds then
TimeoutError is raised. If the remote call raised an exception then
that exception will be reraised as a RemoteError by get().
You can check the examples how to use the get() function here:
https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers
from multiprocessing import Pool, TimeoutError
import time
import os
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
# print "[0, 1, 4,..., 81]"
print pool.map(f, range(10))
# print same numbers in arbitrary order
for i in pool.imap_unordered(f, range(10)):
print i
# evaluate "f(20)" asynchronously
res = pool.apply_async(f, (20,)) # runs in *only* one process
print res.get(timeout=1) # prints "400"
# evaluate "os.getpid()" asynchronously
res = pool.apply_async(os.getpid, ()) # runs in *only* one process
print res.get(timeout=1) # prints the PID of that process
# launching multiple evaluations asynchronously *may* use more processes
multiple_results = [pool.apply_async(os.getpid, ()) for i in range(4)]
print [res.get(timeout=1) for res in multiple_results]
# make a single worker sleep for 10 secs
res = pool.apply_async(time.sleep, (10,))
try:
print res.get(timeout=1)
except TimeoutError:
print "We lacked patience and got a multiprocessing.TimeoutError"
I want to share a dict of thread-objects between 2 processes. I have also another dict of objects which seems to work at the moment.
The problem is that it raises an exception when I try to add key/value pairs to the dict (key is an integer and value is the thread-object):
Exception with manager.dict()
TypeError: can't pickle _thread.lock objects
I try to switch from manager.dict() to manager.list(), it does not work either:
Exception with manager.list()
TypeError: can't pickle _thread.lock objects
The readFiles() function is working correctly.
I use python 3.5.1 (Anaconda)
def startAlgorithm(fNameGraph, fNameEnergyDistribution, fNameRouteTables):
global _manager, _allTiesets, _allNodes, _stopDistribution
_manager = Manager()
_allTiesets = _manager.dict()
_allNodes = _manager.dict()
_stopDistribution = Value(c_bool, False)
readFiles(fNameGraph, fNameEnergyDistribution, fNameRouteTables)
initializeAlgorithm()
procTADiC = Process(target=TADiC, args=(_stopDistribution, _allNodes))
procTA = Process(target=TIESET_AGENT, args=(_stopDistribution, _allNodes, _allTiesets))
procTADiC.start()
procTA.start()
procTADiC.join()
procTA.join()
def initializeAlgorithm():
global _graphNX, _routingTable, _energyDistribution, _energyMeanValue
#Init all Nodes
allNodeIDs = _graphNX.nodes()
energySum = 0
for node in allNodeIDs:
nodeEnergyLoad = float(_energyDistribution.get(str(node)))
nodeObj = Node(node, nodeEnergyLoad)
_allNodes[node] = nodeObj
energySum = energySum + nodeEnergyLoad
#Calculate the mean value from the whole energy in the graph
_energyMeanValue = energySum / len(allNodeIDs)
#Init all Tieset-Threads
for tieset in _routingTable:
tiesetID = int(tieset['TiesetID'])
connNodes = list(tieset['Nodes'])
connEdges = list(tieset['Edges'])
adjTiesets = list(tieset['AdjTiesets'])
tiesetThread = Tieset(tiesetID, connNodes, connEdges, adjTiesets)
_allTiesets[tiesetID] = tiesetThread # Raise Exception!!!!!!!!!!
class Node:
'Node-Class that hold information about a node in a tieset'
def __init__(self, nodeID, energyLoad):
self.nodeID = nodeID
self.energyLoad = energyLoad
self.tiesetFlag = False
class Tieset(threading.Thread):
'Tieset-Class as Thread to distribute the load within the tieset'
def __init__(self, tiesetID, connectedNodes, connectedEdges, adjTiesets):
threading.Thread.__init__(self)
self.tiesetID = tiesetID
self.connectedNodes = connectedNodes
self.connectedEdges = connectedEdges
self.adjTiesets = adjTiesets
self.leaderNodeID = min(int(n) for n in connectedNodes)
self.measureCnt = 0
def run(self):
print('start Thread')
What I can say that you can't share threads between processes, you can share arguments for those threads if you want to start them in different processes, or you can share some results. The problem you are seeing caused by nature of that process creation, in python all the parameters will be serialized in your current process, then passed to a new one, and then python will deserialize them there to run the "target". Apparently, thread object is not serializable (you can check this interesting thread to understand serialization problem debugging pickle).