Multiprocessing with pandas dataframe - python

I would like to use multiprocessing on a large pandas data frame. I want to set a column entry of that data frame based on another columns value. It is a simple labeling done with some if-statements.
This is the minimal example I have tried:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import multiprocessing
def worker(data):
'''worker function'''
try:
assert type(data) == pd.core.frame.DataFrame
for i in data.index:
if 0 < data['Value'].iloc[i] <=2:
data['Label'].iloc[i] = 'low'
elif 2 < data['Value'].iloc[i] <=4:
data['Label'].iloc[i] = 'medium'
elif 4 < data['Value'].iloc[i] <=6:
data['Label'].iloc[i] = 'high'
else:
data['Label'].iloc[i] = 'very high'
except AssertionError:
print('Data has to be pandas df!')
if __name__ == '__main__':
# dummy data set
df = pd.DataFrame(np.random.randint(0,10,1001),columns=['Value'])
df['Labels'] = 0
num_cores = multiprocessing.cpu_count()
splits = np.linspace(0,len(df),num_cores+1,dtype=int)
jobs = []
for i in range(num_cores):
lower_bound = splits[i]
upper_bound = splits[i+1]
p = multiprocessing.Process(target=worker, args=(df.iloc[lower_bound:upper_bound],))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
print(jobs)
However, when I run it I get the following error statement:
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
TypeError: worker() takes 1 positional argument but 2 were given
Process Process-2:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
TypeError: worker() takes 1 positional argument but 2 were given
Process Process-3:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
TypeError: worker() takes 1 positional argument but 2 were given
Process Process-4:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
TypeError: worker() takes 1 positional argument but 2 were given
Process Process-5:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
TypeError: worker() takes 1 positional argument but 2 were given
Process Process-6:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
TypeError: worker() takes 1 positional argument but 2 were given
Process Process-7:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
TypeError: worker() takes 1 positional argument but 2 were given
Process Process-8:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
TypeError: worker() takes 1 positional argument but 2 were given
[<Process(Process-1, stopped[1])>, <Process(Process-2, stopped[1])>, <Process(Process-3, stopped[1])>, <Process(Process-4, stopped[1])>, <Process(Process-5, stopped[1])>, <Process(Process-6, stopped[1])>, <Process(Process-7, stopped[1])>, <Process(Process-8, stopped[1])>]
I'm not sure what's going wrong with the multiprocessing...?

Related

How to stop blocking call and exit thread gracefully in python

I have a working script which invokes multiple blocking read methods in different threads and whenever it gets data in the blocking method it puts the data in queue that will be read by task.
import asyncio
import atexit
import concurrent.futures
import threading
def blocking_read(x, loop):
while True:
print(f"blocking_read {x}")
time.sleep(1)
# implementation is wait for data and put into the queue
class Recorder():
def __init__(self):
self.count = 0
def run_reader_threads(self, thread_pool, event):
loop = asyncio.get_running_loop()
threads = []
for rx in range(2):
threads.append(loop.run_in_executor(thread_pool, blocking_read, rx, loop))
return threads
def wiatforever():
while True:
time.sleep(1)
reader_futures = []
executor = concurrent.futures.ThreadPoolExecutor()
event = threading.Event()
def main():
async def doit():
recorder_app = Recorder()
global reader_futures
reader_futures = recorder_app.run_reader_threads(executor, event)
#reader_handlers = recorder_app.create_handlers()
await wiatforever()
try:
print("RUN DO IT")
asyncio.run(doit())
except KeyboardInterrupt:
# cancel all futures but geeting exception on cancelling 1st future
reader_futures[0].cancel()
pass
#--------------------------------------------------------------------------------------------
if __name__ == '__main__':
main()
The problem is whenever I want to stop the script gracefully by cancelling all futures I am getting exception that RuntimeError: Event loop is closed
OUTPUT
RUN DO IT
blocking_read 0
blocking_read 1
blocking_read 1
blocking_read 0
blocking_read 0
blocking_read 1
^CTraceback (most recent call last):
File "/home/devuser/Desktop/rnr_crash/2trial.py", line 45, in main
asyncio.run(doit())
File "/usr/local/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/local/lib/python3.9/asyncio/base_events.py", line 629, in run_until_complete
self.run_forever()
File "/usr/local/lib/python3.9/asyncio/base_events.py", line 596, in run_forever
self._run_once()
File "/usr/local/lib/python3.9/asyncio/base_events.py", line 1890, in _run_once
handle._run()
File "/usr/local/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/home/devuser/Desktop/rnr_crash/2trial.py", line 41, in doit
await wiatforever()
File "/home/devuser/Desktop/rnr_crash/2trial.py", line 27, in wiatforever
time.sleep(1)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/devuser/Desktop/rnr_crash/2trial.py", line 53, in <module>
main()
File "/home/devuser/Desktop/rnr_crash/2trial.py", line 48, in main
reader_futures[0].cancel()
File "/usr/local/lib/python3.9/asyncio/base_events.py", line 746, in call_soon
self._check_closed()
File "/usr/local/lib/python3.9/asyncio/base_events.py", line 510, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
blocking_read 0
blocking_read 1
blocking_read 0
blocking_read 1
blocking_read 0
blocking_read 1
^CException ignored in: <module 'threading' from '/usr/local/lib/python3.9/threading.py'>
Traceback (most recent call last):
File "/usr/local/lib/python3.9/threading.py", line 1411, in _shutdown
atexit_call()
File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 31, in _python_exit
t.join()
File "/usr/local/lib/python3.9/threading.py", line 1029, in join
self._wait_for_tstate_lock()
File "/usr/local/lib/python3.9/threading.py", line 1045, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt:
so can anyone help me to find out the issue here and exit the script correctly without any exception.

Multiple errors when running example pyqtgraph program (mainly sip type errors)

I have been trying to make some graphs in pyqt5 that can update more quickly and efficiently than my currently embedded matplotlib ones.
I keep running into the same problem whenever I run any example code including pyqtgraph, which always throw the following error:
"TypeError: isdeleted() argument 1 must be sip.simplewrapper, not PlotWidget"
Environment:
Spyder 3.3.2 Python 3.7.1 64-bit | Qt 5.9.6 | PyQt5 5.9.2 | Windows 10.
After running pip freeze I learned my versions are numpy==1.20.1, PyQt5==5.15.2, PyQt5-sip==12.8.1, pyqtgraph==0.11.1
I'm using a very simple test graph from a turial (Link).
from PyQt5 import QtWidgets
from pyqtgraph import PlotWidget, plot
import pyqtgraph as pg
import sys # We need sys so that we can pass argv to QApplication
import os
class MainWindow(QtWidgets.QMainWindow):
def __init__(self, *args, **kwargs):
super(MainWindow, self).__init__(*args, **kwargs)
self.graphWidget = pg.PlotWidget()
self.setCentralWidget(self.graphWidget)
hour = [1,2,3,4,5,6,7,8,9,10]
temperature = [30,32,34,32,33,31,29,32,35,45]
# plot data: x, y values
self.graphWidget.plot(hour, temperature)
def main():
app = QtWidgets.QApplication(sys.argv)
main = MainWindow()
main.show()
sys.exit(app.exec_())
if __name__ == '__main__':
main()
The code which causes the error most recent in traceback is in "Qt.py", in the following code:
# Common to PyQt4 and 5
if QT_LIB in [PYQT4, PYQT5]:
QtVersion = QtCore.QT_VERSION_STR
try:
from PyQt5 import sip
except ImportError:
import sip
def isQObjectAlive(obj):
return not sip.isdeleted(obj)
loadUiType = uic.loadUiType
QtCore.Signal = QtCore.pyqtSignal
The full traceback is much longer, and is as follows:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\GraphicsObject.py", line 40, in itemChange
ret = sip.cast(ret, QtGui.QGraphicsItem)
TypeError: cast() argument 1 must be sip.simplewrapper, not PlotItem
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\ViewBox\ViewBox.py", line 438, in resizeEvent
self.updateAutoRange()
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\ViewBox\ViewBox.py", line 890, in updateAutoRange
childRange = self.childrenBounds(frac=fractionVisible)
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\ViewBox\ViewBox.py", line 1355, in childrenBounds
px, py = [v.length() if v is not None else 0 for v in self.childGroup.pixelVectors()]
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\GraphicsItem.py", line 189, in pixelVectors
dt = self.deviceTransform()
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\GraphicsItem.py", line 108, in deviceTransform
view = self.getViewWidget()
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\GraphicsItem.py", line 65, in getViewWidget
if v is not None and not isQObjectAlive(v):
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\Qt.py", line 328, in isQObjectAlive
return not sip.isdeleted(obj)
TypeError: isdeleted() argument 1 must be sip.simplewrapper, not PlotWidget
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\GraphicsObject.py", line 40, in itemChange
ret = sip.cast(ret, QtGui.QGraphicsItem)
TypeError: cast() argument 1 must be sip.simplewrapper, not PlotDataItem
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\GraphicsObject.py", line 40, in itemChange
ret = sip.cast(ret, QtGui.QGraphicsItem)
TypeError: cast() argument 1 must be sip.simplewrapper, not PlotDataItem
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\GraphicsObject.py", line 26, in itemChange
self.parentChanged()
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\GraphicsItem.py", line 463, in parentChanged
self._updateView()
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\GraphicsItem.py", line 480, in _updateView
view = self.getViewBox()
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\GraphicsItem.py", line 88, in getViewBox
vb = self.getViewWidget()
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\GraphicsItem.py", line 65, in getViewWidget
if v is not None and not isQObjectAlive(v):
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\Qt.py", line 328, in isQObjectAlive
return not sip.isdeleted(obj)
TypeError: isdeleted() argument 1 must be sip.simplewrapper, not PlotWidget
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\GraphicsObject.py", line 40, in itemChange
ret = sip.cast(ret, QtGui.QGraphicsItem)
TypeError: cast() argument 1 must be sip.simplewrapper, not ChildGroup
Traceback (most recent call last):
File "<ipython-input-2-5f5dea77ec5e>", line 1, in <module>
runfile('C:/Users/dowdt/GoogleDrive/Documents/Purdue/GraduateSchool/Homologation/Software/Python Test Code/Python GUI practice/pyqt5LiveGraphExample.py', wdir='C:/Users/dowdt/GoogleDrive/Documents/Purdue/GraduateSchool/Homologation/Software/Python Test Code/Python GUI practice')
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 704, in runfile
execfile(filename, namespace)
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/dowdt/GoogleDrive/Documents/Purdue/GraduateSchool/Homologation/Software/Python Test Code/Python GUI practice/pyqt5LiveGraphExample.py", line 30, in <module>
main()
File "C:/Users/dowdt/GoogleDrive/Documents/Purdue/GraduateSchool/Homologation/Software/Python Test Code/Python GUI practice/pyqt5LiveGraphExample.py", line 24, in main
main = MainWindow()
File "C:/Users/dowdt/GoogleDrive/Documents/Purdue/GraduateSchool/Homologation/Software/Python Test Code/Python GUI practice/pyqt5LiveGraphExample.py", line 19, in __init__
self.graphWidget.plot(hour, temperature)
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\PlotItem\PlotItem.py", line 653, in plot
self.addItem(item, params=params)
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\PlotItem\PlotItem.py", line 530, in addItem
self.vb.addItem(item, *args, **vbargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\ViewBox\ViewBox.py", line 409, in addItem
self.updateAutoRange()
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\ViewBox\ViewBox.py", line 890, in updateAutoRange
childRange = self.childrenBounds(frac=fractionVisible)
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\ViewBox\ViewBox.py", line 1355, in childrenBounds
px, py = [v.length() if v is not None else 0 for v in self.childGroup.pixelVectors()]
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\GraphicsItem.py", line 189, in pixelVectors
dt = self.deviceTransform()
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\GraphicsItem.py", line 108, in deviceTransform
view = self.getViewWidget()
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\graphicsItems\GraphicsItem.py", line 65, in getViewWidget
if v is not None and not isQObjectAlive(v):
File "C:\ProgramData\Anaconda3\lib\site-packages\pyqtgraph\Qt.py", line 328, in isQObjectAlive
return not sip.isdeleted(obj)
TypeError: isdeleted() argument 1 must be sip.simplewrapper, not PlotWidget

Empty list when using processes in Python

I am trying to use multiprocessing library to speed up CSV reading from files. I've done so using Pool and now I'm trying to do it with Process(). However when concatenating the list to create a dataframe, it's giving me the following error:
ValueError: No objects to concatenate
To me it looks like the processes are overwriting the uber_data list. What am I missing here?
import glob
import pandas as pd
from multiprocessing import Process
import matplotlib.pyplot as plt
import os
location = "/home/data/csv/"
uber_data = []
def read_csv(filename):
return uber_data.append(pd.read_csv(filename))
def data_wrangling(uber_data):
uber_data['Date/Time'] = pd.to_datetime(uber_data['Date/Time'], format="%m/%d/%Y %H:%M:%S")
uber_data['Dia Setmana'] = uber_data['Date/Time'].dt.weekday_name
uber_data['Num dia'] = uber_data['Date/Time'].dt.dayofweek
return uber_data
def plotting(uber_data):
weekdays = uber_data.pivot_table(index=['Num dia','Dia Setmana'], values='Base', aggfunc='count')
weekdays.plot(kind='bar', figsize=(8,6))
plt.ylabel('Total Journeys')
plt.title('Journey on Week Day')
def main():
processes = []
files = list(glob.glob(os.path.join(location,'*.csv*')))
for file in files:
print(file)
p = Process(target=read_csv, args=[file])
processes.append(p)
p.start()
for i, process in enumerate(processes):
process.join()
print(uber_data)
combined_df = pd.concat(uber_data, ignore_index=True)
dades_mod = data_wrangling(combined_df)
Plotting(dades_mod)
main()
Traceback is:
Process Process-223:
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "<timed exec>", line 17, in read_csv
File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 255, in concat
sort=sort,
File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 301, in __init__
objs = list(objs)
TypeError: 'NoneType' object is not iterable
Process Process-224:
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "<timed exec>", line 17, in read_csv
File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 255, in concat
sort=sort,
File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 301, in __init__
objs = list(objs)
TypeError: 'NoneType' object is not iterable
Process Process-221:
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "<timed exec>", line 17, in read_csv
File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 255, in concat
sort=sort,
File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 301, in __init__
objs = list(objs)
TypeError: 'NoneType' object is not iterable
Process Process-222:
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
Process Process-225:
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "<timed exec>", line 17, in read_csv
File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 255, in concat
sort=sort,
File "<timed exec>", line 17, in read_csv
File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 301, in __init__
objs = list(objs)
TypeError: 'NoneType' object is not iterable
File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 255, in concat
sort=sort,
File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 301, in __init__
objs = list(objs)
TypeError: 'NoneType' object is not iterable
Process Process-220:
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "<timed exec>", line 17, in read_csv
File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 255, in concat
sort=sort,
File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 301, in __init__
objs = list(objs)
TypeError: 'NoneType' object is not iterable
[]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<timed eval> in <module>
<timed exec> in main()
/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
253 verify_integrity=verify_integrity,
254 copy=copy,
--> 255 sort=sort,
256 )
257
/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy, sort)
302
303 if len(objs) == 0:
--> 304 raise ValueError("No objects to concatenate")
305
306 if keys is None:
ValueError: No objects to concatenate
Thank you
The uber_data's in each process are not the same object as the uber_data in the main process. You can't really share data between processes.
from multiprocessing import Process
def read_csv(filename=r'c:\pyProjects\data.csv'):
print(id(uber_data))
uber_data.append(pd.read_csv(filename))
def main():
processes = []
for file in range(4):
p = Process(target=read_csv)
processes.append(p)
p.start()
for i, process in enumerate(processes):
process.join()
return processes
if __name__ == '__main__':
uber_data = []
print(id(uber_data))
ps = main()
Prints
PS C:\pyProjects> py -m tmp
2632505050432
1932359777344
2230288136512
2039196563648
2479121315968
You could use a Queue to send the data back to the main process.
from multiprocessing import Process, Queue
def read_csv(filename=r'c:\pyProjects\data.csv', q=None):
q.put(pd.read_csv(filename))
def main(q):
processes = []
for file in range(4):
p = Process(target=read_csv, kwargs={'q':q})
processes.append(p)
p.start()
for i, process in enumerate(processes):
process.join()
while not q.empty():
print('.')
uber_data.append(q.get(block=True))
return processes
if __name__ == '__main__':
uber_data = []
q = Queue()
ps = main(q)
for thing in uber_data:
print(thing.head().to_string())
print('**')
Or you could use threads.
from threading import Thread
def g(filename):
uber_data.append(pd.read_csv(filename))
if __name__ == '__main__':
uber_data = []
threads = []
for _ in range(4):
threads.append(Thread(target=g, args=(r'c:\pyProjects\data.csv',)))
for t in threads:
t.start()
while any(t.is_alive() for t in threads):
pass
for thing in uber_data:
print(thing.head().to_string())
print('**')

Multiprocessing deadlocks during large computation using Pool().apply_async

I have an issue in Python 3.7.3 where my multiprocessing operation (using Queue, Pool, and apply_async) deadlocks when handling large computational tasks.
For small computations, this multiprocessing task works just fine. However, when dealing with larger processes, the multiprocessing task stops, or deadlocks, altogether without exiting the process! I read that this will happen if you "grow your queue without bounds, and you are joining up to a subprocess that is waiting for room in the queue [...] your main process is stalled waiting for that one to complete, and it never will." (Process.join() and queue don't work with large numbers)
I am having trouble converting this concept into code. I would greatly appreciate guidance on refactoring the code I have written below:
import multiprocessing as mp
def listener(q, d): # task to queue information into a manager dictionary
while True:
item_to_write = q.get()
if item_to_write == 'kill':
break
foo = d['region']
foo.add(item_to_write)
d['region'] = foo # add items and set to manager dictionary
def main():
manager = mp.Manager()
q = manager.Queue()
d = manager.dict()
d['region'] = set()
pool = mp.Pool(mp.cpu_count() + 2)
watcher = pool.apply_async(listener, (q, d))
jobs = []
for i in range(24):
job = pool.apply_async(execute_search, (q, d)) # task for multiprocessing
jobs.append(job)
for job in jobs:
job.get() # begin multiprocessing task
q.put('kill') # kill multiprocessing task (view listener function)
pool.close()
pool.join()
print('process complete')
if __name__ == '__main__':
main()
Ultimately, I would like to prevent deadlocking altogether to facilitate a multiprocessing task that could operate indefinitely until completion.
BELOW IS THE TRACEBACK WHEN EXITING DEADLOCK IN BASH
^CTraceback (most recent call last):
File "multithread_search_cl_gamma.py", line 260, in <module>
main(GEOTAG)
File "multithread_search_cl_gamma.py", line 248, in main
job.get()
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/pool.py", line 651, in get
Process ForkPoolWorker-28:
Process ForkPoolWorker-31:
Process ForkPoolWorker-30:
Process ForkPoolWorker-27:
Process ForkPoolWorker-29:
Process ForkPoolWorker-26:
self.wait(timeout)
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/pool.py", line 648, in wait
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
task = get()
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
task = get()
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/queues.py", line 351, in get
with self._rlock:
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/queues.py", line 351, in get
self._event.wait(timeout)
File "/Users/Ira/anaconda3/lib/python3.7/threading.py", line 552, in wait
Traceback (most recent call last):
Traceback (most recent call last):
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
task = get()
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/queues.py", line 352, in get
res = self._reader.recv_bytes()
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
task = get()
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/queues.py", line 351, in get
with self._rlock:
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
signaled = self._cond.wait(timeout)
File "/Users/Ira/anaconda3/lib/python3.7/threading.py", line 296, in wait
waiter.acquire()
KeyboardInterrupt
with self._rlock:
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
task = get()
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/queues.py", line 351, in get
with self._rlock:
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
task = get()
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/queues.py", line 351, in get
with self._rlock:
File "/Users/Ira/anaconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Below is the updated script:
import multiprocessing as mp
import queue
def listener(q, d, stop_event):
while not stop_event.is_set():
try:
while True:
item_to_write = q.get(False)
if item_to_write == 'kill':
break
foo = d['region']
foo.add(item_to_write)
d['region'] = foo
except queue.Empty:
pass
time.sleep(0.5)
if not q.empty():
continue
def main():
manager = mp.Manager()
stop_event = manager.Event()
q = manager.Queue()
d = manager.dict()
d['region'] = set()
pool = mp.get_context("spawn").Pool(mp.cpu_count() + 2)
watcher = pool.apply_async(listener, (q, d, stop_event))
stop_event.set()
jobs = []
for i in range(24):
job = pool.apply_async(execute_search, (q, d))
jobs.append(job)
for job in jobs:
job.get()
q.put('kill')
pool.close()
pool.join()
print('process complete')
if __name__ == '__main__':
main()
UPDATE::
execute_command executes several processes necessary for search, so I put in code for where q.put() lies.
Alone, the script will take > 72 hrs to finish. Each multiprocess never completes the entire task, rather they work individually and reference a manager.dict() to avoid repeating tasks. These tasks work until every tuple in the manager.dict() has been processed.
def area(self, tup, housing_dict, q):
state, reg, sub_reg = tup[0], tup[1], tup[2]
for cat in housing_dict:
"""
computationally expensive, takes > 72 hours
for a list of 512 tup(s)
"""
result = self.search_geotag(
state, reg, cat, area=sub_reg
)
q.put(tup)
The q.put(tup) is ultimately placed in the listener function to add tup to the manager.dict()
Since listener and execute_search are sharing the same queue object, there could be race,
where execute_search gets 'kill' from queue before listener does, thus listener will stuck in blocking get() forever, since there are no more new items.
For that case you can use Event object to signal all processes to stop:
import multiprocessing as mp
import queue
def listener(q, d, stop_event):
while not stop_event.is_set():
try:
item_to_write = q.get(timeout=0.1)
foo = d['region']
foo.add(item_to_write)
d['region'] = foo
except queue.Empty:
pass
print("Listener process stopped")
def main():
manager = mp.Manager()
stop_event = manager.Event()
q = manager.Queue()
d = manager.dict()
d['region'] = set()
pool = mp.get_context("spawn").Pool(mp.cpu_count() + 2)
watcher = pool.apply_async(listener, (q, d, stop_event))
stop_event.set()
jobs = []
for i in range(24):
job = pool.apply_async(execute_search, (q, d))
jobs.append(job)
try:
for job in jobs:
job.get(300) #get the result or throws a timeout exception after 300 seconds
except multiprocessing.TimeoutError:
pool.terminate()
stop_event.set() # stop listener process
print('process complete')
if __name__ == '__main__':
main()

How to return a counter dictionary from a function passed to multiprocessing?

I have a list of CSV files. I want to do a set of operations on each of them and then produce a counter dict and i want to cerate a master list containing individual counter dict from all CSV files. I want to parallelize processing each of the csv file and then return the counter dict from each file. I found a similar solution here : How can I recover the return value of a function passed to multiprocessing.Process?
I used the solution suggested by David Cullen. This solution works perfectly for strings, but when I tried to return a counter dict or a normal dict. All the CSV files are processed until the send_end.send(result) and it hangs on there forever when executed and then throws a memory error. I am running this in a Linux server with more than sufficient memory for creating the list of counter dicts.
I used the following code:
import multiprocessing
#get current working directory
cwd = os.getcwd()
#take a list of all files in cwd
files = os.listdir(cwd)
#defining the function that needs to be done on all csv files
def worker(f,send_end):
infile= open(f)
#read liens in csv file
lines = infile.readlines()
#split the lines by "," and store it in a list of lists
master_lst = [line.strip().split(“,”) for line in lines]
#extract the second field in each sublist
counter_lst = [ element[1] for element in master_lst]
print “Total elements in the list” + str(len(counter_lst))
#create a dictionary of count elements
a = Counter(counter_lst)
# return the counter dict
send_end.send(a)
def main():
jobs = []
pipe_list = []
for f in files:
if f.endswith('.csv'):
recv_end, send_end = multiprocessing.Pipe(duplex=False)
p = multiprocessing.Process(target=worker, args=(f, send_end))
jobs.append(p)
pipe_list.append(recv_end)
p.start()
for proc in jobs:
proc.join()
result_list = [x.recv() for x in pipe_list]
print len(result_list)
if __name__ == '__main__':
main()
The error that i get is the following:
Process Process-42:
Traceback (most recent call last):
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in
_bootstrap
self.run()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/home/amm/python/collapse_multiprocessing_return.py", line 32, in
worker
a = Counter(counter_lst)
File "/usr/lib64/python2.7/collections.py", line 444, in __init__
self.update(iterable, **kwds)
File "/usr/lib64/python2.7/collections.py", line 526, in update
self[elem] = self_get(elem, 0) + 1
MemoryError
Process Process-17:
Traceback (most recent call last):
Process Process-6:
Traceback (most recent call last):
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in
_bootstrap
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in
_bootstrap
Process Process-8:
Traceback (most recent call last):
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in
_bootstrap
self.run()
self.run()
self.run()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
File "/home/amm/python/collapse_multiprocessing_return.py", line 32, in
worker
self._target(*self._args, **self._kwargs)
self._target(*self._args, **self._kwargs)
File "/home/amm/python/collapse_multiprocessing_return.py", line 32, in
worker
File "/home/amm/python/collapse_multiprocessing_return.py", line 32, in
worker
a = Counter(counter_lst_lst)
a = Counter(counter_lst_lst)
a = Counter(counter_lst_lst)
File "/usr/lib64/python2.7/collections.py", line 444, in __init__
File "/usr/lib64/python2.7/collections.py", line 444, in __init__
File "/usr/lib64/python2.7/collections.py", line 444, in __init__
self.update(iterable, **kwds)
File "/usr/lib64/python2.7/collections.py", line 526, in update
self[elem] = self_get(elem, 0) + 1
MemoryError
self.update(iterable, **kwds)
self.update(iterable, **kwds)
File "/usr/lib64/python2.7/collections.py", line 526, in update
File "/usr/lib64/python2.7/collections.py", line 526, in update
self[elem] = self_get(elem, 0) + 1
self[elem] = self_get(elem, 0) + 1
MemoryError
MemoryError
Process Process-10:
Traceback (most recent call last):
File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in
_bootstrap
self.run()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/home/amm/python/collapse_multiprocessing_return.py", line 32, in
worker
a = Counter(counter_lst)
File "/usr/lib64/python2.7/collections.py", line 444, in __init__
self.update(iterable, **kwds)
File "/usr/lib64/python2.7/collections.py", line 526, in update
self[elem] = self_get(elem, 0) + 1
MemoryError
^Z
[18]+ Stopped collapse_multiprocessing_return.py
Now instead of "a" in send_end.send(a) if i replace f, the filename. It prints the number of csv files in the directory (which is what len(result_list) does in this case). But when the counter dict "a" is returned it gets stuck forever, throwing the above error.
I would like to have the code pass the counter dict to receive end without any error/problems. Is there a work around? Could someone please suggest a possible solution?
p.s: I am new to multiprocessing module, sorry if this question sounds naive. Also, i tried the multiprocessing.Manager(), but got a similar error
Your traceback mentions Process Process-42:, so there are at least 42 processes being created. You're creating a process for every CSV file, which is not useful and is probably causing the memory error.
Your problem can be solved much more simply using multiprocessing.Pool.map. The worker function can also be shortened greatly:
def worker(f):
with open(f) as infile:
return Counter(line.strip().split(",")[1]
for line in infile)
def main():
pool = multiprocessing.Pool()
result_list = pool.map(worker, [f for f in files if f.endswith('.csv')])
Passing no arguments to the pool means it'll create as many processes as you have CPU cores. Using more may or may not increase performance.

Categories

Resources