I am writing a simple python script that I need to scale to many threads. For simplicity, I have replaced the actual function I need to use with a matrix matrix multiply. I am having trouble getting my code to scale with the number of processors. Any advice to help me get the correct speedup would be helpful! My code and results are as follows:
import numpy as np
import time
import math
from multiprocessing.dummy import Pool
res = 4
#we must iterate over all of these values
wavektests = np.linspace(.1,2.5,res)
omegaratios = np.linspace(.1,2.5,res)
wavekmat,omegamat = np.meshgrid(wavektests,omegaratios)
def solve_for_omegaratio( ind ):
#obtain the indices for this run
x_ind = ind % res
y_ind = math.floor(ind / res)
#obtain the value for this run
wavek = wavektests[x_ind]
omega = omegaratios[y_ind]
#do some work ( I have replaced the real function with this)
randmat = np.random.rand(4000,4000)
nop = np.linalg.matrix_power(randmat,3)
#obtain a scalar value
value = x_ind + y_ind**2.0
return value
list_ind = range(res**2)
#Serial code execution
t0_proc = time.clock()
t0_wall = time.time()
threads = 0
dispersion = map( solve_for_omegaratio , list_ind)
displist = list(dispersion)
t1_proc = time.clock()
t1_wall = time.time()
print('serial execution')
print('wall clock time = ',t1_wall-t0_wall)
print('processor clock time = ',t1_proc-t0_proc)
print('------------------------------------------------')
#Using pool defaults
t0_proc = time.clock()
t0_wall = time.time()
if __name__ == '__main__':
pool = Pool()
dispersion = pool.map( solve_for_omegaratio , list_ind)
displist = list(dispersion)
t1_proc = time.clock()
t1_wall = time.time()
pool.close
print('num of threads = default')
print('wall clock time = ',t1_wall-t0_wall)
print('processor clock time = ',t1_proc-t0_proc)
print('------------------------------------------------')
# Using 4 threads
t0_proc = time.clock()
t0_wall = time.time()
threads = 4
if __name__ == '__main__':
pool = Pool(threads)
dispersion = pool.map( solve_for_omegaratio , list_ind)
displist = list(dispersion)
t1_proc = time.clock()
t1_wall = time.time()
pool.close
print('num of threads = ' + str(threads))
print('wall clock time = ',t1_wall-t0_wall)
print('processor clock time = ',t1_proc-t0_proc)
print('------------------------------------------------')
Results:
serial execution
wall clock time = 66.1561758518219
processor clock time = 129.16376499999998
------------------------------------------------
num of threads = default
wall clock time = 81.86436200141907
processor clock time = 263.45369
------------------------------------------------
num of threads = 4
wall clock time = 77.63390111923218
processor clock time = 260.66285300000004
------------------------------------------------
Because python has a GIL https://wiki.python.org/moin/GlobalInterpreterLock , "python-native" threads can't run execute truly concurrently and thus can't improve the performance of CPU-bound tasks like math. They can be used to parallelize IO bound tasks effectively (eg API calls which spend almost all their time waiting for network I/O). Forking separate processes with multiprocessing rather than dummy's thread-backed implementation will create multiple processes, not threads, which will be able to run concurrently ( at cost of significant memory overhead).
Related
I am trying to concurrently execute methods from two objects concurrently for a computer vision task. My idea is to use two different feature detectors to compute their respective feature descriptions inside a base class.
In this regard, I built the following toy example to understand python concurrent.futures.ProcessPoolExecutor class.
When executed, the first part of the code runs as expected with 20 Heartbeat (10 from each method executed 10 times in total) strings printed out with the sum for two objects coming out correctly as 100, -100.
But in the second half of the code, it appears the ProcessPoolExecutor is not running the do_math(self, numx) method at all. What am I doing wrong here?
With best,
Azmyin
import numpy as np
import concurrent.futures as cf
import time
def current_milli_time():
# CORE FUNCTION
# Function that returns a time tick in milliseconds
return round(time.time() * 1000)
class masterClass(object):
super_multiplier = 1 # Class variable
def __init__(self, ls):
# Attributes of masterClass
self.var1 = ls[0]
self.sumx = ls[1]
def __rep__(self):
print(f"sumx value -- {self.sumx}")
def apply_sup_mult(self, var_in):
self.sumx = self.sumx + (var_in * masterClass.super_multiplier)
time.sleep(0.025)
print(f"Hearbeat!!")
# This is a regular method
def do_math(self, numx):
self.apply_sup_mult(numx)
ls = [10,0]
ls2 = [-10,0]
numx = 10
obj1 = masterClass(ls)
obj2 = masterClass(ls2)
t1 = current_milli_time()
# Run methods one by one
for _ in range(numx):
obj1.do_math(ls[0])
obj2.do_math(ls2[0])
obj1.__rep__()
obj2.__rep__()
t2 = current_milli_time()
print(f"Time taken -- {t2 - t1} ms")
print()
## Using multiprocessing to concurrently run two methods
# Intentionally reinitialize objects
obj1 = masterClass(ls)
obj1 = masterClass(ls2)
t1 = current_milli_time()
resx = []
with cf.ProcessPoolExecutor() as executor:
for i in range(numx):
#fs = [executor.submit(obj3.do_math, ls[0]), executor.submit(obj4.do_math, ls2[0])]
f1 = executor.submit(obj1.do_math, ls[0])
f2 = executor.submit(obj2.do_math, ls2[0])
# for i,f in enumerate(cf.as_completed(fs)):
# print(f"Done with {f}")
# # State of sumx
obj1.__rep__()
obj2.__rep__()
t2 = current_milli_time()
print(f"Time taken -- {t2 - t1} ms")
Need help to complete the code for Round Robin Scheduling algorithm for CPU scheduling.
Each process takes an equal share of CPU time which is equal to a time quantum of 2 units.
After being processed for 2 time quantums, if the process still requires more computation,
it is passed to a waiting queue.
The code should do the following:
Report the time each process is completed
Report wait times of each process in the queue
The #CODE indicates where the code is missing.
from collections import deque
time_quantum = 2
class Process:
def __init__(self, name, arrival_time, required_time):
self.name = name
self.arrival_time = arrival_time
self.required_time = required_time
self.time_processed = 0
def __repr__(self):
return self.name
p0 = Process('P1', 0, 4)
p1 = Process('P2', 1, 3)
p2 = Process('P3', 2, 2)
p3 = Process('P4', 3, 1)
processes = [p0, p1, p2, p3]
end_times = {process.name:0 for process in processes}
wait_times = {process.name:0 for process in processes}
queue = deque()
running_proc = None # Tracks running process in the CPU
running_proc_time = 0 # Tracks the time running process spent in the CPU
for t in range(11):
#CODE
print(end_times) # End times for each process
print(wait_times) # Wait times for each process in the queue
I am making a PyQt5 real-time application that streamline images and its segmentation result on interface. It has 3 components as a cycle:
pull out frame
feed the frame into segmentation module
print out original frame and segmentation result on interface
In particularly, (2) is a bottleneck. I tried to run (2) in main thread, and compared its speed and CPU consumption rate against running (2) in background (using QThread). I strangely found that running by QThread costs ~20 ms slower and is more CPU intensive.
Method 1: Wrapping (1), (2) in a QObject Worker and Run in Main Thread
I don't intentionally want to run all these in main thread because it causes window being locked, but I do this for speed testing. Below is my implementation of the QObject worker (FrameStore is an object retrieving key results in QObject and pass them to GUI):
class FrameObject(QObject):
frame_signal = pyqtSignal(FrameStore)
def __init__(self, parent = None):
super().__init__()
self.load_img()
self.IS_SILENT = True
self.model_config = ModelMetaConfig()
self.frame_store = FrameStore()
def load_img(self):
self.PATH = os.path.join(os.getcwd(), 'test_cases', 'test.jpg')
rgb_img = cv2.imread(self.PATH)
self.rgb_img = rgb_img[:, :, ::-1]
#pyqtSlot()
def run(self):
end_loop = time.time()
while True:
crt_t = time.time()
print('time = {:10.4f} s'.format(time.time()))
print('unexplain = {:10.4f} s'.format(crt_t - end_loop))
torch.cuda.synchronize()
start = time.time()
seg_out = self.model_config.raw_predict(self.rgb_img, self.IS_SILENT)
seg_out = self.model_config.process_predict(seg_out, self.IS_SILENT)
torch.cuda.synchronize()
end = time.time()
print('model run = {:10.4f} s'.format(end - start))
start = time.time()
# store key data at a snapsho
self.frame_store.rgb_img = self.rgb_img
self.frame_store.seg_out = seg_out
#self.frame_signal.emit(self.frame_store)
end_loop = time.time()
self.frame_signal.emit(self.frame_store)
print('emit = {:10.4f} s'.format(end_loop - start))
Below screenshot tells that CPU usage is 60% with frame rate ~120 ms:
Screenshot: ~60% CPU Usage + ~120 ms Frame Rate
Method 2: Wrapping (1), (2) in a QThread
My implementation of QThread is similar to the QObject above:
class FrameThread(QThread):
frame_signal = pyqtSignal(FrameStore)
def __init__(self, parent = None):
super().__init__()
self.load_img()
self.IS_SILENT = True
self.model_config = ModelMetaConfig()
self.frame_store = FrameStore()
def load_img(self):
self.PATH = os.path.join(os.getcwd(), 'test_cases', 'test.jpg')
rgb_img = cv2.imread(self.PATH)
self.rgb_img = rgb_img[:, :, ::-1]
def run(self):
end_loop = time.time()
while True:
crt_t = time.time()
print('time = {:10.4f} s'.format(time.time()))
print('unexplain = {:10.4f} s'.format(crt_t - end_loop))
torch.cuda.synchronize()
start = time.time()
seg_out = self.model_config.raw_predict(self.rgb_img, self.IS_SILENT)
seg_out = self.model_config.process_predict(seg_out, self.IS_SILENT)
torch.cuda.synchronize()
end = time.time()
print('model run = {:10.4f} s'.format(end - start))
start = time.time()
# store key data at a snapsho
self.frame_store.rgb_img = self.rgb_img
self.frame_store.rgb2_img = self.rgb_img
self.frame_store.rgb3_img = self.rgb_img
self.frame_store.rgb4_img = self.rgb_img
self.frame_store.seg_out = seg_out
self.frame_signal.emit(self.frame_store)
end_loop = time.time()
print('emit = {:10.4f} s'.format(end_loop - start))
But it turns out the frame rate is ~20-30 ms slower with 100% CPU usage, as shown below:
Screenshot: 100% CPU Usage + ~145 ms Frame Rate
Reproducing My Result
I have uploaded a simplified version of my application on my repo. You can reproduce my result by running $ python main.py. Note the application requires a GPU and installation of pytorch.
To switch between method 1 and method 2, just toggle comment in the following lines in main.py:
def __init__(self):
super().__init__()
# set up window
self.title = 'Simply Reproducing the Slow Speed'
self.top = 100
self.left = 100
self.width = 1280
self.height = 1280
self.init_window()
#self.init_qobject() #<- uncomment this to run QObject in main thread
#self.init_thread() #<- uncomment this to run QThread
A brief outline of key scripts:
model_utils.py: wrapping all segmentation procedures into a class
object_utils.py: a QObject worker class computing (1) and (2)
thread_utils.py: a QThread class computing (1) and (2)
main.py: main interface application
Question
The slow down and extra CPU usage is so annoying to me. Could anyone suggests me an alternative implementation that could delegate the work to background more efficiently?
I was given some very good hints in this forum about how to code a clock object in Python 2. I've got some code working now. It's a clock that 'ticks' at 60 FPS:
import sys
import time
class Clock(object):
def __init__(self):
self.init_os()
self.fps = 60.0
self._tick = 1.0 / self.fps
print "TICK", self._tick
self.check_min_sleep()
self.t = self.timestamp()
def init_os(self):
if sys.platform == "win32":
self.timestamp = time.clock
self.wait = time.sleep
def timeit(self, f, args):
t1 = self.timestamp()
f(*args)
t2 = self.timestamp()
return t2 - t1
def check_min_sleep(self):
"""checks the min sleep time on the system"""
runs = 1000
times = [self.timeit(self.wait, (0.001, )) for n in xrange(runs)]
average = sum(times) / runs
print "average min sleep time:", round(average, 6)
sort = sorted(times)
print "fastest, slowest", sort[0], sort[-1]
def tick(self):
next_tick = self.t + self._tick
t = self.timestamp()
while t < next_tick:
t = self.timestamp()
self.t = t
if __name__ == "__main__":
clock = Clock()
The clock does not do too bad, but in order to avoid a busy loop I'd like Windows to sleep less than the usual about 15 milliseconds. On my system (64-bit Windows 10), it returns me an average of about 15 / 16 msecs when starting the clock if Python is the only application that's running. That's way too long for a min sleep to avoid a busy loop.
Does anybody know how I can get Windows to sleep less than that value?
You can temporarily lower the timer period to the wPeriodMin value returned by timeGetDevCaps. The following defines a timer_resolution context manager that calls the timeBeginPeriod and timeEndPeriod functions.
import timeit
import contextlib
import ctypes
from ctypes import wintypes
winmm = ctypes.WinDLL('winmm')
class TIMECAPS(ctypes.Structure):
_fields_ = (('wPeriodMin', wintypes.UINT),
('wPeriodMax', wintypes.UINT))
def _check_time_err(err, func, args):
if err:
raise WindowsError('%s error %d' % (func.__name__, err))
return args
winmm.timeGetDevCaps.errcheck = _check_time_err
winmm.timeBeginPeriod.errcheck = _check_time_err
winmm.timeEndPeriod.errcheck = _check_time_err
#contextlib.contextmanager
def timer_resolution(msecs=0):
caps = TIMECAPS()
winmm.timeGetDevCaps(ctypes.byref(caps), ctypes.sizeof(caps))
msecs = min(max(msecs, caps.wPeriodMin), caps.wPeriodMax)
winmm.timeBeginPeriod(msecs)
yield
winmm.timeEndPeriod(msecs)
def min_sleep():
setup = 'import time'
stmt = 'time.sleep(0.001)'
return timeit.timeit(stmt, setup, number=1000)
Example
>>> min_sleep()
15.6137827
>>> with timer_resolution(msecs=1): min_sleep()
...
1.2827173000000016
The original timer resolution is restored after the with block:
>>> min_sleep()
15.6229814
I use minimize from the Scipy module on Python 3.4, specifically:
resultats=minimize(margin_rate, iniprices, method='SLSQP',
jac=margin_rate_deriv, bounds=pricebounds, options={'disp': True,
'maxiter':2000}, callback=iter_report_margin_rate)
The maximum number of iterations can be set (as above), but is there a way to tell minimize to stop searching for a solution after a given set time? I looked at the general options of minimize as well as the specific options of the SLSQP solver, but could not work it out.
Thanks
You can use the callback argument to raise a warning or exception if the execution time exceeds some threshold:
import numpy as np
from scipy.optimize import minimize, rosen
import time
import warnings
class TookTooLong(Warning):
pass
class MinimizeStopper(object):
def __init__(self, max_sec=60):
self.max_sec = max_sec
self.start = time.time()
def __call__(self, xk=None):
elapsed = time.time() - self.start
if elapsed > self.max_sec:
warnings.warn("Terminating optimization: time limit reached",
TookTooLong)
else:
# you might want to report other stuff here
print("Elapsed: %.3f sec" % elapsed)
# example usage
x0 = [1.3, 0.7, 0.8, 1.9, 1.2]
res = minimize(rosen, x0, method='Nelder-Mead', callback=MinimizeStopper(1E-3))
No. What you can do is start the optimizer in a separate process, keep track of how long it has been running and terminate it if necessary:
from multiprocessing import Process, Queue
import time
import random
from __future__ import print_function
def f(param, queue):
#do the minimization and add result to queue
#res = minimize(param)
#queue.put(res)
#to make this a working example I'll just sleep a
#a random amount of time
sleep_amount = random.randint(1, 10)
time.sleep(sleep_amount)
res = param*sleep_amount
queue.put(res)
q = Queue()
p = Process(target=f, args=(2.2, q))
max_time = 3
t0 = time.time()
p.start()
while time.time() - t0 < max_time:
p.join(timeout=1)
if not p.is_alive():
break
if p.is_alive():
#process didn't finish in time so we terminate it
p.terminate()
result = None
else:
result = q.get()
print(result)