I am trying to concurrently execute methods from two objects concurrently for a computer vision task. My idea is to use two different feature detectors to compute their respective feature descriptions inside a base class.
In this regard, I built the following toy example to understand python concurrent.futures.ProcessPoolExecutor class.
When executed, the first part of the code runs as expected with 20 Heartbeat (10 from each method executed 10 times in total) strings printed out with the sum for two objects coming out correctly as 100, -100.
But in the second half of the code, it appears the ProcessPoolExecutor is not running the do_math(self, numx) method at all. What am I doing wrong here?
With best,
Azmyin
import numpy as np
import concurrent.futures as cf
import time
def current_milli_time():
# CORE FUNCTION
# Function that returns a time tick in milliseconds
return round(time.time() * 1000)
class masterClass(object):
super_multiplier = 1 # Class variable
def __init__(self, ls):
# Attributes of masterClass
self.var1 = ls[0]
self.sumx = ls[1]
def __rep__(self):
print(f"sumx value -- {self.sumx}")
def apply_sup_mult(self, var_in):
self.sumx = self.sumx + (var_in * masterClass.super_multiplier)
time.sleep(0.025)
print(f"Hearbeat!!")
# This is a regular method
def do_math(self, numx):
self.apply_sup_mult(numx)
ls = [10,0]
ls2 = [-10,0]
numx = 10
obj1 = masterClass(ls)
obj2 = masterClass(ls2)
t1 = current_milli_time()
# Run methods one by one
for _ in range(numx):
obj1.do_math(ls[0])
obj2.do_math(ls2[0])
obj1.__rep__()
obj2.__rep__()
t2 = current_milli_time()
print(f"Time taken -- {t2 - t1} ms")
print()
## Using multiprocessing to concurrently run two methods
# Intentionally reinitialize objects
obj1 = masterClass(ls)
obj1 = masterClass(ls2)
t1 = current_milli_time()
resx = []
with cf.ProcessPoolExecutor() as executor:
for i in range(numx):
#fs = [executor.submit(obj3.do_math, ls[0]), executor.submit(obj4.do_math, ls2[0])]
f1 = executor.submit(obj1.do_math, ls[0])
f2 = executor.submit(obj2.do_math, ls2[0])
# for i,f in enumerate(cf.as_completed(fs)):
# print(f"Done with {f}")
# # State of sumx
obj1.__rep__()
obj2.__rep__()
t2 = current_milli_time()
print(f"Time taken -- {t2 - t1} ms")
Related
I have a app with pyqt5 and receive data from udp, pack them and show.
I use threading(python built-in thread) for my app(one thread for dl_data, one thread for save_log and one thread for ui) but in real time I have a lots of lag in showing final image.
Part of my code for pack data is:
self.data_log = [[0 for i in range(1000)] for j in range(1600)]
def unpack(self,pack):
tmp = bytearray()
for i in range(3):
tmp.extend(pack[i])
return np.asarray([newdiv(struct.unpack('<H',tmp[x+1:x+3])[0],256) for x in range(0,(self.NP) * 4,4)])
def merge(self,data):
data = self.unpack(data)
data = np.roll(data,200,0)
self.data_log.extend([data])
def dl_data(self):
while 1:
for i in range(self.NPBN * 3):
tmp = self.recv(self.bytes)
self.merge(tmp)
def process_feed(self):
while True:
data = self.tcpsocket.recv(1024)
if b'pressed' in data :
if not self.get_data.is_alive():
self.get_data.start()
def show(self):
.....
if self.Mode == 'run':
pressed= threading.Thread(target=self.process_feed)
pressed.start()
get_data = threading.Thread(target=self.dl_data)
newdiv function is "Making division in Python faster" link
I want to design a python block to remove the white noise of my received signal
As shown in the red circle:
So I first tested on Spyder whether my code can remove noise normally
I got the result as pictured:
Just when I finished the test, I wanted to transfer it to the python block, but it couldn't execute normally, causing a crash
Below is the test program on my python block and spyder:
i = 0
j = 0
t = 0
data=[]
buf=[]
wav = numpy.fromfile(open('C:/Users/user/Desktop/datas'), dtype=numpy.uint8)
for i in range(int(len(wav)/10)):
for j in range(10):
buf.append(wav[(i*10)+j])
if (buf[j]<=180)and(buf[j]>=90):
t = t+1
if t < 6:
data = numpy.append(data,buf)
# else:
# data = numpy.append(data,numpy.zeros(10))
t= 0
j = 0
buf.clear()
"""
Embedded Python Blocks:
Each time this file is saved, GRC will instantiate the first class it finds
to get ports and parameters of your block. The arguments to __init__ will
be the parameters. All of them are required to have default values!
"""
import numpy as np
from gnuradio import gr
i = 0
j = 0
t = 0
data=[]
buf=[]
class blk(gr.sync_block): # other base classes are basic_block, decim_block, interp_block
"""Embedded Python Block example - a simple multiply const"""
def __init__(self): # only default arguments here
"""arguments to this function show up as parameters in GRC"""
gr.sync_block.__init__(
self,
name='noise out', # will show up in GRC
in_sig=[np.float32],
out_sig=[np.float32]
)
# if an attribute with the same name as a parameter is found,
# a callback is registered (properties work, too).
def work(self, input_items, output_items):
"""example: multiply with constant"""
np.frombuffer(input_items, dtype=np.uint8)
for i in range(int((len(input_items[0]))/10)):
for j in range(10):
buf.append(input_items[0][(i*10)+j])
if (buf[j]<=180)and(buf[j]>=90):
t = t+1
if t < 6:
data = numpy.append(data,buf)
else:
data = numpy.append(data,numpy.zeros(10))
t= 0
j = 0
buf.clear()
for i in range(len(output_items[0])):
output_items[0][i]=data[i]
return len(output_items[0])
What should I do to modify it if I want it to run normally?
I am making a PyQt5 real-time application that streamline images and its segmentation result on interface. It has 3 components as a cycle:
pull out frame
feed the frame into segmentation module
print out original frame and segmentation result on interface
In particularly, (2) is a bottleneck. I tried to run (2) in main thread, and compared its speed and CPU consumption rate against running (2) in background (using QThread). I strangely found that running by QThread costs ~20 ms slower and is more CPU intensive.
Method 1: Wrapping (1), (2) in a QObject Worker and Run in Main Thread
I don't intentionally want to run all these in main thread because it causes window being locked, but I do this for speed testing. Below is my implementation of the QObject worker (FrameStore is an object retrieving key results in QObject and pass them to GUI):
class FrameObject(QObject):
frame_signal = pyqtSignal(FrameStore)
def __init__(self, parent = None):
super().__init__()
self.load_img()
self.IS_SILENT = True
self.model_config = ModelMetaConfig()
self.frame_store = FrameStore()
def load_img(self):
self.PATH = os.path.join(os.getcwd(), 'test_cases', 'test.jpg')
rgb_img = cv2.imread(self.PATH)
self.rgb_img = rgb_img[:, :, ::-1]
#pyqtSlot()
def run(self):
end_loop = time.time()
while True:
crt_t = time.time()
print('time = {:10.4f} s'.format(time.time()))
print('unexplain = {:10.4f} s'.format(crt_t - end_loop))
torch.cuda.synchronize()
start = time.time()
seg_out = self.model_config.raw_predict(self.rgb_img, self.IS_SILENT)
seg_out = self.model_config.process_predict(seg_out, self.IS_SILENT)
torch.cuda.synchronize()
end = time.time()
print('model run = {:10.4f} s'.format(end - start))
start = time.time()
# store key data at a snapsho
self.frame_store.rgb_img = self.rgb_img
self.frame_store.seg_out = seg_out
#self.frame_signal.emit(self.frame_store)
end_loop = time.time()
self.frame_signal.emit(self.frame_store)
print('emit = {:10.4f} s'.format(end_loop - start))
Below screenshot tells that CPU usage is 60% with frame rate ~120 ms:
Screenshot: ~60% CPU Usage + ~120 ms Frame Rate
Method 2: Wrapping (1), (2) in a QThread
My implementation of QThread is similar to the QObject above:
class FrameThread(QThread):
frame_signal = pyqtSignal(FrameStore)
def __init__(self, parent = None):
super().__init__()
self.load_img()
self.IS_SILENT = True
self.model_config = ModelMetaConfig()
self.frame_store = FrameStore()
def load_img(self):
self.PATH = os.path.join(os.getcwd(), 'test_cases', 'test.jpg')
rgb_img = cv2.imread(self.PATH)
self.rgb_img = rgb_img[:, :, ::-1]
def run(self):
end_loop = time.time()
while True:
crt_t = time.time()
print('time = {:10.4f} s'.format(time.time()))
print('unexplain = {:10.4f} s'.format(crt_t - end_loop))
torch.cuda.synchronize()
start = time.time()
seg_out = self.model_config.raw_predict(self.rgb_img, self.IS_SILENT)
seg_out = self.model_config.process_predict(seg_out, self.IS_SILENT)
torch.cuda.synchronize()
end = time.time()
print('model run = {:10.4f} s'.format(end - start))
start = time.time()
# store key data at a snapsho
self.frame_store.rgb_img = self.rgb_img
self.frame_store.rgb2_img = self.rgb_img
self.frame_store.rgb3_img = self.rgb_img
self.frame_store.rgb4_img = self.rgb_img
self.frame_store.seg_out = seg_out
self.frame_signal.emit(self.frame_store)
end_loop = time.time()
print('emit = {:10.4f} s'.format(end_loop - start))
But it turns out the frame rate is ~20-30 ms slower with 100% CPU usage, as shown below:
Screenshot: 100% CPU Usage + ~145 ms Frame Rate
Reproducing My Result
I have uploaded a simplified version of my application on my repo. You can reproduce my result by running $ python main.py. Note the application requires a GPU and installation of pytorch.
To switch between method 1 and method 2, just toggle comment in the following lines in main.py:
def __init__(self):
super().__init__()
# set up window
self.title = 'Simply Reproducing the Slow Speed'
self.top = 100
self.left = 100
self.width = 1280
self.height = 1280
self.init_window()
#self.init_qobject() #<- uncomment this to run QObject in main thread
#self.init_thread() #<- uncomment this to run QThread
A brief outline of key scripts:
model_utils.py: wrapping all segmentation procedures into a class
object_utils.py: a QObject worker class computing (1) and (2)
thread_utils.py: a QThread class computing (1) and (2)
main.py: main interface application
Question
The slow down and extra CPU usage is so annoying to me. Could anyone suggests me an alternative implementation that could delegate the work to background more efficiently?
I am writing a simple python script that I need to scale to many threads. For simplicity, I have replaced the actual function I need to use with a matrix matrix multiply. I am having trouble getting my code to scale with the number of processors. Any advice to help me get the correct speedup would be helpful! My code and results are as follows:
import numpy as np
import time
import math
from multiprocessing.dummy import Pool
res = 4
#we must iterate over all of these values
wavektests = np.linspace(.1,2.5,res)
omegaratios = np.linspace(.1,2.5,res)
wavekmat,omegamat = np.meshgrid(wavektests,omegaratios)
def solve_for_omegaratio( ind ):
#obtain the indices for this run
x_ind = ind % res
y_ind = math.floor(ind / res)
#obtain the value for this run
wavek = wavektests[x_ind]
omega = omegaratios[y_ind]
#do some work ( I have replaced the real function with this)
randmat = np.random.rand(4000,4000)
nop = np.linalg.matrix_power(randmat,3)
#obtain a scalar value
value = x_ind + y_ind**2.0
return value
list_ind = range(res**2)
#Serial code execution
t0_proc = time.clock()
t0_wall = time.time()
threads = 0
dispersion = map( solve_for_omegaratio , list_ind)
displist = list(dispersion)
t1_proc = time.clock()
t1_wall = time.time()
print('serial execution')
print('wall clock time = ',t1_wall-t0_wall)
print('processor clock time = ',t1_proc-t0_proc)
print('------------------------------------------------')
#Using pool defaults
t0_proc = time.clock()
t0_wall = time.time()
if __name__ == '__main__':
pool = Pool()
dispersion = pool.map( solve_for_omegaratio , list_ind)
displist = list(dispersion)
t1_proc = time.clock()
t1_wall = time.time()
pool.close
print('num of threads = default')
print('wall clock time = ',t1_wall-t0_wall)
print('processor clock time = ',t1_proc-t0_proc)
print('------------------------------------------------')
# Using 4 threads
t0_proc = time.clock()
t0_wall = time.time()
threads = 4
if __name__ == '__main__':
pool = Pool(threads)
dispersion = pool.map( solve_for_omegaratio , list_ind)
displist = list(dispersion)
t1_proc = time.clock()
t1_wall = time.time()
pool.close
print('num of threads = ' + str(threads))
print('wall clock time = ',t1_wall-t0_wall)
print('processor clock time = ',t1_proc-t0_proc)
print('------------------------------------------------')
Results:
serial execution
wall clock time = 66.1561758518219
processor clock time = 129.16376499999998
------------------------------------------------
num of threads = default
wall clock time = 81.86436200141907
processor clock time = 263.45369
------------------------------------------------
num of threads = 4
wall clock time = 77.63390111923218
processor clock time = 260.66285300000004
------------------------------------------------
Because python has a GIL https://wiki.python.org/moin/GlobalInterpreterLock , "python-native" threads can't run execute truly concurrently and thus can't improve the performance of CPU-bound tasks like math. They can be used to parallelize IO bound tasks effectively (eg API calls which spend almost all their time waiting for network I/O). Forking separate processes with multiprocessing rather than dummy's thread-backed implementation will create multiple processes, not threads, which will be able to run concurrently ( at cost of significant memory overhead).
Hi I am working on a script that will solve and plot an ODE using the Runge Kutta method. I want to have the script use different functions so I can expand upon it later. If I write it with out the function definition it works fine, but with the def() it will not open a plot window or print the results. Than you!
from numpy import *
import matplotlib.pyplot as plt
#H=p^2/2-cosq
#p=dp=-dH/dq
#q=dq=dH/dp
t = 0
h = 0.5
pfa = [] #Create arrays that will hold pf,qf values
qfa = []
while t < 10:
q = 1*t
p = -sin(q*t)
p1 = p
q1 = q
p2 = p + h/2*q1
q2 = q + h/2*p1
p3 = p+ h/2*q2
q3 = q+ h/2*p2
p4 = p+ h/2*q3
q4 = q+ h/2*p4
pf = (p +(h/6.0)*(p1+2*p2+3*p3+p4))
qf = (q +(h/6.0)*(q1+2*q2+3*q3+q4))
pfa.append(pf) #append arrays
qfa.append(qf)
t += h #increase time step
print("test")
plt.plot(pfa,qfa)
print("test1")
plt.show()
print("tes2t")
If you have a function declared, you'll need to call it at some point, e.g.:
def rk(p,q,h):
pass # your code here
if __name__ == '__main__':
rk(1,2,1)
Putting the function call within the if __name__ == '__main__' block will ensure that the function is called only when you run the script directly, not when you import it from another script. ( More on that here in case you're interested: What does if __name__ == "__main__": do? )
And here's an even better option; to avoid hard-coding fn args (your real code should have some error-handling for unexpected command-line input):
def rk(p,q,h):
pass # your code here
if __name__ == '__main__':
import argparse
the_parser = argparse.ArgumentParser()
the_parser.add_argument('integers', type=int, nargs=3)
args = the_parser.parse_args()
p,q,h = args.integers
rk(p,q,h)