PyQt5 - Worker Runs Slower with More CPU Usage in QThread - python

I am making a PyQt5 real-time application that streamline images and its segmentation result on interface. It has 3 components as a cycle:
pull out frame
feed the frame into segmentation module
print out original frame and segmentation result on interface
In particularly, (2) is a bottleneck. I tried to run (2) in main thread, and compared its speed and CPU consumption rate against running (2) in background (using QThread). I strangely found that running by QThread costs ~20 ms slower and is more CPU intensive.
Method 1: Wrapping (1), (2) in a QObject Worker and Run in Main Thread
I don't intentionally want to run all these in main thread because it causes window being locked, but I do this for speed testing. Below is my implementation of the QObject worker (FrameStore is an object retrieving key results in QObject and pass them to GUI):
class FrameObject(QObject):
frame_signal = pyqtSignal(FrameStore)
def __init__(self, parent = None):
super().__init__()
self.load_img()
self.IS_SILENT = True
self.model_config = ModelMetaConfig()
self.frame_store = FrameStore()
def load_img(self):
self.PATH = os.path.join(os.getcwd(), 'test_cases', 'test.jpg')
rgb_img = cv2.imread(self.PATH)
self.rgb_img = rgb_img[:, :, ::-1]
#pyqtSlot()
def run(self):
end_loop = time.time()
while True:
crt_t = time.time()
print('time = {:10.4f} s'.format(time.time()))
print('unexplain = {:10.4f} s'.format(crt_t - end_loop))
torch.cuda.synchronize()
start = time.time()
seg_out = self.model_config.raw_predict(self.rgb_img, self.IS_SILENT)
seg_out = self.model_config.process_predict(seg_out, self.IS_SILENT)
torch.cuda.synchronize()
end = time.time()
print('model run = {:10.4f} s'.format(end - start))
start = time.time()
# store key data at a snapsho
self.frame_store.rgb_img = self.rgb_img
self.frame_store.seg_out = seg_out
#self.frame_signal.emit(self.frame_store)
end_loop = time.time()
self.frame_signal.emit(self.frame_store)
print('emit = {:10.4f} s'.format(end_loop - start))
Below screenshot tells that CPU usage is 60% with frame rate ~120 ms:
Screenshot: ~60% CPU Usage + ~120 ms Frame Rate
Method 2: Wrapping (1), (2) in a QThread
My implementation of QThread is similar to the QObject above:
class FrameThread(QThread):
frame_signal = pyqtSignal(FrameStore)
def __init__(self, parent = None):
super().__init__()
self.load_img()
self.IS_SILENT = True
self.model_config = ModelMetaConfig()
self.frame_store = FrameStore()
def load_img(self):
self.PATH = os.path.join(os.getcwd(), 'test_cases', 'test.jpg')
rgb_img = cv2.imread(self.PATH)
self.rgb_img = rgb_img[:, :, ::-1]
def run(self):
end_loop = time.time()
while True:
crt_t = time.time()
print('time = {:10.4f} s'.format(time.time()))
print('unexplain = {:10.4f} s'.format(crt_t - end_loop))
torch.cuda.synchronize()
start = time.time()
seg_out = self.model_config.raw_predict(self.rgb_img, self.IS_SILENT)
seg_out = self.model_config.process_predict(seg_out, self.IS_SILENT)
torch.cuda.synchronize()
end = time.time()
print('model run = {:10.4f} s'.format(end - start))
start = time.time()
# store key data at a snapsho
self.frame_store.rgb_img = self.rgb_img
self.frame_store.rgb2_img = self.rgb_img
self.frame_store.rgb3_img = self.rgb_img
self.frame_store.rgb4_img = self.rgb_img
self.frame_store.seg_out = seg_out
self.frame_signal.emit(self.frame_store)
end_loop = time.time()
print('emit = {:10.4f} s'.format(end_loop - start))
But it turns out the frame rate is ~20-30 ms slower with 100% CPU usage, as shown below:
Screenshot: 100% CPU Usage + ~145 ms Frame Rate
Reproducing My Result
I have uploaded a simplified version of my application on my repo. You can reproduce my result by running $ python main.py. Note the application requires a GPU and installation of pytorch.
To switch between method 1 and method 2, just toggle comment in the following lines in main.py:
def __init__(self):
super().__init__()
# set up window
self.title = 'Simply Reproducing the Slow Speed'
self.top = 100
self.left = 100
self.width = 1280
self.height = 1280
self.init_window()
#self.init_qobject() #<- uncomment this to run QObject in main thread
#self.init_thread() #<- uncomment this to run QThread
A brief outline of key scripts:
model_utils.py: wrapping all segmentation procedures into a class
object_utils.py: a QObject worker class computing (1) and (2)
thread_utils.py: a QThread class computing (1) and (2)
main.py: main interface application
Question
The slow down and extra CPU usage is so annoying to me. Could anyone suggests me an alternative implementation that could delegate the work to background more efficiently?

Related

How to optimize my code for better performance and no lag in realtime

I have a app with pyqt5 and receive data from udp, pack them and show.
I use threading(python built-in thread) for my app(one thread for dl_data, one thread for save_log and one thread for ui) but in real time I have a lots of lag in showing final image.
Part of my code for pack data is:
self.data_log = [[0 for i in range(1000)] for j in range(1600)]
def unpack(self,pack):
tmp = bytearray()
for i in range(3):
tmp.extend(pack[i])
return np.asarray([newdiv(struct.unpack('<H',tmp[x+1:x+3])[0],256) for x in range(0,(self.NP) * 4,4)])
def merge(self,data):
data = self.unpack(data)
data = np.roll(data,200,0)
self.data_log.extend([data])
def dl_data(self):
while 1:
for i in range(self.NPBN * 3):
tmp = self.recv(self.bytes)
self.merge(tmp)
def process_feed(self):
while True:
data = self.tcpsocket.recv(1024)
if b'pressed' in data :
if not self.get_data.is_alive():
self.get_data.start()
def show(self):
.....
if self.Mode == 'run':
pressed= threading.Thread(target=self.process_feed)
pressed.start()
get_data = threading.Thread(target=self.dl_data)
newdiv function is "Making division in Python faster" link

Python concurrent.futures.ProcessPoolExecutor() not executing methods inside objects

I am trying to concurrently execute methods from two objects concurrently for a computer vision task. My idea is to use two different feature detectors to compute their respective feature descriptions inside a base class.
In this regard, I built the following toy example to understand python concurrent.futures.ProcessPoolExecutor class.
When executed, the first part of the code runs as expected with 20 Heartbeat (10 from each method executed 10 times in total) strings printed out with the sum for two objects coming out correctly as 100, -100.
But in the second half of the code, it appears the ProcessPoolExecutor is not running the do_math(self, numx) method at all. What am I doing wrong here?
With best,
Azmyin
import numpy as np
import concurrent.futures as cf
import time
def current_milli_time():
# CORE FUNCTION
# Function that returns a time tick in milliseconds
return round(time.time() * 1000)
class masterClass(object):
super_multiplier = 1 # Class variable
def __init__(self, ls):
# Attributes of masterClass
self.var1 = ls[0]
self.sumx = ls[1]
def __rep__(self):
print(f"sumx value -- {self.sumx}")
def apply_sup_mult(self, var_in):
self.sumx = self.sumx + (var_in * masterClass.super_multiplier)
time.sleep(0.025)
print(f"Hearbeat!!")
# This is a regular method
def do_math(self, numx):
self.apply_sup_mult(numx)
ls = [10,0]
ls2 = [-10,0]
numx = 10
obj1 = masterClass(ls)
obj2 = masterClass(ls2)
t1 = current_milli_time()
# Run methods one by one
for _ in range(numx):
obj1.do_math(ls[0])
obj2.do_math(ls2[0])
obj1.__rep__()
obj2.__rep__()
t2 = current_milli_time()
print(f"Time taken -- {t2 - t1} ms")
print()
## Using multiprocessing to concurrently run two methods
# Intentionally reinitialize objects
obj1 = masterClass(ls)
obj1 = masterClass(ls2)
t1 = current_milli_time()
resx = []
with cf.ProcessPoolExecutor() as executor:
for i in range(numx):
#fs = [executor.submit(obj3.do_math, ls[0]), executor.submit(obj4.do_math, ls2[0])]
f1 = executor.submit(obj1.do_math, ls[0])
f2 = executor.submit(obj2.do_math, ls2[0])
# for i,f in enumerate(cf.as_completed(fs)):
# print(f"Done with {f}")
# # State of sumx
obj1.__rep__()
obj2.__rep__()
t2 = current_milli_time()
print(f"Time taken -- {t2 - t1} ms")

Inferencing slower on detectron2 when using multithreading with cv2.VideoCapture

To start, this is a continuation of this question: Multithreading degrades GPU performance. However, since that question never got resolved due to everyone not being able to reproduce the results, I have created a new question with code here that reproduces the slower results outlined there.
To recap: when using cv2.VideoCapture with multi-threading, the inferencing time for Detectron2 is much slower compared to when multi-threading is disabled.
Some additional information is that I am operating on Windows and am using an RTX3070 so inferencing times may be slightly different for those trying to rerun this.
Here is the code:
import time
import cv2
from queue import Queue
from threading import Thread
from detectron2.config import get_cfg
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
class FileVideoStream:
def __init__(self, path, queueSize=15):
self.stream = cv2.VideoCapture(path)
self.stopped = False
self.Q = Queue(maxsize=queueSize)
def start(self):
t = Thread(target=self.update, args=())
t.daemon = True
t.start()
return self
def update(self):
while True:
if self.stopped:
self.stream.release()
return
if not self.Q.full():
(grabbed, frame) = self.stream.read()
if not grabbed:
self.stop()
return
self.Q.put(frame)
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file(
"COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml")
)
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7 # set threshold for this model
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(
"COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml"
)
cfg.MODEL.DEVICE = "cuda"
predictor = DefaultPredictor(cfg)
def threading_example():
print("Threading Example:")
fvs = FileVideoStream(r"DemoVideo.mp4")
fvs.start()
# allow time for thread to fill the queue
time.sleep(1)
for i in range(5):
img = fvs.Q.get()
start = time.time()
p = predictor(img)
end = time.time()
print(f"Frame {i} Prediction: {(end - start):.2f}s")
fvs.stopped = True
def non_threading_example():
print("Non-Threading Example:")
video = cv2.VideoCapture(r"DemoVideo.mp4")
for i in range(5):
_, img = video.read()
start = time.time()
p = predictor(img)
end = time.time()
print(f"Frame {i} Prediction: {(end - start):.2f}s")
non_threading_example()
threading_example()
This produces the following output:
Non-Threading Example:
Frame 0 Prediction: 1.41s
Frame 1 Prediction: 0.14s
Frame 2 Prediction: 0.14s
Frame 3 Prediction: 0.14s
Frame 4 Prediction: 0.14s
Threading Example:
Frame 0 Prediction: 10.55s
Frame 1 Prediction: 10.41s
Frame 2 Prediction: 10.77s
Frame 3 Prediction: 10.64s
Frame 4 Prediction: 10.27s
EDIT: I've added code to answer a comment about testing if the GPU on inferencing when inside a thread, which does not appear to be the case.
def infer_5(img):
for i in range(5):
start = time.time()
p = predictor(img)
end = time.time()
print(f"Frame {i}: {(end - start):.2f}s")
def system_load():
img = cv2.imread(
r"Image.jpg")
t = Thread(target=infer_5, args=(img,))
t.start()
Frame 0: 7.51s
Frame 1: 0.39s
Frame 2: 0.15s
Frame 3: 0.15s
Frame 4: 0.15s

How to listen user while playing chunk of audio with qt and simpleaudio

I am trying to play an audio signal with simpleaudio inside a GUI application where the user should react to the content of the chunk played and push a button. At the moment the users push the button I would like to change to the next track. This is all done using the slot and signals from qt in Python 3.x using pytq5. Even when my GUI does not freeze I do not understand why I cannot read the button action during (between) the chunks of audio that are been played, instead all actions are readed after all tracks finish.
My code looks like this:
Module to handle the tracks and chunks
import simpleaudio as sa
import numpy as np
class MusicReactor:
def __init__(self):
self.timeProTone = ...
self.deltaT = ...
self.maxVolSkal = ...
self.minVolSkal = ...
self.frequencySample = ...
self.currentTestedEar = ...
def test_function(self, frequency):
# array of time values
times = np.arange(0, self.timeProTone, self.deltaT)
# generator to create a callable object with new volume each time
for time in times:
# get the volume and set the volume to the starting sequence
currentVolume = (self.maxVolSkal-self.minVolSkal)/self.timeProTone * time + self.minVolSkal
self.setVolumeScalar(currentVolume)
# create chunk the tone as a numpy array
audio = createTone(frequency, self.deltaT, self.frequencySample, self.currentTestedEar)
yield audio, currentVolume
def createTone(frequency, duration, frequencySampled, currentTestedEar = TestEar.Both):
# Generate array with seconds*sample_rate steps, ranging between 0 and seconds
tt = np.linspace((0, 0), (duration, duration), int(duration * frequencySampled), False)
#populate the other ear with zeros
if currentTestedEar is not TestEar.Both:
tt[:, 1-currentTestedEar.value] = 0 # This strategy works only if the note creation i a sinusoidal : sin(0) = 0
# Generate a 440 Hz sine wave
note = np.sin(frequency * tt * 2 * np.pi)
# normalize to 16-bit range
note *= 32767 / np.max(np.abs(note))
# Ensure that highest value is in 16-bit range
audio = note * (2 ** 15 - 1) / np.max(np.abs(note))
# Convert to 16-bit data
audio = audio.astype(np.int16)
return audio
def playTone(audio, frequencySample, num_channels=1, bytes_per_sample=2):
# Start playback
play_obj = sa.play_buffer(audio, num_channels, bytes_per_sample, frequencySample)
# Wait for playback to finish before exiting
play_obj.wait_done()
def generateRndFreq(minF,maxF):
freq = np.random.uniform(low=minF, high=maxF)
return freq
Now the GUI class and its corresponding worker class
class HearingTest_ui(QWidget):
# Send info through signals to subthreads
sig_int_sender = pyqtSignal(int)
hearingObjSender = pyqtSignal( Hearing.HearingTest)
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
uic.loadUi("testForm.ui", self)
self.Both_rB.toggled.connect(self.onTogle_earTested)
self.Links_rB.toggled.connect(self.onTogle_earTested)
self.Recht_rB.toggled.connect(self.onTogle_earTested)
# Method 2 Test
self.ML_startButton.clicked.connect(self.runMethod2Test)
self.setMaxMLProgressBar()
self.ml_nTests = self.ML_spinBox.value()
self.ML_spinBox.valueChanged.connect(self.setNTests)
self.ML_spinBox.valueChanged.connect(self.setMaxMLProgressBar)
# Hearing Test Object
self.HT = Hearing.MusicReactor()
def runMethod2Test(self):
# Preprocessing
self.HT.choose_ear(self.testedEarTuple) # reads a tpogle to assign a channel for the chunk of music
# thread and worker configuration
# Step 2: Create a QThread object
self.ml_thread = QThread(parent=self)
# Step 3: Create a worker object
self.ml_worker = ML_Worker(self.ml_nTests)
# Step 4: Move worker to the thread
self.ml_worker.moveToThread(self.ml_thread)
# Step 5: Connect signals and slots
#self.ml_thread.started.connect(partial(self.ml_worker.actualLongTaskFromHearingTest, self.HT))
self.hearingObjSender.connect(self.ml_worker.actualLongTaskFromHearingTest)
self.ml_worker.progress.connect(self.updateProgressbar)
self.ML_spinBox.valueChanged.connect(self.ml_worker.set_maxTests)
self.sig_int_sender.connect(self.ml_worker.set_maxTests)
self.ML_yesButton.clicked.connect(self.ml_worker.change_Flag)
self.ml_worker.request_playchunk.connect(self.ml_worker.sendAudio2queue)
self.ml_worker.finished.connect(self.ml_thread.quit)
self.ml_worker.finished.connect(self.ml_worker.deleteLater)
self.ml_thread.finished.connect(self.ml_thread.deleteLater)
# Final resets
self.ml_worker.changeButtonStatus.connect(self.ML_startButton.setEnabled)
# start thread
print("clicked runMethodOfLimits")
self.ml_thread.start()
self.hearingObjSender.emit(self.HT)
class ML_Worker(QObject):
finished = pyqtSignal()
progress = pyqtSignal(int)
retrieve = pyqtSignal()
changeButtonStatus = pyqtSignal(bool)
request_playchunk = pyqtSignal(np.ndarray, int, Hearing.MusicReactor)
def __init__(self,nTest):
super().__init__()
self.__abort = False
self.nTests = nTest
self.MoLFlag = False
def abort(self):
self.__abort = True
#pyqtSlot(int)
def set_maxTests(self, val):
print(type(val))
logging.info(f"set_maxTests.... {val}")
self.nTests = val
#pyqtSlot()
def change_Flag(self):
print("clicked")
self.MoLFlag = True
# def of long runnning task
#pyqtSlot(Hearing.MusicReactor)
def actualLongTaskFromHearingTest(self, HTObj):
self.changeButtonStatus.emit(False)
self.progress.emit(0)
self.retrieve.emit()
print(self.nTests)
start = 0
for i in range(self.nTests):
self.MoLFlag = False
j = i + 1
print("start", i)
# create the frequency for the test
chunk_freq = Hearing.generateRndFreq(0, 10000)
#create chunks as generator
for chunk, volume in HTObj.test_function(chunk_freq):
# play chunk of the audio
self.request_playchunk.emit(chunk, 2, HTObj) # this is my current method, by using the signals and slots
# Hearing.playTone(chunk, HTObj.frequencySample, num_channels=2)^# previously I tried something like this, which resulted in the same behavior
print(volume)
if self.MoLFlag:
print(self.MoLFlag)
break
self.progress.emit(j)
self.changeButtonStatus.emit(True)
self.finished.emit()
#pyqtSlot(np.ndarray, int, Hearing.MusicReactor)
def sendAudio2queue(self, chunk, channels, HTObj):
Hearing.playTone(chunk, HTObj.frequencySample, num_channels=channels)
If somebody could take I look I would be very gratefull. I would really like to understand why this is happening. I believe it has something to do with the thread queue, probably I would need to open a new thread which is in charge of the music while the otherone takes care of the GUI reactions, but still I do not understand why it does not break the loop (with the generator) when I click the "ML_yesButton".
It is not necessary to use threads in this case. The wait_done() method blocks the thread where it is executed so that the application does not terminate before finishing playing the audio.
In this case a QTimer can be used to check if the audio finished playing.
import simpleaudio as sa
import numpy as np
from PyQt5.QtCore import pyqtSignal, QObject, Qt, QTimer
from PyQt5.QtWidgets import QApplication, QPushButton, QVBoxLayout, QWidget
class AudioManager(QObject):
started = pyqtSignal()
finished = pyqtSignal()
def __init__(self, parent=None):
super().__init__(parent)
self._play_obj = None
self._timer = QTimer(interval=10)
self._timer.timeout.connect(self._handle_timeout)
def start(self, audio_data, num_channels, bytes_per_sample, sample_rate):
self._play_obj = sa.play_buffer(
audio_data, num_channels, bytes_per_sample, sample_rate
)
self._timer.start()
self.started.emit()
def stop(self):
if self._play_obj is None:
return
self._play_obj.stop()
self._play_obj = None
self.finished.emit()
def _handle_timeout(self):
if self._play_obj is None:
return
if not self.running():
self.stop()
def running(self):
return self._play_obj.is_playing()
def create_tone(duration, fs, f):
tt = np.linspace((0, 0), (duration, duration), int(duration * fs), False)
note = np.sin(f * tt * 2 * np.pi)
note *= 32767 / np.max(np.abs(note))
audio = note * (2 ** 15 - 1) / np.max(np.abs(note))
audio = audio.astype(np.int16)
return audio
class Widget(QWidget):
def __init__(self, parent=None):
super().__init__(parent)
self.audio_manager = AudioManager()
self.audio_manager.started.connect(self.handle_started)
self.audio_manager.finished.connect(self.handle_finished)
self.button = QPushButton("Start", checkable=True)
self.button.toggled.connect(self.handle_toggled)
lay = QVBoxLayout(self)
lay.addWidget(self.button, alignment=Qt.AlignCenter)
def handle_toggled(self, state):
if state:
frequency = np.random.uniform(low=0, high=10000)
tone = create_tone(60, 1000, frequency)
self.audio_manager.start(tone, 1, 2, 16000)
else:
self.audio_manager.stop()
self.button.setText("Start")
def handle_started(self):
self.button.setChecked(True)
self.button.setText("Stop")
def handle_finished(self):
self.button.setChecked(False)
self.button.setText("Start")
def main():
import sys
app = QApplication(sys.argv)
widget = Widget()
widget.resize(640, 480)
widget.show()
sys.exit(app.exec_())
if __name__ == "__main__":
main()

Lack of scaling for python's multiprocessing pool

I am writing a simple python script that I need to scale to many threads. For simplicity, I have replaced the actual function I need to use with a matrix matrix multiply. I am having trouble getting my code to scale with the number of processors. Any advice to help me get the correct speedup would be helpful! My code and results are as follows:
import numpy as np
import time
import math
from multiprocessing.dummy import Pool
res = 4
#we must iterate over all of these values
wavektests = np.linspace(.1,2.5,res)
omegaratios = np.linspace(.1,2.5,res)
wavekmat,omegamat = np.meshgrid(wavektests,omegaratios)
def solve_for_omegaratio( ind ):
#obtain the indices for this run
x_ind = ind % res
y_ind = math.floor(ind / res)
#obtain the value for this run
wavek = wavektests[x_ind]
omega = omegaratios[y_ind]
#do some work ( I have replaced the real function with this)
randmat = np.random.rand(4000,4000)
nop = np.linalg.matrix_power(randmat,3)
#obtain a scalar value
value = x_ind + y_ind**2.0
return value
list_ind = range(res**2)
#Serial code execution
t0_proc = time.clock()
t0_wall = time.time()
threads = 0
dispersion = map( solve_for_omegaratio , list_ind)
displist = list(dispersion)
t1_proc = time.clock()
t1_wall = time.time()
print('serial execution')
print('wall clock time = ',t1_wall-t0_wall)
print('processor clock time = ',t1_proc-t0_proc)
print('------------------------------------------------')
#Using pool defaults
t0_proc = time.clock()
t0_wall = time.time()
if __name__ == '__main__':
pool = Pool()
dispersion = pool.map( solve_for_omegaratio , list_ind)
displist = list(dispersion)
t1_proc = time.clock()
t1_wall = time.time()
pool.close
print('num of threads = default')
print('wall clock time = ',t1_wall-t0_wall)
print('processor clock time = ',t1_proc-t0_proc)
print('------------------------------------------------')
# Using 4 threads
t0_proc = time.clock()
t0_wall = time.time()
threads = 4
if __name__ == '__main__':
pool = Pool(threads)
dispersion = pool.map( solve_for_omegaratio , list_ind)
displist = list(dispersion)
t1_proc = time.clock()
t1_wall = time.time()
pool.close
print('num of threads = ' + str(threads))
print('wall clock time = ',t1_wall-t0_wall)
print('processor clock time = ',t1_proc-t0_proc)
print('------------------------------------------------')
Results:
serial execution
wall clock time = 66.1561758518219
processor clock time = 129.16376499999998
------------------------------------------------
num of threads = default
wall clock time = 81.86436200141907
processor clock time = 263.45369
------------------------------------------------
num of threads = 4
wall clock time = 77.63390111923218
processor clock time = 260.66285300000004
------------------------------------------------
Because python has a GIL https://wiki.python.org/moin/GlobalInterpreterLock , "python-native" threads can't run execute truly concurrently and thus can't improve the performance of CPU-bound tasks like math. They can be used to parallelize IO bound tasks effectively (eg API calls which spend almost all their time waiting for network I/O). Forking separate processes with multiprocessing rather than dummy's thread-backed implementation will create multiple processes, not threads, which will be able to run concurrently ( at cost of significant memory overhead).

Categories

Resources