Architecture for data acquisition and processing

Architecture for data acquisition and processing - python

I am sharping up my Python skills and have started learning about websockets as an educational tool.
Therefore, I'm working with real-time data received every millisecond via a websocket. I would like to seperate its acquisition/processing/plotting in a clean and comprehensive way. Acquisition and processing are critical, whereas plotting can be updated every ~100ms.
A) I am assuming that the raw data arrives at a constant rate, every ms.
B) If processing isn't quick enough (>1ms), skip the data that arrived while busy and stay synced with A)
C) Every ~100ms or so, get the last processed data and plot it.
I guess that a Minimal Working Example would start like this:
import threading
class ReceiveData(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def receive(self):
pass
class ProcessData(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def process(self):
pass
class PlotData(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def plot(self):
pass
Starting with this (is it even the right way to go ?), how can I pass the raw data from ReceiveData to ProcessData, and periodically to PlotData ? How can I keep the executions synced, and repeat calls every ms or 100ms ?
Thank you.

I think your general approach with threads for receiving and processing the data is fine. For communication between the threads, I would suggest a producer-consumer approach. Here is a complete example using a Queue as a data structure.
In your case you want to skip unprocessed data and use only the most recent element. For achieving this, collections.deque (see documentation) might be a better choice for you - see also this discussion.
d = collections.deque(maxlen=1)
The producer side would then append data to the deque like this:
d.append(item)
And the main loop on the consumer side might look like this:
while True:
try:
item = d.pop()
print('Getting item' + str(item))
except IndexError:
print('Deque is empty')
# time.sleep(s) if you want to poll the latest data every s seconds
Possibly, you can merge the ReceiveData and ProcessData functionalities into just one class / thread and use only one deque between this class and PlotData.

Related

jack with multiprocessing skips audio frames

It will be hard without having the whole code but I will try my best to explain it in detail. If you need more information please let me know.
So I have a python program with 3 processes (multiprocessing) running in parallel. The first one is a video-preprocessing task. The second is an audio-preprocessing task, and the last is a DNN model call. All processes are written kinda like this:
from multiprocessing import Process
class NameOfTheProcess(Process):
def __init__(self, queue, videoBuffer, audioBuffer):
super().__init__()
# ....
def run(self):
while True: # so that the processes run till I stop the program
# ....
The video-preprocessing is a simple face tracking and filling in a queue (which is used so I can share the data between the processes).
The audio-preprocessing is a process where I get an audio frame using the jack library. There I downsample the audio and put it in a buffer. After a specific delay of 20 callbacks of jack I start the DNN model process.
In the DNN model process, I have currently only 4 simple steps. First I check if the audio queue is empty if not then I get the element of the queue and then I go through a "dummy" for loop in a range of 1000. After that, I take the last x elements of the audio queue and put them in another queue to use it later.
The video-preprocessing and audio-preprocessing work fine I have no issues there but when I also start the DNN-process than I get many audio lost and in jack-client I get a lot of 16:00:12.435 XRUN callback (7 skipped). And when I just start the audio-preprocessing and the DNN-process I have the same issue. So in my mind, there is no problem with the video-preprocessing.
After a while, I figured out that when I remove this line audioBufferIn = self.audioBuffer.get() in the code below I don't have the audio lost anymore but I need to get the audio queue there somehow so I can work with it.
from multiprocessing import Process
class DnnModelCall(Process):
def __init__(self, queue, audioBuffer):
super().__init__()
print("DnnModelCall: init")
self.queue = queue
self.audioBuffer = audioBuffer
def run(self):
print("DnnModelCall: run")
while True:
if(not self.audioBuffer.empty()):
k = 0
audioBufferIn = self.audioBuffer.get()
# audioBufferIn = self.audioBuffer.get(block=False)
for i in range(0, 1000):
k += 1
outputDnnBackPart = audioBufferIn[-2560:]
outputQueue = []
outputQueue.extend(outputDnnBackPart)
self.queue.put(outputQueue)
I have also tried it with block=False but I get the same result.
Have anyone an idea?
And if you need more information let me know.
Thanks in advance.

Data structure to control the next step

I am learning coroutine
class Scheduler:
def __init__(self):
self.ready = Queue() # a queue of tasks that are ready to run.
self.taskmap = {} #dictionary that keeps track of all active tasks (each task has a unique integer task ID)
def new(self, target): #introduce a new task to the scheduler
newtask = Task(target)
self.taskmap[newtask.tid] = newtask
self.schedule(newtask)
return newtask.tid
def schedule(self, task):
self.ready.put(task)
def mainloop(self):
while self.taskmap: #does not remove element from taskmap
task = self.ready.get() self.ready
result = task.run()
self.schedule(task)
When reading the task = self.ready.get() in schedule, I suddenly realize that the nature of data structure is about control, to control the next step, while the nature of algorithm is also about control, to control all the steps.
Does the understanding make sense?

The Queue object defines control of what step is next, yes. It's FIFO, as described here.
Here, it looks like you're just trying to keep track of tasks, whether there are any remaining, which are executing, and so on. This is "controlling all the steps." Yes.
What's unclear is the purpose. The data structure and algorithm should be suited to your purpose. asyncio can help you implement parallelism and event-driven designs, for example. Sometimes the goal is to quickly and efficiently render data from a source into a data structure. What you're getting at is more meaningful (to me, at least) in the context of an end goal.

How to share a value between threads and inform a consuming thread that a new value is set

A fairly common case for me is to have a periodic update of a value, say every 30 seconds. This value is available on, for instance, a website.
I want to take this value (using a reader), transform it (with a transformer) and publish the result, say on another website (with a publisher).
Both source and destination can be unavailable from time to time, and I'm only interested in new values and timeouts.
My current method is to use a queue for my values and another queue for my results. The reader, the transformer and the publisher are all separate 'threads' using multiprocessing.
This has the advantage that every step can be allowed to 'hang' for some time and the next step can use a get with a timeout to implement some default action in case there is no valid message in the queue.
The drawback of this method is that I'm left with all previous values and results in my queue once the transformer or publisher stalls. In the worst case the publisher has an unrecoverable error and the entire tool runs out of memory.
A possible resolution to this problem is to limit the queue size to 1, use a non-blocking put and handle a queue full exception by throwing away the current value and re-putting the new. This is quite a lot of code for such a simple action and a clear indication that a queue is not the right tool for the job.
I can write my own class to get the behavior I want using multiprocessing primitives, but this is a very common situation for me, so I assume it also is for others and I feel there should be a 'right' solution out there somewhere.
In short is there a standard threadsafe class with the following interface?
class Updatable():
def put(value):
#store value, overwriting existing
def get(timeout):
#blocking, raises Exception when timeout is set and exceeded
return value
edit: my current implementation using multiprocessing
import multiprocessing
from time import sleep
class Updatable():
def __init__(self):
self.manager = multiprocessing.Manager()
self.ns = self.manager.Namespace()
self.updated = self.manager.Event()
def get(self, timeout=None):
self.updated.wait(timeout)
self.updated.clear()
return self.ns.x
def put(self, item):
self.ns.x = item
self.updated.set()
def consumer(updatable):
print(updatable.get()) # expect 1
sleep(1)
print(updatable.get()) # expect "2"
sleep(1)
print(updatable.get()) # expect {3}, after 2 sec
sleep(1)
print(updatable.get()) # expect [4]
sleep(2)
print(updatable.get()) # expect 6
sleep(1)
def producer():
sleep(.5) # make output more stable, by giving both sides 0.5 sec to process
updatable.put(1)
sleep(1)
updatable.put("2")
sleep(2)
updatable.put({3})
sleep(1)
updatable.put([4])
sleep(1)
updatable.put(5,) # will never be consumed
sleep(1)
updatable.put(6)
if __name__ == '__main__':
updatable = Updatable()
p = multiprocessing.Process(target=consumer, args=(updatable,))
p.start()
producer()

dataexchange between two simultanous running functions in Python

as someone who explores the new and mighty world of Python, I am running into an understanding problem for my coding and it would be great if someone could help me on this one.
To make my problem simple I have made an example.
Lets say, I have two functions, running via multiprocessing simultaneously. One is a permanent data listener and one prints the value of it out. In addition I have one object which owns the data, data is set via set/get. So the challenge is how both function can access the data without putting it to global. I guess my lack of understanding is somewhere in how to transfer the object between functions.
NOTE : both functions do not need to be in sync and the while is just for endless loop. It just how to bring the data over.
This gives a code like (I know it is not working, just to get the idea) :
import multiprocessing
#simply a data object
class data(object):
def __init__(self):
self.__value = 1
def set_value(self, value):
self.__value = value
def get_value(self):
return self.__value
# Data listener
def f1(count):
zae = 0
while True:
zae += 1
count.set_value = zae
def f2(count):
while True:
print (count.get_value)
#MainPart
if __name__ == '__main__':
print('start')
count = data()
jobs = []
p1 = multiprocessing.Process(target =f1(count))
p2 = multiprocessing.Process(target =f2(count))
jobs.append(p1)
jobs.append(p2)
p1.start()
p2.start()
print ('end')
Please enlight me,
regards
AdrianMonk

This looks like a neat case for using memory-mapped files.
When a process memory-maps a file (say F) and another process comes along and maps the same file (i.e maps to F.fileno() too), then exactly the same block of memory is mapped into the second process's address space. This allows the two processes to exchange information extremely rapidly by writing into the shared memory. .
Of course you have to manage the proper access (read,write,etc) in your mappings, and then it is just a matter of properly polling/writing the proper locations in the file to satisfy the logic of your application
(see http://docs.python.org/2/library/mmap.html).

The communication channels Pipe or Queue from multiprocessing are designed to solve exactly this kind of problem

"select" on multiple Python multiprocessing Queues?

What's the best way to wait (without spinning) until something is available in either one of two (multiprocessing) Queues, where both reside on the same system?

Actually you can use multiprocessing.Queue objects in select.select. i.e.
que = multiprocessing.Queue()
(input,[],[]) = select.select([que._reader],[],[])
would select que only if it is ready to be read from.
No documentation about it though. I was reading the source code of the multiprocessing.queue library (at linux it's usually sth like /usr/lib/python2.6/multiprocessing/queue.py) to find it out.
With Queue.Queue I didn't have found any smart way to do this (and I would really love to).

It doesn't look like there's an official way to handle this yet. Or at least, not based on this:
http://bugs.python.org/issue3831
You could try something like what this post is doing -- accessing the underlying pipe filehandles:
http://haltcondition.net/?p=2319
and then use select.

Not sure how well the select on a multiprocessing queue works on windows. As select on windows listens for sockets and not file handles, I suspect there could be problems.
My answer is to make a thread to listen to each queue in a blocking fashion, and to put the results all into a single queue listened to by the main thread, essentially multiplexing the individual queues into a single one.
My code for doing this is:
"""
Allow multiple queues to be waited upon.
queue,value = multiq.select(list_of_queues)
"""
import queue
import threading
class queue_reader(threading.Thread):
def __init__(self,inq,sharedq):
threading.Thread.__init__(self)
self.inq = inq
self.sharedq = sharedq
def run(self):
while True:
data = self.inq.get()
print ("thread reads data=",data)
result = (self.inq,data)
self.sharedq.put(result)
class multi_queue(queue.Queue):
def __init__(self,list_of_queues):
queue.Queue.__init__(self)
for q in list_of_queues:
qr = queue_reader(q,self)
qr.start()
def select(list_of_queues):
outq = queue.Queue()
for q in list_of_queues:
qr = queue_reader(q,outq)
qr.start()
return outq.get()
The following test routine shows how to use it:
import multiq
import queue
q1 = queue.Queue()
q2 = queue.Queue()
q3 = multiq.multi_queue([q1,q2])
q1.put(1)
q2.put(2)
q1.put(3)
q1.put(4)
res=0
while not res==4:
while not q3.empty():
res = q3.get()[1]
print ("returning result =",res)
Hope this helps.
Tony Wallace

Seems like using threads which forward incoming items to a single Queue which you then wait on is a practical choice when using multiprocessing in a platform independent manner.
Avoiding the threads requires either handling low-level pipes/FDs which is both platform specific and not easy to handle consistently with the higher-level API.
Or you would need Queues with the ability to set callbacks which i think are the proper higher level interface to go for. I.e. you would write something like:
singlequeue = Queue()
incoming_queue1.setcallback(singlequeue.put)
incoming_queue2.setcallback(singlequeue.put)
...
singlequeue.get()
Maybe the multiprocessing package could grow this API but it's not there yet. The concept works well with py.execnet which uses the term "channel" instead of "queues", see here http://tinyurl.com/nmtr4w

As of Python 3.3 you can use multiprocessing.connection.wait to wait on multiple Queue._reader objects at once.

You could use something like the Observer pattern, wherein Queue subscribers are notified of state changes.
In this case, you could have your worker thread designated as a listener on each queue, and whenever it receives a ready signal, it can work on the new item, otherwise sleep.

New version of above code...
Not sure how well the select on a multiprocessing queue works on windows. As select on windows listens for sockets and not file handles, I suspect there could be problems.
My answer is to make a thread to listen to each queue in a blocking fashion, and to put the results all into a single queue listened to by the main thread, essentially multiplexing the individual queues into a single one.
My code for doing this is:
"""
Allow multiple queues to be waited upon.
An EndOfQueueMarker marks a queue as
"all data sent on this queue".
When this marker has been accessed on
all input threads, this marker is returned
by the multi_queue.
"""
import queue
import threading
class EndOfQueueMarker:
def __str___(self):
return "End of data marker"
pass
class queue_reader(threading.Thread):
def __init__(self,inq,sharedq):
threading.Thread.__init__(self)
self.inq = inq
self.sharedq = sharedq
def run(self):
q_run = True
while q_run:
data = self.inq.get()
result = (self.inq,data)
self.sharedq.put(result)
if data is EndOfQueueMarker:
q_run = False
class multi_queue(queue.Queue):
def __init__(self,list_of_queues):
queue.Queue.__init__(self)
self.qList = list_of_queues
self.qrList = []
for q in list_of_queues:
qr = queue_reader(q,self)
qr.start()
self.qrList.append(qr)
def get(self,blocking=True,timeout=None):
res = []
while len(res)==0:
if len(self.qList)==0:
res = (self,EndOfQueueMarker)
else:
res = queue.Queue.get(self,blocking,timeout)
if res[1] is EndOfQueueMarker:
self.qList.remove(res[0])
res = []
return res
def join(self):
for qr in self.qrList:
qr.join()
def select(list_of_queues):
outq = queue.Queue()
for q in list_of_queues:
qr = queue_reader(q,outq)
qr.start()
return outq.get()
The follow code is my test routine to show how it works:
import multiq
import queue
q1 = queue.Queue()
q2 = queue.Queue()
q3 = multiq.multi_queue([q1,q2])
q1.put(1)
q2.put(2)
q1.put(3)
q1.put(4)
q1.put(multiq.EndOfQueueMarker)
q2.put(multiq.EndOfQueueMarker)
res=0
have_data = True
while have_data:
res = q3.get()[1]
print ("returning result =",res)
have_data = not(res==multiq.EndOfQueueMarker)

The one situation where I'm usually tempted to multiplex multiple queues is when each queue corresponds to a different type of message that requires a different handler. You can't just pull from one queue because if it isn't the type of message you want, you need to put it back.
However, in this case, each handler is essentially a separate consumer, which makes it an a multi-producer, multi-consumer problem. Fortunately, even in this case you still don't need to block on multiple queues. You can create different thread/process for each handler, with each handler having its own queue. Basically, you can just break it into multiple instances of a multi-producer, single-consumer problem.
The only situation I can think of where you would have to wait on multiple queues is if you were forced to put multiple handlers in the same thread/process. In that case, I would restructure it by creating a queue for my main thread, spawning a thread for each handler, and have the handlers communicate with the main thread using the main queue. Each handler could then have a separate queue for its unique type of message.

Don't do it.
Put a header on the messages and send them to a common queue. This simplifies the code and will be cleaner overall.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Architecture for data acquisition and processing - python

Related

jack with multiprocessing skips audio frames

Data structure to control the next step

How to share a value between threads and inform a consuming thread that a new value is set

dataexchange between two simultanous running functions in Python

"select" on multiple Python multiprocessing Queues?

Categories

Resources