Struggling with multiprocessing Queue - python

My structure (massively simplified) is depicted below:
import multiprocessing
def creator():
# creates files
return
def relocator():
# moves created files
return
create = multiprocessing.Process(target=creator)
relocate = multiprocessing.Process(target=relocator)
create.start()
relocate.start()
What I am trying to do is have a bunch of files created by creator and as soon as they get created have them moved to another directory by relocator.
The reason I want to use multiprocessing here is:
I do not want creator to wait for the moving to be finished first because moving takes time I dont want to waste.
Creating all the files first before starting to copy is not an option either because there is not enough space in the drive for all of them.
I want both the creator and relocator processes to be serial (one file at a time each) but run in parallel. A "log" of the actions should lool like this:
# creating file 1
# creating file 2 and relocating file 1
# creating file 3 and relocating file 2
# ...
# relocating last file
Based on what I have read, Queue is the way to go here.
Strategy: (maybe not the best one?!)
After an file gets created it will be entering the queue and after it has finished being relocated, it will be removed from the queue.
I am however having issues coding it; multiple files being created at the same time (multiple instances of creator running in parallel) and others...
I would be very grateful for any ideas, hints, explanations, etc

Lets take your idea and split in this features:
Creator should create files (100 for example)
Relocator should move 1 file at a time till there are no more files to move
Creator may end before Relocator so it can also
transform himself into a Relocator Both have to know when to
finish
So, we have 2 main functionalities:
def create(i):
# creates files and return outpath
return os.path.join("some/path/based/on/stuff", "{}.ext".format(i))
def relocate(from, to):
# moves created files
shuttil.move(from, to)
Now lets create our processes:
from multiprocessing import Process, Queue
comm_queue = Queue()
#process that create the files and push the data into the queue
def creator(comm_q):
for i in range(100):
comm_q.put(create(i))
comm_q.put("STOP_FLAG") # we tell the workers when to stop, we just push one since we only have one more worker
#the relocator works till it gets an stop flag
def relocator(comm_q):
data = comm_q.get()
while data != "STOP_FLAG":
if data:
relocate(data, to_path_you_may_want)
data = comm_q.get()
creator_process= multiprocessing.Process(target=creator, args=(comm_queue))
relocators = multiprocessing.Process(target=relocator, args=(comm_queue))
creator_process.start()
relocators .start()
This way we would have now a creator and a relocator, but, lets say now we want the Creator to start relocating when the creation job is done by it, we can just use relocator, but we would need to push one more "STOP_FLAG" since we would have 2 processes relocating
def creator(comm_q):
for i in range(100):
comm_q.put(create(i))
for _ in range(2):
comm_q.put("STOP_FLAG")
relocator(comm_q)
Lets say we want now an arbitrary number of relocator processes, we should adapt our code a bit to handle this, we would need the creator method to be aware of how many flags to notify the other processes when to stop, our resulting code would look like this:
from multiprocessing import Process, Queue, cpu_count
comm_queue = Queue()
#process that create the files and push the data into the queue
def creator(comm_q, number_of_subprocesses):
for i in range(100):
comm_q.put(create(i))
for _ in range(number_of_subprocesses + 1): # we need to count ourselves
comm_q.put("STOP_FLAG")
relocator(comm_q)
#the relocator works till it gets an stop flag
def relocator(comm_q):
data = comm_q.get()
while data != "STOP_FLAG":
if data:
relocate(data, to_path_you_may_want)
data = comm_q.get()
num_of_cpus = cpu_count() #we will spam as many processes as cpu core we have
creator_process= Process(target=creator, args=(comm_queue, num_of_cpus))
relocators = [Process(target=relocator, args=(comm_queue)) for _ in num_of_cpus]
creator_process.start()
for rp in relocators:
rp.start()
Then you will have to WAIT for them to finish:
creator_process.join()
for rp in relocators:
rp.join()
You may want to check at the multiprocessing.Queue documentation
Specially to the get method (is a blocking call by default)
Remove and return an item from the queue. If optional args block is
True (the default) and timeout is None (the default), block if
necessary until an item is available.

Related

Python multiprocessing progress approach

I've been busy writing my first multiprocessing code and it works, yay.
However, now I would like some feedback of the progress and I'm not sure what the best approach would be.
What my code (see below) does in short:
A target directory is scanned for mp4 files
Each file is analysed by a separate process, the process saves a result (an image)
What I'm looking for could be:
Simple
Each time a process finishes a file it sends a 'finished' message
The main code keeps count of how many files have finished
Fancy
Core 0 processing file 20 of 317 ||||||____ 60% completed
Core 1 processing file 21 of 317 |||||||||_ 90% completed
...
Core 7 processing file 18 of 317 ||________ 20% completed
I read all kinds of info about queues, pools, tqdm and I'm not sure which way to go. Could anyone point to an approach that would work in this case?
Thanks in advance!
EDIT: Changed my code that starts the processes as suggested by gsb22
My code:
# file operations
import os
import glob
# Multiprocessing
from multiprocessing import Process
# Motion detection
import cv2
# >>> Enter directory to scan as target directory
targetDirectory = "E:\Projects\Programming\Python\OpenCV\\videofiles"
def get_videofiles(target_directory):
# Find all video files in directory and subdirectories and put them in a list
videofiles = glob.glob(target_directory + '/**/*.mp4', recursive=True)
# Return the list
return videofiles
def process_file(videofile):
'''
What happens inside this function:
- The video is processed and analysed using openCV
- The result (an image) is saved to the results folder
- Once this function receives the videofile it completes
without the need to return anything to the main program
'''
# The processing code is more complex than this code below, this is just a test
cap = cv2.VideoCapture(videofile)
for i in range(10):
succes, frame = cap.read()
# cv2.imwrite('{}/_Results/{}_result{}.jpg'.format(targetDirectory, os.path.basename(videofile), i), frame)
if succes:
try:
cv2.imwrite('{}/_Results/{}_result_{}.jpg'.format(targetDirectory, os.path.basename(videofile), i), frame)
except:
print('something went wrong')
if __name__ == "__main__":
# Create directory to save results if it doesn't exist
if not os.path.exists(targetDirectory + '/_Results'):
os.makedirs(targetDirectory + '/_Results')
# Get a list of all video files in the target directory
all_files = get_videofiles(targetDirectory)
print(f'{len(all_files)} video files found')
# Create list of jobs (processes)
jobs = []
# Create and start processes
for file in all_files:
proc = Process(target=process_file, args=(file,))
jobs.append(proc)
for job in jobs:
job.start()
for job in jobs:
job.join()
# TODO: Print some form of progress feedback
print('Finished :)')
I read all kinds of info about queues, pools, tqdm and I'm not sure which way to go. Could anyone point to an approach that would work in this case?
Here's a very simple way to get progress indication at minimal cost:
from multiprocessing.pool import Pool
from random import randint
from time import sleep
from tqdm import tqdm
def process(fn) -> bool:
sleep(randint(1, 3))
return randint(0, 100) < 70
files = [f"file-{i}.mp4" for i in range(20)]
success = []
failed = []
NPROC = 5
pool = Pool(NPROC)
for status, fn in tqdm(zip(pool.imap(process, files), files), total=len(files)):
if status:
success.append(fn)
else:
failed.append(fn)
print(f"{len(success)} succeeded and {len(failed)} failed")
Some comments:
tqdm is a 3rd-party library which implements progressbars extremely well. There are others. pip install tqdm.
we use a pool (there's almost never a reason to manage processes yourself for simple things like this) of NPROC processes. We let the pool handle iterating our process function over the input data.
we signal state by having the function return a boolean (in this example we choose randomly, weighting in favour of success). We don't return the filename, although we could, because it would have to be serialised and sent from the subprocess, and that's unnecessary overhead.
we use Pool.imap, which returns an iterator which keeps the same order as the iterable we pass in. So we can use zip to iterate files directly. Since we use an iterator with unknown size, tqdm needs to be told how long it is. (We could have used pool.map, but there's no need to commit the ram---although for one bool it probably makes no difference.)
I've deliberately written this as a kind of recipe. You can do a lot with multiprocessing just by using the high-level drop in paradigms, and Pool.[i]map is one of the most useful.
References
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool
https://tqdm.github.io/

How to wait for an input for a set amount of time in python

In the code that I am writing in python, I check if there is an NFC tag at the reader with id, text = reader.read() how do I make it skip and do a different function if an NFC tag is not read in a set amount of time?
Thanks.
This is the code adapted form what you gave that i used to test it
from multiprocessing import Process, Manager
def NCFReader(info):
info = input("Test: ")
def main():
# Motion is detected, now we want to time the function NCFReader
info = None
nfc_proc = Process(target=NCFReader, args=(info,))
nfc_proc.start()
nfc_proc.join(timeout=5)
if nfc_proc.is_alive():
nfc_proc.terminate()
print("It did not complete")
# PROCESS DID NOT FINISH, DO SOMETHING
else:
# PROCESS DID FINISH, DO SOMETHING ELSE
print("It did complete")
main()
From my experience setting a time limit on a function is surprisingly difficult.
It might seem as an over kill but the way I would do it is like this:
Say the function you want to limit is called NFCReader then
from multiprocessing import Process
# Create a process of the function
func_proc = Process(target=NFCReader)
# Start the process
func_proc.start()
# Wait until child process terminate or T seconds pass
func_proc.join(timeout=T)
# Check if function has finished and if not kill it
if func_proc.is_alive():
func_proc.terminate()
You can read more about python Processes here
Additionally Since you want to receive data from your Process you need to somehow be able to read a variable in one process and have it available in another, for that you can use the Manager object.
In your case you could to the following:
def NCFReader(info):
info['id'], info['text'] = reader.read()
def main():
... SOME LINES OF CODE ...
# Motion is detected, now we want to time the function NCFReader
info = Manager.dict()
info['id'] = None
info['text'] = None
nfc_proc = Process(target=NCFReader, args=(info,))
nfc_proc.start()
nfc_proc.join(timeout=T)
if nfc_proc.is_alive():
nfc_proc.terminate()
# PROCESS DID NOT FINISH, DO SOMETHING
else:
# PROCESS DID FINISH, DO SOMETHING ELSE
notice that the dictionary info is a dictionary that is shared among all Process, so if you would want to use it again make sure you reset it's values.
Hope this helps

Python multiprocessing gradually increases memory until it runs our

I have a python program with multiple modules. They go like this:
Job class that is the entry point and manages the overall flow of the program
Task class that is the base class for the tasks to be run on given data. Many SubTask classes created specifically for different types of calculations on different columns of data are derived from the Task class. think of 10 columns in the data and each one having its own Task to do some processing. eg. 'price' column can used by a CurrencyConverterTask to return local currency values and so on.
Many other modules like a connector for getting data, utils module etc, which I don't think are relevant for this question.
The general flow of program: get data from the db continuously -> process the data -> write back the updated data to the db.
I decided to do it in multiprocessing because the tasks are relatively simple. Most of them do some basic arithmetic or logic operations and running it in one process takes a long time, especially getting data from a large db and processing in sequence is very slow.
So the multiprocessing (mp) code looks something like this (I cannot expose the entire file so i'm writing a simplified version, the parts not included are not relevant here. I've tested by commenting them out so this is an accurate representation of the actual code):
class Job():
def __init__():
block_size = 100 # process 100 rows at a time
some_query = "SELECT * IF A > B" # some query to filter data from db
def data_getter():
# continusouly get data from the db and put it into a queue in blocks
cursor = Connector.get_data(some_query)
block = []
for item in cursor:
block.append(item)
if len(block) ==block_size:
data_queue.put(data)
block = []
data_queue.put(None) # this will indicate the worker processors when to stop
def monitor():
# continuously monitor the system stats
timer = Timer()
while (True):
if timer.time_taken >= 60: # log some stats every 60 seconds
print(utils.system_stats())
timer.reset()
def task_runner():
while True:
# get data from the queue
# if there's no data, break out of loop
data = data_queue.get()
if data is None:
break
# run task one by one
for task in tasks:
task.do_something(data)
def run():
# queue to put data for processing
data_queue = mp.Queue()
# start a process for reading data from db
dg = mp.Process(target=self.data_getter).start()
# start a process for monitoring system stats
mon = mp.Process(target=self.monitor).start()
# get a list of tasks to run
tasks = [t for t in taskmodule.get_subtasks()]
workers = []
# start 4 processes to do the actual processing
for _ in range(4):
worker = mp.Process(target=task_runner)
worker.start()
workers.append(worker)
for w in workers:
w.join()
mon.terminate() # terminate the monitor process
dg.terminate() # end the data getting process
if __name__ == "__main__":
job = Job()
job.run()
The whole program is run like: python3 runjob.py
Expected behaviour: continuous stream of data goes in the data_queue and the each worker process gets the data and processes until there's no more data from the cursor at which point the workers finish and the entire program finishes.
This is working as expected but what is not expected is that the system memory usage keeps creeping up continuously until the system crashes. The data i'm getting here is not copied anywhere (at least intentionally). I expect the memory usage to be steady throughout the program. The length of the data_queue rarely exceeds 1 or 2 since the processes are fast enough to get the data when available so It's not the queue holding too much data.
My guess is that all the processes initiated here are long running ones and that has something to do with this. Although I can print the pid and if I follow the PID on top command the data_getter and monitor processes don't exceed more than 2% of memory usage. the 4 worker processes also don't use a lot of memory. And neither does the main process the whole thing runs in. there is an unaccounted for process that takes up 20%+ of the ram. And it bugs me so much I can't figure out what it is.

How to have several (python) processes watch a folder for items but take action one at a time?

Say I have a python script which watches a folder for new files, and then processes the files (one at a time) based on certain criteria (in their names.)
I need to run several of these "watchers" at the same time, so that they can process several files at once. (Rendering video.)
Once a watcher picks up a file for processing, it renames it (prepending rendering_)
What's the best way to make sure that 2 or more of the watchers don't pick up the same file at the same time and try to render the same job?
My only idea is to have each 'watcher' check only when the current time in seconds is x, so that process 1 checks when it's :01 past the minute, etc. But this seems silly, and we'd have to wait a whole minute for every check.
Just to clarify ... say I have 4 instances of watcher running. In the watch folder 7 items are added: job1..job7. I want 1 watcher to pick up 1 job.
When a watcher is done, it should grab the next job. So watcher1 might do job1, watcher2 does job2, etc.
When watcher1 is done with job1, it should pick up job5.
I hope that's clear.
Also, I want each 'watcher' running in its own Terminal window, where we can see its progress, as well as easily terminate, or launch more watchers.
You should be using something like multiprocessing I think.
What you can do is have 1 master program that watches for files constantly.
Then when it detects something that master program sends it off to 1 slave and continues watching.
So instead of 5 scripts looking, have 1 looking and the rest processing when the one looking tells them to.
You asked how I would do this, I'm not experienced and this is probably not a great way to do it:
In order to do this you can have the main script store the data you want in a variable temporarily. Let's say the variable is called "Data".
Then you can use something like subprocess if in windows to get it running from master script:
subprocess.run(["python", "slave_file.py"])
Then you can have another python script (the slave scripts) which do:
from your_master_script import x
and then do things.
To expand on my comment, you can try to rename the files and track each file type/name by each watcher like so:
watcher 1 -> check for .step0 files
rename to .step1 when finished
watcher 2 -> check for .step1 files
rename to .step2 when finished
...
watcher n -> check for .step{n-1} files
rename to .final_format when finished
To demonstrate, here's a sample using multiprocessing to instantiate 4 different watchers:
import time, glob
from multiprocessing import Process
path = 'Watcher Demo'
class Watcher(object):
def __init__(self, num):
self.num = num
self.lifetime = 50.0
def start(self):
start = time.time()
targets = '\\'.join((path, f'*.step{self.num-1}'))
while time.time() - start <= self.lifetime:
for filename in glob.glob(targets):
time.sleep(2) # arificial wait so we can see the effects
with open(filename, 'a') as file:
file.write(f"I've been touched inappropriately by watcher {self.num}\n")
newname = glob.os.path.splitext(filename)[0] + f'.step{self.num}'
glob.os.rename(filename, newname)
def create_file():
for i in range(7):
filename = '\\'.join((path, f'job{i}.step0'))
with open(filename, 'w') as file:
file.write(f'new file {i}\n')
time.sleep(5)
if __name__ == '__main__':
if not glob.os.path.exists(path):
glob.os.mkdir(path)
watchers = [Watcher(i).start for i in range(1, 5)]
processes = [Process(target=p) for p in [create_file] + watchers]
for proc in processes:
proc.start()
for proc in processes:
proc.join()
Which will create and process files like so:
create_file() -> *newfile* -> job0.step0
Watcher(1).start() -> job0.step0 -> job0.step1
watcher2('job0.step1') -> job0.step1 -> job0.step2
watcher3('job0.step2') -> job0.step2 -> job0.step3
watcher4('job0.step3') -> job0.step3 -> job0.step4
And the files (e.g. job0.step4) will be done in order:
new file 0
I've been touched inappropriately by watcher 1
I've been touched inappropriately by watcher 2
I've been touched inappropriately by watcher 3
I've been touched inappropriately by watcher 4
I haven't renamed the file format to a final one as this is just a demo, but it's easily doable as your final code should have different watchers instead of generic ones anyhow.
With multiprocess module you won't be able to see separate terminals for each watcher, but this is just to demonstrate the concept... You can always switch to subprocess module.
As a side note, I do notice a bit of a performance dip while I was testing this. I'm assuming it's because the program is continuously looping and watching. A better, more efficient way would be to schedule your watches as a task to run in specific time. You can run watch1 every hour at the dot, watch2 every hour at 15th minute, watch3 ever hour at 30th minute... etc. This is a way more efficient approach as it only looks for file once, and only process them if found.

Python's multiprocessing is not creating tasks in parallel

I am learning about multithreading in python using the multiprocessing library. For that purpose, I tried to create a program to divide a big file into several smaller chunks. So, first I read all the data from that file, and then create worker tasks that take a segment of the data from that input file, and write that segment into a file. I expect to have as many parallel threads running as the number of segments, but that does not happen. I see maximum two tasks and the program terminates after that. What mistake am I doing. The code is given below.
import multiprocessing
def worker(segment, x):
fname = getFileName(x)
writeToFile(segment, fname)
if __name__ == '__main__':
with open(fname) as f:
lines = f.readlines()
jobs = []
for x in range(0, numberOfSegments):
segment = getSegment(x, lines)
jobs.append(multiprocessing.Process(target=worker, args=(segment, x)))
jobs[len(jobs)-1].start()
for p in jobs:
p.join
Process gives you one additional thread (which, with your main process, gives you two processes). The call to join at the end of each loop will wait for that process to finish before starting the next loop. If you insist on using Process, you'll need to store the returned processes (probably in a list), and join every process in a loop after your current loop.
You want the Pool class from multiprocessing (https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.pool)

Categories

Resources