Python ThreadPoolExecutor with continuous unbounded input - python

I have a folder in a server which will continuously receive some files throughout the day. I need to watch the directory and once a file is received need to start some processing on that file. Sometime, processing can take a bit longer based on file size which can reach upto 20 GB.
I am using concurrent.futures.ThreadPoolExecutor to process multiple files at a go. But, I need some help in understanding how to handle the below scenario :-
I have received 5 files (4 small and 1 huge file) at once, ThreadPoolExecutor picks up all the 5 files for processing. It takes few seconds to process 4 small files but it takes 20 mins to process the large file. Now, I have another 10 files waiting in the folder while the large file is being processed.
I have set max_workers=5 but only one ThreadPoolExecutor worker runs now to process the large file, which blocks the execution of next set of files. How can we start processing the other files while 4 workers are free that time.
import os
import time
import random
import concurrent.futures
import datetime
import functools
def process_file(file1, input_num):
# Do some processing
os.remove(os.path.join('C:\\temp\\abcd',file1))
time.sleep(10)
def main():
print("Start Time is ",datetime.datetime.now())
#It will be a continuous loop which will watch a directory for incoming file
while True:
#Get the list of files in directory
file_list = os.listdir('C:\\temp\\abcd')
print("file_list is", file_list)
input_num = random.randint(1000000000,9999999999)
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
process_file_arg = functools.partial(process_file, input_num = input_num)
executor.map(process_file_arg, file_list)
time.sleep(10)
if __name__ == '__main__':
main()
the main() function continuously watch a directory and calls ThreadPoolExecutor

I ran into the same problem, this answer may help you.
concurrent.futures.wait returns the futures into a named 2-tuple of sets, done and not_done, so we can remove done part and add new tasks into the not_done thread list to make the parallel job be continuous, here is an example snippet:
thread_list = []
with open(input_filename, 'r') as fp_in:
with concurrent.futures.ThreadPoolExecutor(max_workers=THREAD_LIMIT) as executor:
for para_list in fp_in:
thread_list.append(executor.submit(your_thread_func, para_list))
if len(thread_list) >= THREAD_LIMIT:
done, not_done = concurrent.futures.wait(thread_list, timeout=1,
return_when=concurrent.futures.FIRST_COMPLETED)
# consume finished
done_res = [i.result() for i in done]
# and keep unfinished
thread_list = list(not_done)

Related

Python multiprocessing progress approach

I've been busy writing my first multiprocessing code and it works, yay.
However, now I would like some feedback of the progress and I'm not sure what the best approach would be.
What my code (see below) does in short:
A target directory is scanned for mp4 files
Each file is analysed by a separate process, the process saves a result (an image)
What I'm looking for could be:
Simple
Each time a process finishes a file it sends a 'finished' message
The main code keeps count of how many files have finished
Fancy
Core 0 processing file 20 of 317 ||||||____ 60% completed
Core 1 processing file 21 of 317 |||||||||_ 90% completed
...
Core 7 processing file 18 of 317 ||________ 20% completed
I read all kinds of info about queues, pools, tqdm and I'm not sure which way to go. Could anyone point to an approach that would work in this case?
Thanks in advance!
EDIT: Changed my code that starts the processes as suggested by gsb22
My code:
# file operations
import os
import glob
# Multiprocessing
from multiprocessing import Process
# Motion detection
import cv2
# >>> Enter directory to scan as target directory
targetDirectory = "E:\Projects\Programming\Python\OpenCV\\videofiles"
def get_videofiles(target_directory):
# Find all video files in directory and subdirectories and put them in a list
videofiles = glob.glob(target_directory + '/**/*.mp4', recursive=True)
# Return the list
return videofiles
def process_file(videofile):
'''
What happens inside this function:
- The video is processed and analysed using openCV
- The result (an image) is saved to the results folder
- Once this function receives the videofile it completes
without the need to return anything to the main program
'''
# The processing code is more complex than this code below, this is just a test
cap = cv2.VideoCapture(videofile)
for i in range(10):
succes, frame = cap.read()
# cv2.imwrite('{}/_Results/{}_result{}.jpg'.format(targetDirectory, os.path.basename(videofile), i), frame)
if succes:
try:
cv2.imwrite('{}/_Results/{}_result_{}.jpg'.format(targetDirectory, os.path.basename(videofile), i), frame)
except:
print('something went wrong')
if __name__ == "__main__":
# Create directory to save results if it doesn't exist
if not os.path.exists(targetDirectory + '/_Results'):
os.makedirs(targetDirectory + '/_Results')
# Get a list of all video files in the target directory
all_files = get_videofiles(targetDirectory)
print(f'{len(all_files)} video files found')
# Create list of jobs (processes)
jobs = []
# Create and start processes
for file in all_files:
proc = Process(target=process_file, args=(file,))
jobs.append(proc)
for job in jobs:
job.start()
for job in jobs:
job.join()
# TODO: Print some form of progress feedback
print('Finished :)')
I read all kinds of info about queues, pools, tqdm and I'm not sure which way to go. Could anyone point to an approach that would work in this case?
Here's a very simple way to get progress indication at minimal cost:
from multiprocessing.pool import Pool
from random import randint
from time import sleep
from tqdm import tqdm
def process(fn) -> bool:
sleep(randint(1, 3))
return randint(0, 100) < 70
files = [f"file-{i}.mp4" for i in range(20)]
success = []
failed = []
NPROC = 5
pool = Pool(NPROC)
for status, fn in tqdm(zip(pool.imap(process, files), files), total=len(files)):
if status:
success.append(fn)
else:
failed.append(fn)
print(f"{len(success)} succeeded and {len(failed)} failed")
Some comments:
tqdm is a 3rd-party library which implements progressbars extremely well. There are others. pip install tqdm.
we use a pool (there's almost never a reason to manage processes yourself for simple things like this) of NPROC processes. We let the pool handle iterating our process function over the input data.
we signal state by having the function return a boolean (in this example we choose randomly, weighting in favour of success). We don't return the filename, although we could, because it would have to be serialised and sent from the subprocess, and that's unnecessary overhead.
we use Pool.imap, which returns an iterator which keeps the same order as the iterable we pass in. So we can use zip to iterate files directly. Since we use an iterator with unknown size, tqdm needs to be told how long it is. (We could have used pool.map, but there's no need to commit the ram---although for one bool it probably makes no difference.)
I've deliberately written this as a kind of recipe. You can do a lot with multiprocessing just by using the high-level drop in paradigms, and Pool.[i]map is one of the most useful.
References
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool
https://tqdm.github.io/

dask.distributed: wait for all tasks to finish before shutdown (without futures)

Tldr:
I'm using fire_and_forget to execute tasks on a dask.distributed cluster, so I don't maintain a future for each task. How can I wait until they are all done before the cluster gets shut down?
Details:
I have a workflow that creates a xarray dataset which is persisted on the cluster. Once the computations are done, I want to save the time slices individually and move on to the next dataset.
Until now, I've been using a delayed function and collected a list of delayed tasks which I then passed on to client.compute - this way I was sure everything was done before I moved on to the next dataset. The downside is, that all is blocked until every last file got written.
Now I'm looking into fire_and_forget to be able to start the computations on the next dataset while the files of the previous one are still being written.
I'm planning to wait for each dataset to be completed before I start the fire_and_forget tasks, so they should have plenty of time to complete.
The only issue I've encountered is, that when processing the last dataset, there's no more waiting and the cluster gets shut down after the last fire_and_forget call, even though the processes are still running.
So is there any way to tell the client it needs to block until all is completed?
Or am I maybe not properly understanding the use of fire_and_forget and should stay with my previous approach?
Here's an example code that simulates the workflow - it does 10 iterations (simulating the different datasets) and then writes the first 10 time slices to pickle files. So in the end I'm expecting 100 pickle files on disk, which is not the case.
import pickle
import random
from time import sleep
from dask import delayed
from dask.distributed import LocalCluster, Client, wait, fire_and_forget
import xarray as xr
#delayed
def dump_delayed(x, fn):
with open(fn, "wb") as f:
random.seed(42)
sleep(random.randint(1,2))
pickle.dump(x, f)
TARGET = "/home/jovyan/"
def main():
cluster = LocalCluster(n_workers=2, ip="0.0.0.0")
client = Client(cluster)
ds = xr.tutorial.open_dataset("rasm")
for it in range(1,10):
print("Iteration %s" % it)
# simulating the processing and persisting
ds2 = (ds*it).chunk({"time": 1}).persist()
_ = wait(ds2)
for ii in range(10):
fn = TARGET + f"temp{ii}_{it}.pkl"
xx = ds2.isel(time=ii)
f = client.persist(dump_delayed(xx, fn))
fire_and_forget(f)
if __name__ == "__main__":
main()
Not sure if this qualifies for a solution, but fire_and_forget is for a specific use case where you do not want to track the status of the task. If you are interested in the status of the tasks, it's better to use the regular future.

Processing many files in a folder parallely using python

I have a folder with 100 excel files. From my program I have to process all the files. I want to do this parallely using multithreading or multiprocessing using python. I am planning to use 10 threads or processes where each of them will process 10 files. the first thread/process should process files 1-10,second 11-20 files,likewise. I tried using multithreading in python but not sure how to index on specific file? Any suggestions will be most welcome
The simplest way to do multiprocessing is as follows:
files = [ ... list of file names generated somehow ... ]
def process_file(file_name):
.... process file named file_name however you want ...
with multiprocessing.Pool(10) as pool:
pool.map(process_file, files, 10);
The 10 indicates you want 10 threads. The second 10 indicates that you want to send the files to each thread in chunks of 10. Variants of map() exist that can take care of many of your needs.
Python 3 has an inbuild library "threading". Here is an example:
from threading import Thread
import time
import random
def slow_function(i):
time.sleep(random.randint(1, 10))
print(i)
def running_threads():
threads = []
for i in range(10):
t = Thread(target=slow_function, args=(i,))
threads.append(t)
t.start()
for t in threads:
t.join() # making sure that all your threads are done before doing something else with all results
running_threads()

Python - multiprocessing and text file processing

BACKGROUND:
I have a huge file .txt which I have to process. It is a data mining project.
So I've split it to many .txt files each one 100MB size, saved them all in the same directory and managed to run them this way:
from multiprocessing.dummy import Pool
for filename in os.listdir(pathToFile):
if filename.endswith(".txt"):
process(filename)
else:
continue
In Process, I parse the file into a list of Objects, then I apply another function. This is SLOWER than running the whole file AS IS. But for big enough files I won't be able to run at once and I will have to slice. So I want to have threads as I don't have to wait for each process(filename) to finish.
How can I apply it ? I've checked this but I didn't understand how to apply it to my code...
Any help please would be appreciated.
I looked here to see how to do this. What I've tried:
pool = Pool(6)
for x in range(6):
futures.append(pool.apply_async(process, filename))
Unfortunately I realized it will only do the first 6 text files, or will it not ? How can I make it work ? as soon as a thread is over, assign to it another file text and start running.
EDIT:
for filename in os.listdir(pathToFile):
if filename.endswith(".txt"):
for x in range(6):
pool.apply_async(process(filename))
else:
continue
First, using multiprocessing.dummy will only give you a speed increase if your problem is IO bound (when reading the files is the main bottleneck), for CPU intensive tasks (processing the file is the bottleneck) it won't help, in which case you should use "real" multiprocessing.
The problem you describe seems more fit for the use of one of the map functions of Pool:
from multiprocessing import Pool
files = [f for f in os.listdir(pathToFile) if f.endswith(".txt")]
pool = Pool(6)
results = pool.map(process, files)
pool.close()
This will use 6 worker processes to process the list of files and return a list of the return values of the process() function after all files have been processed. Your current example would submit the same file 6 times.

Python's multiprocessing is not creating tasks in parallel

I am learning about multithreading in python using the multiprocessing library. For that purpose, I tried to create a program to divide a big file into several smaller chunks. So, first I read all the data from that file, and then create worker tasks that take a segment of the data from that input file, and write that segment into a file. I expect to have as many parallel threads running as the number of segments, but that does not happen. I see maximum two tasks and the program terminates after that. What mistake am I doing. The code is given below.
import multiprocessing
def worker(segment, x):
fname = getFileName(x)
writeToFile(segment, fname)
if __name__ == '__main__':
with open(fname) as f:
lines = f.readlines()
jobs = []
for x in range(0, numberOfSegments):
segment = getSegment(x, lines)
jobs.append(multiprocessing.Process(target=worker, args=(segment, x)))
jobs[len(jobs)-1].start()
for p in jobs:
p.join
Process gives you one additional thread (which, with your main process, gives you two processes). The call to join at the end of each loop will wait for that process to finish before starting the next loop. If you insist on using Process, you'll need to store the returned processes (probably in a list), and join every process in a loop after your current loop.
You want the Pool class from multiprocessing (https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.pool)

Categories

Resources