implementing a basic queue/thread process within python - python

looking for some eyeballs to verifiy that the following chunk of psuedo python makes sense. i'm looking to spawn a number of threads to implement some inproc functions as fast as possible. the idea is to spawn the threads in the master loop, so the app will run the threads simultaneously in a parallel/concurrent manner
chunk of code
-get the filenames from a dir
-write each filename ot a queue
-spawn a thread for each filename, where each thread
waits/reads value/data from the queue
-the threadParse function then handles the actual processing
based on the file that's included via the "execfile" function...
# System modules
from Queue import Queue
from threading import Thread
import time
# Local modules
#import feedparser
# Set up some global variables
appqueue = Queue()
# more than the app will need
# this matches the number of files that will ever be in the
# urldir
#
num_fetch_threads = 200
def threadParse(q)
#decompose the packet to get the various elements
line = q.get()
college,level,packet=decompose (line)
#build name of included file
fname=college+"_"+level+"_Parse.py"
execfile(fname)
q.task_done()
#setup the master loop
while True
time.sleep(2)
# get the files from the dir
# setup threads
filelist="ls /urldir"
if filelist
foreach file_ in filelist:
worker = Thread(target=threadParse, args=(appqueue,))
worker.start()
# again, get the files from the dir
#setup the queue
filelist="ls /urldir"
foreach file_ in filelist:
#stuff the filename in the queue
appqueue.put(file_)
# Now wait for the queue to be empty, indicating that we have
# processed all of the downloads.
#don't care about this part
#print '*** Main thread waiting'
#appqueue.join()
#print '*** Done'
Thoughts/comments/pointers are appreciated...
thanks

If I understand this right: You spawn lots of threads to get things done faster.
This only works if the main part of the job done in each thread is done without holding the GIL. So if there is a lot of waiting for data from network, disk or something like that, it might be a good idea.
If each of the tasks are using a lot of CPU, this will run pretty much like on a single core 1-CPU machine and you might as well do them in sequence.
I should add that what I wrote is true for CPython, but not necessarily for Jython/IronPython.
Also, I should add that if you need to utilize more CPUs/cores, there's the multiprocessing module that might help.

Related

Python multiprocessing progress approach

I've been busy writing my first multiprocessing code and it works, yay.
However, now I would like some feedback of the progress and I'm not sure what the best approach would be.
What my code (see below) does in short:
A target directory is scanned for mp4 files
Each file is analysed by a separate process, the process saves a result (an image)
What I'm looking for could be:
Simple
Each time a process finishes a file it sends a 'finished' message
The main code keeps count of how many files have finished
Fancy
Core 0 processing file 20 of 317 ||||||____ 60% completed
Core 1 processing file 21 of 317 |||||||||_ 90% completed
...
Core 7 processing file 18 of 317 ||________ 20% completed
I read all kinds of info about queues, pools, tqdm and I'm not sure which way to go. Could anyone point to an approach that would work in this case?
Thanks in advance!
EDIT: Changed my code that starts the processes as suggested by gsb22
My code:
# file operations
import os
import glob
# Multiprocessing
from multiprocessing import Process
# Motion detection
import cv2
# >>> Enter directory to scan as target directory
targetDirectory = "E:\Projects\Programming\Python\OpenCV\\videofiles"
def get_videofiles(target_directory):
# Find all video files in directory and subdirectories and put them in a list
videofiles = glob.glob(target_directory + '/**/*.mp4', recursive=True)
# Return the list
return videofiles
def process_file(videofile):
'''
What happens inside this function:
- The video is processed and analysed using openCV
- The result (an image) is saved to the results folder
- Once this function receives the videofile it completes
without the need to return anything to the main program
'''
# The processing code is more complex than this code below, this is just a test
cap = cv2.VideoCapture(videofile)
for i in range(10):
succes, frame = cap.read()
# cv2.imwrite('{}/_Results/{}_result{}.jpg'.format(targetDirectory, os.path.basename(videofile), i), frame)
if succes:
try:
cv2.imwrite('{}/_Results/{}_result_{}.jpg'.format(targetDirectory, os.path.basename(videofile), i), frame)
except:
print('something went wrong')
if __name__ == "__main__":
# Create directory to save results if it doesn't exist
if not os.path.exists(targetDirectory + '/_Results'):
os.makedirs(targetDirectory + '/_Results')
# Get a list of all video files in the target directory
all_files = get_videofiles(targetDirectory)
print(f'{len(all_files)} video files found')
# Create list of jobs (processes)
jobs = []
# Create and start processes
for file in all_files:
proc = Process(target=process_file, args=(file,))
jobs.append(proc)
for job in jobs:
job.start()
for job in jobs:
job.join()
# TODO: Print some form of progress feedback
print('Finished :)')
I read all kinds of info about queues, pools, tqdm and I'm not sure which way to go. Could anyone point to an approach that would work in this case?
Here's a very simple way to get progress indication at minimal cost:
from multiprocessing.pool import Pool
from random import randint
from time import sleep
from tqdm import tqdm
def process(fn) -> bool:
sleep(randint(1, 3))
return randint(0, 100) < 70
files = [f"file-{i}.mp4" for i in range(20)]
success = []
failed = []
NPROC = 5
pool = Pool(NPROC)
for status, fn in tqdm(zip(pool.imap(process, files), files), total=len(files)):
if status:
success.append(fn)
else:
failed.append(fn)
print(f"{len(success)} succeeded and {len(failed)} failed")
Some comments:
tqdm is a 3rd-party library which implements progressbars extremely well. There are others. pip install tqdm.
we use a pool (there's almost never a reason to manage processes yourself for simple things like this) of NPROC processes. We let the pool handle iterating our process function over the input data.
we signal state by having the function return a boolean (in this example we choose randomly, weighting in favour of success). We don't return the filename, although we could, because it would have to be serialised and sent from the subprocess, and that's unnecessary overhead.
we use Pool.imap, which returns an iterator which keeps the same order as the iterable we pass in. So we can use zip to iterate files directly. Since we use an iterator with unknown size, tqdm needs to be told how long it is. (We could have used pool.map, but there's no need to commit the ram---although for one bool it probably makes no difference.)
I've deliberately written this as a kind of recipe. You can do a lot with multiprocessing just by using the high-level drop in paradigms, and Pool.[i]map is one of the most useful.
References
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool
https://tqdm.github.io/

Using multithreading for a function with a while loop?

I have a pretty basic function that iters through a directory, reading files and collecting data, however it does this way too slow and only uses about a quarter of each core (quad-core i5 CPU) for processing power. How can I run the function 4 times simultaneously. Because it's going through a rather large directory, could I have the parameter use random.shuffle()? Here's the code I have now:
import multiprocessing
def function():
while True:
pass #do the code. variables are assigned inside the function.
with Pool(processes=4) as pool:
pool.map(function)
Because the function doesn't take any parameters, what can I do?
I didn't use map(), it is said map takes only one iterable argument, theoretically, you either modify your fuction() to function(one_arg) or try to use an empty list or tuple or other iterable structure but I didn't test it.
I suggest you put all files into queue(can be shared by processes), and share the queue to multiple processes(in your case it is 4). Use try-except to quit when finish reading a file. Creates 4 processes to consume the files queue and quit until all files are processed.
Queue is easy for you to tell whether there's more files need to be read or not based on Queue.Empty and TimeoutError
from multiprocessing import Process
import Queue
def function(files_queue):
try:
filename = files_queue.get(timeout=60) # set timeout
with open(filename) as inputs:
# process lines
# time consuming work is here
except (multiprocessing.TimeoutError, Queue.Empty) as toe:
# queue is empty or timeout
break
if __name__ == '__main__':
files_queue = ... # put all files into queue
processes = list()
# here you need a loop to create 4/multiple processes
p = Process(target=function, args=(files_queue,))
processes.add(p)
p.start()
for pro in processes:
pro.join()
This method pool.map(function) will create 4 threads, not actually 4 processes. All this "multiprocessing" will happen in the same process with 4 threads.
What I suggest is to use the multiprocessing.Process according the documentation here (Python 2) or here (Python 3).

Python multi-threaded processing with limited CPU/ports

I have a dictionary of folder names that I would like to process in parallel. Under each folder, there is an array of file names that I would like to process in series:
folder_file_dict = {
folder_name : {
file_names_key : [file_names_array]
}
}
Ultimately, I will be creating a folder named folder_name which contains the files with names len(folder_file_dict[folder_name][file_names_key]). I have a method like so:
def process_files_in_series(file_names_array, udp_port):
for file_name in file_names_array:
time_consuming_method(file_name, udp_port)
# create "file_name"
udp_ports = [123, 456, 789]
Note the time_consuming_method() above, which takes a long time due to calls over a UDP port. I am also limited to using the UDP ports in the array above. Thus, I have to wait for time_consuming_method to complete on a UDP port before I can use that UDP port again. This means that I can only have len(udp_ports) threads running at a time.
Thus, I will ultimately create len(folder_file_dict.keys()) threads, with len(folder_file_dict.keys()) calls to process_files_in_series. I also have a MAX_THREAD count. I am trying to use the Queue and Threading modules, but I am not sure what kind of design I need. How can I do this using Queues and Threads, and possibly Conditions as well? A solution that uses a thread pool may also be helpful.
NOTE
I am not trying to increase the read/write speed. I am trying to parallelize the calls to time_consuming_method under process_files_in_series. Creating these files is just part of the process, but not the rate limiting step.
Also, I am looking for a solution that uses Queue, Threading, and possible Condition modules, or anything relevant to those modules. A threadpool solution may also be helpful. I cannot use processes, only threads.
I am also looking for a solution in Python 2.7.
Using a thread pool:
#!/usr/bin/env python2
from multiprocessing.dummy import Pool, Queue # thread pool
folder_file_dict = {
folder_name: {
file_names_key: file_names_array
}
}
def process_files_in_series(file_names_array, udp_port):
for file_name in file_names_array:
time_consuming_method(file_name, udp_port)
# create "file_name"
...
def mp_process(filenames):
udp_port = free_udp_ports.get() # block until a free udp port is available
args = filenames, udp_port
try:
return args, process_files_in_series(*args), None
except Exception as e:
return args, None, str(e)
finally:
free_udp_ports.put_nowait(udp_port)
free_udp_ports = Queue() # in general, use initializer to pass it to children
for port in udp_ports:
free_udp_ports.put_nowait(port)
pool = Pool(number_of_concurrent_jobs) #
for args, result, error in pool.imap_unordered(mp_process, get_files_arrays()):
if error is not None:
print args, error
I don't think you need to bind number of threads to number of udp ports if the processing time may differ for different filenames arrays.
If I understand the structure of folder_file_dict correctly then to generate the filenames arrays:
def get_files_arrays(folder_file_dict=folder_file_dict):
for folder_name_dict in folder_file_dict.itervalues():
for filenames_array in folder_name_dict.itervalues():
yield filenames_array
Use the multiprocessing.pool.ThreadPool. It handles queue / thread management for you and can be easily changed to do multiprocessing instead.
EDIT: Added example
Here's an example... multiple threads may end up using the same udp port. I'm not sure if that's a problem for you.
import multithreading
import multithreading.pool
import itertools
def process_files_in_series(file_names_array, udp_port):
for file_name in file_names_array:
time_consuming_method(file_name, udp_port)
# create "file_name"
udp_ports = [123, 456, 789]
folder_file_dict = {
folder_name : {
file_names_key : [file_names_array]
}
}
def main(folder_file_dict, udp_ports):
# number of threads - here I'm limiting to the smaller of udp_ports,
# file lists to process and a cap I arbitrarily set to 4
num_threads = min(len(folder_file_dict), len(udp_ports), 4)
# the pool
pool = multithreading.pool.ThreadPool(num_threads)
# build files to be processed into list. You may want to do other
# Things like join folder_name...
file_arrays = [value['file_names_key'] for value in folder_file_dict.values()]
# do the work
pool.map(process_files_in_series, zip(file_arrays, itertools.cycle(udp_ports))
pool.close()
pool.join()
This is kind of a blue print to how you could use multiprocessing.Process
with JoinableQueue 's to deliver Jobs to Workers. You will
still be bound by I/O but with Process you do have true concurrency,
which may prove to be useful, since threading may even be slower than
a normal script processing the files.
(Be aware that this will also prevent you from doing anything else with your Laptop
if you dare to start too many processes at once :P).
I tried to explain the code
as much as possible with comments.
import traceback
from multiprocessing import Process, JoinableQueue, cpu_count
# Number if CPU's on your PC
cpus = cpu_count()
# The Worker Function. Could also be modelled as a class
def Worker(q_jobs):
while True:
# Try / Catch / finally may be necessary for error-prone tasks since the processes
# may hang forever if the task_done() method is not called.
try:
# Get an item from the Queue
item = q_jobs.get()
# At this point the data should somehow be processed
except:
traceback.print_exc()
else:
pass
finally:
# Inform the Queue that the Task has been done
# Without this. The processes can not be killed
# and will be left as Zombies afterwards
q_jobs.task_done()
# A Joinable Queue to end the process
q_jobs = JoinableQueue()
# Create process depending on the number of CPU's
for i in range(cpus):
# target function and arguments
# a list of multiple arguments should not end with ',' e.g.
# (q_jobs, 'bla')
p = Process(target=Worker,
args=(q_jobs,)
)
p.daemon = True
p.start()
# fill Queue with Jobs
q_jobs.put(['Do'])
q_jobs.put(['Something'])
# End Process
q_jobs.join()
Cheers
EDIT
I wrote this with Python 3 in mind.
Taking the parenthesis from the print function
print item
should make this work for 2.7.

Can I asynchronously delete a file in Python?

I have a long running python script which creates and deletes temporary files. I notice there is a non-trivial amount of time spent on file deletion, but the only purpose of deleting those files is to ensure that the program doesn't eventually fill up all the disk space during a long run. Is there a cross platform mechanism in Python to aschyronously delete a file so the main thread can continue to work while the OS takes care of the file delete?
You can try delegating deleting the files to another thread or process.
Using a newly spawned thread:
thread.start_new_thread(os.remove, filename)
Or, using a process:
# create the process pool once
process_pool = multiprocessing.Pool(1)
results = []
# later on removing a file in async fashion
# note: need to hold on to the async result till it has completed
results.append(process_pool.apply_async(os.remove, filename), callback=lambda result: results.remove(result))
The process version may allow for more parallelism because Python threads are not executing in parallel due to the notorious global interpreter lock. I would expect though that GIL is released when it calls any blocking kernel function, such as unlink(), so that Python lets another thread to make progress. In other words, a background worker thread that calls os.unlink() may be the best solution, see Tim Peters' answer.
Yet, multiprocessing is using Python threads underneath to asynchronously communicate with the processes in the pool, so some benchmarking is required to figure which version gives more parallelism.
An alternative method to avoid using Python threads but requires more coding is to spawn another process and send the filenames to its standard input through a pipe. This way you trade os.remove() to a synchronous os.write() (one write() syscall). It can be done using deprecated os.popen() and this usage of the function is perfectly safe because it only communicates in one direction to the child process. A working prototype:
#!/usr/bin/python
from __future__ import print_function
import os, sys
def remover():
for line in sys.stdin:
filename = line.strip()
try:
os.remove(filename)
except Exception: # ignore errors
pass
def main():
if len(sys.argv) == 2 and sys.argv[1] == '--remover-process':
return remover()
remover_process = os.popen(sys.argv[0] + ' --remover-process', 'w')
def remove_file(filename):
print(filename, file=remover_process)
remover_process.flush()
for file in sys.argv[1:]:
remove_file(file)
if __name__ == "__main__":
main()
You can create a thread to delete files, following a common producer-consumer pattern:
import threading, Queue
dead_files = Queue.Queue()
END_OF_DATA = object() # a unique sentinel value
def background_deleter():
import os
while True:
path = dead_files.get()
if path is END_OF_DATA:
return
try:
os.remove(path)
except: # add the exceptions you want to ignore here
pass # or log the error, or whatever
deleter = threading.Thread(target=background_deleter)
deleter.start()
# when you want to delete a file, do:
# dead_files.put(file_path)
# when you want to shut down cleanly,
dead_files.put(END_OF_DATA)
deleter.join()
CPython releases the GIL (global interpreter lock) around internal file deletion calls, so this should be effective.
Edit - new text
I would advise against spawning a new process per delete. On some platforms, process creation is quite expensive. Would also advise against spawning a new thread per delete: in a long-running program, you really never want the possibility of creating an unbounded number of threads at any point. Depending on how quickly file deletion requests pile up, that could happen here.
The "solution" above is wordier than the others, because it avoids all that. There's only one new thread total. Of course it could easily be generalized to use any fixed number of threads instead, all sharing the same dead_files queue. Start with 1, add more if needed ;-)
The OS-level file removal primitives are synchronous on both Unix and Windows, so I think you pretty much have to use a worker thread. You could have it pull files to delete off a Queue object, and then when the main thread is done with a file it can just post the file to the queue. If you're using NamedTemporaryFile objects, you probably want to set delete=False in the constructor and just post the name to the queue, not the file object, so you don't have object lifetime headaches.

python -> multiprocessing module

Here's what I am trying to accomplish -
I have about a million files which I need to parse & append the parsed content to a single file.
Since a single process takes ages, this option is out.
Not using threads in Python as it essentially comes to running a single process (due to GIL).
Hence using multiprocessing module. i.e. spawning 4 sub-processes to utilize all that raw core power :)
So far so good, now I need a shared object which all the sub-processes have access to. I am using Queues from the multiprocessing module. Also, all the sub-processes need to write their output to a single file. A potential place to use Locks I guess. With this setup when I run, I do not get any error (so the parent process seems fine), it just stalls. When I press ctrl-C I see a traceback (one for each sub-process). Also no output is written to the output file. Here's code (note that everything runs fine without multi-processes) -
import os
import glob
from multiprocessing import Process, Queue, Pool
data_file = open('out.txt', 'w+')
def worker(task_queue):
for file in iter(task_queue.get, 'STOP'):
data = mine_imdb_page(os.path.join(DATA_DIR, file))
if data:
data_file.write(repr(data)+'\n')
return
def main():
task_queue = Queue()
for file in glob.glob('*.csv'):
task_queue.put(file)
task_queue.put('STOP') # so that worker processes know when to stop
# this is the block of code that needs correction.
if multi_process:
# One way to spawn 4 processes
# pool = Pool(processes=4) #Start worker processes
# res = pool.apply_async(worker, [task_queue, data_file])
# But I chose to do it like this for now.
for i in range(4):
proc = Process(target=worker, args=[task_queue])
proc.start()
else: # single process mode is working fine!
worker(task_queue)
data_file.close()
return
what am I doing wrong? I also tried passing the open file_object to each of the processes at the time of spawning. But to no effect. e.g.- Process(target=worker, args=[task_queue, data_file]). But this did not change anything. I feel the subprocesses are not able to write to the file for some reason. Either the instance of the file_object is not getting replicated (at the time of spawn) or some other quirk... Anybody got an idea?
EXTRA: Also Is there any way to keep a persistent mysql_connection open & pass it across to the sub_processes? So I open a mysql connection in my parent process & the open connection should be accessible to all my sub-processes. Basically this is the equivalent of a shared_memory in python. Any ideas here?
Although the discussion with Eric was fruitful, later on I found a better way of doing this. Within the multiprocessing module there is a method called 'Pool' which is perfect for my needs.
It's optimizes itself to the number of cores my system has. i.e. only as many processes are spawned as the no. of cores. Of course this is customizable. So here's the code. Might help someone later-
from multiprocessing import Pool
def main():
po = Pool()
for file in glob.glob('*.csv'):
filepath = os.path.join(DATA_DIR, file)
po.apply_async(mine_page, (filepath,), callback=save_data)
po.close()
po.join()
file_ptr.close()
def mine_page(filepath):
#do whatever it is that you want to do in a separate process.
return data
def save_data(data):
#data is a object. Store it in a file, mysql or...
return
Still going through this huge module. Not sure if save_data() is executed by parent process or this function is used by spawned child processes. If it's the child which does the saving it might lead to concurrency issues in some situations. If anyone has anymore experience in using this module, you appreciate more knowledge here...
The docs for multiprocessing indicate several methods of sharing state between processes:
http://docs.python.org/dev/library/multiprocessing.html#sharing-state-between-processes
I'm sure each process gets a fresh interpreter and then the target (function) and args are loaded into it. In that case, the global namespace from your script would have been bound to your worker function, so the data_file would be there. However, I am not sure what happens to the file descriptor as it is copied across. Have you tried passing the file object as one of the args?
An alternative is to pass another Queue that will hold the results from the workers. The workers put the results and the main code gets the results and writes it to the file.

Categories

Resources