Processing many files in a folder parallely using python

Processing many files in a folder parallely using python - python

I have a folder with 100 excel files. From my program I have to process all the files. I want to do this parallely using multithreading or multiprocessing using python. I am planning to use 10 threads or processes where each of them will process 10 files. the first thread/process should process files 1-10,second 11-20 files,likewise. I tried using multithreading in python but not sure how to index on specific file? Any suggestions will be most welcome

The simplest way to do multiprocessing is as follows:
files = [ ... list of file names generated somehow ... ]
def process_file(file_name):
.... process file named file_name however you want ...
with multiprocessing.Pool(10) as pool:
pool.map(process_file, files, 10);
The 10 indicates you want 10 threads. The second 10 indicates that you want to send the files to each thread in chunks of 10. Variants of map() exist that can take care of many of your needs.

Python 3 has an inbuild library "threading". Here is an example:
from threading import Thread
import time
import random
def slow_function(i):
time.sleep(random.randint(1, 10))
print(i)
def running_threads():
threads = []
for i in range(10):
t = Thread(target=slow_function, args=(i,))
threads.append(t)
t.start()
for t in threads:
t.join() # making sure that all your threads are done before doing something else with all results
running_threads()

Related

Multiprocessing vs Threading in Python

I am learning Multiprocessing and Threading in python to process and create large amount of files, the diagram is shown here diagram
Each of output file depends on the analysis of all input files.
Single processing of the program takes quite a long time, so I tried the following codes:
(a) multiprocessing
start = time.time()
process_count = cpu_count()
p = Pool(process_count)
for i in range(process_count):
p.apply_async(my_read_process_and_write_func, args=(i,w))
p.close()
p.join()
end = time.time()
(b) threading
start = time.time()
thread_count = cpu_count()
thread_list = []
for i in range(0, thread_count):
t = threading.Thread(target=my_read_process_and_write_func, args=(i,))
thread_list.append(t)
for t in thread_list:
t.start()
for t in thread_list:
t.join()
end = time.time()
I am runing these codes using Python 3.6 on a Windows PC with 8 cores. However Multiprocessing method takes about the same time as the single-processing method, and Threading method takes about 75% of the single-processing method.
My questions are:
Are my codes correct?
Is there any better way/codes to improve the efficiency?
Thanks!

Your processing is I/O bound, not CPU bound. As a result, the fact that you have multiple processes helps little. Each Python process in multiprocessing is stuck waiting for input or output while the CPU does nothing. Increasing the Pool size in multiprocessing should improve performance.

Follwing Tarik's answer, since my processing is I/O bound, I made serveral copies of input files, then each processing reads and processes different copy of these files.
Now my codes run 8 times faster.

Now my processing diagram looks like this.
My input files include one index file (about 400MB) and 100 other files(each size=330MB, can be considered as a file pool).
In order to generate one output file, index file and all flles within the file pool need to be read. (e.g. First line of index file is 15, then line 15 of each files within the file pool need to be read to generate output file1.)
Previously I tried multiprocessing and Threading without making copies, the codes were very slow. Then I optimized the codes by making copies of only the index file for each processing, so each processing reads copies of index file individually, and then reads the file pool to generate the output files.
Currently, with 8 cpu cores, multiprocessing with poolsize=8 takes least time.

Python ThreadPoolExecutor with continuous unbounded input

I have a folder in a server which will continuously receive some files throughout the day. I need to watch the directory and once a file is received need to start some processing on that file. Sometime, processing can take a bit longer based on file size which can reach upto 20 GB.
I am using concurrent.futures.ThreadPoolExecutor to process multiple files at a go. But, I need some help in understanding how to handle the below scenario :-
I have received 5 files (4 small and 1 huge file) at once, ThreadPoolExecutor picks up all the 5 files for processing. It takes few seconds to process 4 small files but it takes 20 mins to process the large file. Now, I have another 10 files waiting in the folder while the large file is being processed.
I have set max_workers=5 but only one ThreadPoolExecutor worker runs now to process the large file, which blocks the execution of next set of files. How can we start processing the other files while 4 workers are free that time.
import os
import time
import random
import concurrent.futures
import datetime
import functools
def process_file(file1, input_num):
# Do some processing
os.remove(os.path.join('C:\\temp\\abcd',file1))
time.sleep(10)
def main():
print("Start Time is ",datetime.datetime.now())
#It will be a continuous loop which will watch a directory for incoming file
while True:
#Get the list of files in directory
file_list = os.listdir('C:\\temp\\abcd')
print("file_list is", file_list)
input_num = random.randint(1000000000,9999999999)
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
process_file_arg = functools.partial(process_file, input_num = input_num)
executor.map(process_file_arg, file_list)
time.sleep(10)
if __name__ == '__main__':
main()
the main() function continuously watch a directory and calls ThreadPoolExecutor

I ran into the same problem, this answer may help you.
concurrent.futures.wait returns the futures into a named 2-tuple of sets, done and not_done, so we can remove done part and add new tasks into the not_done thread list to make the parallel job be continuous, here is an example snippet:
thread_list = []
with open(input_filename, 'r') as fp_in:
with concurrent.futures.ThreadPoolExecutor(max_workers=THREAD_LIMIT) as executor:
for para_list in fp_in:
thread_list.append(executor.submit(your_thread_func, para_list))
if len(thread_list) >= THREAD_LIMIT:
done, not_done = concurrent.futures.wait(thread_list, timeout=1,
return_when=concurrent.futures.FIRST_COMPLETED)
# consume finished
done_res = [i.result() for i in done]
# and keep unfinished
thread_list = list(not_done)

Python - multiprocessing and text file processing

BACKGROUND:
I have a huge file .txt which I have to process. It is a data mining project.
So I've split it to many .txt files each one 100MB size, saved them all in the same directory and managed to run them this way:
from multiprocessing.dummy import Pool
for filename in os.listdir(pathToFile):
if filename.endswith(".txt"):
process(filename)
else:
continue
In Process, I parse the file into a list of Objects, then I apply another function. This is SLOWER than running the whole file AS IS. But for big enough files I won't be able to run at once and I will have to slice. So I want to have threads as I don't have to wait for each process(filename) to finish.
How can I apply it ? I've checked this but I didn't understand how to apply it to my code...
Any help please would be appreciated.
I looked here to see how to do this. What I've tried:
pool = Pool(6)
for x in range(6):
futures.append(pool.apply_async(process, filename))
Unfortunately I realized it will only do the first 6 text files, or will it not ? How can I make it work ? as soon as a thread is over, assign to it another file text and start running.
EDIT:
for filename in os.listdir(pathToFile):
if filename.endswith(".txt"):
for x in range(6):
pool.apply_async(process(filename))
else:
continue

First, using multiprocessing.dummy will only give you a speed increase if your problem is IO bound (when reading the files is the main bottleneck), for CPU intensive tasks (processing the file is the bottleneck) it won't help, in which case you should use "real" multiprocessing.
The problem you describe seems more fit for the use of one of the map functions of Pool:
from multiprocessing import Pool
files = [f for f in os.listdir(pathToFile) if f.endswith(".txt")]
pool = Pool(6)
results = pool.map(process, files)
pool.close()
This will use 6 worker processes to process the list of files and return a list of the return values of the process() function after all files have been processed. Your current example would submit the same file 6 times.

Using multithreading for a function with a while loop?

I have a pretty basic function that iters through a directory, reading files and collecting data, however it does this way too slow and only uses about a quarter of each core (quad-core i5 CPU) for processing power. How can I run the function 4 times simultaneously. Because it's going through a rather large directory, could I have the parameter use random.shuffle()? Here's the code I have now:
import multiprocessing
def function():
while True:
pass #do the code. variables are assigned inside the function.
with Pool(processes=4) as pool:
pool.map(function)
Because the function doesn't take any parameters, what can I do?

I didn't use map(), it is said map takes only one iterable argument, theoretically, you either modify your fuction() to function(one_arg) or try to use an empty list or tuple or other iterable structure but I didn't test it.
I suggest you put all files into queue(can be shared by processes), and share the queue to multiple processes(in your case it is 4). Use try-except to quit when finish reading a file. Creates 4 processes to consume the files queue and quit until all files are processed.
Queue is easy for you to tell whether there's more files need to be read or not based on Queue.Empty and TimeoutError
from multiprocessing import Process
import Queue
def function(files_queue):
try:
filename = files_queue.get(timeout=60) # set timeout
with open(filename) as inputs:
# process lines
# time consuming work is here
except (multiprocessing.TimeoutError, Queue.Empty) as toe:
# queue is empty or timeout
break
if __name__ == '__main__':
files_queue = ... # put all files into queue
processes = list()
# here you need a loop to create 4/multiple processes
p = Process(target=function, args=(files_queue,))
processes.add(p)
p.start()
for pro in processes:
pro.join()

This method pool.map(function) will create 4 threads, not actually 4 processes. All this "multiprocessing" will happen in the same process with 4 threads.
What I suggest is to use the multiprocessing.Process according the documentation here (Python 2) or here (Python 3).

Python's multiprocessing is not creating tasks in parallel

I am learning about multithreading in python using the multiprocessing library. For that purpose, I tried to create a program to divide a big file into several smaller chunks. So, first I read all the data from that file, and then create worker tasks that take a segment of the data from that input file, and write that segment into a file. I expect to have as many parallel threads running as the number of segments, but that does not happen. I see maximum two tasks and the program terminates after that. What mistake am I doing. The code is given below.
import multiprocessing
def worker(segment, x):
fname = getFileName(x)
writeToFile(segment, fname)
if __name__ == '__main__':
with open(fname) as f:
lines = f.readlines()
jobs = []
for x in range(0, numberOfSegments):
segment = getSegment(x, lines)
jobs.append(multiprocessing.Process(target=worker, args=(segment, x)))
jobs[len(jobs)-1].start()
for p in jobs:
p.join

Process gives you one additional thread (which, with your main process, gives you two processes). The call to join at the end of each loop will wait for that process to finish before starting the next loop. If you insist on using Process, you'll need to store the returned processes (probably in a list), and join every process in a loop after your current loop.
You want the Pool class from multiprocessing (https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.pool)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Processing many files in a folder parallely using python - python

Related

Multiprocessing vs Threading in Python

Python ThreadPoolExecutor with continuous unbounded input

Python - multiprocessing and text file processing

Using multithreading for a function with a while loop?

Python's multiprocessing is not creating tasks in parallel

Categories

Resources