I use the multiprocessing package to run the function: run_performance, on which it loads zip files in which they contains several csv files.
I search to display a progress bar properly with the number of csv in each zipfile.
With my code, the display is incoherent/wrong:
My code:
from alive_progress import alive_bar
from zipfile import ZipFile
import os
def get_filepaths(directory):
file_paths = [] # List which will store all of the full filepaths.
# Walk the tree.
for root, directories, files in os.walk(directory):
for filename in files:
# Join the two strings in order to form the full filepath.
filepath = os.path.join(root, filename)
file_paths.append(filepath) # Add it to the list.
return file_paths # Self-explanatory.
def count_files_7z(myarchive):
cnt_files = []
with closing(ZipFile(myarchive)) as archive:
for csv in archive.namelist():
cnt_files.append(csv)
return cnt_files
def run_performance(zipobj):
zf = zipfile.ZipFile(zipobj)
cnt = count_files_7z(zipobj)
with alive_bar(len(cnt)) as bar:
for f in zf.namelist():
bar()
with zf.open(f) as myfile:
print(myfile) # and done other things
list_dir = "path_of_zipfiles" #
for idx1, folder in enumerate(list_dir):
get_all_zips = get_filepaths(folder)
for idx2, zip_file in enumerate(get_all_zips):
with zipfile.ZipFile(zip_file) as zipobj:
p = Process(target=run_performance,args=(zipobj.filename,))
p.start()
p.join()
My display:
|████▌ | ▄▆█ 1/9 [11%] in 0s (3.3/s, eta: 0s)|████▌ | ▄▆█ 1/9 [11%] in 0s (3.3/s, eta: 0s)|████▌ | ▄▆█ 1/9 [11%] in 0s (3.3/s, eta: 0s
...
If I place the line p.join() as the same indentation as p.start(), the display is correct, but the multiprocessing does not work anymore.
So the script takes too much time:
1m18s vs 0m14s
Desired output:
|████████████████████████████████████████| 1/1 [100%] in 2.4s (0.41/s)
|████████████████████████████████████████| 2/2 [100%] in 4.7s (0.43/s)
|████████████████████ | ▄▂▂ 1/2 [50%] in 2s (0.6/s, eta: 0s)
First a few general comments concerning your code. In your main process you use a path to a file to open zip archive just to retrieve back the original file name. That really does not make too much sense. Then in count_files_7z you iterate the return value from zf.namelist() to build a list of the files within the archive when zf.namelist() is already a list of those files. That does not make too much sense either. You also use the context manager function closing to ensure that the archive is closed at the end of the block, but the with block itself is a context manager that serves the same purpose.
I tried installing alive-progress and the progress bars were a mess. This is a task better suited to multithreading rather than multiprocessing. Actually, it is probably better suited to serial processing since doing concurrent I/O operations to your disk, unless it is a solid state drive, is probably going to hurt performance. You will gain performance if there is heavy CPU-intensive processing involved of the files you read. If that is the case, I have passed to each thread a multiprocessing pool to which you can execute a calls to apply specifying functions in which you have placed CPU-intensive code. But the progress bars will should work better when done under multithreading rather than multiprocessing. Even then I could not get any sort of decent display with alive-progress, which admittedly I did not spend too much time on. So I have switched to using the more common tqdm module available from the PyPi repository.
Even with tqdm there is a problem in that when a progress bar reaches 100%, tqdm must be writing something (a newline?) that relocates the other progress bars. Therefore, what I have done is specified leave=False, which causes the bar to disappear when it reaches 100%. But at least you can see all the progress bars without distortion as they are progressing.
from multiprocessing.pool import Pool, ThreadPool
from threading import Lock
import tqdm
from zipfile import ZipFile
import os
import heapq
def get_filepaths(directory):
file_paths = [] # List which will store all of the full filepaths.
# Walk the tree.
for root, directories, files in os.walk(directory):
for filename in files:
# Join the two strings in order to form the full filepath.
filepath = os.path.join(root, filename)
file_paths.append(filepath) # Add it to the list.
return file_paths # Self-explanatory.
def get_free_position():
""" Return the minimum possible position """
with lock:
free_position = heapq.heappop(free_positions)
return free_position
def return_free_position(position):
with lock:
heapq.heappush(free_positions, position)
def run_performance(zip_file):
position = get_free_position()
with ZipFile(zip_file) as zf:
file_list = zf.namelist()
with tqdm.tqdm(total=len(file_list), position=position, leave=False) as bar:
for f in file_list:
with zf.open(f) as myfile:
... # do things with myfile (perhaps myfile.read())
# for CPU-intensive tasks: result = pool.apply(some_function, args=(arg1, arg2, ... argn))
import time
time.sleep(.005) # simulate doing something
bar.update()
return_free_position(position)
def generate_zip_files():
list_dir = ['path1', 'path2']
for folder in list_dir:
get_all_zips = get_filepaths(folder)
for zip_file in get_all_zips:
yield zip_file
# Required for Windows:
if __name__ == '__main__':
N_THREADS = 5
free_positions = list(range(N_THREADS)) # already a heap
lock = Lock()
pool = Pool()
thread_pool = ThreadPool(N_THREADS)
for result in thread_pool.imap_unordered(run_performance, generate_zip_files()):
pass
pool.close()
pool.join()
thread_pool.close()
thread_pool.join()
The code above uses a multiprocessing thread pool arbitrarily limited in size to 5 just as a demo. You can increase or decrease N_THREADS to whatever value you want, but as I said, it may or may not help performance. If you want one thread per zip file then:
if __name__ == '__main__':
zip_files = list(generate_zip_files())
N_THREADS = len(zip_files)
free_positions = list(range(N_THREADS)) # already a heap
lock = Lock()
pool = Pool()
thread_pool = ThreadPool(N_THREADS)
for result in thread_pool.imap_unordered(run_performance, zip_files):
pass
pool.close()
pool.join()
thread_pool.close()
thread_pool.join()
In the Enlighten codebase there is an example of something similar. You would just substitute the process_files() function with your own.
It's a bit large to recreate here, but the idea is you should really only be doing console output in the main process and use some form of IPC to relay the information from subprocesses. The Enlighten example uses queues for IPC, which is pretty reasonable given it's only sending it's current count.
It seems that alive_bar remembers the position of the cursor when it was called, and starts drawing the bar from that point. When you start many processes, each one is not aware of the other and the output gets scrambled.
Indeed, there is an open issue in github about this (see here). There are some hacky solutions for using multithreading, but I don't think that it will be easy to solve it using multiprocessing, unless you implement some kind on interprocess communication that will slow down things.
Related
I've been busy writing my first multiprocessing code and it works, yay.
However, now I would like some feedback of the progress and I'm not sure what the best approach would be.
What my code (see below) does in short:
A target directory is scanned for mp4 files
Each file is analysed by a separate process, the process saves a result (an image)
What I'm looking for could be:
Simple
Each time a process finishes a file it sends a 'finished' message
The main code keeps count of how many files have finished
Fancy
Core 0 processing file 20 of 317 ||||||____ 60% completed
Core 1 processing file 21 of 317 |||||||||_ 90% completed
...
Core 7 processing file 18 of 317 ||________ 20% completed
I read all kinds of info about queues, pools, tqdm and I'm not sure which way to go. Could anyone point to an approach that would work in this case?
Thanks in advance!
EDIT: Changed my code that starts the processes as suggested by gsb22
My code:
# file operations
import os
import glob
# Multiprocessing
from multiprocessing import Process
# Motion detection
import cv2
# >>> Enter directory to scan as target directory
targetDirectory = "E:\Projects\Programming\Python\OpenCV\\videofiles"
def get_videofiles(target_directory):
# Find all video files in directory and subdirectories and put them in a list
videofiles = glob.glob(target_directory + '/**/*.mp4', recursive=True)
# Return the list
return videofiles
def process_file(videofile):
'''
What happens inside this function:
- The video is processed and analysed using openCV
- The result (an image) is saved to the results folder
- Once this function receives the videofile it completes
without the need to return anything to the main program
'''
# The processing code is more complex than this code below, this is just a test
cap = cv2.VideoCapture(videofile)
for i in range(10):
succes, frame = cap.read()
# cv2.imwrite('{}/_Results/{}_result{}.jpg'.format(targetDirectory, os.path.basename(videofile), i), frame)
if succes:
try:
cv2.imwrite('{}/_Results/{}_result_{}.jpg'.format(targetDirectory, os.path.basename(videofile), i), frame)
except:
print('something went wrong')
if __name__ == "__main__":
# Create directory to save results if it doesn't exist
if not os.path.exists(targetDirectory + '/_Results'):
os.makedirs(targetDirectory + '/_Results')
# Get a list of all video files in the target directory
all_files = get_videofiles(targetDirectory)
print(f'{len(all_files)} video files found')
# Create list of jobs (processes)
jobs = []
# Create and start processes
for file in all_files:
proc = Process(target=process_file, args=(file,))
jobs.append(proc)
for job in jobs:
job.start()
for job in jobs:
job.join()
# TODO: Print some form of progress feedback
print('Finished :)')
I read all kinds of info about queues, pools, tqdm and I'm not sure which way to go. Could anyone point to an approach that would work in this case?
Here's a very simple way to get progress indication at minimal cost:
from multiprocessing.pool import Pool
from random import randint
from time import sleep
from tqdm import tqdm
def process(fn) -> bool:
sleep(randint(1, 3))
return randint(0, 100) < 70
files = [f"file-{i}.mp4" for i in range(20)]
success = []
failed = []
NPROC = 5
pool = Pool(NPROC)
for status, fn in tqdm(zip(pool.imap(process, files), files), total=len(files)):
if status:
success.append(fn)
else:
failed.append(fn)
print(f"{len(success)} succeeded and {len(failed)} failed")
Some comments:
tqdm is a 3rd-party library which implements progressbars extremely well. There are others. pip install tqdm.
we use a pool (there's almost never a reason to manage processes yourself for simple things like this) of NPROC processes. We let the pool handle iterating our process function over the input data.
we signal state by having the function return a boolean (in this example we choose randomly, weighting in favour of success). We don't return the filename, although we could, because it would have to be serialised and sent from the subprocess, and that's unnecessary overhead.
we use Pool.imap, which returns an iterator which keeps the same order as the iterable we pass in. So we can use zip to iterate files directly. Since we use an iterator with unknown size, tqdm needs to be told how long it is. (We could have used pool.map, but there's no need to commit the ram---although for one bool it probably makes no difference.)
I've deliberately written this as a kind of recipe. You can do a lot with multiprocessing just by using the high-level drop in paradigms, and Pool.[i]map is one of the most useful.
References
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool
https://tqdm.github.io/
I had a look into many published issues without finding some insights to my current issue.
I am dealing with multiprocessing runs of an external code. This external code eats inputs files. The files names are joined in a list that enable me to launch pool for each file. A path is also needed.
for i in range(len(file2run)):
pool.apply_async(runcase, args=(file2run[i], filepath))
The runcase function launches one process for a given input file and analyses and saves the results in some folder.
it works fine whatever the length of the file2run is. The external code runs on several processes (as many as maxCPU : defined in the pool with:
pool = multiprocessing.Pool(processes = maxCPU).
My issue is that I'd like to make a step further and integrate this in a for loop. In each loop, several input files are created and once all of the runs are finished a new set of inputs files are created and a pool is created again.
It works fine for two loops but I encountered the issue of the xxx line 105, in spawn_main exitcode = _main(fd) and a bunch of messages up the error of a missing needed module. Same messages for 2 or 1000 input files in each loop...
So I guess it's about the pool creation, but is there a way of clearing the variables between each runs ?? I have tried to created the pool initialization (with the number of CPU) at the very beginning of the main function but same issues raises...I have tried to make a sort of equivalent of clear all matlab function but always same issue... and why does it work for two loops and not for the third one ? why is the 2nd one working??
Thanks in advance for any help (or to point out to the good already published issue).
Xavfa
here is a try of an example that actually......works !
I copy\paste my original script and made it way more easier to share for sake of understanding the paradigm of my original try (the original one deals with object of several kinds to build the input file and uses an embedded function of one of the objects to launches the external code with subprocess.check_all).
but the example keeps the over all paradigm of making input files in a folder, simulation results in an other one with multiprocessing package.
the original still doesn't work, still at the third round of the loop (if name == 'main' : of multiproc_test.py
here is one script (multiproc_test.py):
import os
import Simlauncher
def RunProcess(MainPath):
file2run = Simlauncher.initiateprocess(MainPath)
Simlauncher.RunMultiProc(file2run, MainPath, multi=True, maxcpu=0.7)
def LaunchProcess(nbcase):
#exemple that build the file
MainPath = os.getcwd()
SimDir = os.path.join(os.getcwd(), 'SimFiles\\')
if not os.path.exists(SimDir):
os.mkdir(SimDir)
for i in range(100):
with open(SimDir+'inputfile'+str(i)+'.mptest', 'w') as file:
file.write('Hello World')
RunProcess(MainPath)
if __name__ == '__main__' :
for i in range(1,10):
LaunchProcess(i)
os.rename(os.path.join(os.getcwd(), 'SimFiles'), os.path.join(os.getcwd(), 'SimFiles'+str(i)))
here is the other one (Simlauncher.py) :
import multiprocessing as mp
import os
def initiateprocess(MainPath):
filepath = MainPath + '\\SimFiles\\'
listOfFiles = os.listdir(filepath)
file2run = []
for file in listOfFiles:
if '.mptest' in file:
file2run.append(file)
return file2run
def runtestcase(file,filepath):
filepath = filepath+'\\SimFiles'
ResSimpath = filepath + '\\SimRes\\'
if not os.path.exists(ResSimpath):
os.mkdir(ResSimpath)
with open(ResSimpath+'Res_' + file, 'w') as res:
res.write('I am done')
print(file +'is finished')
def RunMultiProc(file2run, filepath, multi, maxcpu):
print('Launching cases :')
nbcpu = mp.cpu_count()
pool = mp.Pool(processes=int(nbcpu * maxcpu))
for i in range(len(file2run)):
pool.apply_async(runtestcase, args=(file2run[i], filepath))
pool.close()
pool.join()
print('Done with this one !')
any help is still needed....
btw, the external code is energyplus (for building energy simulation)
Xavier
I apologise if this has already been asked, but I've read a heap of documentation and am still not sure how to do what I would like to do.
I would like to run a Python script over multiple cores simultaneously.
I have 1800 .h5 files in a directory, with names 'snaphots_s1.h5', 'snapshots_s2.h5' etc, each about 30MB in size. This Python script:
Reads in the h5py files one at a time from the directory.
Extracts and manipulates the data in the h5py file.
Creates plots of the extracted data.
Once this is done, the script then reads in the next h5py file from the directory and follows the same procedure. Hence, none of the processors need to communicate to any other whilst doing this work.
The script is as follows:
import h5py
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import cmocean
import os
from mpi4py import MPI
de.logging_setup.rootlogger.setLevel('ERROR')
# Plot writes
count = 1
for filename in os.listdir('directory'): ### [PERF] Applied to ~ 1800 .h5 files
with h5py.File('directory/{}'.format(filename),'r') as file:
### Manipulate 'filename' data. ### [PERF] Each fileI ~ 0.03 TB in size
...
### Plot 'filename' data. ### [PERF] Some fileO is output here
...
count = count + 1
Ideally, I would like to use mpi4py to do this (for various reasons), though I am open to other options such as multiprocessing.Pool (which I couldn't actually get to work. I tried following the approach outlined here).
So, my question is: What commands do I need to put in the script to parallelise it using mpi4py? Or, if this option isn't possible, how else could I parallelise the script?
You should go with multiprocessing, and Javier example should work but I would like to break it down so you can understand the steps too.
In general, when working with pools you create a pool of processes that idle until you pass them some work. To ideal way to do it is to create a function that each process will execute separetly.
def worker(fn):
with h5py.File(fn, 'r') as f:
# process data..
return result
That simple. Each process will run this, and return the result to the parent process.
Now that you have the worker function that does the work, let's create the input data for it. It takes a filename, so we need a list of all files
full_fns = [os.path.join('directory', filename) for filename in
os.listdir('directory')]
Next initialize the process pool.
import multiprocessing as mp
pool = mp.Pool(4) # pass the amount of processes you want
results = pool.map(worker, full_fns)
# pool takes a worker function and input data
# you usually need to wait for all the subprocesses done their work before
using the data; so you don't work on partial data.
pool.join()
poo.close()
Now you can access your data through results.
for r in results:
print r
Let me know in comments how this worked out for you
Multiprocessing should not be more complicated than this:
def process_one_file(fn):
with h5py.File(fn, 'r') as f:
....
return is_successful
fns = [os.path.join('directory', fn) for fn in os.listdir('directory')]
pool = multiprocessing.Pool()
for fn, is_successful in zip(fns, pool.imap(process_one_file, fns)):
print(fn, "succedded?", is_successful)
You should be able to implement multiprocessing easily using the multiprocessing library.
from multiprocessing.dummy import Pool
def processData(files):
print files
...
return result
allFiles = glob.glob("<file path/file mask>")
pool = Pool(6) # for 6 threads for example
results = pool.map(processData, allFiles)
I have a program which copies large numbers of files from one location to another - I'm talking 100,000+ files (I'm copying 314g in image sequences at this moment). They're both on huge, VERY fast network storage RAID'd in the extreme. I'm using shutil to copy the files over sequentially and it is taking some time, so I'm trying to find the best way to opimize this. I've noticed some software I use effectively multi-threads reading files off of the network with huge gains in load times so I'd like to try doing this in python.
I have no experience with programming multithreading/multiprocessesing - does this seem like the right area to proceed? If so what's the best way to do this? I've looked around a few other SO posts regarding threading file copying in python and they all seemed to say that you get no speed gain, but I do not think this will be the case considering my hardware. I'm nowhere near my IO cap at the moment and resources are sitting around 1% (I have 40 cores and 64g of RAM locally).
EDIT
Been getting some up-votes on this question (now a few years old) so I thought I'd point out one more thing to speed up file copies. In addition to the fact that you can easily 8x-10x copy speeds using some of the answers below (seriously!) I have also since found that shutil.copy2 is excruciatingly slow for no good reason. Yes, even in python 3+. It is beyond the scope of this question so I won't dive into it here (it's also highly OS and hardware/network dependent), beyond just mentioning that by tweaking the copy buffer size in the copy2 function you can increase copy speeds by yet another factor of 10! (however note that you will start running into bandwidth limits and the gains are not linear when multi-threading AND tweaking buffer sizes. At some point it does flat line).
UPDATE:
I never did get Gevent working (first answer) because I couldn't install the module without an internet connection, which I don't have on my workstation. However I was able to decrease file copy times by 8 just using the built in threading with python (which I have since learned how to use) and I wanted to post it up as an additional answer for anyone interested! Here's my code below, and it is probably important to note that my 8x copy time will most likely differ from environment to environment due to your hardware/network set-up.
import Queue, threading, os, time
import shutil
fileQueue = Queue.Queue()
destPath = 'path/to/cop'
class ThreadedCopy:
totalFiles = 0
copyCount = 0
lock = threading.Lock()
def __init__(self):
with open("filelist.txt", "r") as txt: #txt with a file per line
fileList = txt.read().splitlines()
if not os.path.exists(destPath):
os.mkdir(destPath)
self.totalFiles = len(fileList)
print str(self.totalFiles) + " files to copy."
self.threadWorkerCopy(fileList)
def CopyWorker(self):
while True:
fileName = fileQueue.get()
shutil.copy(fileName, destPath)
fileQueue.task_done()
with self.lock:
self.copyCount += 1
percent = (self.copyCount * 100) / self.totalFiles
print str(percent) + " percent copied."
def threadWorkerCopy(self, fileNameList):
for i in range(16):
t = threading.Thread(target=self.CopyWorker)
t.daemon = True
t.start()
for fileName in fileNameList:
fileQueue.put(fileName)
fileQueue.join()
ThreadedCopy()
How about using a ThreadPool?
import os
import glob
import shutil
from functools import partial
from multiprocessing.pool import ThreadPool
DST_DIR = '../path/to/new/dir'
SRC_DIR = '../path/to/files/to/copy'
# copy_to_mydir will copy any file you give it to DST_DIR
copy_to_mydir = partial(shutil.copy, dst=DST_DIR)
# list of files we want to copy
to_copy = glob.glob(os.path.join(SRC_DIR, '*'))
with ThreadPool(4) as p:
p.map(copy_to_mydir, to_copy)
This can be parallelized by using gevent in Python.
I would recommend the following logic to achieve speeding up 100k+ file copying:
Put names of all the 100K+ files, which need to be copied in a csv file, for eg: 'input.csv'.
Then create chunks from that csv file. The number of chunks should be decided based on no.of processors/cores in your machine.
Pass each of those chunks to separate threads.
Each thread sequentially reads filename in that chunk and copies it from one location to another.
Here goes the python code snippet:
import sys
import os
import multiprocessing
from gevent import monkey
monkey.patch_all()
from gevent.pool import Pool
def _copyFile(file):
# over here, you can put your own logic of copying a file from source to destination
def _worker(csv_file, chunk):
f = open(csv_file)
f.seek(chunk[0])
for file in f.read(chunk[1]).splitlines():
_copyFile(file)
def _getChunks(file, size):
f = open(file)
while 1:
start = f.tell()
f.seek(size, 1)
s = f.readline()
yield start, f.tell() - start
if not s:
f.close()
break
if __name__ == "__main__":
if(len(sys.argv) > 1):
csv_file_name = sys.argv[1]
else:
print "Please provide a csv file as an argument."
sys.exit()
no_of_procs = multiprocessing.cpu_count() * 4
file_size = os.stat(csv_file_name).st_size
file_size_per_chunk = file_size/no_of_procs
pool = Pool(no_of_procs)
for chunk in _getChunks(csv_file_name, file_size_per_chunk):
pool.apply_async(_worker, (csv_file_name, chunk))
pool.join()
Save the file as file_copier.py.
Open terminal and run:
$ ./file_copier.py input.csv
While re-implementing the code posted by #Spencer, I ran into the same error as mentioned in the comments below the post (to be more specific: OSError: [Errno 24] Too many open files).
I solved this issue by moving away from the daemonic threads and using concurrent.futures.ThreadPoolExecutor instead. This seems to handle in a better way the opening and closing of the files to copy. By doing so all the code stayed the same besides the threadWorkerCopy(self, filename_list: List[str]) method which looks like this now:
def threadWorkerCopy(self, filename_list: List[str]):
"""
This function initializes the workers to enable the multi-threaded process. The workers are handles automatically with
ThreadPoolExecutor. More infos about multi-threading can be found here: https://realpython.com/intro-to-python-threading/.
A recurrent problem with the threading here was "OSError: [Errno 24] Too many open files". This was coming from the fact
that deamon threads were not killed before the end of the script. Therefore, everything opened by them was never closed.
Args:
filename_list (List[str]): List containing the name of the files to copy.
"""
with concurrent.futures.ThreadPoolExecutor(max_workers=cores) as executor:
executor.submit(self.CopyWorker)
for filename in filename_list:
self.file_queue.put(filename)
self.file_queue.join() # program waits for this process to be done.
If you just want to copy a directory tree from one path to another, here's my solution that's a litte more simple than the previous solutions. It leverages multiprocessing.pool.ThreadPool and uses a custom copy function for shutil.copytree:
import shutil
from multiprocessing.pool import ThreadPool
class MultithreadedCopier:
def __init__(self, max_threads):
self.pool = ThreadPool(max_threads)
def copy(self, source, dest):
self.pool.apply_async(shutil.copy2, args=(source, dest))
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.pool.close()
self.pool.join()
src_dir = "/path/to/src/dir"
dest_dir = "/path/to/dest/dir"
with MultithreadedCopier(max_threads=16) as copier:
shutil.copytree(src_dir, dest_dir, copy_function=copier.copy)
I am trying to parse many files found in a directory, however using multiprocessing slows my program.
# Calling my parsing function from Client.
L = getParsedFiles('/home/tony/Lab/slicedFiles') <--- 1000 .txt files found here.
combined ~100MB
Following this example from python documentation:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
I've written this piece of code:
from multiprocessing import Pool
from api.ttypes import *
import gc
import os
def _parse(pathToFile):
myList = []
with open(pathToFile) as f:
for line in f:
s = line.split()
x, y = [int(v) for v in s]
obj = CoresetPoint(x, y)
gc.disable()
myList.append(obj)
gc.enable()
return Points(myList)
def getParsedFiles(pathToFile):
myList = []
p = Pool(2)
for filename in os.listdir(pathToFile):
if filename.endswith(".txt"):
myList.append(filename)
return p.map(_pars, , myList)
I followed the example, put all the names of the files that end with a .txt in a list, then created Pools, and mapped them to my function. Then I want to return a list of objects. Each object holds the parsed data of a file. However it amazes me that I got the following results:
#Pool 32 ---> ~162(s)
#Pool 16 ---> ~150(s)
#Pool 12 ---> ~142(s)
#Pool 2 ---> ~130(s)
Graph:
Machine specification:
62.8 GiB RAM
Intel® Core™ i7-6850K CPU # 3.60GHz × 12
What am I missing here ?
Thanks in advance!
Looks like you're I/O bound:
In computer science, I/O bound refers to a condition in which the time it takes to complete a computation is determined principally by the period spent waiting for input/output operations to be completed. This is the opposite of a task being CPU bound. This circumstance arises when the rate at which data is requested is slower than the rate it is consumed or, in other words, more time is spent requesting data than processing it.
You probably need to have your main thread do the reading and add the data to the pool when a subprocess becomes available. This will be different to using map.
As you are processing a line at a time, and the inputs are split, you can use fileinput to iterate over lines of multiple files, and map to a function processing lines instead of files:
Passing one line at a time might be too slow, so we can ask map to pass chunks, and can adjust until we find a sweet-spot. Our function parses chunks of lines:
def _parse_coreset_points(lines):
return Points([_parse_coreset_point(line) for line in lines])
def _parse_coreset_point(line):
s = line.split()
x, y = [int(v) for v in s]
return CoresetPoint(x, y)
And our main function:
import fileinput
def getParsedFiles(directory):
pool = Pool(2)
txts = [filename for filename in os.listdir(directory):
if filename.endswith(".txt")]
return pool.imap(_parse_coreset_points, fileinput.input(txts), chunksize=100)
In general it is never a good idea to read from the same physical (spinning) hard disk from different threads simultaneously, because every switch causes an extra delay of around 10ms to position the read head of the hard disk (would be different on SSD).
As #peter-wood already said, it is better to have one thread reading in the data, and have other threads processing that data.
Also, to really test the difference, I think you should do the test with some bigger files. For example: current hard disks should be able to read around 100MB/sec. So reading the data of a 100kB file in one go would take 1ms, while positioning the read head to the beginning of that file would take 10ms.
On the other hand, looking at your numbers (assuming those are for a single loop) it is hard to believe that being I/O bound is the only problem here. Total data is 100MB, which should take 1 second to read from disk plus some overhead, but your program takes 130 seconds. I don't know if that number is with the files cold on disk, or an average of multiple tests where the data is already cached by the OS (with 62 GB or RAM all that data should be cached the second time) - it would be interesting to see both numbers.
So there has to be something else. Let's take a closer look at your loop:
for line in f:
s = line.split()
x, y = [int(v) for v in s]
obj = CoresetPoint(x, y)
gc.disable()
myList.append(obj)
gc.enable()
While I don't know Python, my guess would be that the gc calls are the problem here. They are called for every line read from disk. I don't know how expensive those calls are (or what if gc.enable() triggers a garbage collection for example) and why they would be needed around append(obj) only, but there might be other problems because this is multithreading:
Assuming the gc object is global (i.e. not thread local) you could have something like this:
thread 1 : gc.disable()
# switch to thread 2
thread 2 : gc.disable()
thread 2 : myList.append(obj)
thread 2 : gc.enable()
# gc now enabled!
# switch back to thread 1 (or one of the other threads)
thread 1 : myList.append(obj)
thread 1 : gc.enable()
And if the number of threads <= number of cores, there wouldn't even be any switching, they would all be calling this at the same time.
Also, if the gc object is thread safe (it would be worse if it isn't) it would have to do some locking in order to safely alter it's internal state, which would force all other threads to wait.
For example, gc.disable() would look something like this:
def disable()
lock() # all other threads are blocked for gc calls now
alter internal data
unlock()
And because gc.disable() and gc.enable() are called in a tight loop, this will hurt performance when using multiple threads.
So it would be better to remove those calls, or place them at the beginning and end of your program if they are really needed (or only disable gc at the beginning, no need to do gc right before quitting the program).
Depending on the way Python copies or moves objects, it might also be slightly better to use myList.append(CoresetPoint(x, y)).
So it would be interesting to test the same on one 100MB file with one thread and without the gc calls.
If the processing takes longer than the reading (i.e. not I/O bound), use one thread to read the data in a buffer (should take 1 or 2 seconds on one 100MB file if not already cached), and multiple threads to process the data (but still without those gc calls in that tight loop).
You don't have to split the data into multiple files in order to be able to use threads. Just let them process different parts of the same file (even with the 14GB file).
A copy-paste snippet, for people who come from Google and don't like reading
Example is for json reading, just replace __single_json_loader with another file type to work with that.
from multiprocessing import Pool
from typing import Callable, Any, Iterable
import os
import json
def parallel_file_read(existing_file_paths: Iterable[str], map_lambda: Callable[[str], Any]):
result = {p: None for p in existing_file_paths}
pool = Pool()
for i, (temp_result, path) in enumerate(zip(pool.imap(map_lambda, existing_file_paths), result.keys())):
result[path] = temp_result
pool.close()
pool.join()
return result
def __single_json_loader(f_path: str):
with open(f_path, "r") as f:
return json.load(f)
def parallel_json_read(existing_file_paths: Iterable[str]):
combined_result = parallel_file_read(existing_file_paths, __single_json_loader)
return combined_result
And usage
if __name__ == "__main__":
def main():
directory_path = r"/path/to/my/file/directory"
assert os.path.isdir(directory_path)
d: os.DirEntry
all_files_names = [f for f in os.listdir(directory_path)]
all_files_paths = [os.path.join(directory_path, f_name) for f_name in all_files_names]
assert(all(os.path.isfile(p) for p in all_files_paths))
combined_result = parallel_json_read(all_files_paths)
main()
Very straight forward to replace a json reader with any other reader, and you're done.