How to do multiprocessing of image augmentations of large quantity? - python

I'm trying to do image augmentations for which I'm getting images from a large number of folders Doing this in a sequel takes lots of time, so I'm running the same script in different terminals in order to complete augmentations very quickly by providing the start and end value of the list index as shown in the below code.
do_augmentations(args):
total_count = len(all_folders)
split_no = total_count //2
start = split_no
if split_no == 0:
split_no = 1
end = total_count
for folder in all_folders[start:end]:
allImgs = list(paths.list_images(folder))
count = len(allImgs)
for img in allImages:
augmentations(img)
cv2.imwrite(img)
def main():
all_folders= os.walk(folderpath)
do_augmentations(all_folders)
I was wondering if we could use multiple CPU cores with multithreading and multiprocessing packages in Python in parallel rather than sequentially because it takes a long time. Here I'm mentioning the start and end value of the folder number to run separately in multiple terminals to run faster . I tried using a multiprocessing library to implement this in parallel, but it runs in the same sequential manner as before. Below is the code of my approach to solving this.
from multiprocessing import pool
from multiprocessing.dummy import Pool as ThreadPool
do_augmentations(args):
all_folers = args[0]
bpath = args[1]
for folder in all_folders:
allImgs = list(paths.list_images(folder))
count = len(allImgs)
for img in allImages:
augmentations(img)
cv2.imwrite(img)
def main():
bpath = 'img_foldr'
all_folders= os.walk(folderpath)
pool = ThreadPool(4)
pool.map(do_augmentations,[[all_folders,bpath]])
When running this, it does image processing on one folder at a time in a loop instead of in parallel for many folders simultaneously. I don't understand what I'm doing wrong. Any help or suggestions to solve this will be very helpful.
Update:
I tried answer given by Jan Wilamowski as below
def augment(image):
img = do_augmentation(image)
cv2.imwrite(img)
def main():
all_folders = os.walk(imagefolder)
all_images = chain(paths.list_images(folder) for folder in all_folders)
pool = Pool(4)
pool.map(augment,all_images)
I get error as below
Traceback (most recent call last):
File "aug_img.py", line 424, in
main()
File "aug_img.py", line 346, in main
pool.map(augment,all_images)
File
"C:\Users\mathew\anaconda3\envs\retinanet\lib\multiprocessing\pool.py",
line 364, in map return
self._map_async(func, iterable, mapstar, chunksize).get()
File
"C:\Users\mathew\anaconda3\envs\retinanet\lib\multiprocessing\pool.py",
line 771, in get raise self._value
File
"C:\Users\mathew\anaconda3\envs\retinanet\lib\multiprocessing\pool.py",
line 537, in _handle_tasks put(task)
File
"C:\Users\mathew\anaconda3\envs\retinanet\lib\multiprocessing\connection.py",
line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File
"C:\Users\mathew\anaconda3\envs\retinanet\lib\multiprocessing\reduction.py",
line 51, in dumps cls(buf, protocol).dump(obj)
TypeError: cannot pickle 'generator' object

Have your function work on a single folder and pass the folder list to pool.map(). Also, use a process pool to avoid problems with the GIL (as pointed out by several commenters):
from multiprocessing import Pool
do_augmentations(folder):
allImgs = list(paths.list_images(folder))
count = len(allImgs)
for img in allImages:
augmentations(img)
cv2.imwrite(img)
def main():
bpath = 'img_foldr'
all_folders = os.walk(folderpath)
pool = Pool(4)
pool.map(do_augmentations, all_folders)
You could also break it down further and have your function work on a single image, giving more consistent performance:
from itertools import chain
from imutils import paths
def augment(image):
augmentations(image)
cv2.imwrite(image)
def main():
all_images = paths.list_images(folderpath)
pool = Pool(4)
pool.map(augment, all_images)
Note however that disk I/O can be a bottleneck so don't expect a linear performance improvement.

Let me give you my simple multiprocessing recipy for any task not just augmentation.
This is a top down view.
import os
import multiprocessing
from workers import batch_function
no_of_cpus = os.cpu_count()
all_folders= os.walk(folderpath)
input_list = get_all_the_files_from_folder(all_folders) #This is specific to your file structures
mp_dict= split_input_list(number_of_splits=int(no_of_cpus), input_list= input_list)
pool = multiprocessing.Pool()
results = pool.map(workers.batch_function, mp_dict) # Call Image Augmentations
First I would make a list of all the data that needs to be preprocessed and then split it in the number of process I want by means of a function split_input_list.
If you don't need to return anything from the batch function you don't need the results variable but in essence you will have a list of results from each process that you can iterate by means of for res in results:
no matter what you return in batch_function.
def split_input_list(process_number, d):
dict_list = []
pn = process_number
for i in range(pn - 1):
start = len(d) // pn * i
finish = len(d) // pn * (i + 1)
split_dict = dict(list(d.items())[start:finish])
print(len(split_dict))
dict_list.append(split_dict)
last_dict = dict(list(d.items())[finish:])
dict_list.append(last_dict)
print(len(last_dict))
return dict_list
Then in a separate workers.py file I usually have multiple batch_function that can accomplish certain tasks. In this case for augmentations I would do something similar to :
def batch_function(mp_list):
local_split = mp_list[0]
for k,_ in local_split.items():
augmentations(k)
...
#return what_you_need
Also if you don't have an impressive amount of RAM and a CPU with 32 cores expect some crashing from lack of memory.

Related

Make use of multiple cores using multiprocessing converting lidar point cloud files to raster

I have many classified lidar point cloud files, which I want to convert to geotiff raster files. For that I wrote a function that creates a json-Pipeline file that is required for conversion with PDAL and then executes that pipeline.
tiles = []
for file in glob.glob("*.las"):
tiles.append(file)
def select_points_and_raster(file, class_nr, resolution):
filename_out = file.split('.')[0]+'_'+ str(do) +'.tif'
config = json.dumps([ file,
{'type':'filters.range', 'limits':classification[class_nr]},
{'resolution':resolution, 'radius':resolution*1.414,
'gdaldriver':'GTiff',
'output_type':['mean'],
'filename':filename_out}
])
pipeline = pdal.Pipeline(config)
pipeline.execute()
return filename_out
for i in range(len(tiles)):
print(f'do file {tiles[i]}')
filename_out = select_points_and_raster(tiles[i], class_nr, resolution)
print(f'finished and wrote {filename_out}')
where classification is a dictionary containing numbers that correspond to ground/buildings/vegetation, so I don't have to remember the numbers.
This works fine serially by iterating over each file in tiles. However, as I have many files, I would like to use multiple cores for that. How do I split the task to make use of at least all the four cores I have in my machine? I have tried to do it with the following:
from multiprocess import Pool
ncores = 2
pool = Pool(processes=ncores)
pool.starmap(select_points_and_raster,
[([file for file in tiles], classification[class_nr], resolution)])
pool.close()
pool.join()
but that does not work as I get an AttributeError: 'list' object has no attribute 'split'.
But I'm not passing a list, or am I? Is that generally the way to go parallelizing that?
def select_points_and_raster(input):
file, class_nr, resolution = input
filename_out = file.split('.')[0]+'_'+ str(do) +'.tif'
config = json.dumps([ file,
{'type':'filters.range', 'limits':classification[class_nr]},
{'resolution':resolution, 'radius':resolution*1.414,
'gdaldriver':'GTiff',
'output_type':['mean'],
'filename':filename_out}
])
pipeline = pdal.Pipeline(config)
pipeline.execute()
return filename_out
info = []
for i in range(len(tiles)):
print(f'do file {tiles[i]}')
info.append((tiles[i], class_nr, resolution))
from multiprocess import Pool
ncores = 2 # ncores = cpu_count() - 1
pool = Pool(ncores)
pool.map(select_points_and_raster, input)
pool.close()
pool.join()
This is what works for me. You seem to be passing a list inside your tuple list: [([],,)] which is passing a list of file names. You also seem to be passing different inputs: classification[class_nr] instead of just class_nr.

How to efficiently run parallel processes in python, while each should read from a big file

I have a large delimited protobuf file (can be between 1GB-30GB).
Each message in the file has a certain format, where the first attribute is a string, and the second is a repeated (list like) object that contains 2 attributes.
It's similar to this text representation
BIG FILE:
first 10:32, 12:1, ... ,100002:3
second 1:3, 15:5, ... ,548756:57
...
...
ten_million 4:7, 48:4, ... ,12357458:8
Currently my code looks something like that:
import itertools
from multiprocessing import Pool
from google.protobuf.internal.decoder import _DecodeVarint32
import proto_pb2
def read_message(buffer, n, message_type):
message = message_type()
msg_len, new_pos = _DecodeVarint32(buffer, n)
n = new_pos
msg_buf = buffer[n:n + msg_len]
n += msg_len
message.ParseFromString(msg_buf)
return message
class A:
def __init__(self, big_file):
with open(big_file, 'rb') as fp:
self.buf = fp.read()
def get_line(self, n):
return read_message(self.buf, n, proto_pb2.line_type)
def func(obj_a, lines):
res = []
for line in lines:
res.append(obj_a.get_line(line))
return res
if __name__ == '__main__':
all_lines = [[54487, 78, 10025, 548967], [12, 3218], [45786, 5744, 567, 45648], [45156, 456, 75]]
a = A(big_file)
with Pool() as pool:
result = pool.starmap(func, itertools.product([a], all_lines))
print(result)
I open and read the file inside the class and hold it once creating a class object to avoid multiple openings/closings of the file.
It fits into the memory but I'd like to avoid that.
Then I expect each of the processes created by the Pool to read the lines it needs from the file and return the wanted result.
All the sub-processes will only read from the file (no writing) and print/return the results
Currently it doesn't really works in parallel, seems to me that it waits on acquiring lock for a long time before each process can run.
I assume that it's happening due to the large size of the file, which being copied for each process.
What would be a proper implementation?
This code is only used for the example, the actual file is a protobuf file (hope I haven't made many mistakes there), and I keep a mapping (dict) of the lines numbers and their location in the file

Function that multiprocesses another function

I'm performing analyses of time-series of simulations. Basically, it's doing the same tasks for every time steps. As there is a very high number of time steps, and as the analyze of each of them is independant, I wanted to create a function that can multiprocess another function. The latter will have arguments, and return a result.
Using a shared dictionnary and the lib concurrent.futures, I managed to write this :
import concurrent.futures as Cfut
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# function : function that is running in parallel
# param_list : list of items
# group_size : size of the groups
# Nworkers : number of group/items running in the same time
# **param_fixed : passing parameters
manager = mlp.Manager()
dic = manager.dict()
executor = Cfut.ProcessPoolExecutor(Nworkers)
futures = [executor.submit(function, param, dic, *args)
for param in grouper(param_list, group_size)]
Cfut.wait(futures)
return [dic[i] for i in sorted(dic.keys())]
Typically, I can use it like this :
def read_file(files, dictionnary):
for file in files:
i = int(file[4:9])
#print(str(i))
if 'bz2' in file:
os.system('bunzip2 ' + file)
file = file[:-4]
dictionnary[i] = np.loadtxt(file)
os.system('bzip2 ' + file)
Map = np.array(multiprocess_loop_grouped(read_file, list_alti, Group_size, N_thread))
or like this :
def autocorr(x):
result = np.correlate(x, x, mode='full')
return result[result.size//2:]
def find_lambda_finger(indexes, dic, Deviation):
for i in indexes :
#print(str(i))
# Beach = Deviation[i,:] - np.mean(Deviation[i,:])
dic[i] = Anls.find_first_max(autocorr(Deviation[i,:]), valmax = True)
args = [Deviation]
Temp = Rescal.multiprocess_loop_grouped(find_lambda_finger, range(Nalti), Group_size, N_thread, *args)
Basically, it is working. But it is not working well. Sometimes it crashes. Sometimes it actually launches a number of python processes equal to Nworkers, and sometimes there is only 2 or 3 of them running at a time while I specified Nworkers = 15.
For example, a classic error I obtain is described in the following topic I raised : Calling matplotlib AFTER multiprocessing sometimes results in error : main thread not in main loop
What is the more Pythonic way to achieve what I want ? How can I improve the control this function ? How can I control more the number of running python process ?
One of the basic concepts for Python multi-processing is using queues. It works quite well when you have an input list that can be iterated and which does not need to be altered by the sub-processes. It also gives you a good control over all the processes, because you spawn the number you want, you can run them idle or stop them.
It is also a lot easier to debug. Sharing data explicitly is usually an approach that is much more difficult to setup correctly.
Queues can hold anything as they are iterables by definition. So you can fill them with filepath strings for reading files, non-iterable numbers for doing calculations or even images for drawing.
In your case a layout could look like that:
import multiprocessing as mp
import numpy as np
import itertools as it
def worker1(in_queue, out_queue):
#holds when nothing is available, stops when 'STOP' is seen
for a in iter(in_queue.get, 'STOP'):
#do something
out_queue.put({a: result}) #return your result linked to the input
def worker2(in_queue, out_queue):
for a in iter(in_queue.get, 'STOP'):
#do something differently
out_queue.put({a: result}) //return your result linked to the input
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# your final result
result = {}
in_queue = mp.Queue()
out_queue = mp.Queue()
# fill your input
for a in param_list:
in_queue.put(a)
# stop command at end of input
for n in range(Nworkers):
in_queue.put('STOP')
# setup your worker process doing task as specified
process = [mp.Process(target=function,
args=(in_queue, out_queue), daemon=True) for x in range(Nworkers)]
# run processes
for p in process:
p.start()
# wait for processes to finish
for p in process:
p.join()
# collect your results from the calculations
for a in param_list:
result.update(out_queue.get())
return result
temp = multiprocess_loop_grouped(worker1, param_list, group_size, Nworkers, *args)
map = multiprocess_loop_grouped(worker2, param_list, group_size, Nworkers, *args)
It can be made a bit more dynamic when you are afraid that your queues will run out of memory. Than you need to fill and empty the queues while the processes are running. See this example here.
Final words: it is not more Pythonic as you requested. But it is easier to understand for a newbie ;-)

Handle multiple results in Python multiprocessing

I'm writing a Python piece of code to parse a lot of ascii file using multiprocessing functionality.
For each file I've to perform the operations of this function
def parse_file(file_name):
record = False
path_include = []
buffer_include = []
include_file_filters = {}
include_keylines = {}
grids_lines = []
mat_name_lines = []
pids_name_lines = []
pids_shell_lines= []
pids_weld_lines = []
shells_lines = []
welds_lines = []
with open(file_name, 'rb') as in_file:
for lineID, line in enumerate(in_file):
if record:
path_include += line
if record and re.search(r'[\'|\"]$', line.strip()):
buffer_include.append(re_path_include.search(
path_include).group(1).replace('\n', ''))
record = False
if 'INCLUDE' in line and '$' not in line:
if re_path_include.search(line):
buffer_include.append(
re_path_include.search(line).group(1))
else:
path_include = line
record = True
if line.startswith('GRID'):
grids_lines += [lineID]
if line.startswith('$HMNAME MAT'):
mat_name_lines += [lineID]
if line.startswith('$HMNAME PROP'):
pids_name_lines += [lineID]
if line.startswith('PSHELL'):
pids_shell_lines += [lineID]
if line.startswith('PWELD'):
pids_weld_lines += [lineID]
if line.startswith(('CTRIA3', 'CQUAD4')):
shells_lines += [lineID]
if line.startswith('CWELD'):
welds_lines += [lineID]
include_keylines = {'grid': grids_lines, 'mat_name': mat_name_lines, 'pid_name': pids_name_lines, \
'pid_shell': pids_shell_lines, 'pid_weld': pids_weld_lines, 'shell': shells_lines, 'weld': welds_lines}
include_file_filters = {file_name: include_keylines}
return buffer_include, include_file_filters
This function is used in a loop through list of files, in this way (each process on CPU parse one entire file)
import multiprocessing as mp
p = mp.Pool(mp.cpu_count())
buffer_include = []
include_file_filters = {}
for include in grouper([list_of_file_path]):
current = mp.current_process()
print 'Running: ', current.name, current._identity
results = p.map(parse_file, include)
buffer_include += results[0]
include_file_filters.update(results[1])
p.close()
The grouper function used above is defined as
def grouper(iterable, padvalue=None):
return itertools.izip_longest(*[iter(iterable)]*mp.cpu_count(), fillvalue=padvalue)
I'm using Python 2.7.15 in cpu with 4 cores (Intel Core i3-6006U).
When I run my code, I see all the CPUs engaged on 100%, the output in Python console as Running: MainProcess () but nothing appened otherwise. It seems that my code is blocked at instruction results = p.map(parse_file, include) and can't go ahead (the code works well when i parse the files one at a time without parallelization).
What is wrong?
How can I deal with the results given by parse_file function
during parallel execution?My approach is correct or not?
Thanks in advance for your support
EDIT
Thanks darc for your reply. I've tried your suggestion but the issue is the same. The problem, seems to be overcome if I put the code under if statement like so
if __name__ == '__main__':
Maybe this is due to the manner in which Python IDLE handle the process. I'm using the IDLE environ for development and debugging reasons.
according to python docs:
map(func, iterable[, chunksize])
A parallel equivalent of the map() built-in function (it supports only one iterable argument though). It blocks until the result is ready.
This method chops the iterable into a number of chunks which it submits to the process pool as separate tasks. The (approximate) size of these chunks can be specified by setting chunksize to a positive integer.
since it is blocking your process wait until parse file is done.
since map already chnucks the iterable you can try to send all of the includes together as one large iterable.
import multiprocessing as mp
p = mp.Pool(mp.cpu_count())
buffer_include = []
include_file_filters = {}
results = p.map(parse_file, list_of_file_path, 1)
buffer_include += results[0]
include_file_filters.update(results[1])
p.close()
if you want to keep the original loop use apply_async, or if you are using python3 you can use ProcessPoolExecutor submit() function and read the results.

Calculations are not stored in the passed arguments when processes are executed in parallel

I have a function which I am applying to different chunks of my data. Since each chunk is independent of the rest, I wish to execute the function for all chunks in parallel.
I have a result dictionary which should hold the output of calculations for each chunk.
Here is how I did it:
from joblib import Parallel, delayed
import multiprocessing
cpu_count = multiprocessing.cpu_count()
# I have 8 cores, so I divide the data into 8 chunks.
endIndeces = divideIndecesUniformly(myData.shape[0], cpu_count) # e.g., [0, 125, 250, ..., 875, 1000]
# initialize result dictionary with empty lists.
result = dict()
for i in range(cpu_count):
result[i] = []
# Parallel execution for 8 chunks
Parallel(n_jobs=cpu_count)(delayed(myFunction)(myData, start_idx=endIndeces[i], end_idx=endIndeces[i+1]-1, result, i) for i in range(cpu_count))
However, when the execution is finished result has all initial empty lists. I figured that if I execute the function serially over each chunk of data, it works just fine. For example, if I replace the last line with the following, result will have all the calculated values.
# Instead of parallel execution, call the function in a for-loop.
for i in range(cpu_count):
myFunction(myData, start_idx=endIndeces[i], end_idx=endIndeces[i+1]-1, result, i)
In this case, result values are updated.
It seems that when the function is executed in parallel, it cannot write on the given dictionary (result). So, I was wondering how I can obtain the output of function for each chunk of data?
joblib, by default uses the multiprocessing module in python. According to this SO Answer, when arguments are passed to new Processes they create a fork, which copies the memory space of the current process. This means that myFunction is essentially working on a copy of result and does not modify the original.
My suggestion is to have myFunction return the desired data as a list. The call to Process will then return a list of the lists generated by myFunction. From there, it is simple to add them to results. It could look something like this:
from joblib import Parallel, delayed
import multiprocessing
if __name__ == '__main__':
cpu_count = multiprocessing.cpu_count()
endIndeces = divideIndecesUniformly(myData.shape[0], cpu_count)
# make sure myFunction returns the grouped results in a list
r = Parallel(n_jobs=cpu_count)(delayed(myFunction)(myData, start_idx=endIndeces[i], end_idx=endIndeces[i+1]-1, result, i) for i in range(cpu_count))
result = dict()
for i, data in enumerate(r): # cycles through each resultant chunk, numbered and in the original order
result[i] = data

Categories

Resources