I have a very large python code. The fundamental of that is I have a function which use a row of Dataframe and apply some formulas and save the object i've create whit joblib in my files. (Im gonna put a function to capture the essence of the script).
import Multiprocessing as multi
def somefunct(DataFrame_row, some_parameter1, some_parameter2, sema):
python_object = My_object(DataFrame_row['Column1'],DataFrame_row['Column2'])
python_object.some_complicate_method(some_parameter1, some_parameter2)
# for example calculate an integral of My_object data
#takes 50-60 second aprox per row
joblib.dump(python_object, path_save)
#Before of tried a function that save the object i tried afunction that
#save the object in the DataFrame
sema.release()
def apply_all_data_frame(df, n_procces):
sema = multi.Semaphore(n_procesos)
procesos_list = []
for index, row in df.iterrow():
sema.acquire()
p = multi.Process(target = somefunct,
args = (row, some_parameter1, some_parameter2, sema))
procesos_list.append(p)
p.start()
for proceso in procesos_list:
proceso.join()
So, the DataFrame contain 5000 rows and it maybe contain more in the future. I test the script with a data with 100 rows in a computer with 16 cores and 32 logic processor. I choose 30 process and with 100 rows use the 30 process(100% CPU) and finish quickly. But when i try again with all the data the computer only use 4 or 3 process (11%) and use 2.0 gb of RAM each process. Take to long.
My first try with the program was use Pool and Pool.map, but in that case is the same problem and full the RAM an broke everything despite having use less process (16 i think).
I've coment in the script that my first program was saving the object in the DataFrame but when i see that the RAM full 100% i decided to save the object. In that case i tried the Pool and freezing all, because create a python process with 0% work in the CPU.
I tried the function without Semaphore to.
I'm apologize for the English and for the explanation, is my first question online.
screenshot of how the process of computer works
Related
I am running Python 3.5.2, and it seems that my pool is not calling the target function for some worker jobs. I currently have a list of length 630, where each element is a np.array. A pool of workers is then created for each element of this this using map_async(). As a some what condensed version of what is happening in my script, let us consider the following example code,
def run(grid_point):
index, geometry = grid_point # unpack tuple passed in
parent = os.getcwd()
checkfile = "%s/CHECK/Check_GP%s.chk" % (parent, index) # file for each process for debugging purposes.
setup.logging(checkfile, message='Calculation started!') # write calculation started in file for each process N.
''' Do some stuff with geometries via. interface to external code. ''' # Some other routines that do some calculation based on input
if normal_termination:
''' write some output to some array with global memory sharing between processes '''
# if calculation finished succesfully some output numbers (always non-zero) are written to a .npy array
else:
''' fill array for that process with NaN ''' # if calculation fails, those elements are NaN.
cpus = int((multiprocessing.cpu_count())/2) # 16 cpus here
pool = multiprocessing.Pool(processes=cpus)
pool.map_async(run, [(k, v) for k, v in enumerate(list_geom)]) # list of molecular geometries to do stuff too in target function run. N.B. len(list_geom) = 630.
pool.close()
pool.join()
Here, if the calculation runs as expected for all 630 points, one would expect 630 *.chk files with the message 'Calculation started!' inside. In addition, since the numbers written to the output array are always non-zero (including if the external code's calculation fails - set to NaN), I would also expect this array to have no zero values.
Instead, what I see is that there are only 580 *.chk files present once my python script finishes running. Moreover, there are 50 elements in the output array that are zero (this array is initialised using np.zeros). This indicates, that 50 out of the 630 jobs submitted to the pool do not even start. In addition, in some other routines I have written that are called within this run() function, the programme will create a directory containing some output for each of the 630 jobs. Again of these 630 output directories that should be created, 50 are missing.
If anyone has any idea as to why map_aync() would appear to skip jobs in the pool, or perhaps what would lead to the target function run() not being called for a selection of 50 processes in the pool, I would be keen to hear your ideas! I am rather new to this kind of programming and am really stumped with this one. Thank you!
I am new here but I wanted to ask something regarding multiprocessing.
So I have some huge raster tiles that I process and extract information and I found that delivering tons of pickle files is faster than appending to a dataframe. The point is that I loop over each of my tiles for processing and I create pools inside a for loop
#This creates a directory for my pickle files
if not os.path.exists('pkl_tmp'):
os.mkdir('pkl_tmp')
Here I start looping over each one of my tiles and create a pool with the grid cells that I want to process, then I use the map function to apply all my nasty processing to each cell of my grid.
for GHSL_tile in ROI.iloc[4:].itertuples():
ct += 1
L18_cells = GHSL_query(GHSL_tile, L18_grid)
vector_tile = poligonize_tile(GHSL_tile)
print(datetime.today())
subdir = './pkl_tmp/{}/'.format(ct)
if not os.path.exists(subdir):
os.mkdir(subdir)
if vector_tile is not None:
# assign how many cores will be used
num_processes = int(multiprocessing.cpu_count() - 15)
chunk_size = 1 # chunk size set to 1 to return cell like outputs
# break the dataframe as a list
chunks = [L18_cells.iloc[i:i + chunk_size, :] for i in range(0, L18_cells.shape[0], chunk_size)]
pool = multiprocessing.Pool(processes=num_processes)
result = pool.map(process_cell, chunks)
del result
else:
print('Tile # {} skipped'.format(ct))
print('GHSL database created')
In fact this has not errors, it takes around 2 days executing due to the size of my data and sometimes I had many iddle cores (specially towards the end of a tile).
My question is:
I tried using map_async instead of map and it was creating files really fast, even sometimes processes multiple tiles at the same time which is wonderful, the problem is that when it creates the directory for my last tile, the code gets out of the for loop and many tasks end up not being executed. What am I doing wrong? How can I make the map_async function work better or how can I avoid iddle cores (slow down) when I use the map function?
Thank you in advance
PC resources is definitely not a problem
I have a function which I will run using multi-processing. However the function returns a value and I do not know how to store that value once it's done.
I read somewhere online about using a queue but I don't know how to implement it or if that'd even work.
cores = []
for i in range(os.cpu_count()):
cores.append(Process(target=processImages, args=(dataSets[i],)))
for core in cores:
core.start()
for core in cores:
core.join()
Where the function 'processImages' returns a value. How do I save the returned value?
In your code fragment you have input dataSets which is a list of some unspecified size. You have a function processImages which takes a dataSet element and apparently returns a value you want to capture.
cpu_count == dataset length ?
The first problem I notice is that os.cpu_count() drives the range of values i which then determines which datasets you process. I'm going to assume you would prefer these two things to be independent. That is, you want to be able to crunch some X number of datasets and you want it to work on any machine, having anywhere from 1 - 1000 (or more...) cores.
An aside about CPU-bound work
I'm also going to assume that you have already determined that the task really is CPU-bound, thus it makes sense to split by core. If, instead, your task is disk io-bound, you would want more workers. You could also be memory bound or cache bound. If optimal parallelization is important to you, you should consider doing some trials to see which number of workers really gives you maximum performance.
Here's more reading if you like
Pool class
Anyway, as mentioned by Michael Butscher, the Pool class simplifies this for you. Yours is a standard use case. You have a set of work to be done (your list of datasets to be processed) and a number of workers to do it (in your code fragment, your number of cores).
TLDR
Use those simple multiprocessing concepts like this:
from multiprocessing import Pool
# Renaming this variable just for clarity of the example here
work_queue = datasets
# This is the number you might want to find experimentally. Or just run with cpu_count()
worker_count = os.cpu_count()
# This will create processes (fork) and join all for you behind the scenes
worker_pool = Pool(worker_count)
# Farm out the work, gather the results. Does not care whether dataset count equals cpu count
processed_work = worker_pool.map(processImages, work_queue)
# Do something with the result
print(processed_work)
You cannot return the variable from another process. The recommended way would be to create a Queue (multiprocessing.Queue), then have your subprocess put the results to that queue, and once it's done, you may read them back -- this works if you have a lot of results.
If you just need a single number -- using Value or Array could be easier.
Just remember, you cannot use a simple variable for that, it has to be wrapped with above mentioned classes from multiprocessing lib.
If you want to use the result object returned by a multiprocessing, try this
from multiprocessing.pool import ThreadPool
def fun(fun_argument1, ... , fun_argumentn):
<blabla>
return object_1, object_2
pool = ThreadPool(processes=number_of_your_process)
async_num1 = pool.apply_async(fun, (fun_argument1, ... , fun_argumentn))
object_1, object_2 = async_num1.get()
then you can do whatever you want.
I am using pool.map in multiprocessing to do my custom function,
def my_func(data): #This is just a dummy function.
data = data.assign(new_col = data.apply(lambda x: f(x), axis = 1))
return data
def main():
mypool=pool.Pool(processes=16,maxtasksperchild=100)
ret_list=mypool.map(my_func,(group for name, group in gpd))
mypool.close()
mypool.join()
result = pd.concat(ret_list, axis=0)
Here gpd is a grouped data frame and so I am passing one data frame at a time to the pool.map function here. I keep getting memory error here.
As I can see from here, VIRT increase to multiple fold and leads to this error.
Two questions,
How do I solve this key growing memory issue at VIRT? May be a way to play with chunk size here.?
Second thing, though its launching as many python subprocess as I mentioned in pool(processes), I can see all the CPU doesn't hit 100%CPU, seems it doesn't use all the processes. One or Two run at a time? May be due to its applying same chunk size on different data frame sizes I pass every time (some data frames will be small)? How do I utilise every CPU process here?
Just for someone looking for answer in future. I solved this by using imap instead of map. Because map will make a list of iterator which is intensive.
I am experiencing a strange thing: I wrote a program to simulate economies. Instead of running this simulation one by one on one CPU core, I want to use multiprocessing to make things faster. So I run my code (fine), and I want to get some stats from the simulations I am doing. Then arises one surprise: all the simulations done at the same time yield the very same result! Is there some strange relationship between Pool() and random.seed()?
To be much clearer, here is what the code can be summarized as:
class Economy(object):
def __init__(self,i):
self.run_number = i
self.Statistics = Statistics()
self.process()
def run_and_return(i):
eco = Economy(i)
return eco
collection = []
def get_result(x):
collection.append(x)
if __name__ == '__main__':
pool = Pool(processes=4)
for i in range(NRUN):
pool.apply_async(run_and_return, (i,), callback=get_result)
pool.close()
pool.join()
The process(i) is the function that goes through every step of the simulation, during i steps. Basically I simulate NRUN Economies, from which I get the Statistics that I put in the list collection.
Now the strange thing is that the output of this is exactly the same for the first 4 runs: during the same "wave" of simulations, I get the very same output. Once I get to the second wave, then I get a different output for the next 4 simulations!
All these simulations run well if I use the same program with processes=1: I get different results when I only work on one core, taking simulations one by one... I have tried a few things, but can't get my head around this, hence my post...
Thank you very much for taking the time to read this long post, do not hesitate to ask for more precisions!
All the best,
If you are on Linux then each pool process is made by forking the parent process. This means the process is literally duplicated - this includes the seed any random object may be using.
The random module selects the seed for its default functions on import. Meaning the seed has already been selected before you create the Pool.
To get around this you must use an initialiser for each pool process that sets the random seed to something unique.
A decent way to seed random would be to use the process id and the current time. The process id is bound to be unique on a single run of your program. Whilst using the time will ensure uniqueness over multiple runs in case the same process id is produced. Passing process id and time through as a string will mean that the digest of the string is also used to seed the random number generator -- meaning two similar strings will produce substantially different seeds. Alternatively, you could use the uuid module to generate seeds.
def proc_init():
random.seed(str(os.getpid()) + str(time.time()))
pool = Pool(num_procs, initializer=proc_init)