unpacking a dask delayed object of list of tuples - python

I have a function returning a tuple of two elements. The function is called with pool starmap to generate a list of tuples which are unpacked to two lists.
def func():
#...some operations
return (x,y)
def MP_a_func(func,iterable,proc,chunk):
pool=multiprocessing.Pool(processes=proc)
Result=pool.starmap(func,iterable,chunksize=chunk)
pool.close()
return Result
##
if __name__ == '__main__':
results=MP_a_func(func,iterable,proc,chunk)
a,b=zip(*results)
I now wish to use dask delayed API as the following
if __name__ == '__main__':
results=delayed(MP_a_func(func,iterable,proc,chunk))
is it possible to unpack tuples in the delayed object without using results.compute() ?
Thank your for your help

It is possible for another delayed function to unpack the tuple, in the example below, the delayed value of return_tuple(1) was not computed, but passed as a delayed object:
import dask
#dask.delayed
def return_tuple(x):
return x+1, x-1
#dask.delayed
def process_first_item(some_tuple):
return some_tuple[0]+10
result = process_first_item(return_tuple(1))
dask.compute(result)
As per #mdurant's answer, it turns out delayed function/decorator has nout parameter, also see this answer.

If you know the number of outputs, the delayed function (or decorator) takes an optional nout arguments, and this will split the single delayed into that many delayed outputs. This sounds like exactly what you need.

Related

Python concurrent.futures

I have a multiprocessing code, and each process have to analyse same data differently.
I have implemented:
with concurrent.futures.ProcessPoolExecutor() as executor:
res = executor.map(goal_fcn, p, [global_DataFrame], [global_String])
for f in concurrent.futures.as_completed(res):
fp = res
and function:
def goal_fcn(x, DataFrame, String):
return heavy_calculation(x, DataFrame, String)
the problem is goal_fcn is called only once, while should be multiple time
In debugger, I checked now the variable p is looking, and it has multiple columns and rows. Inside goal_fcn, variable x have only first row - looks good.
But the function is called only once. There is no error, the code just execute next steps.
Even if I modify variable p = [1,3,4,5], and of course code. goal_fcn is executed only once
I have to use map() because keeping the order between input and output is required
map works like zip. It terminates once at least one input sequence is at its end. Your [global_DataFrame] and [global_String] lists have one element each, so that is where map ends.
There are two ways around this:
Use itertools.product. This is the equivalent of running "for all data frames, for all strings, for all p". Something like this:
def goal_fcn(x_DataFrame_String):
x, DataFrame, String = x_DataFrame_String
...
executor.map(goal_fcn, itertools.product(p, [global_DataFrame], [global_String]))
Bind the fixed arguments instead of abusing the sequence arguments.
def goal_fcn(x, DataFrame, String):
pass
bound = functools.partial(goal_fcn, DataFrame=global_DataFrame, String=global_String)
executor.map(bound, p)

Run one function on different CPUs

I have one machine with two CPUs, and each CPU has different number of cores. I have one function in my python code. How can I run this function on each of the CPUs?
In this case, I need to run function two times because I have two CPUs.
I want this because I want to compare the performance of different CPU.
This can be part of code. Please let me know if the code it not written in correct way.
import multiprocessing
def my_function():
print ("This Function needs high computation")
# Add code of function
pool = multiprocessing.Pool()
jobs = []
for j in range(2): #how can I run function depends on the number of CPUs?
p = multiprocessing.Process(target = my_function)
jobs.append(p)
p.start()
I have read many posts, but have not found a suitable answer for my problem.
The concurrent package handles the allocation of resources in an easy way, so that you don't have to specify any particular process/thread IDs, something that is OS-specific anyway.
If you want to run a function using either multiple processes or multiple threads, you can have a class that does it for you:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
from typing import Generator
class ConcurrentExecutor:
#staticmethod
def _concurrent_execution(executor, func, values):
with executor() as ex:
if isinstance(values, Generator):
return list(ex.map(lambda args: func(*args), values))
return list(ex.map(func, values))
#staticmethod
def concurrent_process_execution(func, values):
return ConcurrentExecutor._concurrent_execution(
ProcessPoolExecutor, func, values,
)
#staticmethod
def concurrent_thread_execution(func, values):
return ConcurrentExecutor._concurrent_execution(
ThreadPoolExecutor, func, values,
)
Then you can execute any function with it, even with arguments. If it's a single argument-function:
from concurrency import ConcurrentExecutor as concex
# Single argument function that prints the input
def single_arg_func(arg):
print(arg)
# Dummy list of 5 different input values
n_values = 5
arg_values = [x for x in range(n_values)]
# We want to run the function concurrently for each value in values
concex.concurrent_thread_execution(single_arg_func, arg_values)
Or with multiple arguments:
from concurrency import ConcurrentExecutor as concex
# Multi argument function that prints the input
def multi_arg_func(arg1, arg2):
print(arg1, arg2)
# Dummy list of 5 different input values per argument
n_values = 5
arg1_values = [x for x in range(n_values)]
arg2_values = [2*x for x in range(n_values)]
# Create a generator of combinations of values for the 2 arguments
args_values = ((arg1_values[i], arg2_values[i]) for i in range(n_values))
# We want to run the function concurrently for each value combination
concex.concurrent_thread_execution(multi_arg_func, args_values)

Parallelizing comparisons between two dataframes with multiprocessing

I've got the following function that allows me to do some comparison between the rows of two dataframes (data and ref)and return the index of both rows if there's a match.
def get_gene(row):
m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values)
return ref.index[m] if m.any() else None
Being a process that takes time (25min for 1.6M rows in data versus 20K rows in ref), I tried to speed things up by parallelizing the computation. As pandas doesn't support multiprocessing natively, I used this piece of code that I found on SO and it worked ok with my function get_gene.
def _apply_df(args):
df, func, kwargs = args
return df.apply(func, **kwargs)
def apply_by_multiprocessing(df, func, **kwargs):
workers = kwargs.pop('workers')
pool = multiprocessing.Pool(processes=workers)
result = pool.map(_apply_df, [(d, func, kwargs) for d in np.array_split(df, workers)])
pool.close()
df = pd.concat(list(result))
return df
It allowed me to go down to 9min of computation. But, if I understood correctly, this code just breaks down my dataframe data in 4 pieces and send each one to each core of the CPU. Hence, each core ends up doing a comparisons between 400K rows (from data split in 4) versus 20K rows (ref).
What I would actually want to do is to split both dataframes based on a value in one of their column so that I only compute comparisons between dataframes of the same 'group':
data.get_group(['a']) versus ref.get_group(['a'])
data.get_group(['b']) versus ref.get_group(['b'])
data.get_group(['c']) versus ref.get_group(['c'])
etc...
which would reduce the amount of computation to do. Each row in data would only be able to be matched against ~3K rows in ref, instead of all 20K rows.
Therefore, I tried to modify the code above but I couldn't manage to make it work.
def apply_get_gene(df, func, **kwargs):
reference = pd.read_csv('genomic_positions.csv', index_col=0)
reference = reference.groupby(['Chr'])
df = df.groupby(['Chr'])
chromosome = df.groups.keys()
workers = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=workers)
args_list = [(df.get_group(chrom), func, kwargs, reference.get_group(chrom)) for chrom in chromosome]
results = pool.map(_apply_df, args_list)
pool.close()
pool.join()
return pd.concat(results)
def _apply_df(args):
df, func, kwarg1, kwarg2 = args
return df.apply(func, **kwargs)
def get_gene(row, ref):
m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values)
return ref.index[m] if m.any() else None
I'm pretty sure it has to do with the way of how *args and **kwargs are passed trough the different functions (because in this case I have to take into account that I want to pass my splitted ref dataframe with the splitted data dataframe..).
I think the problem lies within the function _apply_df. I thought I understood what it really does but the line df, func, kwargs = args is still bugging me and I think I failed to modify it correctly..
All advices are appreciated !
Take a look at starmap():
starmap(func, iterable[, chunksize])
Like map() except that the elements of the iterable are expected to be iterables that are unpacked as arguments.
Hence an iterable of [(1,2), (3, 4)] results in [func(1,2), func(3,4)].
Which seems to be exactly what you need.
I post the answer I came up with for readers who might stumble upon this post:
As noted by #Michele Tonutti, I just had to use starmap() and do a bit of tweaking here and there. The tradeoff is that it applies only my custom function get_gene with the setting axis=1 but there's probably a way to make it more flexible if needed.
def Detect_gene(data):
reference = pd.read_csv('genomic_positions.csv', index_col=0)
ref = reference.groupby(['Chr'])
df = data.groupby(['Chr'])
chromosome = df.groups.keys()
workers = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=workers)
args = [(df.get_group(chrom), ref.get_group(chrom))
for chrom in chromosome]
results = pool.starmap(apply_get_gene, args)
pool.close()
pool.join()
return pd.concat(results)
def apply_get_gene(df, a):
return df.apply(get_gene, axis=1, ref=a)
def get_gene(row, ref):
m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values)
return ref.index[m] if m.any() else None
It now takes ~5min instead of ~9min with the former version of the code and ~25min without multiprocessing.

How to use a map with *args to unpack a tuple in a python function call

I am currently doing a merge over a set of variables that I'd like to parallelize. My code looks something like this:
mergelist = [
('leftfile1', 'rightfile1', 'leftvarname1', 'outputname1'),
('leftfile1', 'rightfile1', 'leftvarname2', 'outputname2')
('leftfile2', 'rightfile2', 'leftvarname3', 'outputname3')
]
def merger(leftfile,rightfile,leftvarname,outvarname):
do_the_merge
for m in mergelist:
merger(*m)
Ordinarily, to speed up long loops, I would replace the for m in mergelist with something like....
from multiprocessing import Pool
p = Pool(8)
p.map(merger(m), mergelist)
p.close()
But since I'm using the star to unpack the tuple, it's not clear to me how to map this correctly. How do I get the *m?
Use lambda:
with Pool(8) as p:
p.map(lambda m:merger(*m), mergelist)
You can unpack the tuple in your merge function:
def merger(args):
if len(args) != 4:
# error
leftfile,rightfile,leftvarname,outvarname = args
do_the_merge
The other option is to unpack in the argument list:
def merger( (leftfile,rightfile,leftvarname,outvarname) ):
do_the_merge
Edit: to address the OP concerns:
def merger((l,r,v,o)):
return l+r
for m in mergelist:
print merger(m)
returns
leftfile1rightfile1
leftfile1rightfile1
leftfile2rightfile2
The simplest solution IMHO is to change the merger function, or add a wrapper:
def merger(leftfile,rightfile,'leftvarname','outvarname'):
do_the_merge
def merger_wrapper(wrapper_tuple):
merger(*wrapper_tuple)
p.map(merger_wrapper, mergelist)
I see #delnan actually also put this solution in the comments.
To add a little value to this :) You could also wrap it like this:
from functools import partial
def unpack_wrapper(f):
def unpack(arg):
return f(*arg)
return unpack
This should let you simplify this to
p.map(unpack_wrapper(merger), mergelist)

how to make my own mapping type in python

I have created a class MyClassthat contains a lot of simulation data. The class groups simulation results for different simulations that have a similar structure. The results can be retreived with a MyClass.get(foo) method. It returns a dictionary with simulationID/array pairs, array being the value of foo for each simulation.
Now I want to implement a method in my class to apply any function to all the arrays for foo. It should return a dictionary with simulationID/function(foo) pairs.
For a function that does not need additional arguments, I found the following solution very satisfying (comments always welcome :-) ):
def apply(self, function, variable):
result={}
for k,v in self.get(variable).items():
result[k] = function(v)
return result
However, for a function requiring additional arguments I don't see how to do it in an elegant way. A typical operation would be the integration of foo with bar as x-values like np.trapz(foo, x=bar), where both foo and bar can be retreived with MyClass.get(...)
I was thinking in this direction:
def apply(self, function_call):
"""
function_call should be a string with the complete expression to evaluate
eg: MyClass.apply('np.trapz(QHeat, time)')
"""
result={}
for SID in self.simulations:
result[SID] = eval(function_call, locals=...)
return result
The problem is that I don't know how to pass the locals mapping object. Or maybe I'm looking in a wrong direction. Thanks on beforehand for your help.
Roel
You have two ways. The first is to use functools.partial:
foo = self.get('foo')
bar = self.get('bar')
callable = functools.partial(func, foo, x=bar)
self.apply(callable, variable)
while the second approach is to use the same technique used by partial, you can define a function that accept arbitrary argument list:
def apply(self, function, variable, *args, **kwds):
result={}
for k,v in self.get(variable).items():
result[k] = function(v, *args, **kwds)
return result
Note that in both case the function signature remains unchanged. I don't know which one I'll choose, maybe the first case but I don't know the context on you are working on.
I tried to recreate (the relevant part of) the class structure the way I am guessing it is set up on your side (it's always handy if you can provide a simplified code example for people to play/test).
What I think you are trying to do is translate variable names to variables that are obtained from within the class and then use those variables in a function that was passed in as well. In addition to that since each variable is actually a dictionary of values with a key (SID), you want the result to be a dictionary of results with the function applied to each of the arguments.
class test:
def get(self, name):
if name == "valA":
return {"1":"valA1", "2":"valA2", "3":"valA3"}
elif name == "valB":
return {"1":"valB1", "2":"valB2", "3":"valB3"}
def apply(self, function, **kwargs):
arg_dict = {fun_arg: self.get(sim_args) for fun_arg, sim_args in kwargs.items()}
result = {}
for SID in arg_dict[kwargs.keys()[0]]:
fun_kwargs = {fun_arg: sim_dict[SID] for fun_arg, sim_dict in arg_dict.items()}
result[SID] = function(**fun_kwargs)
return result
def joinstrings(string_a, string_b):
return string_a+string_b
my_test = test()
result = my_test.apply(joinstrings, string_a="valA", string_b="valB")
print result
So the apply method gets an argument dictionary, gets the class specific data for each of the arguments and creates a new argument dictionary with those (arg_dict).
The SID keys are obtained from this arg_dict and for each of those, a function result is calculated and added to the result dictionary.
The result is:
{'1': 'valA1valB1', '3': 'valA3valB3', '2': 'valA2valB2'}
The code can be altered in many ways, but I thought this would be the most readable. It is of course possible to join the dictionaries instead of using the SID's from the first element etc.

Categories

Resources