Parallelizing comparisons between two dataframes with multiprocessing - python

I've got the following function that allows me to do some comparison between the rows of two dataframes (data and ref)and return the index of both rows if there's a match.
def get_gene(row):
m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values)
return ref.index[m] if m.any() else None
Being a process that takes time (25min for 1.6M rows in data versus 20K rows in ref), I tried to speed things up by parallelizing the computation. As pandas doesn't support multiprocessing natively, I used this piece of code that I found on SO and it worked ok with my function get_gene.
def _apply_df(args):
df, func, kwargs = args
return df.apply(func, **kwargs)
def apply_by_multiprocessing(df, func, **kwargs):
workers = kwargs.pop('workers')
pool = multiprocessing.Pool(processes=workers)
result = pool.map(_apply_df, [(d, func, kwargs) for d in np.array_split(df, workers)])
pool.close()
df = pd.concat(list(result))
return df
It allowed me to go down to 9min of computation. But, if I understood correctly, this code just breaks down my dataframe data in 4 pieces and send each one to each core of the CPU. Hence, each core ends up doing a comparisons between 400K rows (from data split in 4) versus 20K rows (ref).
What I would actually want to do is to split both dataframes based on a value in one of their column so that I only compute comparisons between dataframes of the same 'group':
data.get_group(['a']) versus ref.get_group(['a'])
data.get_group(['b']) versus ref.get_group(['b'])
data.get_group(['c']) versus ref.get_group(['c'])
etc...
which would reduce the amount of computation to do. Each row in data would only be able to be matched against ~3K rows in ref, instead of all 20K rows.
Therefore, I tried to modify the code above but I couldn't manage to make it work.
def apply_get_gene(df, func, **kwargs):
reference = pd.read_csv('genomic_positions.csv', index_col=0)
reference = reference.groupby(['Chr'])
df = df.groupby(['Chr'])
chromosome = df.groups.keys()
workers = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=workers)
args_list = [(df.get_group(chrom), func, kwargs, reference.get_group(chrom)) for chrom in chromosome]
results = pool.map(_apply_df, args_list)
pool.close()
pool.join()
return pd.concat(results)
def _apply_df(args):
df, func, kwarg1, kwarg2 = args
return df.apply(func, **kwargs)
def get_gene(row, ref):
m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values)
return ref.index[m] if m.any() else None
I'm pretty sure it has to do with the way of how *args and **kwargs are passed trough the different functions (because in this case I have to take into account that I want to pass my splitted ref dataframe with the splitted data dataframe..).
I think the problem lies within the function _apply_df. I thought I understood what it really does but the line df, func, kwargs = args is still bugging me and I think I failed to modify it correctly..
All advices are appreciated !

Take a look at starmap():
starmap(func, iterable[, chunksize])
Like map() except that the elements of the iterable are expected to be iterables that are unpacked as arguments.
Hence an iterable of [(1,2), (3, 4)] results in [func(1,2), func(3,4)].
Which seems to be exactly what you need.

I post the answer I came up with for readers who might stumble upon this post:
As noted by #Michele Tonutti, I just had to use starmap() and do a bit of tweaking here and there. The tradeoff is that it applies only my custom function get_gene with the setting axis=1 but there's probably a way to make it more flexible if needed.
def Detect_gene(data):
reference = pd.read_csv('genomic_positions.csv', index_col=0)
ref = reference.groupby(['Chr'])
df = data.groupby(['Chr'])
chromosome = df.groups.keys()
workers = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=workers)
args = [(df.get_group(chrom), ref.get_group(chrom))
for chrom in chromosome]
results = pool.starmap(apply_get_gene, args)
pool.close()
pool.join()
return pd.concat(results)
def apply_get_gene(df, a):
return df.apply(get_gene, axis=1, ref=a)
def get_gene(row, ref):
m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values)
return ref.index[m] if m.any() else None
It now takes ~5min instead of ~9min with the former version of the code and ~25min without multiprocessing.

Related

Spark: forcing each task on a seperate executor

Suppose we have a SparkDataFrame of 20 rows. I'm applying a pyspark UDF on each row that performs some expensive calculation.
def expensive_python_function(df, a, b) -> pd.DataFrame:
return ...
def create_udf(a: Broadcast, b: Broadcast, func: Broadcast) -> Callable:
def my_udf(df: pd.DataFrame) -> pd.DataFrame:
result = func.value(df, a.value, b.value)
result["timestamp"] = datetime.datetime.now()
return result
return my_udf
broadcast_func = sparkContext.broadcast(expensive_python_function)
broadcast_a = sparkContext.broadcast(a)
broadcast_b = sparkContext.broadcast(b)
result = sdf.groupby(*groups).applyInPandas(
create_udf(broadcast_a, broadcast_b, broadcast_func),
schema=schema
)
result.show()
To clarify, each unique group in the groupby will result in a dataframe of one row.
The variables a and b are used by each executor and are the same for all of them. I am accessing the variables in my_udf using broadcast_a.value.
Problem
This operation results in 2 partitions and thus 2 tasks. Both tasks are executed on a single (the same) executor. Obviously that is not what I want, I would like to have each task run on a seperate executor in parrallel.
What I tried
I repartitioned the dataframe into 20 partitions and used persist the cache it in memory.
sdf = sdf.repartition(20).persist()
result = sdf.groupby(*groups).applyInPandas(
create_udf(broadcast_a, broadcast_b, broadcast_func),
schema=schema
)
result.show()
This indeed gives me 20 partitions and 20 tasks to be completed. However, from the 10 executors only 1 is still active.
I tried:
setting spark.executor.cores explictly to 1
setting spark.sql.shuffle.partitions to 20
I also noticed that each executor does contain rdd block, that puzzles me as well?
Question
It seems to me like the spark driver is deciding for me that all jobs can be run on one executor, which makes sense from a big data point of view. I realize that Spark is not exactly intended for my use-case, I'm testing if and what kind of speedup I can achieve as oppossed to using something like python multiprocessing.
Is it possible to force each task to be run on a seperate executor, regardless of the size of the data or the nature of the task?
I'm using Python 3.9 and Spark 3.2.1
So, the solution lied in not using the DataFrame API. Working with RDD's seems to give you much more control.
params = [(1,2), (3,4), (5,6)]
#dataclass
class Task:
func: Callable
a: int
b: int
def run_task(task: Task):
return task.func(task.a, task.b)
data = spark.parallelize(
[Task(expensive_python_function, a, b) for a, b in params],
len(params)]
)
result = data.map(run_task)
It will return an RDD, so you need to convert to DataFrame. Or use collect() to collect to get the result.
To be sure I also set spark.default.parallelism = str(len(params)) and I set spark.executor.instances = str(len(params)). I believe the parallelism setting should not be necessary as you are basically passing that in spark.parallelize as well.
Hope it helps someone!

Run one function on different CPUs

I have one machine with two CPUs, and each CPU has different number of cores. I have one function in my python code. How can I run this function on each of the CPUs?
In this case, I need to run function two times because I have two CPUs.
I want this because I want to compare the performance of different CPU.
This can be part of code. Please let me know if the code it not written in correct way.
import multiprocessing
def my_function():
print ("This Function needs high computation")
# Add code of function
pool = multiprocessing.Pool()
jobs = []
for j in range(2): #how can I run function depends on the number of CPUs?
p = multiprocessing.Process(target = my_function)
jobs.append(p)
p.start()
I have read many posts, but have not found a suitable answer for my problem.
The concurrent package handles the allocation of resources in an easy way, so that you don't have to specify any particular process/thread IDs, something that is OS-specific anyway.
If you want to run a function using either multiple processes or multiple threads, you can have a class that does it for you:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
from typing import Generator
class ConcurrentExecutor:
#staticmethod
def _concurrent_execution(executor, func, values):
with executor() as ex:
if isinstance(values, Generator):
return list(ex.map(lambda args: func(*args), values))
return list(ex.map(func, values))
#staticmethod
def concurrent_process_execution(func, values):
return ConcurrentExecutor._concurrent_execution(
ProcessPoolExecutor, func, values,
)
#staticmethod
def concurrent_thread_execution(func, values):
return ConcurrentExecutor._concurrent_execution(
ThreadPoolExecutor, func, values,
)
Then you can execute any function with it, even with arguments. If it's a single argument-function:
from concurrency import ConcurrentExecutor as concex
# Single argument function that prints the input
def single_arg_func(arg):
print(arg)
# Dummy list of 5 different input values
n_values = 5
arg_values = [x for x in range(n_values)]
# We want to run the function concurrently for each value in values
concex.concurrent_thread_execution(single_arg_func, arg_values)
Or with multiple arguments:
from concurrency import ConcurrentExecutor as concex
# Multi argument function that prints the input
def multi_arg_func(arg1, arg2):
print(arg1, arg2)
# Dummy list of 5 different input values per argument
n_values = 5
arg1_values = [x for x in range(n_values)]
arg2_values = [2*x for x in range(n_values)]
# Create a generator of combinations of values for the 2 arguments
args_values = ((arg1_values[i], arg2_values[i]) for i in range(n_values))
# We want to run the function concurrently for each value combination
concex.concurrent_thread_execution(multi_arg_func, args_values)

unpacking a dask delayed object of list of tuples

I have a function returning a tuple of two elements. The function is called with pool starmap to generate a list of tuples which are unpacked to two lists.
def func():
#...some operations
return (x,y)
def MP_a_func(func,iterable,proc,chunk):
pool=multiprocessing.Pool(processes=proc)
Result=pool.starmap(func,iterable,chunksize=chunk)
pool.close()
return Result
##
if __name__ == '__main__':
results=MP_a_func(func,iterable,proc,chunk)
a,b=zip(*results)
I now wish to use dask delayed API as the following
if __name__ == '__main__':
results=delayed(MP_a_func(func,iterable,proc,chunk))
is it possible to unpack tuples in the delayed object without using results.compute() ?
Thank your for your help
It is possible for another delayed function to unpack the tuple, in the example below, the delayed value of return_tuple(1) was not computed, but passed as a delayed object:
import dask
#dask.delayed
def return_tuple(x):
return x+1, x-1
#dask.delayed
def process_first_item(some_tuple):
return some_tuple[0]+10
result = process_first_item(return_tuple(1))
dask.compute(result)
As per #mdurant's answer, it turns out delayed function/decorator has nout parameter, also see this answer.
If you know the number of outputs, the delayed function (or decorator) takes an optional nout arguments, and this will split the single delayed into that many delayed outputs. This sounds like exactly what you need.

Multiprocessing.pool with a function that has multiple args and kwargs

I would like to parallelise a calculation using the mutliprocessing.pool method. The problem is that the function I would like to use in the calculation presents two args and optional kwargs, being the first argument a dataframe, the second one a str and any kwargs a dictionary.
Both the dataframe and the dictionary I want to use are the same for all the calculations I am trying to carry out, being only the second arg the one that keeps changing. I was therefore hoping to be able to pass it as a list of different strings using the map method to the already packed function with the df and dict.
from utils import *
import multiprocessing
from functools import partial
def sumifs(df, result_col, **kwargs):
compare_cols = list(kwargs.keys())
operators = {}
for col in compare_cols:
if type(kwargs[col]) == tuple:
operators[col] = kwargs[col][0]
kwargs[col] = list(kwargs[col][1])
else:
operators[col] = operator.eq
kwargs[col] = list(kwargs[col])
result = []
cache = {}
# Go through each value
for i in range(len(kwargs[compare_cols[0]])):
compare_values = [kwargs[col][i] for col in compare_cols]
cache_key = ','.join([str(s) for s in compare_values])
if (cache_key in cache):
entry = cache[cache_key]
else:
df_copy = df.copy()
for compare_col, compare_value in zip(compare_cols, compare_values):
df_copy = df_copy.loc[operators[compare_col](df_copy[compare_col], compare_value)]
entry = df_copy[result_col].sum()
cache[cache_key] = entry
result.append(entry)
return pd.Series(result)
if __name__ == '__main__':
ca = read_in_table('Tab1')
total_consumer_ids = len(ca)
base = pd.DataFrame()
base['ID'] = range(1, total_consumer_ids + 1)
result_col= ['A', 'B', 'C']
keywords = {'Z': base['Consumer archetype ID']}
max_number_processes = multiprocessing.cpu_count()
with multiprocessing.Pool(processes=max_number_processes) as pool:
results = pool.map(partial(sumifs, a=ca, kwargs=keywords), result_col)
print(results)
However, when I run the code above I get the following error: TypeError: sumifs() missing 1 required positional argument: 'result_col'. How could I provide the function with the first arg and kwargs, while providing the second argument as a list of str so I can paralelise the calculation? I have read several similar questions in the forum but none of the solutions seem to work for this case...
Thank you and apologies if something is not clear, I just learnt of the multiprocessing package today!
Let's have a look at two part of your code.
First the sumifs function declaration:
def sumifs(df, result_col, **kwargs):
Secondly, the call to this function with the relevant parameters.
# Those are the params
ca = read_in_table('Tab1')
keywords = {'Z': base['Consumer archetype ID']}
# This is the function call
results = pool.map(partial(sumifs, a=ca, kwargs=keywords), tasks)
Update 1:
After the original code has been edited.It look like the problem is the positional argument assignment, try to discard it.
replace the line:
results = pool.map(partial(sumifs, a=ca, kwargs=keywords), result_col)
with:
results = pool.map(partial(sumifs, ca, **keywords), result_col)
An example code:
import multiprocessing
from functools import partial
def test_func(arg1, arg2, **kwargs):
print(arg1)
print(arg2)
print(kwargs)
return arg2
if __name__ == '__main__':
list_of_args2 = [1, 2, 3]
just_a_dict = {'key1': 'Some value'}
with multiprocessing.Pool(processes=3) as pool:
results = pool.map(partial(test_func, 'This is arg1', **just_a_dict), list_of_args2)
print(results)
Will output:
This is arg1
1
{'key1': 'Some value'}
This is arg1
2
{'key1': 'Some value'}
This is arg1
2
{'key1': 'Some value'}
['1', '2', '3']
More example for how to Multiprocessing.pool with a function that has multiple args and kwargs
Update 2:
Extended example (due to comments):
I wonder however, in the same fashion, if my function had three args and kwargs, and I wanted to keep arg1, arg3 and kwargs costant, how could I pass arg2 as a list for multiprocessing? In essence, how will I inidicate multiprocessing that map(partial(test_func, 'This is arg1', 'This would be arg3', **just_a_dict), arg2) the second value in partial corresponds to arg3 and not arg2?
The Update 1 code would have change as follow:
# The function signature
def test_func(arg1, arg2, arg3, **kwargs):
# The map call
pool.map(partial(test_func, 'This is arg1', arg3='This is arg3', **just_a_dict), list_of_args2)
This can be done using the python positional and keyword assignment.
Note that the kwargs is left aside and not assigned using a keyword despite the fact that it's located after a keyword assigned value.
More information about argument assignment differences can be found here.
If there is a piece of data that is constant/fixed across all works/jobs, then it is better to "initialize" the processes in the pool with this fixed data during the creation of the pool and map over the varying data. This avoids resending of fixed data with every job request. In your case, I'd do something like the following:
df = None
kw = {}
def initialize(df_in, kw_in):
global df, kw
df, kw = df_in, kw_in
def worker(data):
# computation involving df, kw, and data
...
...
with multiprocessing.Pool(max_number_processes, intializer, (base, keywords)) as pool:
pool.map(worker, varying_data)
This gist contains a full blown example of using the initializer. This blog post explains the performance gains from using initializer.

Repeatedly execute same code before/after statements/code blocks

I am filtering some data in a pandas.DataFrame and want to track the rows I loose. So basically, I want to
df = pandas.read_csv(...)
n1 = df.shape[0]
df = ... # some logic that might reduce the number of rows
print(f'Lost {n1 - df.shape[0]} rows')
Now there are multiple of these filter steps, and the code before/after it is always the same. So I am looking for a way to abstract that away.
Of course the first thing that comes into mind are decorators - however, I don't like the idea of creating a bunch of functions with just one LOC.
What I came up with are context managers:
from contextlib import contextmanager
#contextmanager
def rows_lost(df):
try:
n1 = df.shape[0]
yield df
finally:
print(f'Lost {n1 - df.shape[0]} rows')
And then:
with rows_lost(df) as df:
df = ...
I am wondering whether there is a better solution to this?
Edit:
I just realized that the context manager approach does not work, if a filter step returns a new object (which is the default for pandas Dataframes). It only works when the objects are modified "in place".
You could write a "wrapper-function" that wraps the filter you specify:
def filter1(arg):
return arg+1
def filter2(arg):
return arg*2
def wrap_filter(arg, filter_func):
print('calculating with argument', arg)
result = filter_func(arg)
print('result', result)
return result
wrap_filter(5, filter1)
wrap_filter(5, filter2)
The only thing that this improves on using a decorator is that you can choose to call the filter without the wrapper...

Categories

Resources