I am filtering some data in a pandas.DataFrame and want to track the rows I loose. So basically, I want to
df = pandas.read_csv(...)
n1 = df.shape[0]
df = ... # some logic that might reduce the number of rows
print(f'Lost {n1 - df.shape[0]} rows')
Now there are multiple of these filter steps, and the code before/after it is always the same. So I am looking for a way to abstract that away.
Of course the first thing that comes into mind are decorators - however, I don't like the idea of creating a bunch of functions with just one LOC.
What I came up with are context managers:
from contextlib import contextmanager
#contextmanager
def rows_lost(df):
try:
n1 = df.shape[0]
yield df
finally:
print(f'Lost {n1 - df.shape[0]} rows')
And then:
with rows_lost(df) as df:
df = ...
I am wondering whether there is a better solution to this?
Edit:
I just realized that the context manager approach does not work, if a filter step returns a new object (which is the default for pandas Dataframes). It only works when the objects are modified "in place".
You could write a "wrapper-function" that wraps the filter you specify:
def filter1(arg):
return arg+1
def filter2(arg):
return arg*2
def wrap_filter(arg, filter_func):
print('calculating with argument', arg)
result = filter_func(arg)
print('result', result)
return result
wrap_filter(5, filter1)
wrap_filter(5, filter2)
The only thing that this improves on using a decorator is that you can choose to call the filter without the wrapper...
Related
Suppose we have a SparkDataFrame of 20 rows. I'm applying a pyspark UDF on each row that performs some expensive calculation.
def expensive_python_function(df, a, b) -> pd.DataFrame:
return ...
def create_udf(a: Broadcast, b: Broadcast, func: Broadcast) -> Callable:
def my_udf(df: pd.DataFrame) -> pd.DataFrame:
result = func.value(df, a.value, b.value)
result["timestamp"] = datetime.datetime.now()
return result
return my_udf
broadcast_func = sparkContext.broadcast(expensive_python_function)
broadcast_a = sparkContext.broadcast(a)
broadcast_b = sparkContext.broadcast(b)
result = sdf.groupby(*groups).applyInPandas(
create_udf(broadcast_a, broadcast_b, broadcast_func),
schema=schema
)
result.show()
To clarify, each unique group in the groupby will result in a dataframe of one row.
The variables a and b are used by each executor and are the same for all of them. I am accessing the variables in my_udf using broadcast_a.value.
Problem
This operation results in 2 partitions and thus 2 tasks. Both tasks are executed on a single (the same) executor. Obviously that is not what I want, I would like to have each task run on a seperate executor in parrallel.
What I tried
I repartitioned the dataframe into 20 partitions and used persist the cache it in memory.
sdf = sdf.repartition(20).persist()
result = sdf.groupby(*groups).applyInPandas(
create_udf(broadcast_a, broadcast_b, broadcast_func),
schema=schema
)
result.show()
This indeed gives me 20 partitions and 20 tasks to be completed. However, from the 10 executors only 1 is still active.
I tried:
setting spark.executor.cores explictly to 1
setting spark.sql.shuffle.partitions to 20
I also noticed that each executor does contain rdd block, that puzzles me as well?
Question
It seems to me like the spark driver is deciding for me that all jobs can be run on one executor, which makes sense from a big data point of view. I realize that Spark is not exactly intended for my use-case, I'm testing if and what kind of speedup I can achieve as oppossed to using something like python multiprocessing.
Is it possible to force each task to be run on a seperate executor, regardless of the size of the data or the nature of the task?
I'm using Python 3.9 and Spark 3.2.1
So, the solution lied in not using the DataFrame API. Working with RDD's seems to give you much more control.
params = [(1,2), (3,4), (5,6)]
#dataclass
class Task:
func: Callable
a: int
b: int
def run_task(task: Task):
return task.func(task.a, task.b)
data = spark.parallelize(
[Task(expensive_python_function, a, b) for a, b in params],
len(params)]
)
result = data.map(run_task)
It will return an RDD, so you need to convert to DataFrame. Or use collect() to collect to get the result.
To be sure I also set spark.default.parallelism = str(len(params)) and I set spark.executor.instances = str(len(params)). I believe the parallelism setting should not be necessary as you are basically passing that in spark.parallelize as well.
Hope it helps someone!
I have a multiprocessing code, and each process have to analyse same data differently.
I have implemented:
with concurrent.futures.ProcessPoolExecutor() as executor:
res = executor.map(goal_fcn, p, [global_DataFrame], [global_String])
for f in concurrent.futures.as_completed(res):
fp = res
and function:
def goal_fcn(x, DataFrame, String):
return heavy_calculation(x, DataFrame, String)
the problem is goal_fcn is called only once, while should be multiple time
In debugger, I checked now the variable p is looking, and it has multiple columns and rows. Inside goal_fcn, variable x have only first row - looks good.
But the function is called only once. There is no error, the code just execute next steps.
Even if I modify variable p = [1,3,4,5], and of course code. goal_fcn is executed only once
I have to use map() because keeping the order between input and output is required
map works like zip. It terminates once at least one input sequence is at its end. Your [global_DataFrame] and [global_String] lists have one element each, so that is where map ends.
There are two ways around this:
Use itertools.product. This is the equivalent of running "for all data frames, for all strings, for all p". Something like this:
def goal_fcn(x_DataFrame_String):
x, DataFrame, String = x_DataFrame_String
...
executor.map(goal_fcn, itertools.product(p, [global_DataFrame], [global_String]))
Bind the fixed arguments instead of abusing the sequence arguments.
def goal_fcn(x, DataFrame, String):
pass
bound = functools.partial(goal_fcn, DataFrame=global_DataFrame, String=global_String)
executor.map(bound, p)
I've got the following function that allows me to do some comparison between the rows of two dataframes (data and ref)and return the index of both rows if there's a match.
def get_gene(row):
m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values)
return ref.index[m] if m.any() else None
Being a process that takes time (25min for 1.6M rows in data versus 20K rows in ref), I tried to speed things up by parallelizing the computation. As pandas doesn't support multiprocessing natively, I used this piece of code that I found on SO and it worked ok with my function get_gene.
def _apply_df(args):
df, func, kwargs = args
return df.apply(func, **kwargs)
def apply_by_multiprocessing(df, func, **kwargs):
workers = kwargs.pop('workers')
pool = multiprocessing.Pool(processes=workers)
result = pool.map(_apply_df, [(d, func, kwargs) for d in np.array_split(df, workers)])
pool.close()
df = pd.concat(list(result))
return df
It allowed me to go down to 9min of computation. But, if I understood correctly, this code just breaks down my dataframe data in 4 pieces and send each one to each core of the CPU. Hence, each core ends up doing a comparisons between 400K rows (from data split in 4) versus 20K rows (ref).
What I would actually want to do is to split both dataframes based on a value in one of their column so that I only compute comparisons between dataframes of the same 'group':
data.get_group(['a']) versus ref.get_group(['a'])
data.get_group(['b']) versus ref.get_group(['b'])
data.get_group(['c']) versus ref.get_group(['c'])
etc...
which would reduce the amount of computation to do. Each row in data would only be able to be matched against ~3K rows in ref, instead of all 20K rows.
Therefore, I tried to modify the code above but I couldn't manage to make it work.
def apply_get_gene(df, func, **kwargs):
reference = pd.read_csv('genomic_positions.csv', index_col=0)
reference = reference.groupby(['Chr'])
df = df.groupby(['Chr'])
chromosome = df.groups.keys()
workers = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=workers)
args_list = [(df.get_group(chrom), func, kwargs, reference.get_group(chrom)) for chrom in chromosome]
results = pool.map(_apply_df, args_list)
pool.close()
pool.join()
return pd.concat(results)
def _apply_df(args):
df, func, kwarg1, kwarg2 = args
return df.apply(func, **kwargs)
def get_gene(row, ref):
m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values)
return ref.index[m] if m.any() else None
I'm pretty sure it has to do with the way of how *args and **kwargs are passed trough the different functions (because in this case I have to take into account that I want to pass my splitted ref dataframe with the splitted data dataframe..).
I think the problem lies within the function _apply_df. I thought I understood what it really does but the line df, func, kwargs = args is still bugging me and I think I failed to modify it correctly..
All advices are appreciated !
Take a look at starmap():
starmap(func, iterable[, chunksize])
Like map() except that the elements of the iterable are expected to be iterables that are unpacked as arguments.
Hence an iterable of [(1,2), (3, 4)] results in [func(1,2), func(3,4)].
Which seems to be exactly what you need.
I post the answer I came up with for readers who might stumble upon this post:
As noted by #Michele Tonutti, I just had to use starmap() and do a bit of tweaking here and there. The tradeoff is that it applies only my custom function get_gene with the setting axis=1 but there's probably a way to make it more flexible if needed.
def Detect_gene(data):
reference = pd.read_csv('genomic_positions.csv', index_col=0)
ref = reference.groupby(['Chr'])
df = data.groupby(['Chr'])
chromosome = df.groups.keys()
workers = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=workers)
args = [(df.get_group(chrom), ref.get_group(chrom))
for chrom in chromosome]
results = pool.starmap(apply_get_gene, args)
pool.close()
pool.join()
return pd.concat(results)
def apply_get_gene(df, a):
return df.apply(get_gene, axis=1, ref=a)
def get_gene(row, ref):
m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values)
return ref.index[m] if m.any() else None
It now takes ~5min instead of ~9min with the former version of the code and ~25min without multiprocessing.
I have code that looks like this:
if(func_cliche_start(line)):
a=func_cliche_start(line)
#... do stuff with 'a' and line here
elif(func_test_start(line)):
a=func_test_start(line)
#... do stuff with a and line here
elif(func_macro_start(line)):
a=func_macro_start(line)
#... do stuff with a and line here
...
Each of the func_blah_start functions either return None or a string (based on the input line). I don't like the redundant call to func_blah_start as it seems like a waste (func_blah_start is "pure", so we can assume no side effects). Is there a better idiom for this type of thing, or is there a better way to do it?
Perhaps I'm wrong, (my C is rusty), but I thought that you could do something this in C:
int a;
if(a=myfunc(input)){ /*do something with a and input here*/ }
is there a python equivalent?
Why don't you assign the function func_cliche_start to variable a before the if statement?
a = func_cliche_start(line)
if a:
pass # do stuff with 'a' and line here
The if statement will fail if func_cliche_start(line) returns None.
You can create a wrapper function to make this work.
def assign(value, lst):
lst[0] = value
return value
a = [None]
if assign(func_cliche_start(line), a):
#... do stuff with 'a[0]' and line here
elif assign(func_test_start(line), a):
#...
You can just loop thru your processing functions that would be easier and less lines :), if you want to do something different in each case, wrap that in a function and call that e.g.
for func, proc in [(func_cliche_start, cliche_proc), (func_test_start, test_proc), (func_macro_start, macro_proc)]:
a = func(line)
if a:
proc(a, line)
break;
I think you should put those blocks of code in functions. That way you can use a dispatcher-style approach. If you need to modify a lot of local state, use a class and methods. (If not, just use functions; but I'll assume the class case here.) So something like this:
from itertools import dropwhile
class LineHandler(object):
def __init__(self, state):
self.state = state
def handle_cliche_start(self, line):
# modify state
def handle_test_start(self, line):
# modify state
def handle_macro_start(self, line):
# modify state
line_handler = LineHandler(initial_state)
handlers = [line_handler.handle_cliche_start,
line_handler.handle_test_start,
line_handler.handle_macro_start]
tests = [func_cliche_start,
func_test_start,
func_macro_start]
handlers_tests = zip(handlers, tests)
for line in lines:
handler_iter = ((h, t(line)) for h, t in handlers_tests)
handler_filter = ((h, l) for h, l in handler_iter if l is not None)
handler, line = next(handler_filter, (None, None))
if handler:
handler(line)
This is a bit more complex than your original code, but I think it compartmentalizes things in a much more scalable way. It does require you to maintain separate parallel lists of functions, but the payoff is that you can add as many as you want without having to write long if statements -- or calling your function twice! There are probably more sophisticated ways of organizing the above too -- this is really just a roughed-out example of what you could do. For example, you might be able to create a sorted container full of (priority, test_func, handler_func) tuples and iterate over it.
In any case, I think you should consider refactoring this long list of if/elif clauses.
You could take a list of functions, make it a generator and return the first Truey one:
functions = [func_cliche_start, func_test_start, func_macro_start]
functions_gen = (f(line) for f in functions)
a = next((x for x in functions_gen if x), None)
Still seems a little strange, but much less repetition.
I have a situation where I have six possible situations which can relate to four different results. Instead of using an extended if/else statement, I was wondering if it would be more pythonic to use a dictionary to call the functions that I would call inside the if/else as a replacement for a "switch" statement, like one might use in C# or php.
My switch statement depends on two values which I'm using to build a tuple, which I'll in turn use as the key to the dictionary that will function as my "switch". I will be getting the values for the tuple from two other functions (database calls), which is why I have the example one() and zero() functions.
This is the code pattern I'm thinking of using which I stumbled on with playing around in the python shell:
def one():
#Simulated database value
return 1
def zero():
return 0
def run():
#Shows the correct function ran
print "RUN"
return 1
def walk():
print "WALK"
return 1
def main():
switch_dictionary = {}
#These are the values that I will want to use to decide
#which functions to use
switch_dictionary[(0,0)] = run
switch_dictionary[(1,1)] = walk
#These are the tuples that I will build from the database
zero_tuple = (zero(), zero())
one_tuple = (one(), one())
#These actually run the functions. In practice I will simply
#have the one tuple which is dependent on the database information
#to run the function that I defined before
switch_dictionary[zero_tuple]()
switch_dictionary[one_tuple]()
I don't have the actual code written or I would post it here, as I would like to know if this method is considered a python best practice. I'm still a python learner in university, and if this is a method that's a bad habit, then I would like to kick it now before I get out into the real world.
Note, the result of executing the code above is as expected, simply "RUN" and "WALK".
edit
For those of you who are interested, this is how the relevant code turned out. It's being used on a google app engine application. You should find the code is considerably tidier than my rough example pattern. It works much better than my prior convoluted if/else tree.
def GetAssignedAgent(self):
tPaypal = PaypalOrder() #Parent class for this function
tAgents = []
Switch = {}
#These are the different methods for the actions to take
Switch[(0,0)] = tPaypal.AssignNoAgent
Switch[(0,1)] = tPaypal.UseBackupAgents
Switch[(0,2)] = tPaypal.UseBackupAgents
Switch[(1,0)] = tPaypal.UseFullAgents
Switch[(1,1)] = tPaypal.UseFullAndBackupAgents
Switch[(1,2)] = tPaypal.UseFullAndBackupAgents
Switch[(2,0)] = tPaypal.UseFullAgents
Switch[(2,1)] = tPaypal.UseFullAgents
Switch[(2,2)] = tPaypal.UseFullAgents
#I'm only interested in the number up to 2, which is why
#I can consider the Switch dictionary to be all options available.
#The "state" is the current status of the customer agent system
tCurrentState = (tPaypal.GetNumberofAvailableAgents(),
tPaypal.GetNumberofBackupAgents())
tAgents = Switch[tCurrentState]()
Consider this idiom instead:
>>> def run():
... print 'run'
...
>>> def walk():
... print 'walk'
...
>>> def talk():
... print 'talk'
>>> switch={'run':run,'walk':walk,'talk':talk}
>>> switch['run']()
run
I think it is a little more readable than the direction you are heading.
edit
And this works as well:
>>> switch={0:run,1:walk}
>>> switch[0]()
run
>>> switch[max(0,1)]()
walk
You can even use this idiom for a switch / default type structure:
>>> default_value=1
>>> try:
... switch[49]()
... except KeyError:
... switch[default_value]()
Or (the less readable, more terse):
>>> switch[switch.get(49,default_value)]()
walk
edit 2
Same idiom, extended to your comment:
>>> def get_t1():
... return 0
...
>>> def get_t2():
... return 1
...
>>> switch={(get_t1(),get_t2()):run}
>>> switch
{(0, 1): <function run at 0x100492d70>}
Readability matters
It is a reasonably common python practice to dispatch to functions based on a dictionary or sequence lookup.
Given your use of indices for lookup, an list of lists would also work:
switch_list = [[run, None], [None, walk]]
...
switch_list[zero_tuple]()
What is considered most Pythonic is that which maximizes clarity while meeting other operational requirements. In your example, the lookup tuple doesn't appear to have intrinsic meaning, so the operational intent is being lost of a magic constant. Try to make sure the business logic doesn't get lost in your dispatch mechanism. Using meaningful names for the constants would likely help.