Pandas groupby count values above threshold - python

I have a groupby question that I can't solve. It is probably simple, but I can't get it to work nicely. I am trying to compute some statistics on a variable with pandas groupby chained with the very handy agg function. I would like add to the list below a calculation of the number of values above a given threshold.
df = df.groupby(['scenario','Name','year','month'])["Value"].agg([np.min,np.max,np.mean,np.std])
Usually, I compute the number of values above a given threshold as shown below, but I can't find a way to add this to the aggregation function. Do you know how I could do that?
df =df[df>0].groupby(['scenario','Name','year','month']).count()

Your answer works. Else you could add it to the one line, not needing to create a separate function by using lambda x: instead.
df = df.groupby(["scenario", "Name", "year", "month"])["Value"].agg([np.min, np.max, np.mean, np.std, lambda x: ((x > 0)*1).sum()])
The logic here: (x > 0) returns True/False bool; *1 turns the bool to an integer (1 = True, 0 = False); .sum() will sum all the 1s and 0s within the group - and as those that are True = 1, the sum will count all values greater than 0.
Running a quick test on the time taken, your solution is faster, but I thought I would give an alternative solution anyway.

I found a solution by creating a function and passing it in the agg function.
def counta(x):
m = np.count_nonzero(x > 10)
return m
df = df.groupby(['scenario','Name','year','month'])["Value"].agg([np.min,np.max,np.mean,np.std,counta])

Related

Pandas apply function to multiple columns with sliding window

I need to calculate some metric using sliding window over dataframe. If metric needed just 1 column, I'd use rolling. But some how it does not work with 2+ columns.
Below is how I calculate the metric using regular cycle.
def mean_squared_error(aa, bb):
return np.sum((aa - bb) ** 2) / len(aa)
def rolling_metric(df_, col_a, col_b, window, metric_fn):
result = []
for i, id_ in enumerate(df_.index):
if i < (df_.shape[0] - window + 1):
slice_idx = df_.index[i: i+window-1]
slice_a, slice_b = df_.loc[slice_idx, col_a], df_.loc[slice_idx, col_b]
result.append(metric_fn(slice_a, slice_b))
else:
result.append(None)
return pd.Series(data = result, index = df_.index)
df = pd.DataFrame(data=(np.random.rand(1000, 2)*10).round(2), columns = ['y_true', 'y_pred'] )
%time df2 = rolling_metric(df, 'y_true', 'y_pred', window=7, metric_fn=mean_squared_error)
This takes close to a second for just 1000 rows.
Please suggest faster vectorized way to calculate such metric over sliding window.
In this specific case:
You can calculate the squared error beforehand and then use .Rolling.mean():
df['sq_error'] = (df['y_true'] - df['y_pred'])**2
%time df['sq_error'].rolling(6).mean().dropna()
Please note that in your example the actual window size is 6 (print the slice length), that's why I set it to 6 in my snippet.
You can even write it like this:
%time df['y_true'].subtract(df['y_pred']).pow(2).rolling(6).mean().dropna()
In general:
In case you cannot reduce it to a single column, as of pandas 1.3.0 you can use the method='table parameter to apply the function to the entire DataFrame. This, however, has the following requirements:
This is only implemented when using the numba engine. So, you need to set engine='numba' in apply and have it installed.
You need to set raw=True in apply: this means in your function you will operate on numpy arrays instead of the DataFrame. This is a consequence of the previous point.
Therefore, your computation could be something like this:
WIN_LEN = 6
def mean_sq_err_table(arr, min_window=WIN_LEN):
if len(arr) < min_window:
return np.nan
else:
return np.mean((arr[:, 0] - arr[:, 1])**2)
df.rolling(WIN_LEN, method='table').apply(mean_sq_err_table, engine='numba', raw=True).dropna()
Because it uses numba, this is also relatively fast.

Pandas: agg() gives me 'Series' objects are mutable, thus they cannot be hashed

I'm trying to agg() a df at the same time I make a subsetting from one of the columns:
indi = pd.DataFrame({"PONDERA":[1,2,3,4], "ESTADO": [1,1,2,2]})
empleo = indi.agg(ocupados = (indi.PONDERA[indi["ESTADO"]==1], sum) )
but I'm getting 'Series' objects are mutable, thus they cannot be hashed
I want to sum the values of "PONDERA" only when "ESTADO" == 1.
Expected output:
ocupados
0 3
I'm trying to imitate R function summarise(), so I want to do it in one step and agg some other columns too.
In R would be something like:
empleo <- indi %>%
summarise(poblacion = sum(PONDERA),
ocupados = sum(PONDERA[ESTADO == 1]))
Is this even the correct approach?
Thank you all in advance.
Generally agg takes as an argument function, not Series itself. In your case though it's more beneficial to separate filtering and summation.
One of the options would be the following:
empleo = indi.query("ESTADO == 1")[["PONDERA"]].sum()
(Use single square brackets to output single number, instead of pd.Series)
Another option would be to use loc and filter the dataframe to when estado = 1, and sum the values of the column pondera:
indi.loc[indi.ESTADO==1, ['PONDERA']].sum()
Thanks to #Henry's input.
A bit fancy, but the output is exactly the format you want, and the syntax is similar to what you tried:
Use DataFrameGroupBy.agg() instead of DataFrame.agg():
empleo = (indi.loc[indi['ESTADO']==1]
.groupby('ESTADO')
.agg(ocupados=('PONDERA', sum))
.reset_index(drop=True)
)
Result:
print(empleo) gives:
ocupados
0 3
Here are two different ways you can get the scalar value 3.
option1 = indi.loc[indi['ESTADO'].eq(1),'PONDERA'].sum()
option2 = indi['PONDERA'].where(indi['ESTADO'].eq(1)).sum()
However, your expected output shows this value in a dataframe. To do this, you can create a new dataframe with the desired column name "ocupados".
outputdf = pd.DataFrame({'ocupados':[option1]})
Based on your comment you provided, is this what you are looking for?
(indi.agg(poblacion = ("PONDERA", 'sum'),
ocupados = ('PONDERA',lambda x: x.where(indi['ESTADO'].eq(1)).sum())))

Pandas: assign values based on a previously unknown number of conditions

I have a function that receives a DataFrame, and a dictionary of column name, operator, and threshold.
Function looks like:
df = pd.DataFrame(...)
df["passed_thresholds"] = False
threshold_dict = {"height": (operator.lt, 0.7), "width": (operator.gt, 0.1)}
def my_func(df, threshold_dict):
# return df with "passed_thresholds" equal true for rows that meet the thresholds.
What I want to do is to find all the rows in df that meet the thresholds in threshold_dict and set the "passed_thresholds" column to be True for those rows only. Usually I can do this pretty easily with:
df.loc[(df["height"] < 0.7) & (df["width"] > 0.1), "passed_thresholds"] = True
But the issue here is that I won't know how many elements will be inside of threshold_dict and what their values will be. By the way, threshold_dict is flexible and I can change how it looks/works if you have a better idea for it too. For example, maybe passing in a operator function isn't the best idea.
Let us try concat with for loop then apply all
out = pd.concat([y[0](df[x],y[1]) for x, y in threshold_dict.items()],axis=1).all(1)
df['passed_thresholds'] = out

Calculate the mean on a Groupby Object in Pandas after applying .nsmallest(2)

I have a pandas DataFrame, which holds the performance results for many athletes. Now I want to group the data by 'BIB# and 'COURSE', so I write:
grupper = df.groupby(['BIB#', 'COURSE'])
Next, I want to find the two best runs (column = 'FINISH) for each 'BIB' and 'COURSE', so I write:
x = grupper.apply(lambda x: x.nsmallest(2, 'FINISH'))
This gives me the following:
Then, I want to calculate the mean of the two best runs for each athlete for each of the BIB and COURSE but can't find an appropriate solution. I have tried to apply mean() like in the code below but that calculates the mean for each column in the dataframe and that's not what I want.
x = grupper.apply(lambda x: x.nsmallest(2, 'FINISH')).mean()
What can I do?
I think you need pass mean into apply method after nsmallest:
x = grupper['FINISH'].apply(lambda x: x.nsmallest(2).mean())
In your solution should working also:
x = grupper.apply(lambda x: x.nsmallest(2, 'FINISH').mean())

Constructing Mode and Corresponding Count Functions Using Custom Aggregation Functions for GroupBy in Dask

So dask has now been updated to support custom aggregation functions for groupby. (Thanks to the dev team and #chmp for working on this!). I am currently trying to construct a mode function and corresponding count function. Basically what I envision is that mode returns a list, for each grouping, of the most common values for a specific column (ie. [4, 1, 2]). Additionally, there is a corresponding count function that returns the number of instances of those values, ie. 3.
Now I am currently trying to implement this in code. As per the groupby.py file, the parameters for custom aggregations are as follows:
Parameters
----------
name : str
the name of the aggregation. It should be unique, since intermediate
result will be identified by this name.
chunk : callable
a function that will be called with the grouped column of each
partition. It can either return a single series or a tuple of series.
The index has to be equal to the groups.
agg : callable
a function that will be called to aggregate the results of each chunk.
Again the argument(s) will be grouped series. If ``chunk`` returned a
tuple, ``agg`` will be called with all of them as individual positional
arguments.
finalize : callable
an optional finalizer that will be called with the results from the
aggregation.
Here is the provided code for mean:
custom_mean = dd.Aggregation(
'custom_mean',
lambda s: (s.count(), s.sum()),
lambda count, sum: (count.sum(), sum.sum()),
lambda count, sum: sum / count,
)
df.groupby('g').agg(custom_mean)
I am trying to think of the best way to do this. Currently I have the following functions:
def custom_count(x):
count = Counter(x)
freq_list = count.values()
max_cnt = max(freq_list)
total = freq_list.count(max_cnt)
return count.most_common(total)
custom_mode = dd.Aggregation(
'custom_mode',
lambda s: custom_count(s),
lambda s1: s1.extend(),
lambda s2: ......
)
However I am getting stuck on understanding how exactly the agg part should be working. Any help on this problem would be appreciated.
Thanks!
Admittedly, the docs are currently somewhat light on detail. Thanks for bringing this issue to my attention. Please let me now if this answer helps and I will contribute an updated version of the docs to dask.
To your question: for a single return value, the different steps of the aggregation are equivalent to:
res = chunk(df.groupby('g')['col'])
res = agg(res.groupby(level=[0]))
res = finalize(res)
In these terms, the mode function could be implemented as follows:
def chunk(s):
# for the comments, assume only a single grouping column, the
# implementation can handle multiple group columns.
#
# s is a grouped series. value_counts creates a multi-series like
# (group, value): count
return s.value_counts()
def agg(s):
# s is a grouped multi-index series. In .apply the full sub-df will passed
# multi-index and all. Group on the value level and sum the counts. The
# result of the lambda function is a series. Therefore, the result of the
# apply is a multi-index series like (group, value): count
return s.apply(lambda s: s.groupby(level=-1).sum())
# faster version using pandas internals
s = s._selected_obj
return s.groupby(level=list(range(s.index.nlevels))).sum()
def finalize(s):
# s is a multi-index series of the form (group, value): count. First
# manually group on the group part of the index. The lambda will receive a
# sub-series with multi index. Next, drop the group part from the index.
# Finally, determine the index with the maximum value, i.e., the mode.
level = list(range(s.index.nlevels - 1))
return (
s.groupby(level=level)
.apply(lambda s: s.reset_index(level=level, drop=True).argmax())
)
mode = dd.Aggregation('mode', chunk, agg, finalize)
Note, that this implementation does not match the dataframe .mode function in case of ties. This version will return one of the values in case of a tie, instead of all values.
The mode aggregation can now be used as in
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({
'col': [0, 1, 1, 2, 3] * 10,
'g0': [0, 0, 0, 1, 1] * 10,
'g1': [0, 0, 0, 1, 1] * 10,
})
ddf = dd.from_pandas(df, npartitions=10)
res = ddf.groupby(['g0', 'g1']).agg({'col': mode}).compute()
print(res)

Categories

Resources