Find a period with specific characteristics in a Pandas DataFrame

Find a period with specific characteristics in a Pandas DataFrame - python

I have a metereological DataFrame, indexed by TimeStamp, and I want to find all the possible periods of 24 hours present in the DataFrame with these conditions:
at least 6 hours of Rainfalls with Temperature > 10°C
a minimum of 6 consecutive hours of Relative Humidity > 90%.
The hours taken in consideration may also be 'overlapped' (a period with 6 hours of both RH > 90 and Rainfalls > 0 is sufficient).
A sample DataFrame with 48 hours can be created by:
df = pd.DataFrame({'TimeStamp': pd.date_range('1/5/2015 00:00:00', periods=48, freq='H'),
'Temperature': np.random.choice( [11,12,13], 48),
'Rainfalls': [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.2,0.3,0.3,0.3,0.2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
'RelativeHumidity': [95,95,95,95,95,95,80,80,80,80,80,80,80,80,85,85,85,85,85,85,85,85,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80,80]})
df = df.set_index('TimeStamp')
In output I just want the indexes of the various TimeStamps from which every period with the mentioned characteristics starts. In the case of the sample df, only the first TimeStamp is given in output.
I have tried to use the df.rolling() function but I managed to find only the 6 hours of consecutive RH > 90.
Thanks in advance for the help.

I hope I've understood your question right. This example will find all groups where Temperature > 10 and RH > 90 of minimum length of 6 and then prints the first index of these groups:
x = (df.Temperature > 10).astype(int) + (df.RelativeHumidity > 90).astype(int)
out = (
x.groupby((x != x.shift(1)).cumsum().values)
.apply(lambda x: x.index[0] if (x.iat[0] == 2) and len(x) > 5 else np.nan)
.dropna()
)
print(out)
Prints:
1 2015-01-05
dtype: datetime64[ns]

Related

Improving performance for a nested for loop iterating over dates

I am looking to learn how to improve the performance of code over a large dataframe (10 million rows) and my solution loops over multiple dates (2023-01-10, 2023-01-20, 2023-01-30) for different combinations of category_a and category_b.
The working approach is shown below, which iterates over the dates for different pairings of the two-category data by first locating a subset of a particular pair. However, I would want to refactor it to see if there is an approach that is more efficient.
My input (df) looks like:
date
category_a
category_b
outflow
open
inflow
max
close
buy
random_str
0
2023-01-10
4
1
1
0
0
10
0
0
a
1
2023-01-20
4
1
2
0
0
20
nan
nan
a
2
2023-01-30
4
1
10
0
0
20
nan
nan
a
3
2023-01-10
4
2
2
0
0
10
0
0
b
4
2023-01-20
4
2
2
0
0
20
nan
nan
b
5
2023-01-30
4
2
0
0
0
20
nan
nan
b
with 2 pairs (4, 1) and (4,2) over the days and my expected output (results) looks like this:
date
category_a
category_b
outflow
open
inflow
max
close
buy
random_str
0
2023-01-10
4
1
1
0
0
10
-1
23
a
1
2023-01-20
4
1
2
-1
23
20
20
10
a
2
2023-01-30
4
1
10
20
10
20
20
nan
a
3
2023-01-10
4
2
2
0
0
10
-2
24
b
4
2023-01-20
4
2
2
-2
24
20
20
0
b
5
2023-01-30
4
2
0
20
0
20
20
nan
b
I have a working solution using pandas dataframes to take a subset then loop over it to get a solution but I would like to see how I can improve the performance of this using perhaps ;numpy, numba, pandas-multiprocessing or dask. Another great idea was to rewrite it in BigQuery SQL.
I am not sure what the best solution would be and I would appreciate any help in improving the performance.
Minimum working example
The code below generates the input dataframe.
import pandas as pd
import numpy as np
# prepare the input df
df = pd.DataFrame({
'date' : ['2023-01-10', '2023-01-20','2023-01-30', '2023-01-10', '2023-01-20','2023-01-30'] ,
'category_a' : [4, 4,4,4, 4, 4] ,
'category_b' : [1, 1,1, 2, 2,2] ,
'outflow' : [1.0, 2.0,10.0, 2.0, 2.0, 0.0],
'open' : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0] ,
'inflow' : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0] ,
'max' : [10.0, 20.0, 20.0 , 10.0, 20.0, 20.0] ,
'close' : [0.0, np.nan,np.nan, 0.0, np.nan, np.nan] ,
'buy' : [0.0, np.nan,np.nan, 0.0, np.nan,np.nan],
'random_str' : ['a', 'a', 'a', 'b', 'b', 'b']
})
df['date'] = pd.to_datetime(df['date'])
# get unique pairs of category_a and category_b in a dictionary
unique_pairs = df.groupby(['category_a', 'category_b']).size().reset_index().rename(columns={0:'count'})[['category_a', 'category_b']].to_dict('records')
unique_dates = np.sort(df['date'].unique())
Using this input dataframe and Numpy, the code below is what I am trying to optmizize.
df = df.set_index('date')
day_0 = unique_dates[0] # first date
# Using Dictionary comprehension
list_of_numbers = list(range(len(unique_pairs)))
myset = {key: None for key in list_of_numbers}
for count_pair, value in enumerate(unique_pairs):
# pair of category_a and category_b
category_a = value['category_a']
category_b = value['category_b']
# subset the dataframe for the pair
df_subset = df.loc[(df['category_a'] == category_a) & (df['category_b'] == category_b)]
log.info(f" running for {category_a} and {category_b}")
# day 0
df_subset.loc[day_0, 'close'] = df_subset.loc[day_0, 'open'] + df_subset.loc[day_0, 'inflow'] - df_subset.loc[day_0, 'outflow']
# loop over single pair using date
for count, date in enumerate(unique_dates[1:], start=1):
previous_date = unique_dates[count-1]
df_subset.loc[date, 'open'] = df_subset.loc[previous_date, 'close']
df_subset.loc[date, 'close'] = df_subset.loc[date, 'open'] + df_subset.loc[date, 'inflow'] - df_subset.loc[date, 'outflow']
# check if closing value is negative, if so, set inflow to buy for next weeks deficit
if df_subset.loc[date, 'close'] < df_subset.loc[date, 'max']:
df_subset.loc[previous_date, 'buy'] = df_subset.loc[date, 'max'] - df_subset.loc[date, 'close'] + df_subset.loc[date, 'inflow']
elif df_subset.loc[date, 'close'] > df_subset.loc[date, 'max']:
df_subset.loc[previous_date, 'buy'] = 0
else:
df_subset.loc[previous_date, 'buy'] = df_subset.loc[date, 'inflow']
df_subset.loc[date, 'inflow'] = df_subset.loc[previous_date, 'buy']
df_subset.loc[date, 'close'] = df_subset.loc[date, 'open'] + df_subset.loc[date, 'inflow'] - df_subset.loc[date, 'outflow']
# store all the dataframes in a container myset
myset[count_pair] = df_subset
# make myset into a dataframe
result = pd.concat(myset.values()).reset_index(drop=False)
result
After which we can check that the solution is the same as what we expected.
from pandas.testing import assert_frame_equal
expected = pd.DataFrame({
'date' : [pd.Timestamp('2023-01-10 00:00:00'), pd.Timestamp('2023-01-20 00:00:00'), pd.Timestamp('2023-01-30 00:00:00'), pd.Timestamp('2023-01-10 00:00:00'), pd.Timestamp('2023-01-20 00:00:00'), pd.Timestamp('2023-01-30 00:00:00')] ,
'category_a' : [4, 4, 4, 4, 4, 4] ,
'category_b' : [1, 1, 1, 2, 2, 2] ,
'outflow' : [1, 2, 10, 2, 2, 0] ,
'open' : [0.0, -1.0, 20.0, 0.0, -2.0, 20.0] ,
'inflow' : [0.0, 23.0, 10.0, 0.0, 24.0, 0.0] ,
'max' : [10, 20, 20, 10, 20, 20] ,
'close' : [-1.0, 20.0, 20.0, -2.0, 20.0, 20.0] ,
'buy' : [23.0, 10.0, np.nan, 24.0, 0.0, np.nan] ,
'random_str' : ['a', 'a', 'a', 'b', 'b', 'b']
})
# check that the result is the same as expected
assert_frame_equal(result, expected)
SQL to create first table
The solution can also be in sql, if so you can use the following code to create the initial table.
I am busy trying to implement a solution in big query sql using a user defined function to keep the logic going too. This would be a nice approach to solving the problem too.
WITH data AS (
SELECT
DATE '2023-01-10' as date, 4 as category_a, 1 as category_b, 1 as outflow, 0 as open, 0 as inflow, 10 as max, 0 as close, 0 as buy, 'a' as random_str
UNION ALL
SELECT
DATE '2023-01-20' as date, 4 as category_a, 1 as category_b, 2 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'a' as random_str
UNION ALL
SELECT
DATE '2023-01-30' as date, 4 as category_a, 1 as category_b, 10 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'a' as random_str
UNION ALL
SELECT
DATE '2023-01-10' as date, 4 as category_a, 2 as category_b, 2 as outflow, 0 as open, 0 as inflow, 10 as max, 0 as close, 0 as buy, 'b' as random_str
UNION ALL
SELECT
DATE '2023-01-20' as date, 4 as category_a, 2 as category_b, 2 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'b' as random_str
UNION ALL
SELECT
DATE '2023-01-30' as date, 4 as category_a, 2 as category_b, 0 as outflow, 0 as open, 0 as inflow, 20 as max, NULL as close, NULL as buy, 'b' as random_str
)
SELECT
ROW_NUMBER() OVER (ORDER BY date) as " ",
date,
category_a,
category_b,
outflow,
open,
inflow,
max,
close,
buy,
random_str
FROM data

Efficient algorithm
First of all, the complexity of the algorithm can be improved. Indeed, (df['category_a'] == category_a) & (df['category_b'] == category_b) travels the whole dataframe and this is done for each item in unique_pairs. The running time is O(U R) where U = len(unique_pairs) and R = len(df).
An efficient solution is to perform a groupby, that is, to split the dataframe in M groups each sharing the same pair of category. This operation can be done in O(R) time where R is the number of rows in the dataframe. In practice, Pandas may implement this using a (comparison-based) sort running in O(R log R) time.
Faster access & Conversion to Numpy
Moreover, accessing a dataframe item per item using loc is very slow. Indeed, Pandas needs to locate the location of the column using an internal dictionary, find the row based on the provided date, extract the value in the dataframe based on the ith row and jth column, create a new object and return it, not to mention the several check done (eg. types and bounds). On top of that, Pandas introduces a significant overhead partially due to its code being interpreted using typically CPython.
A faster solution is to extract the columns ahead of time, and to iterate over the row using integers instead of values (like dates). The thing is the order of the sorted date may not be the one in the dataframe subset. I guess it is the case for your input dataframe in practice, but if it is not, then you can sort the dataframe of each precomputed groups by date. I assume all the dates are present in all subset dataframe (but again, if this not the case, you can correct the result of the groupby). Each column can be converted to Numpy so to the can be faster. The result is a pure-Numpy code, not using Pandas anymore. Computationally-intensive Numpy codes are great since they can often be heavily optimized, especially when the target arrays contains native numerical types.
Here is the implementation so far:
df = df.set_index('date')
day_0 = unique_dates[0] # first date
# Using Dictionary comprehension
list_of_numbers = list(range(len(unique_pairs)))
myset = {key: None for key in list_of_numbers}
groups = dict(list(df.groupby(['category_a', 'category_b'])))
for count_pair, value in enumerate(unique_pairs):
# pair of category_a and category_b
category_a = value['category_a']
category_b = value['category_b']
# subset the dataframe for the pair
df_subset = groups[(category_a, category_b)]
# Extraction of the Pandas columns and convertion to Numpy ones
col_open = df_subset['open'].to_numpy()
col_close = df_subset['close'].to_numpy()
col_inflow = df_subset['inflow'].to_numpy()
col_outflow = df_subset['outflow'].to_numpy()
col_max = df_subset['max'].to_numpy()
col_buy = df_subset['buy'].to_numpy()
# day 0
col_close[0] = col_open[0] + col_inflow[0] - col_outflow[0]
# loop over single pair using date
for i in range(1, len(unique_dates)):
col_open[i] = col_close[i-1]
col_close[i] = col_open[i] + col_inflow[i] - col_outflow[i]
# check if closing value is negative, if so, set inflow to buy for next weeks deficit
if col_close[i] < col_max[i]:
col_buy[i-1] = col_max[i] - col_close[i] + col_inflow[i]
elif col_close[i] > col_max[i]:
col_buy[i-1] = 0
else:
col_buy[i-1] = col_inflow[i]
col_inflow[i] = col_buy[i-1]
col_close[i] = col_open[i] + col_inflow[i] - col_outflow[i]
# store all the dataframes in a container myset
myset[count_pair] = df_subset
# make myset into a dataframe
result = pd.concat(myset.values()).reset_index(drop=False)
result
This code is not only faster, but also a bit easier to read.
Fast execution using Numba
At this point, the general solution is to use vectorized functions but this is really not easy to do that efficiently (if even possible) here due to the loop dependencies and the conditionals. A fast solution is to use a JIT compiler like Numba so to generate a very-fast implementation. Numba is designed to work efficiently on natively-typed Numpy arrays so this is the perfect use-case. Note that Numba need the input parameter to have a well-defined (native) type. Providing the types manually cause Numba to generate the code eagerly (during the definition of the function) instead of lazily (during the first execution).
Here is the final resulting code:
import numba as nb
#nb.njit('(float64[:], float64[:], float64[:], int64[:], int64[:], float64[:], int64)')
def compute(col_open, col_close, col_inflow, col_outflow, col_max, col_buy, n):
# Important checks to avoid out-of bounds that are
# not checked by Numba for sake of performance.
# If they are not true and not done, then
# the function can simply cause a crash.
assert col_open.size == n and col_close.size == n
assert col_inflow.size == n and col_outflow.size == n
assert col_max.size == n and col_buy.size == n
# day 0
col_close[0] = col_open[0] + col_inflow[0] - col_outflow[0]
# loop over single pair using date
for i in range(1, n):
col_open[i] = col_close[i-1]
col_close[i] = col_open[i] + col_inflow[i] - col_outflow[i]
# check if closing value is negative, if so, set inflow to buy for next weeks deficit
if col_close[i] < col_max[i]:
col_buy[i-1] = col_max[i] - col_close[i] + col_inflow[i]
elif col_close[i] > col_max[i]:
col_buy[i-1] = 0
else:
col_buy[i-1] = col_inflow[i]
col_inflow[i] = col_buy[i-1]
col_close[i] = col_open[i] + col_inflow[i] - col_outflow[i]
df = df.set_index('date')
day_0 = unique_dates[0] # first date
# Using Dictionary comprehension
list_of_numbers = list(range(len(unique_pairs)))
myset = {key: None for key in list_of_numbers}
groups = dict(list(df.groupby(['category_a', 'category_b'])))
for count_pair, value in enumerate(unique_pairs):
# pair of category_a and category_b
category_a = value['category_a']
category_b = value['category_b']
# subset the dataframe for the pair
df_subset = groups[(category_a, category_b)]
# Extraction of the Pandas columns and convertion to Numpy ones
col_open = df_subset['open'].to_numpy()
col_close = df_subset['close'].to_numpy()
col_inflow = df_subset['inflow'].to_numpy()
col_outflow = df_subset['outflow'].to_numpy()
col_max = df_subset['max'].to_numpy()
col_buy = df_subset['buy'].to_numpy()
# Numba-accelerated computation
compute(col_open, col_close, col_inflow, col_outflow, col_max, col_buy, len(unique_dates))
# store all the dataframes in a container myset
myset[count_pair] = df_subset
# make myset into a dataframe
result = pd.concat(myset.values()).reset_index(drop=False)
result
Feel free to change the type of the parameters if they do not match with the real-world input data-type (eg. int32 vs int64 or float64 vs int64). Note that you can replace things like float64[:] by float64[::1] if you know that the input array is contiguous which is likely the case. This generates a faster code.
Also please note that myset can be a list since count_pair is an increasing integer. This is simpler and faster but it might be useful in your real-world code.
Performance results
The Numba function call runs in about 1 µs on my machine as opposed to 7.1 ms for the initial code. This means the hot part of the code is 7100 times faster just on the tiny example. That being said, Pandas takes some time to convert the columns to Numpy, to create groups and to merge the dataframes. The former takes a small constant time negligible for large arrays. The two later operations take more time on bigger input dataframes and they are actually the main bottleneck on my machine (both takes 1 ms on the small example). Overall, the whole initial code takes 16.5 ms on my machine for the tiny example dataframe, while the new one takes 3.1 ms. This means a 5.3 times faster code just for this small input. On bigger input dataframes the speed-up should be significantly better. Finally, please not that df.groupby(['category_a', 'category_b']) was actually already precomputed so I am not even sure we should include it in the benchmark ;) .

Groupby on second level index that respects the previous level

I'm looking to have two-level index, of which one is of type datetime and the other one is int. The time column I'd like to resample for 1min, and the int column I'd like to have it as intervals of 5.
Currently I've only done the first part, but I've left the second level untouched:
x = w.groupby([pd.Grouper(level='time', freq='1min'), pd.Grouper(level=1)]).sum()
The problem is that it's not good to use bins generated from the entire range of data for pd.cut(), because most of them will be zero. I want to limit the bins only to the context of each 5-second interval.
In other words, I want to replace the second argument (pd.Grouper(level=1)) with pd.cut(rows_from_level0, my_bins) where mybins is an array from the respective 5 second group that's in intervals of 5. (e.g. for [34,54,29,31] -> [30, 35, 40, 45, 50, 55]).
How my_bins computed can be seen below:
def roundTo(num, base=5):
return base * round(num/base)
arr_min = roundTo(min(arr))
arr_max = roundTo(max(arr))
dif = arr_max - arr_min
my_bins = np.linspace(arr_min, arr_max, dif//5 +1)
Basically I'm not sure how to make the second level pd.cut aware of the rows from the first level index in order to produce the bins.

One way to go is to extract the level values, do some math, then groupby on that:
N = 5
df.groupby([pd.Grouper(level='datetime', freq='1min'),
df.index.get_level_values(level=1)//N * N]
).sum()
You would get something similar to this:
data
datetime lvl1
2021-01-01 00:00:00 5 9
15 1
25 4
60 9
2021-01-01 00:01:00 5 8
25 7
85 2
90 6
2021-01-01 00:02:00 0 9
70 8

Calculate average based on available data points

Imagine I have the following data frame:
Product
Month 1
Month 2
Month 3
Month 4
Total
Stuff A
5
0
3
3
11
Stuff B
10
11
4
8
33
Stuff C
0
0
23
30
53
that can be constructed from:
df = pd.DataFrame({'Product': ['Stuff A', 'Stuff B', 'Stuff C'],
'Month 1': [5, 10, 0],
'Month 2': [0, 11, 0],
'Month 3': [3, 4, 23],
'Month 4': [3, 8, 30],
'Total': [11, 33, 53]})
This data frame shows the amount of units sold per product, per month.
Now, what I want to do is to create a new column called "Average" that calculates the average units sold per month. HOWEVER, notice in this example that Stuff C's values for months 1 and 2 are 0. This product was probably introduced in Month 3, so its average should be calculated based on months 3 and 4 only. Also notice that Stuff A's units sold in Month 2 were 0, but that does not mean the product was introduced in Month 3 since 5 units were sold in Month 1. That is, its average should be calculated based on all four months. Assume that the provided data frame may contain any number of months.
Based on these conditions, I have come up with the following solution in pseudo-code:
months = ["list of index names of months to calculate"]
x = len(months)
if df["Month 1"] != 0:
df["Average"] = df["Total"] / x
elif df["Month 2"] != 0:
df["Average"] = df["Total"] / x - 1
...
elif df["Month " + str(x)] != 0:
df["Average"] = df["Total"] / 1
else:
df["Average"] = 0
That way, the average would be calculated starting from the first month where units sold are different from 0. However, I haven't been able to translate this logical abstraction into actual working code. I couldn't manage to iterate over len(months) while maintaining the elif conditions. Or maybe there is a better, more practical approach.
I would appreciate any help, since I've been trying to crack this problem for a while with no success.

There is numpy method np.trim_zeros that trims leading and/or trailing zeros. Using a list comprehension, you can iterate over the relevant DataFrame rows, trim the leading zeros and find the average of what remains for each row.
Note that since 'Month 1' to 'Month 4' are consecutive, you can slice the columns between them using .loc.
import numpy as np
df['Average Sales'] = [np.trim_zeros(row, trim='f').mean() for row in df.loc[:, 'Month 1':'Month 4'].to_numpy()]
Output:
Product Month 1 Month 2 Month 3 Month 4 Total Average Sales
0 Stuff A 5 0 3 3 11 2.75
1 Stuff B 10 11 4 8 33 8.25
2 Stuff C 0 0 23 30 53 26.50

Try:
df = df.set_index(['Product','Total'])
df['Average'] = df.where(df.ne(0).cummax(axis=1)).mean(axis=1)
df_out=df.reset_index()
print(df_out)
Output:
Product Total Month 1 Month 2 Month 3 Month 4 Average
0 Stuff A 11 5 0 3 3 2.75
1 Stuff B 33 10 11 4 8 8.25
2 Stuff C 53 0 0 23 30 26.50
Details:
Move Product and Total into the dataframe index, so we can do calcation on the rest of the dataframe.
First create a boolean matrix using ne to zero. Then, use cummax along the rows which means that if there is a non-zero value, It will remain True until then end of the row. If it starts with a zero, then the False will stay until first non-zero then turns to Turn and remain True.
Next, use pd.DataFrame.where to only select those values for that boolean matrix were Turn, other values (leading zeros) will be NaN and not used in the calcuation of mean.

If you don't mind it being a little memory inefficient, you could put your dataframe into a numpy array. Numpy has a built-in function to remove zeroes from an array, and then you could use the mean function to calculate the average. It could look something like this:
import numpy as np
arr = np.array(Stuff_A_DF)
mean = arr[np.nonzero(arr)].mean()
Alternatively, you could manually extract the row to a list, then loop through to remove the zeroes.

Moving average 2 day span using Pandas

I have a dataframe, df, that I would like to calculate the moving average for the next 2 days, with a window of 5. I then wish to store the results in a variable within Python:
date count
08052020 50
08152020 10
08252020 30
09052020 10
09152020 15
09252020 5
This is what I am doing:
count = df['new'] = df['count'].rolling(2).mean()
print(count)
Any insight is appreciated.
Updated Desired output:
count = [23, 14]

I think you need:
df['new'] = df['count'].rolling(5).mean()
count = df["new"].dropna().to_list()
print(count)
Output:
[23.0, 14.0]

Pandas: Calculate Median Based on Multiple Conditions in Each Row

I am trying to calculate median values on the fly based on multiple conditions in each row of a data frame and am not getting there.
Basically, for every row, I am counting the number of people in the same department with rank B with pay greater than the pay listed in that row. I was able to get the count to work properly with a lambda function:
df['B Count'] = df.apply(lambda x: sum(df[(df['Department'] == x['Department']) & (df['Rank'] == 'B')]['Pay'] > x['Pay']), axis=1)
However, I now need to calculate the median for each case satisfying those conditions. So in row x of the data frame, I need the median of df['Pay'] for all others matching x['Department'] and df['Rank'] == 'B'. I can't apply .median() instead of sum(), as that gives me the median count, not the median pay. Any thoughts?
Using the fake data below, the 'B Count' code from above counts the number of B's in each Department with higher pay than each A. That part works fine. What I want is to then construct the 'B Median' column, calculating the median pay of the B's in each Department with higher pay than each A in the same Department.
Person Department Rank Pay B Count B Median
1 One A 1000 1 1500
2 One B 800
3 One A 500 2 1150
4 One A 3000 0
5 One B 1500
6 Two B 2000
7 Two B 1800
8 Two A 1500 3 1800
9 Two B 1700
10 Two B 1000

Well, I was able to do what I wanted to do with a function:
def median_b(x):
if x['B Count'] == 0:
return np.nan
else:
return df[(df['Department'] == x['Department']) & (df['Rank'] == 'B') & (
df['Pay'] > x['Pay'])]['Pay'].median()
df['B Median'] = df.apply(median_b, axis = 1)
Do any of you know of better ways to achieve this result?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find a period with specific characteristics in a Pandas DataFrame - python

Related

Improving performance for a nested for loop iterating over dates

Groupby on second level index that respects the previous level

Calculate average based on available data points

Moving average 2 day span using Pandas

Pandas: Calculate Median Based on Multiple Conditions in Each Row

Categories

Resources