Python: Compare data against the 95th percentile of a running window dataset - python

I have a large DataFrame of thousands of rows but only 2 columns. The 2 columns are of the below format:
Dt
Val
2020-01-01
10.5
2020-01-01
11.2
2020-01-01
10.9
2020-01-03
11.3
2020-01-05
12.0
The first column is date and the second column is a value. For each date, there may be zero, one or more values.
What I need to do is the following: Compute the 95th percentile based on the 30 days that just past and see if the current value is above or below that 95th percentile value. There must however be a minimum of 50 values available for the past 30 days.
For example, if a record has date "2020-12-01" and value "10.5", then I need to first see how many values are there available for the date range 2020-11-01 to 2020-11-30. If there are at least 50 values available over that date range, then I will want to compute the 95th percentile of those values and compare 10.5 against that. If 10.5 is greater than the 95th percentile value, then the result for that record is "Above Threshold". If 10.5 is less than the 95th percentile value, then the result for that record is "Below Threshold". If there are less than 50 values over the date range 2020-11-01 to 2020-11-30, then the result for that record is "Insufficient Data".
I would like to avoid running a loop if possible as it may be expensive from a resource and time perspective to loop through thousands of records to process them one by one. I hope someone can advise of a simple(r) python / pandas solution here.

Use rolling on DatetimeIndex to get the number of values available and the 95th percentile in the last 30 days. Here is an example with 3 days rolling window:
import datetime
import pandas as pd
df = pd.DataFrame({'val':[1,2,3,4,5,6]},
index = [datetime.date(2020,10,1), datetime.date(2020,10,1), datetime.date(2020,10,2),
datetime.date(2020,10,3), datetime.date(2020,10,3), datetime.date(2020,10,4)])
df.index = pd.DatetimeIndex(df.index)
df['number_of_values'] = df.rolling('3D').count()
df['rolling_percentile'] = df.rolling('3D')['val'].quantile(0.9, interpolation='nearest')
Then you can simply do your comparison:
# Above Threshold
(df['val']>df['rolling_percentile'])&(df['number_of_values']>=50)
# Below Threshold
(df['val']>df['rolling_percentile'])&(df['number_of_values']>=50)
# Insufficient Data
df['number_of_values']<50
To remove the current date, close argument would not work for more than one row on a day, so maybe use the rolling apply:
def f(x, metric):
x = x[x.index!=x.index[-1]]
if metric == 'count':
return len(x)
elif metric == 'percentile':
return x.quantile(0.9, interpolation='nearest')
else:
return np.nan
df = pd.DataFrame({'val':[1,2,3,4,5,6]},
index = [datetime.date(2020,10,1), datetime.date(2020,10,1), datetime.date(2020,10,2),
datetime.date(2020,10,3), datetime.date(2020,10,3), datetime.date(2020,10,4)])
df.index = pd.DatetimeIndex(df.index)
df['count'] = df.rolling('3D')['val'].apply(f, args = ('count',))
df['percentile'] = df.rolling('3D')['val'].apply(f, args = ('percentile',))
val count percentile
2020-10-01 1 0.0 NaN
2020-10-01 2 0.0 NaN
2020-10-02 3 2.0 2.0
2020-10-03 4 3.0 3.0
2020-10-03 5 3.0 3.0
2020-10-04 6 3.0 5.0

Related

Upsampling and dividing data in pandas

I am trying to upsample a pandas datetime-indexed dataframe, so that resulting data is equally divided over the new entries.
For instance, let's say I have a dataframe which stores a cost each month, and I want to get a dataframe which summarizes the equivalent costs per day for each month:
df = (pd.DataFrame([[pd.to_datetime('2023-01-01'), 31],
[pd.to_datetime('2023-02-01'), 14]],
columns=['time', 'cost']
)
.set_index("time")
)
Daily costs are 1$ (or whatever currency you like) in January, and 0.5$ in February. My goal in picture:
After a lot of struggle, I managed to obtain the next code snippet which seems to do what I want:
# add a value to perform a correct resampling
df.loc[df.index.max() + relativedelta(months=1)] = 0
# forward-fill over the right scale
# then divide each entry per the number of rows in the month
df = (df
.resample('1d')
.ffill()
.iloc[:-1]
.groupby(lambda x: datetime(x.year, x.month, 1))
.transform(lambda x: (x / x.count()))
)
However, this is not entirely ok:
using transform forces me to have dataframes with a single column ;
I need to hardcode my original frequency several times in different formats (while adding an extra value at the end of the dataframe, and in the groupby), making a function design hard ;
It only works with evenly-spaced datetime index (even if it's ok in my case) ;
it remains complex.
Does anyone have a suggestion to improve that code snippet ?
What if we took df's month indices and expanded them into days range, while dividing df's values by a number those days and assigning to each day, all by list comprehensions (edit: for equally distributed values per day):
import pandas as pd
# initial DataFrame
df = (pd.DataFrame([[pd.to_datetime('2023-01-01'), 31],
[pd.to_datetime('2023-02-01'), 14]],
columns=['time', 'cost']
).set_index("time"))
# reformat to months
df.index = df.index.strftime('%m-%Y')
df1 = pd.concat( # concatenate the resulted DataFrames into one
[pd.DataFrame( # make a DataFrame from a row in df
[v / pd.Period(i).days_in_month # each month's value divided by n of days in a month
for d in range(pd.Period(i).days_in_month)], # repeated for as many times as there are days
index=pd.date_range(start=i, periods=pd.Period(i).days_in_month, freq='D')) # days range
for i, v in df.iterrows()]) # for each df's index and value
df1
Output:
cost
2023-01-01 1.0
2023-01-02 1.0
2023-01-03 1.0
2023-01-04 1.0
2023-01-05 1.0
2023-01-06 1.0
2023-01-07 1.0
2023-01-08 1.0
2023-01-09 1.0
2023-01-10 1.0
2023-01-11 1.0
... ...
2023-02-13 0.5
2023-02-14 0.5
2023-02-15 0.5
2023-02-16 0.5
2023-02-17 0.5
2023-02-18 0.5
2023-02-19 0.5
2023-02-20 0.5
2023-02-21 0.5
2023-02-22 0.5
2023-02-23 0.5
2023-02-24 0.5
2023-02-25 0.5
2023-02-26 0.5
2023-02-27 0.5
2023-02-28 0.5
What could be done to avoid uniform distribution of daily costs and for the cases with multiple columns? Here's an extended df:
# additional columns and a row
df = (pd.DataFrame([[pd.to_datetime('2023-01-01'), 31, 62, 23],
[pd.to_datetime('2023-02-01'), 14, 28, 51],
[pd.to_datetime('2023-03-01'), 16, 33, 21]],
columns=['time', 'cost1', 'cost2', 'cost3']
).set_index("time"))
# reformat to months
df.index = df.index.strftime('%m-%Y')
df
Output:
cost1 cost2 cost3
time
01-2023 31 62 23
02-2023 14 28 51
03-2023 16 33 21
Here's what I came up for the cases where monthly costs may be upsampled by randomized daily costs, inspired by this question. This solution is scalable to the number of columns and rows:
df1 = pd.concat( # concatenate the resulted DataFrames into one
[pd.DataFrame( # make a DataFrame from a row in df
# here we make a Series with random Dirichlet distributed numbers
# with length of a month and a column's value as the sum
[pd.Series((np.random.dirichlet(np.ones(pd.Period(i).days_in_month), size=1)*v
).flatten()) # the product is an ndarray that needs flattening
for v in row], # for every column value in a row
# index renamed as columns because of the created DataFrame's shape
index=df.columns
# transpose and set the proper index
).T.set_index(
pd.date_range(start=i,
periods=pd.Period(i).days_in_month,
freq='D'))
for i, row in df.iterrows()]) # iterate over every row
Output:
cost1 cost2 cost3
2023-01-01 1.703177 1.444117 0.160151
2023-01-02 0.920706 3.664460 0.823405
2023-01-03 1.210426 1.194963 0.294093
2023-01-04 0.214737 1.286273 0.923881
2023-01-05 1.264553 0.380062 0.062829
... ... ... ...
2023-03-27 0.124092 0.615885 0.251369
2023-03-28 0.520578 1.505830 1.632373
2023-03-29 0.245154 3.094078 0.308173
2023-03-30 0.530927 0.406665 1.149860
2023-03-31 0.276992 1.115308 0.432090
90 rows × 3 columns
To assert the monthly sum:
df1.groupby(pd.Grouper(freq='M')).agg('sum')
Output:
cost1 cost2 cost3
2023-01-31 31.0 62.0 23.0
2023-02-28 14.0 28.0 51.0
2023-03-31 16.0 33.0 21.0

Pandas dataframe vectorized bucketing/aggregation?

The Task
I have a dataframe that looks like this:
date
money_spent ($)
meals_eaten
weight
2021-01-01 10:00:00
350
5
140
2021-01-02 18:00:00
250
2
170
2021-01-03 12:10:00
200
3
160
2021-01-04 19:40:00
100
1
150
I want to discretize this so that it "cuts" the rows every $X. I want to know some statistics on how much is being done for every $X i spend.
So if I were to use $500 as a threshold, the first two rows would fall in the first cut, and I could aggregate the remaining columns as follows:
first date of the cut
average meals_eaten
minimum weight
maximum weight
So the final table would be two rows like this:
date
cumulative_spent ($)
meals_eaten
min_weight
max_weight
2021-01-01 10:00:00
600
3.5
140
170
2021-01-03 12:10:00
300
2
150
160
My Approach:
My first instinct is to calculate the cumsum() of the money_spent (assume the data is sorted by date), then I use pd.cut() to basically make a new column, we call it spent_bin, that determines each row's bin.
Note: In this toy example, spent_bin would basically be: [0,500] for the first two rows and (500-1000] for the last two.
Then it's fairly simple, I do a groupby spent_bin then aggregate as follows:
.agg({
'date':'first',
'meals_eaten':'mean',
'returns': ['min', 'max']
})
What I've Tried
import pandas as pd
rows = [
{"date":"2021-01-01 10:00:00","money_spent":350, "meals_eaten":5, "weight":140},
{"date":"2021-01-02 18:00:00","money_spent":250, "meals_eaten":2, "weight":170},
{"date":"2021-01-03 12:10:00","money_spent":200, "meals_eaten":3, "weight":160},
{"date":"2021-01-05 22:07:00","money_spent":100, "meals_eaten":1, "weight":150}]
df = pd.DataFrame.from_dict(rows)
df['date'] = pd.to_datetime(df.date)
df['cum_spent'] = df.money_spent.cumsum()
print(df)
print(pd.cut(df.cum_spent, 500))
For some reason, I can't get the cut step to work. Here is my toy code from above. The labels are not cleanly [0-500], (500,1000] for some reason. Honestly I'd settle for [350,500],(500-800] (this is what the actual cum sum values are at the edges of the cuts), but I can't even get that to work even though I'm doing the exact same as the documentation example. Any help with this?
Caveats and Difficulties:
It's pretty easy to write this in a for loop of course, just do a while cum_spent < 500:. The problem is I have millions of rows in my actual dataset, it currently takes me 20 minutes to process a single df this way.
There's also a minor issue that sometimes rows will break the interval. When that happens, I want that last row included. This problem is in the toy example where row #2 actually ends at $600 not $500. But it is the first row that ends at or surpasses $500, so I'm including it in the first bin.
The customized function to achieve the cumsum with reset limitation
df['new'] = cumli(df['money_spent ($)'].values,500)
out = df.groupby(df.new.iloc[::-1].cumsum()).agg(
date = ('date','first'),
meals_eaten = ('meals_eaten','mean'),
min_weight = ('weight','min'),
max_weight = ('weight','max')).sort_index(ascending=False)
Out[81]:
date meals_eaten min_weight max_weight
new
1 2021-01-01 3.5 140 170
0 2021-01-03 2.0 150 160
from numba import njit
#njit
def cumli(x, lim):
total = 0
result = []
for i, y in enumerate(x):
check = 0
total += y
if total >= lim:
total = 0
check = 1
result.append(check)
return result

Boxplot of Multiindex df

I want to do 2 things:
I want to create one boxplot per date/day with all the values for MeanTravelTimeSeconds in that date. The number of MeanTravelTimeSeconds elements varies from date to date (e.g. one day might have a count of 300 values while another, 400).
Also, I want to transform the rows in my multiindex series into columns because I don't want the rows to repeat every time. If it stays like this I'd have tens of millions of unnecessary rows.
Here is the resulting series after using df.stack() on a df indexed by date (date is a datetime object index):
Date
2016-01-02 NumericIndex 1611664
OriginMovementID 4744
DestinationMovementID 5084
MeanTravelTimeSeconds 1233
RangeLowerBoundTravelTimeSeconds 756
...
2020-03-31 DestinationMovementID 3594
MeanTravelTimeSeconds 1778
RangeLowerBoundTravelTimeSeconds 1601
RangeUpperBoundTravelTimeSeconds 1973
DayOfWeek Tuesday
Length: 11281655, dtype: object
When I use seaborn to plot the boxplot I guet a bucnh of errors after playing with different selections.
If I try to do df.stack().unstack() or df.stack().T I get then following error:
Index contains duplicate entries, cannot reshape
How do I plot the boxplot and how do I turn the rows into columns?
You really do need to make your index unique to make the functions you want to work. I suggest a sequential number that resets at every change in the other two key columns.
import datetime as dt
import random
import numpy as np
cat = ["NumericIndex","OriginMovementID","DestinationMovementID","MeanTravelTimeSeconds",
"RangeLowerBoundTravelTimeSeconds"]
df = pd.DataFrame(
[{"Date":d, "Observation":cat[random.randint(0,len(cat)-1)],
"Value":random.randint(1000,10000)}
for i in range(random.randint(5,20))
for d in pd.date_range(dt.datetime(2016,1,2), dt.datetime(2016,3,31), freq="14D")])
# starting point....
df = df.sort_values(["Date","Observation"]).set_index(["Date","Observation"])
# generate an array that is sequential within change of key
seq = np.full(df.index.shape, 0)
s=0
p=""
for i, v in enumerate(df.index):
if i==0 or p!=v: s=0
else: s+=1
seq[i] = s
p=v
df["SeqNo"] = seq
# add to index - now unstack works as required
dfdd = df.set_index(["SeqNo"], append=True)
dfdd.unstack(0).loc["MeanTravelTimeSeconds"].boxplot()
print(dfdd.unstack(1).head().to_string())
output
Value
Observation DestinationMovementID MeanTravelTimeSeconds NumericIndex OriginMovementID RangeLowerBoundTravelTimeSeconds
Date SeqNo
2016-01-02 0 NaN NaN 2560.0 5324.0 5085.0
1 NaN NaN 1066.0 7372.0 NaN
2016-01-16 0 NaN 6226.0 NaN 7832.0 NaN
1 NaN 1384.0 NaN 8839.0 NaN
2 NaN 7892.0 NaN NaN NaN

Pandas: Calculate average of values for a time frame

I am working on a large datasets that looks like this:
Time, Value
01.01.2018 00:00:00.000, 5.1398
01.01.2018 00:01:00.000, 5.1298
01.01.2018 00:02:00.000, 5.1438
01.01.2018 00:03:00.000, 5.1228
01.01.2018 00:04:00.000, 5.1168
.... , ,,,,
31.12.2018 23:59:59.000, 6.3498
The data is a minute data from the first day of the year to the last day of the year
I want to use Pandas to find the average of every 5 days.
For example:
Average from 01.01.2018 00:00:00.000 to 05.01.2018 23:59:59.000 is average for 05.01.2018
The next average will be from 02.01.2018 00:00:00.000 to 6.01.2018 23:59:59.000 is average for 06.01.2018
The next average will be from 03.01.2018 00:00:00.000 to 7.01.2018 23:59:59.000 is average for 07.01.2018
and so on... We are incrementing day by 1 but calculating an average from the day to past 5days, including the current date.
For a given day, there are 24hours * 60minutes = 1440 data points. So I need to get the average of 1440 data points * 5 days = 7200 data points.
The final DataFrame will look like this, time format [DD.MM.YYYY] (without hh:mm:ss) and the Value is the average of 5 data including the current date:
Time, Value
05.01.2018, 5.1398
06.01.2018, 5.1298
07.01.2018, 5.1438
.... , ,,,,
31.12.2018, 6.3498
The bottom line is to calculate the average of data from today to the past 5 days and the average value is shown as above.
I tried to iterate through Python loop but I wanted something better than we can do from Pandas.
Perhaps this will work?
import numpy as np
# Create one year of random data spaced evenly in 1 minute intervals.
np.random.seed(0) # So that others can reproduce the same result given the random numbers.
time_idx = pd.date_range(start='2018-01-01', end='2018-12-31', freq='min')
df = pd.DataFrame({'Time': time_idx, 'Value': abs(np.random.randn(len(time_idx))) + 5})
>>> df.shape
(524161, 2)
Given the dataframe with 1 minute intervals, you can take a rolling average over the past five days (5 days * 24 hours/day * 60 minutes/hour = 7200 minutes) and assign the result to a new column named rolling_5d_avg. You can then group on the original timestamps using the dt accessor method to grab the date, and then take the last rolling_5d_avg value for each date.
df = (
df
.assign(rolling_5d_avg=df.rolling(window=5*24*60)['Value'].mean())
.groupby(df['Time'].dt.date)['rolling_5d_avg']
.last()
)
>>> df.head(10)
Time
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 5.786603
2018-01-06 5.784011
2018-01-07 5.790133
2018-01-08 5.786967
2018-01-09 5.789944
2018-01-10 5.789299
Name: rolling_5d_avg, dtype: float64

Maximum Monthly Values whilst retaining the Data at which that values occurred

I have daily rainfall data that looks like the following:
Date Rainfall (mm)
1922-01-01 0.0
1922-01-02 0.0
1922-01-03 0.0
1922-01-04 0.0
1922-01-05 31.5
1922-01-06 0.0
1922-01-07 0.0
1922-01-08 0.0
1922-01-09 0.0
1922-01-10 0.0
1922-01-11 0.0
1922-01-12 9.1
1922-01-13 6.4
I am trying to work out the maximum value for each month for each year, and also what date the maximum value occurred on. I have been using the code:
rain_data.groupby(pd.Grouper(freq = 'M'))['Rainfall (mm)'].max()
This is returning the correct maximum value but returns the end date of each month rather than the date that maximum event occurred on.
1974-11-30 0.0
I have also tried using .idxmax(), but this also just return the end values of each month.
Any suggestions on how I could get the correct date?
pd.Grouper seems to change the order within groups for Datetime, which breaks the usual trick of .sort_values + .tail. Instead group on the year and month:
df.sort_values('Rainfall (mm)').groupby([df.Date.dt.year, df.Date.dt.month]).tail(1)
Sample Data + Output
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Date': pd.date_range('1922-01-01', freq='D', periods=100),
'Rainfall (mm)': np.random.randint(1,100,100)})
df.sort_values('Rainfall (mm)').groupby([df.Date.dt.month, df.Date.dt.year]).tail(1)
# Date Rainfall (mm)
#82 1922-03-24 92
#35 1922-02-05 98
#2 1922-01-03 99
#90 1922-04-01 99
The problem with pd.Grouper is that it creates a DatetimeIndex with an end of the month frequency, which we don't really need and we're using .apply. This give you a new index, and is nicely sorted by date though!
(df.groupby(pd.Grouper(key='Date', freq='1M'))
.apply(lambda x: x.loc[x['Rainfall (mm)'].idxmax()])
.reset_index(drop=True))
# Date Rainfall (mm)
#0 1922-01-03 99
#1 1922-02-05 98
#2 1922-03-24 92
#3 1922-04-01 99
Also can with .drop_duplicates using the first 7 characters of the date to get the Year-Month
(df.assign(ym = df.Date.astype(str).str[0:7])
.sort_values('Rainfall (mm)')
.drop_duplicates('ym', keep='last')
.drop(columns='ym'))

Categories

Resources