I have a dataset with daily data and am trying to create aggregate statistical summaries based on a 3 calendar month rolling window. So for example, given this dataset:
date amount
0 2015-01-01 100
1 2015-01-05 500
2 2015-02-12 50
3 2015-03-25 50
4 2015-03-04 100
5 2015-04-19 500
6 2015-05-31 50
7 2015-05-01 100
8 2015-06-09 500
9 2015-07-15 50
If I wanted to calculate the kurtosis and standard of amount, I would get the following:
date sd kurtosis
0 2015-01-01 NaN NaN
1 2015-02-01 NaN NaN
2 2015-03-01 171 4.7
3 2015-04-01 189 3.8
4 2015-05-01 171 4.7
5 2015-06-01 213 -5.8
6 2015-07-01 189 3.8
Note that these measures are calculated on the daily values for the current and prior 2 months. Is there a way of solving this using rolling?
You could use the min_periods argument as 1. As it is described on the docs, the default value is the window size.
And it adds NaN for entries less or equal this parameter (min_periods).
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html
For your case, you're rolling on a 3 window size. I don't know how you have coded to group days in months, but for the same time-scale would be like this:
df['sd'] = df.rolling(3, min_periods=1).std()
df['kurtosis'] = df.rolling(3, min_periods=1).kurtosis()
Employing rolling() could be difficult here, as you have to look both ahead and backward in your dataframe to get the desired window.
Here is an approach that uses a mask to get the desired window for each index (i.e., window between the first day of the month two calendar months before current month and the last of the current month) and then applies a function to the amount column in that window:
df = df.set_index('date')
df["roll_std"] = [
df[
(df.index >= (curr + pd.offsets.MonthBegin(1) - pd.offsets.MonthBegin(3)))
& (df.index <= (curr + pd.offsets.MonthBegin(1) - pd.offsets.Day(1)))
]["amount"].std(ddof=0)
for curr in df.index
]
df["roll_krt"] = [
df[
(df.index >= (curr + pd.offsets.MonthBegin(1) - pd.offsets.MonthBegin(3)))
& (df.index <= (curr + pd.offsets.MonthBegin(1) - pd.offsets.Day(1)))
]["amount"].kurtosis()
for curr in df.index
]
This will create new columns in the original day-level dataframe. You can then do the final housekeeping to condense it down to month-level and set the first two months to null, if you need to.
df["yr_mon"] = df.index + pd.offsets.MonthEnd(-1) + pd.offsets.Day(1)
monthly_df = (
df[["yr_mon", "roll_std", "roll_krt"]]
.drop_duplicates(subset="yr_mon")
.sort_values(by="yr_mon")
.reset_index(drop=True)
)
monthly_df.loc[:1, ["roll_std", "roll_krt"]] = None
monthly_df
# yr_mon roll_std roll_krt
# 0 2015-01-01 NaN NaN
# 1 2015-02-01 NaN NaN
# 2 2015-03-01 171.464282 4.663335
# 3 2015-04-01 188.745861 3.750693
# 4 2015-05-01 171.464282 4.663335
# 5 2015-06-01 213.234026 -5.794877
# 6 2015-07-01 188.745861 3.750693
Related
What I have and am trying to do:
A dataframe, with headers: event_id, location_id, start_date, end_date.
An event can only have one location, start and end.
A location can have multiple events, starts and ends, and they can overlap.
The goal here is to be able to say, given any time T, for location X, how many events were there?
E.g.
Given three events, all for location 2:
Event.
Start.
End.
Event 1
2022-05-01
2022-05-07
Event 2
2022-05-04
2022-05-10
Event 3
2022-05-02
2022-05-05
Time T.
Count of Events
2022-05-01
1
2022-05-02
2
2022-05-03
2
2022-05-04
3
2022-05-05
3
2022-05-06
2
**What I have tried so far, but got stuck on: **
((I did look at THIS possible solution for a similar problem, and I went pretty far with it, but I got lost in the itterows and how to have that apply here.))
Try to get an array or dataframe that has a 365 day date range for each location ID.
E.g.
[1,2022-01-01],[1,2022-01-02]........[98,2022-01-01][98,2022-01-02]
Then convert that array to a dataframe, and merge it with the original dataframe like:
index
location
time
event
location2
start
end
0
1
2022-01-01
1
10
2022-11-07
2022-11-12
1
1
2022-01-01
2
4
2022-02-16
2022-03-05
2
1
2022-01-01
3
99
2022-06-10
2022-06-15
3
1
2022-01-01
4
2
2021-12-31
2022-01-05
4
1
2022-01-01
5
5
2022-05-08
2022-05-22
Then perform some kind of reduction that returns the count:
location
Time
Count
1
2022-01-01
10
1
2022-01-02
3
1
2022-01-03
13
..
...
...
99
2022-01-01
4
99
2022-01-02
0
99
2022-01-03
7
99
2022-01-04
12
I've done something similar with tying events to other events where their dates overalapped, using the .loc(...) but I don't think that would work here, and I'm kind of just stumped.
Where I got stuck was creating an array that combines the location ID and date range, because they're different lengths, and I could figure out how to repeat the location ID for every date in the range.
Anyway, I am 99% positive that there is a much more efficient way of doing this, and really any help at all is greatly appreciated!!
Thank you :)
Update per comment
# get the min and max dates
min_date, max_date = df[['Start.', 'End.']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the location of dates that are between start and end
new_df = pd.DataFrame({'Date': date_range,
'Location': [df[df['Start.'].le(date) & df['End.'].ge(date)]['Event.'].tolist()
for date in date_range]})
# get the length of each list, which is the count
new_df['Count'] = new_df['Location'].str.len()
Date Location Count
0 2022-05-01 [Event 1] 1
1 2022-05-02 [Event 1, Event 3] 2
2 2022-05-03 [Event 1, Event 3] 2
3 2022-05-04 [Event 1, Event 2, Event 3] 3
4 2022-05-05 [Event 1, Event 2, Event 3] 3
5 2022-05-06 [Event 1, Event 2] 2
6 2022-05-07 [Event 1, Event 2] 2
7 2022-05-08 [Event 2] 1
8 2022-05-09 [Event 2] 1
9 2022-05-10 [Event 2] 1
IIUC you can try something like
# get the min and max dates
min_date, max_date = df[['Start.', 'End.']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the count of dates that are between start and end
# df.le is less than or equal to
# df.ge is greater than or equal to
new_df = pd.DataFrame({'Date': date_range,
'Count': [sum(df['Start.'].le(date) & df['End.'].ge(date))
for date in date_range]})
Date Count
0 2022-05-01 1
1 2022-05-02 2
2 2022-05-03 2
3 2022-05-04 3
4 2022-05-05 3
5 2022-05-06 2
6 2022-05-07 2
7 2022-05-08 1
8 2022-05-09 1
9 2022-05-10 1
Depending on how large your date range is we may need to take a different approach as things may get slow if you have a range of two years instead of 10 days in the example.
You can also use a custom date range if you do not want to use the min and max values from the whole frame
min_date = '2022-05-01'
max_date = '2022-05-06'
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the count of dates that are between start and end
new_df = pd.DataFrame({'Date': date_range,
'Count': [sum(df['Start.'].le(date) & df['End.'].ge(date))
for date in date_range]})
Date Count
0 2022-05-01 1
1 2022-05-02 2
2 2022-05-03 2
3 2022-05-04 3
4 2022-05-05 3
5 2022-05-06 2
Note - I wanted to leave the original question up as is, and I was out of space, so I am answering my own question here, but #It_is_Chris is the real MVP.
Update! - with the enormous help from #It_is_Chris and some additional messing around, I was able to use the following code to generate the output I wanted:
# get the min and max dates
min_date, max_date = original_df[['start', 'end']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# create location range
loc_range = original_df['location'].unique()
# create a new list that combines every date with every location
combined_list = []
for item in date_range:
for location in loc_range:
combined_list.append(
{
'Date':item,
'location':location
}
)
# convert the list to a dataframe
combined_df = pd.DataFrame(combined_list)
# use merge to put original data together with the new dataframe
merged_df = pd.merge(combined_df,original_df, how="left", on="location")
# use loc to directly connect each event to a specific location and time
merged_df = merged_df.loc[(pd.to_datetime(merged_df['Date'])>=pd.to_datetime(merged_df['start'])) & (pd.to_datetime(merged_df['Date'])<=pd.to_datetime(merged_df['end']))]
# use groupby to push out a table as sought Date - Location - Count
output_merged_df = merged_df.groupby(['Date','fleet_id']).size()
The result looked like this:
Note - the sorting was not as I have it here, I believe I would need to add some additional sorting to the dataframe before outputting as a CSV.
Date
location
count
2022-01-01
1
1
2022-01-01
2
4
2022-01-01
3
1
2022-01-01
4
10
2022-01-01
5
3
2022-01-01
6
1
I'm looking to add a %Y%m%d date column to my dataframe using a period column that has integers 1-32, which represent monthly data points starting at a defined environment variable "odate" (e.g. if odate=20190531 then period 1 should be 20190531, period 2 should be 20190630, etc.)
I tried defining a dictionary with the number of periods in the column as the keys and the value being odate + MonthEnd(period -1)
This works fine and well; however, I want to improve the code to be flexible given changes in the number of periods.
Is there a function that will allow me to fill the date columns with the odate in period 1 and then subsequent month ends for subsequent periods?
example dataset:
odate=20190531
period value
1 5.5
2 5
4 6.2
3 5
5 40
11 5
desired dataset:
odate=20190531
period value date
1 5.5 2019-05-31
2 5 2019-06-30
4 6.2 2019-08-31
3 5 2019-07-31
5 40 2019-09-30
11 5 2020-03-31
You can use pd.date_range():
pd.date_range(start = '2019-05-31', periods = 100,freq='M')
You can change total periods depending on what you need, the freq='M' means a Month-End frequency
Here is a list of Offset Aliases you can for freq parameter.
If you just want to add or subtract some period to a date you can use pd.DataOffset:
odate = pd.Timestamp('20191031')
odate
>> Timestamp('2019-10-31 00:00:00')
odate - pd.DateOffset(months=4)
>> Timestamp('2019-06-30 00:00:00')
odate + pd.DateOffset(months=4)
>> Timestamp('2020-02-29 00:00:00')
To add given the period column to Month Ends:
odate = pd.Timestamp('20190531')
df['date'] = df.period.apply(lambda x: odate + pd.offsets.MonthEnd(x-1))
df
period value date
0 1 5.5 2019-05-31
1 2 5.0 2019-06-30
2 4 6.2 2019-08-31
3 3 5.0 2019-07-31
4 5 40.0 2019-09-30
5 11 5.0 2020-03-31
To improve performance use list-comprehension:
df['date'] = [odate + pd.offsets.MonthEnd(period-1) for period in df.period]
How can I compare each row's "Price" value with the next 2 rows? I want to run a function for every row: If the current price is lower on any of the following 2 hours, I want to assign "Low" to the current row's "Action" column. If the current price is higher than on the following 2 hours, then assign "High". If the current price is nor the highest or the lowest of all the 3 hours compared, assign "Hold".
So how can I take the Price from each row and compare it to the following 2 rows with Pandas? The dataframe looks like this:
data.head()
Date Time Price Month Hour Action
0 2018-01-01 0 2633 January 1 NaN
1 2018-01-01 1 2643 January 2 NaN
2 2018-01-01 2 2610 January 3 NaN
3 2018-01-01 3 2470 January 4 NaN
4 2018-01-01 4 2474 January 5 NaN
The desired output in this case would look like this:
data.head()
Date Time Price Month Hour Action
0 2018-01-01 0 2633 January 1 Hold
1 2018-01-01 1 2643 January 2 High
2 2018-01-01 2 2610 January 3 High
3 2018-01-01 3 2470 January 4 Low
4 2018-01-01 4 2474 January 5 Hold
Thank you.
edit: probably can be easily done with for loops but I'm sure pandas has some better way to do this
You can use the function data['Price'].shift(-1) to get the next price in the current row and data['Price'].shift(-2) to get the price 2 periods ahead in the current row.
Next you can use slicing to select the rows where the next two rows are higher or lower than the current price and fill it with the desired value.
See below how this is done:
# Check if the current price is lower than the next 2 rows and assign to the column 'Action' the value 'Low' if this is true
data.loc[(data['Price'].shift(-2)> data['Price']) & (data['Price'].shift(-1) > data['Price']), 'Action'] = 'Low'
# Check if the current price is higher than the next 2 rows and assign to the column 'Action' the value 'High' if this is true
data.loc[(data['Price'].shift(-2)< data['Price']) & (data['Price'].shift(-1) < data['Price']), 'Action'] = 'High'
# fill the rest of the rows with the value Hold
data['Action'] = data['Action'].fillna('Hold')
We can write some conditions for this. And choose values based on those conditions with np.select. In our conditions we use .shift for this which compares the current row to the next two rows.
Note The last two rows will return Unknown since we don't have two days data to compare with. Which makes sense.
# Print the extended dataframe which is used
print(df)
Date Time Price Month Hour Action
0 2018-01-01 0 2633 January 1 NaN
1 2018-01-01 1 2643 January 2 NaN
2 2018-01-01 2 2610 January 3 NaN
3 2018-01-01 3 2470 January 4 NaN
4 2018-01-01 4 2474 January 5 NaN
5 2018-01-01 5 2475 January 6 NaN
6 2018-01-01 6 2471 January 7 NaN
Define conditions, choices and apply np.select
conditions = [
(df['Price'] > df['Price'].shift(-1)) & (df['Price'] > df['Price'].shift(-2)),
((df['Price'].between(df['Price'].shift(-1), df['Price'].shift(-2))) | (df['Price'].between(df['Price'].shift(-2), df['Price'].shift(-1)))),
(df['Price'] < df['Price'].shift(-1)) & (df['Price'] < df['Price'].shift(-2)),
]
choices = ['High', 'Hold', 'Low']
df['Action'] = np.select(conditions, choices, default='Unknown')
print(df)
Date Time Price Month Hour Action
0 2018-01-01 0 2633 January 1 Hold
1 2018-01-01 1 2643 January 2 High
2 2018-01-01 2 2610 January 3 High
3 2018-01-01 3 2470 January 4 Low
4 2018-01-01 4 2474 January 5 Hold
5 2018-01-01 5 2475 January 6 Unknown
6 2018-01-01 6 2471 January 7 Unknown
I started from creation of the source DataFrame, a bit longer than
your head:
df = pd.DataFrame(data=[[ '2018-01-01', 0, 2633, 'January', 1 ],
[ '2018-01-01', 1, 2643, 'January', 2 ], [ '2018-01-01', 2, 2610, 'January', 3 ],
[ '2018-01-01', 3, 2470, 'January', 4 ], [ '2018-01-01', 4, 2474, 'January', 5 ],
[ '2018-01-01', 5, 2475, 'January', 6 ]],
columns=['Date', 'Time', 'Price', 'Month', 'Hour']); df
The first step is to compute 2 auxiliary columns, P1 with the price
from the next hour and P2 with the price from 2 hours in advance:
df['P1'] = df.Price.diff(-1).fillna(0, downcast='infer')
df['P2'] = df.Price.diff(-2).fillna(0, downcast='infer')
Then we need a function to be applied to each row:
def fn(row):
if row.P1 < 0 and row.P2 < 0:
return 'Low'
elif row.P1 > 0 and row.P2 > 0:
return 'High'
else:
return 'Hold'
And the last step is to compute the new column (applying the above function)
and delete the auxiliary columns:
df['Action'] = df.apply(fn, axis=1)
df.drop(['P1', 'P2'], axis=1, inplace=True)
Currently working with an interesting transport smart card dataset. Each line in the current data represent a trip (e.g. bus trip from A to B). Any trips within 60 min needs to be grouped into journey.
The current table:
CustomerID SegmentID Origin Dest StartTime EndTime Fare Type
0 A001 101 A B 7:30am 7:45am 1.5 Bus
1 A001 102 B C 7:50am 8:30am 3.5 Train
2 A001 103 C B 17:10pm 18:00pm 3.5 Train
3 A001 104 B A 18:10pm 18:30pm 1.5 Bus
4 A002 105 K Y 11:30am 12:30pm 3.0 Train
5 A003 106 P O 10:23am 11:13am 4.0 Ferrie
and covert into something like:
CustomerID JourneyID Origin Dest Start Time End Time Fare Type NumTrips
0 A001 1 A C 7:30am 8:30am 5 Intermodal 2
1 A001 2 C A 17:10pm 18:30pm 5 Intermodal 2
2 A002 6 K Y 11:30am 12:30pm 3 Train 1
3 A003 8 P O 10:23am 11:13am 4 Ferrie 1
I'm new to Python and Pandas and have no idea how to start, so any guidance would be appreciated.
Here's a fairly complete answer. You didn't fully specify the concept of a single journey so I took a guess. You could adjust mask below to better suit your own definition.
# get rid of am/pm and convert to proper datetime
# converts to year 1900 b/c it's not specified, doesn't matter here
df['StTime'] = pd.to_datetime( df.StartTime.str[:-2], format='%H:%M' )
df['EndTime'] = pd.to_datetime( df.EndTime.str[:-2], format='%H:%M' )
# some of the later processing is easier if you use duration
# instead of arrival time
df['Duration'] = df.EndTime-df.StTime
# get rid of some nuisance variables for clarity
df = df[['CustomerID','Origin','Dest','StTime','Duration','Fare','Type']]
First, we need to figure out a way group the rows. As this is not well specified in the question, I'll group by Customer ID where Start Times are within 1 hr. Note that for tri-modal trips this actually implies that start times of the first and third trips could differ by more than one hour as long as first+second and second+third are each individaully under 1 hour. This seems like a natural way to do it, but for your actual use case you'd have to adjust this for your desired definition. There are quite a few ways you could proceed here.
mask1 = df.StTime - df.StTime.shift(1) <= pd.Timedelta('01:00:00')
mask2 = (df.CustomerID == df.CustomerID.shift(1))
mask = ( mask1 & mask2 )
Now we can use the mask with cumsum to generate a tripID:
df['JourneyID'] = 1
df.ix[mask,'JourneyID'] = 0
df['JourneyID'] = df['JourneyID'].cumsum()
df['NumTrips'] = 1
df[['CustomerID','StTime','Fare','JourneyID']]
CustomerID StTime Fare JourneyID
0 A001 1900-01-01 07:30:00 1.5 1
1 A001 1900-01-01 07:50:00 3.5 1
2 A001 1900-01-01 17:10:00 3.5 2
3 A001 1900-01-01 18:10:00 1.5 2
4 A002 1900-01-01 11:30:00 3.0 3
5 A003 1900-01-01 10:23:00 4.0 4
Now, for each column just aggregate appropriately:
df2 = df.groupby('JourneyID').agg({ 'Origin' : sum, 'CustomerID' : min,
'Dest' : sum, 'StTime' : min,
'Fare' : sum, 'Duration' : sum,
'Type' : sum, 'NumTrips' : sum })
StTime Dest Origin Fare Duration Type CustomerID NumTrips
JourneyID
1 1900-01-01 07:30:00 BC AB 5 00:55:00 BusTrain A001 2
2 1900-01-01 17:10:00 BA CB 5 01:10:00 TrainBus A001 2
3 1900-01-01 11:30:00 Y K 3 01:00:00 Train A002 1
4 1900-01-01 10:23:00 O P 4 00:50:00 Ferrie A003 1
Note that Duration includes only travel time and not the time in-between trips (e.g. if start time of second trip is later than the end time of first trip).
I have pandas dataframe with Columns 'Date' and 'Skew(float no.)'. I want to average the values of the skew between every Tuesday and the store it in a list or dataframe. I tried using lambda as given in this question Pandas, groupby and summing over specific months I but it only helps to some over a particular week but i cannot go across week i.e from one tuesday to another. Can you give how to do the same?
Here's an example with random data
df = pd.DataFrame({'Date' : pd.date_range('20130101', periods=100),
'Skew': 10+pd.np.random.randn(100)})
min_date = df.Date.min()
start = min_date.dayofweek
if start < 1:
min_date = min_date - pd.np.timedelta64(6+start, 'D')
elif start > 1:
min_date = min_date - pd.np.timedelta64(start-1, 'D')
df.groupby((df.Date - min_date).astype('timedelta64[D]')//7).mean()
Input:
>>> df
Date Skew
0 2013-01-01 10.082080
1 2013-01-02 10.907402
2 2013-01-03 8.485768
3 2013-01-04 9.221740
4 2013-01-05 10.137910
5 2013-01-06 9.084963
6 2013-01-07 9.457736
7 2013-01-08 10.092777
Output:
Skew
Date
0 9.625371
1 9.993275
2 10.041077
3 9.837709
4 9.901311
5 9.985390
6 10.123757
7 9.782892
8 9.889291
9 9.853204
10 10.190098
11 10.594125
12 10.012265
13 9.278008
14 10.530251
Logic: Find relative week from the first week's Tuesday and groupby and each groups (i.e week no's) mean.