Tricky groupby/moving average by date calculation

Tricky groupby/moving average by date calculation - python

I am having trouble illustrating my problem with the form the data is in without complicating things. So bear with me as I would like to start with the following screen shot is for explaining the problem only (aka the data is not in this form) :
I would like to identify the past 14 days with a number > 0 across all bins (aka the total row has a value greater than 0). This would include all days except for days 5 and 12 (highlighted in red). I would then like to sum across bins horizontally for those 14 days (aka sum all days expect for 5 and 12, by bin), with the goal of ultimately calculating a 14 day average by Bin number.
Note the example above would be for one “Lane”, where my data has > 10,000. The example also only illustrates today being day 16. But I would like to apply this logic to every day in the data set. I.e. on day 20 (along with any other date), it would look at the last 14 days with a value across all bins, then use that data range to aggregate across Bin. This is a screenshot sample of how the data looks:
A simple example using the data as it is structured, with only 3 Bins, 1 Lane, and a 3 data point/date look back:
Lane Date Bin KG
AMS-ORD 2018-08-26 3 10
AMS-ORD 2018-08-29 1 25
AMS-ORD 2018-08-30 2 30
AMS-ORD 2018-09-03 2 20
AMS-ORD 2018-09-04 1 40
Note KG here is a sum. Again this is for one day (aka today), but I would like every date in my data set to follow the same logic. The output would look like the following:
Lane Date Bin KG Average
AMS-ORD 2018-09-04 1 40 13.33
AMS-ORD 2018-09-04 2 50 16.67
AMS-ORD 2018-09-04 3 0 -
I have messed around with .rolling(14).mean(), .tail(), and some others. The problem I have is specifying the correct date range for the correct Bin aggregation.

Related

Data preparation to convert two date field in one

I have two date columns having corresponding Dollars associated in two other column. I want to plot it in single chart, but for that data preparation in required in python.
Actual table
StartDate
start$
EndDate
End$
5 June
500
7 June
300
7 June
600
10 June
550
8 june
900
10 June
600
Expected Table
PythonDate
start$
End$
5 June
500
0
6 june
0
0
7 June
600
300
8 June
900
0
9 June
0
0
10June
0
1150
Any solution in Python?

I can suggest you a basic logic, you figure out how to do it. It's not difficult to do it and it'll be a good learning too:
You can read only the subset of columns you need from the input table
as a single dataframe. Make such two dataframes with value as 0 for
the column that you be missing and then append them together.

Counting consecutive days of temperature data

So I have some sea surface temperature anomaly data. These data have been filtered down so that these are the values that are below a certain threshold. However, I am trying to identify cold spells - that is, to isolate events that last longer than 5 consecutive days. A sample of my data is below (I've been working between xarray datasets/dataarrays and pandas dataframes). Note, the 'day' is the day number of the month I am looking at (eventually will be expanded to the whole year). I have been scouring SO/the internet for ways to extract these 5-day-or-longer events based on the 'day' column, but I haven't gotten anything to work. I'm still relatively new to coding so my first thought was looping over the rows of the 'day' column but I'm not sure. Any insight is appreciated.
Here's what some of my data look like as a pandas df:
lat lon time day ssta
5940 24.125 262.375 1984-06-03 3 -1.233751
21072 24.125 262.375 1984-06-04 4 -1.394495
19752 24.125 262.375 1984-06-05 5 -1.379742
10223 24.125 262.375 1984-06-27 27 -1.276407
47355 24.125 262.375 1984-06-28 28 -1.840763
... ... ... ... ... ...
16738 30.875 278.875 2015-06-30 30 -1.345640
3739 30.875 278.875 2020-06-16 16 -1.212824
25335 30.875 278.875 2020-06-17 17 -1.446407
41891 30.875 278.875 2021-06-01 1 -1.714249
27740 30.875 278.875 2021-06-03 3 -1.477497
64228 rows × 5 columns
As a filtered xarray:
xarray.Dataset
Dimensions: lat: 28, lon: 68, time: 1174
Coordinates:
time (time) datetime64[ns] 1982-06-01 ... 2021-06-04
lon (lon) float32 262.1 262.4 262.6 ... 278.6 278.9
lat (lat) float32 24.12 24.38 24.62 ... 30.62 30.88
day (time) int64 1 2 3 4 5 6 7 ... 28 29 30 1 2 3 4
Data variables:
ssta (time, lat, lon) float32 nan nan nan nan ... nan nan nan nan
Attributes: (0)
TLDR; I want to identify (and retain the information of) events that are 5+ consecutive days, ie if there were a day 3 through day 8, or day 21 through day 30, etc.

I think rather than filtering your original data you should try to do it the pandas way which in this case means obtain a series with true false values depending on your condition.
Your data seems not to include temperatures so here is my example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'temp':np.random.randint(10,high=40,size=64228,dtype='int64')})
Will generate a DataFrame with a single column containing random temperatures between 10 and 40 degrees. Notice that I can just work with the auto generated index but you might have to switch it to a column like time or date or something like that using .set_index. Say we are interested in the consecutive days with more than 30 degrees.
is_over_30 = df['temp'] > 30
will give us a True/False array with that information. Notice that this format is very useful since we can index with it. E.g. df[is_over_30] will give us the rows of the dataframe for days where the temperature is over 30 deg. Now we wanna shift the True/False values in is_over_30 one spot forward and generate a new series that is true if both are true like so
is_over_30 & np.roll(is_over_30, -1)
Basically we are done here and could write 3 more of those & rolls. But there is a way to write it more concise.
from functools import reduce
is_consecutively_over_30 = reduce(lambda a,b: a&b, [np.roll(is_over_30, -i) for i in range(5)])
Keep in mind that that even though the last 4 days can't be consecutively over 30 deg this might still happen here since roll shifts the first values into the position relevant for that. But you can just set the last 4 values to False to resolve this.
is_consecutively_over_30[-4:] = False

You can pull the day ranges of the spells using this approach:
min_spell_days = 6
days = {'day': [1,2,5,6,7,8,9,10,17,19,21,22,23,24,25,26,27,31]}
df = pd.DataFrame(days)
Find number of days between consecutive entries:
diff = df['day'].diff()
Mark the last day of a spell:
df['last'] = (diff == 1) & (diff.shift(-1) > 1)
Accumulate the number of days in each spell:
df['diff0'] = np.where(diff > 1, 0, diff)
df['cs'] = df['diff0'].eq(0).cumsum()
df['spell_days'] = df.groupby('cs')['diff0'].transform('cumsum')
Mark the last entry as the last day of a spell if applicable:
if diff.iat[-1] == 1:
df['last'].iat[-1] = True
Select the last day of all qualifying spells:
df_spells = (df[df['last'] & (df['spell_days'] >= (min_spell_days-1))]).copy()
Identify the start, end and duration of each spell:
df_spells['end_day'] = df_spells['day']
df_spells['start_day'] = (df_spells['day'] - df['spell_days'])
df_spells['spell_days'] = df['spell_days'] + 1
Resulting df:
df_spells[['start_day','end_day','spell_days']].astype('int')
start_day end_day spell_days
7 5 10 6
16 21 27 7
Also, using date arithmetic 'day' you could represent a serial day number relative to some base date - like 1/1/1900. That way spells that span month and year boundaries could be handled. It would then be trivial to convert back to a date using date arithmetic and that serial number.

Multiple math operations on timeseries data using groupby

I have a dataframe/series containing hourly sampled data over a couple of years. I'd like to sum the values for each month, then calculate the mean of those monthly totals over all the years.
I can get a multi-index dataframe/series of the totals using:
df.groupby([df.index.year, df.index.month]).sum()
Date & Time Date & Time
2016 3 220.246292
4 736.204574
5 683.240291
6 566.693919
7 948.116766
8 761.214823
9 735.168033
10 771.210572
11 542.314915
12 434.467037
2017 1 728.983901
2 639.787918
3 709.944521
4 704.610437
5 685.729297
6 760.175060
7 856.928659
But I don't know how to then combine the data to get the means.
I might be totally off on the wrong track too. Also not sure I've labelled the question very well.

I think you need mean per years - so per first level:
df.groupby([df.index.year, df.index.month]).sum().mean(level=0)

You can use groupby twice, once to get the monthly sum, once to get the mean of monthly sum:
(df.groupby(pd.Grouper(freq='M')).sum()
.groupby(pd.Grouper(freq='Y')).mean()
)

Traversing groups of group by object pandas

I need help with some big pandas issue.
As a lot of people asked to have the real input and real desired output in order to answer the question, there it goes:
So I have the following dataframe
Date user cumulative_num_exercises total_exercises %_exercises
2017-01-01 1 2 7 28,57
2017-01-01 2 1 7 14.28
2017-01-01 4 3 7 42,85
2017-01-01 10 1 7 14,28
2017-02-02 1 2 14 14,28
2017-02-02 2 3 14 21,42
2017-02-02 4 4 14 28,57
2017-02-02 10 5 14 35,71
2017-03-03 1 3 17 17,64
2017-03-03 2 3 17 17,64
2017-03-03 4 5 17 29,41
2017-03-03 10 6 17 35,29
%_exercises_accum
28,57
42,85
85,7
100
14,28
35,7
64,27
100
17,64
35,28
64,69
100
-The column %_exercises is the value of the column (cumulative_num_exercises/total_exercises)*100
-The column %_exercises_accum is the value of the sum of the %_exercises for each month. (Note that at the end of each month, it reaches the value 100).
-I need to calculate, whith this data, the % of users that contributed to do a 50%, 80% and 90% of the total exercises, during each month.
-In order to do so, I have thought to create a new column, called category, which will later be used to count how many users contributed to each of the 3 percentages (50%, 80% and 90%). The category column takes the following values:
0 if the user did a %_exercises_accum = 0.
1 if the user did a %_exercises_accum < 50 and > 0.
50 if the user did a %_exercises_accum = 50.
80 if the user did a %_exercises_accum = 80.
90 if the user did a %_exercises_accum = 90.
And so on, because there are many cases in order to determine who contributes to which percentage of the total number of exercises on each month.
I have already determined all the cases and all the values that must be taken.
Basically, I traverse the dataframe using a for loop, and with two main ifs:
if (df.iloc[i][date] == df.iloc[i][date].shift()):
calculations to determine the percentage or percentages to which the user from the second to the last row of the same month group contributes
(because the same user can contribute to all the percentages, or to more than one)
else:
calculations to determine to which percentage of exercises the first
member of each
month group contributes.
The calculations involve:
Looking at the value of the category column in the previous row using shift().
Doing while loops inside the for, because when a user suddenly reaches a big percentage, we need to go back for the users in the same month, and change their category_column value to 50, as they have contributed to the 50%, but didn't reach it. for instance, in this situation:
Date %_exercises_accum
2017-01-01 1,24
2017-01-01 3,53
2017-01-01 20,25
2017-01-01 55,5
The desired output for the given dataframe at the beginning of the question would include the same columns as before (date, user, cumulative_num_exercises, total_exercises, %_exercises and %_exercises_accum) plus the category column, which is the following:
category
50
50
508090
90
50
50
5080
8090
50
50
5080
8090
Note that the rows with the values: 508090, or 8090, mean that that user is contributing to create:
508090: both 50%, 80% and 90% of total exercises in a month.
8090: both 80% and 90% of exercises in a month.
Does anyone know how can I simplify this for loop by traversing the groups of a group by object?
Thank you very much!

Given no sense of what calculations you wish to accomplish, this is my best guess at what you're looking for. However, I'd re-iterate Datanovice's point that the best way to get answers is to provide a sample output.
You can slice to each unique date using the following code:
dates = ['2017-01-01', '2017-01-01','2017-01-01','2017-01-01','2017-02-02','2017-02-02','2017-02-02','2017-02-02','2017-03-03','2017-03-03','2017-03-03','2017-03-03']
df = pd.DataFrame(
{'date':pd.to_datetime(dates),
'user': [1,2,4,10,1,2,4,10,1,2,4,10],
'cumulative_num_exercises':[2,1,3,1,2,3,4,5,3,3,5,6],
'total_exercises':[7,7,7,7,14,14,14,14,17,17,17,17]}
)
df = df.set_index('date')
for idx in df.index.unique():
hold = df.loc[idx]
### YOUR CODE GOES HERE ###

Average hourly week profile for a year excluding weekend days and holidays

With Pandas I have created a DataFrame from an imported .csv file (this file is generated through simulation). The DataFrame consists of half-hourly energy consumption data for a single year. I have already created a DateTimeindex for the dates.
I would like to be able to reformat this data into average hourly week and weekend profile results. With the week profile excluding holidays.
DataFrame:
Date_Time Equipment:Electricity:LGF Equipment:Electricity:GF
01/01/2000 00:30 0.583979872 0.490327348
01/01/2000 01:00 0.583979872 0.490327348
01/01/2000 01:30 0.583979872 0.490327348
01/01/2000 02:00 0.583979872 0.490327348
I found an example (Getting the average of a certain hour on weekdays over several years in a pandas dataframe) that explains doing this for several years, but not explicitly for a week (without holidays) and weekend.
I realised that there is no resampling techniques in Pandas that do this directly, I used several aliases (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) for creating Monthly and Daily profiles.
I was thinking of using the business day frequency and create a new dateindex with working days and compare that to my DataFrame datetimeindex for every half hour. Then return values for working days and weekend days when true or false respectively to create a new dataset, but am not sure how to do this.
PS; I am just getting into Python and Pandas.

Dummy data (for future reference, more likely to get an answer if you post some in a copy-paste-able form)
df = pd.DataFrame(data={'a':np.random.randn(1000)},
index=pd.date_range(start='2000-01-01', periods=1000, freq='30T'))
Here's an approach. First define a US (or modify as appropriate) business day offset with holidays, and generate and range covering your dates.
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
bday_us = CustomBusinessDay(calendar=USFederalHolidayCalendar())
bday_over_df = pd.date_range(start=df.index.min().date(),
end=df.index.max().date(), freq=bday_us)
Then, develop your two grouping columns. An hour column is easy.
df['hour'] = df.index.hour
For weekday/weekend/holiday, define a function to group the data.
def group_day(date):
if date.weekday() in [5,6]:
return 'weekend'
elif date.date() in bday_over_df:
return 'weekday'
else:
return 'holiday'
df['day_group'] = df.index.map(group_day)
Then, just group by the two columns as you wish.
In [140]: df.groupby(['day_group', 'hour']).sum()
Out[140]:
a
day_group hour
holiday 0 1.890621
1 -0.029606
2 0.255001
3 2.837000
4 -1.787479
5 0.644113
6 0.407966
7 -1.798526
8 -0.620614
9 -0.567195
10 -0.822207
11 -2.675911
12 0.940091
13 -1.601885
14 1.575595
15 1.500558
16 -2.512962
17 -1.677603
18 0.072809
19 -1.406939
20 2.474293
21 -1.142061
22 -0.059231
23 -0.040455
weekday 0 9.192131
1 2.759302
2 8.379552
3 -1.189508
4 3.796635
5 3.471802
... ...
18 -5.217554
19 3.294072
20 -7.461023
21 8.793223
22 4.096128
23 -0.198943
weekend 0 -2.774550
1 0.461285
2 1.522363
3 4.312562
4 0.793290
5 2.078327
6 -4.523184
7 -0.051341
8 0.887956
9 2.112092
10 -2.727364
11 2.006966
12 7.401570
13 -1.958666
14 1.139436
15 -1.418326
16 -2.353082
17 -1.381131
18 -0.568536
19 -5.198472
20 -3.405137
21 -0.596813
22 1.747980
23 -6.341053
[72 rows x 1 columns]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.