Efficiently combining groupby, last and count in pandas

Efficiently combining groupby, last and count in pandas - python

From a list of logs, i want to get the number of active events at each timestamp for a specific event type.
A sample log input looks like this:
time
id
event
2022-03-01 10:00
1
A
2022-03-01 11:00
2
B
2022-03-01 12:00
3
A
2022-03-01 13:00
1
B
2022-03-01 14:00
4
A
2022-03-01 15:00
2
C
2022-03-01 16:00
1
A
...
...
...
What i want is basically how many ids have event A active at each time in the df, like in the table below.
time
eventA
2022-03-01 10:00
1
2022-03-01 11:00
1
2022-03-01 12:00
2
2022-03-01 13:00
1
2022-03-01 14:00
2
2022-03-01 15:00
2
2022-03-01 16:00
3
...
...
I achieved this with some basic pandas operations:
df = pd.DataFrame(
{
"time": pd.date_range("2022-03-01 10:00", periods=7, freq="H"),
"id": [1, 2, 3, 1, 4, 2, 1],
"event": ["A", "B", "A", "B", "A", "C", "A"],
}
)
timestamps = df.time
values = []
for timestamp in timestamps:
filtered_df = df.loc[df.time <= timestamp]
eventA = filtered_df.groupby("id").last().groupby("event").count().["time"]["A"]
values.append({"time": timestamp, "eventA": eventA})
df_count = pd.DataFrame(values)
In my case though, i have to go over >50,000 rows and this basic approach becomes very inefficient time wise.
Is there a better approach to achieve the desired result? I guess there might be some pandas groupby aggregation methods that could help here, but i found none that helped me.

df.set_index(['time', 'id']).unstack().fillna(method='ffill')\
.stack().value_counts(['time', 'event']).unstack().fillna(0)
The first line takes care of getting the latest event from each id at each hour by forward-filling the NaNs
event
id 1 2 3 4
time
2022-03-01 10:00:00 A NaN NaN NaN
2022-03-01 11:00:00 A B NaN NaN
2022-03-01 12:00:00 A B A NaN
2022-03-01 13:00:00 B B A NaN
2022-03-01 14:00:00 B B A A
2022-03-01 15:00:00 B C A A
2022-03-01 16:00:00 A C A A
The second line does the counting and thus
event A B C
time
2022-03-01 10:00:00 1.0 0.0 0.0
2022-03-01 11:00:00 1.0 1.0 0.0
2022-03-01 12:00:00 2.0 1.0 0.0
2022-03-01 13:00:00 1.0 2.0 0.0
2022-03-01 14:00:00 2.0 2.0 0.0
2022-03-01 15:00:00 2.0 1.0 1.0
2022-03-01 16:00:00 3.0 0.0 1.0

Related

Pandas dataframe expand rows in specific times

I have a dataframe:
df = T1 C1
01/01/2022 11:20 2
01/01/2022 15:40 8
01/01/2022 17:50 3
I want to expand it such that
I will have the value in specific given times
I will have a row for each round timestamp
So if the times are given in
l=[ 01/01/2022 15:46 , 01/01/2022 11:28]
I will have:
df_new = T1 C1
01/01/2022 11:20 2
01/01/2022 11:28 2
01/01/2022 12:00 2
01/01/2022 13:00 2
01/01/2022 14:00 2
01/01/2022 15:00 2
01/01/2022 15:40 8
01/01/2022 15:46 8
01/01/2022 16:00 8
01/01/2022 17:00 8
01/01/2022 17:50 3

You can add the extra dates and ffill:
df['T1'] = pd.to_datetime(df['T1'])
extra = pd.date_range(df['T1'].min().ceil('H'), df['T1'].max().floor('H'), freq='1h')
(pd.concat([df, pd.DataFrame({'T1': extra})])
.sort_values(by='T1', ignore_index=True)
.ffill()
)
Output:
T1 C1
0 2022-01-01 11:20:00 2.0
1 2022-01-01 12:00:00 2.0
2 2022-01-01 13:00:00 2.0
3 2022-01-01 14:00:00 2.0
4 2022-01-01 15:00:00 2.0
5 2022-01-01 15:40:00 8.0
6 2022-01-01 16:00:00 8.0
7 2022-01-01 17:00:00 8.0
8 2022-01-01 17:50:00 3.0

Here is a way to do what your question asks that will ensure:
there are no duplicate times in T1 in the output, even if any of the times in the original are round hours
the results will be of the same type as the values in the C1 column of the input (in this case, integers not floats).
hours = pd.date_range(df.T1.min().ceil("H"), df.T1.max().floor("H"), freq="60min")
idx_new = df.set_index('T1').join(pd.DataFrame(index=hours), how='outer', sort=True).index
df_new = df.set_index('T1').reindex(index = idx_new, method='ffill').reset_index().rename(columns={'index':'T1'})
Output:
T1 C1
0 2022-01-01 11:20:00 2
1 2022-01-01 12:00:00 2
2 2022-01-01 13:00:00 2
3 2022-01-01 14:00:00 2
4 2022-01-01 15:00:00 2
5 2022-01-01 15:40:00 8
6 2022-01-01 16:00:00 8
7 2022-01-01 17:00:00 8
8 2022-01-01 17:50:00 3
Example of how round dates in the input are handled:
df = pd.DataFrame({
#'T1':pd.to_datetime(['01/01/2022 11:20','01/01/2022 15:40','01/01/2022 17:50']),
'T1':pd.to_datetime(['01/01/2022 11:00','01/01/2022 15:40','01/01/2022 17:00']),
'C1':[2,8,3]})
Input:
T1 C1
0 2022-01-01 11:00:00 2
1 2022-01-01 15:40:00 8
2 2022-01-01 17:00:00 3
Output (no duplicates):
T1 C1
0 2022-01-01 11:00:00 2
1 2022-01-01 12:00:00 2
2 2022-01-01 13:00:00 2
3 2022-01-01 14:00:00 2
4 2022-01-01 15:00:00 2
5 2022-01-01 15:40:00 8
6 2022-01-01 16:00:00 8
7 2022-01-01 17:00:00 3

Another possible solution, based on pandas.DataFrame.resample:
df['T1'] = pd.to_datetime(df['T1'])
(pd.concat([df, df.set_index('T1').resample('1H').asfreq().reset_index()])
.sort_values('T1').ffill().dropna().reset_index(drop=True))
Output:
T1 C1
0 2022-01-01 11:20:00 2.0
1 2022-01-01 12:00:00 2.0
2 2022-01-01 13:00:00 2.0
3 2022-01-01 14:00:00 2.0
4 2022-01-01 15:00:00 2.0
5 2022-01-01 15:40:00 8.0
6 2022-01-01 16:00:00 8.0
7 2022-01-01 17:00:00 8.0
8 2022-01-01 17:50:00 3.0

Mapping ranges of date in pandas dataframe

I would like to map values defined in a dictionary of date: value into a DataFrame of dates.
Consider the following example:
import pandas as pd
df = pd.DataFrame(range(19), index=pd.date_range(start="2010-01-01", end="2010-01-10", freq="12H"))
dct = {
"2009-01-01": 1,
"2010-01-05": 2,
"2020-01-01": 3,
}
I would like to get something like this:
df
0 test
2010-01-01 00:00:00 0 1.0
2010-01-01 12:00:00 1 1.0
2010-01-02 00:00:00 2 1.0
2010-01-02 12:00:00 3 1.0
2010-01-03 00:00:00 4 1.0
2010-01-03 12:00:00 5 1.0
2010-01-04 00:00:00 6 1.0
2010-01-04 12:00:00 7 1.0
2010-01-05 00:00:00 8 2.0
2010-01-05 12:00:00 9 2.0
2010-01-06 00:00:00 10 2.0
2010-01-06 12:00:00 11 2.0
2010-01-07 00:00:00 12 2.0
2010-01-07 12:00:00 13 2.0
2010-01-08 00:00:00 14 2.0
2010-01-08 12:00:00 15 2.0
2010-01-09 00:00:00 16 2.0
2010-01-09 12:00:00 17 2.0
2010-01-10 00:00:00 18 2.0
I have tried the following but I get a list of nan:
df["test"] = pd.Series(df.index.map(dct), index=df.index).ffill()
Any suggestions?

There are missing values, because no match types - in dict are keys like strings, in DaatFrame is datetimes in DatetimeIndex, need same types - here datetimes in helper Series created from dictionary with Series.asfreq for add datetimes between:
dct = {
"2009-01-01": 1,
"2010-01-05": 2,
"2020-01-01": 3,
}
s = pd.Series(dct).rename(lambda x: pd.to_datetime(x)).asfreq('d', method='ffill')
df["test"] = df.index.to_series().dt.normalize().map(s)
print (df)
0 test
2010-01-01 00:00:00 0 1
2010-01-01 12:00:00 1 1
2010-01-02 00:00:00 2 1
2010-01-02 12:00:00 3 1
2010-01-03 00:00:00 4 1
2010-01-03 12:00:00 5 1
2010-01-04 00:00:00 6 1
2010-01-04 12:00:00 7 1
2010-01-05 00:00:00 8 2
2010-01-05 12:00:00 9 2
2010-01-06 00:00:00 10 2
2010-01-06 12:00:00 11 2
2010-01-07 00:00:00 12 2
2010-01-07 12:00:00 13 2
2010-01-08 00:00:00 14 2
2010-01-08 12:00:00 15 2
2010-01-09 00:00:00 16 2
2010-01-09 12:00:00 17 2
2010-01-10 00:00:00 18 2

How to find occurrence of consecutive events in python timeseries data frame?

I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!

Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0

count contiguous NaN values by unique values

I have contiguous periods of NaN values by code. I want to count NaN values from periods of contiguous NaN values by code, and also i want the start and end date of the contiguos period of NaN values.
df :
CODE TMIN
1998-01-01 00:00:00 12 2.5
1999-01-01 00:00:00 12 NaN
2000-01-01 00:00:00 12 NaN
2001-01-01 00:00:00 12 2.2
2002-01-01 00:00:00 12 NaN
1998-01-01 00:00:00 41 NaN
1999-01-01 00:00:00 41 NaN
2000-01-01 00:00:00 41 5.0
2001-01-01 00:00:00 41 9.0
2002-01-01 00:00:00 41 8.0
1998-01-01 00:00:00 52 2.0
1999-01-01 00:00:00 52 NaN
2000-01-01 00:00:00 52 NaN
2001-01-01 00:00:00 52 NaN
2002-01-01 00:00:00 52 1.0
1998-01-01 00:00:00 91 NaN
Expected results :
Start_Date End date CODE number of contiguous missing values
1999-01-01 00:00:00 2000-01-01 00:00:00 12 2
2002-01-01 00:00:00 2002-01-01 00:00:00 12 1
1998-01-01 00:00:00 1999-01-01 00:00:00 41 2
1999-01-01 00:00:00 2001-01-01 00:00:00 52 3
1998-01-01 00:00:00 1998-01-01 00:00:00 91 1
How can i solve this? Thanks!

You can try groupby the cumsum of non-null:
df['group'] = df.TMIN.notna().cumsum()
(df[df.TMIN.isna()]
.groupby(['group','CODE'])
.agg(Start_Date=('group', lambda x: x.index.min()),
End_Date=('group', lambda x: x.index.max()),
cont_missing=('TMIN', 'size')
)
)
Output:
Start_Date End_Date cont_missing
group CODE
1 12 1999-01-01 00:00:00 2000-01-01 00:00:00 2
2 12 2002-01-01 00:00:00 2002-01-01 00:00:00 1
41 1998-01-01 00:00:00 1999-01-01 00:00:00 2
6 52 1999-01-01 00:00:00 2001-01-01 00:00:00 3
7 91 1998-01-01 00:00:00 1998-01-01 00:00:00 1

Finding maximum null values in stretch and generating flag

I have dataframe with datetime and two columns.I have to find maximum stretch of null values in a 'particular date' for column 'X' and replace it with zero in both column for that particular date. In addition to that I have to create third column with name 'flag' which will carry value of 1 for every zero imputation in other two column or else value of 0. In example below, January 1st the maximum stretch null value is 3 times, so I have to replace this with zero. Similarly, I have to replicate the process for 2nd January.
Below is my sample data:
Datetime X Y
01-01-2018 00:00 1 1
01-01-2018 00:05 nan 2
01-01-2018 00:10 2 nan
01-01-2018 00:15 3 4
01-01-2018 00:20 2 2
01-01-2018 00:25 nan 1
01-01-2018 00:30 nan nan
01-01-2018 00:35 nan nan
01-01-2018 00:40 4 4
02-01-2018 00:00 nan nan
02-01-2018 00:05 2 3
02-01-2018 00:10 2 2
02-01-2018 00:15 2 5
02-01-2018 00:20 2 2
02-01-2018 00:25 nan nan
02-01-2018 00:30 nan 1
02-01-2018 00:35 3 nan
02-01-2018 00:40 nan nan
"Below is the result that I am expecting"
Datetime X Y Flag
01-01-2018 00:00 1 1 0
01-01-2018 00:05 nan 2 0
01-01-2018 00:10 2 nan 0
01-01-2018 00:15 3 4 0
01-01-2018 00:20 2 2 0
01-01-2018 00:25 0 0 1
01-01-2018 00:30 0 0 1
01-01-2018 00:35 0 0 1
01-01-2018 00:40 4 4 0
02-01-2018 00:00 nan nan 0
02-01-2018 00:05 2 3 0
02-01-2018 00:10 2 2 0
02-01-2018 00:15 2 5 0
02-01-2018 00:20 2 2 0
02-01-2018 00:25 nan nan 0
02-01-2018 00:30 nan 1 0
02-01-2018 00:35 3 nan 0
02-01-2018 00:40 nan nan 0
This question is the extension of previous question. Here is the link Python - Find maximum null values in stretch and replacing with 0

First create consecutive groups for each column filled by unique values:
df1 = df.isna()
df2 = df1.ne(df1.groupby(df1.index.date).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
X Y
Datetime
2018-01-01 00:00:00 NaN NaN
2018-01-01 00:05:00 2.0 NaN
2018-01-01 00:10:00 NaN 36.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:20:00 NaN NaN
2018-01-01 00:25:00 4.0 NaN
2018-01-01 00:30:00 4.0 72.0
2018-01-01 00:35:00 4.0 72.0
2018-01-01 00:40:00 NaN NaN
2018-02-01 00:00:00 6.0 108.0
2018-02-01 00:05:00 NaN NaN
2018-02-01 00:10:00 NaN NaN
2018-02-01 00:15:00 NaN NaN
2018-02-01 00:20:00 NaN NaN
2018-02-01 00:25:00 8.0 144.0
2018-02-01 00:30:00 8.0 NaN
2018-02-01 00:35:00 NaN 180.0
2018-02-01 00:40:00 10.0 180.0
Then get groups with maximum count - here group 4:
a = df2.stack().value_counts().index[0]
print (a)
4.0
Get mask for match rows for set 0 and for Flag column cast mask to integer to Tru/False to 1/0 mapping:
mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)
print (df)
X Y Flag
Datetime
2018-01-01 00:00:00 1.0 1.0 0
2018-01-01 00:05:00 NaN 2.0 0
2018-01-01 00:10:00 2.0 NaN 0
2018-01-01 00:15:00 3.0 4.0 0
2018-01-01 00:20:00 2.0 2.0 0
2018-01-01 00:25:00 0.0 0.0 1
2018-01-01 00:30:00 0.0 0.0 1
2018-01-01 00:35:00 0.0 0.0 1
2018-01-01 00:40:00 4.0 4.0 0
2018-02-01 00:00:00 NaN NaN 0
2018-02-01 00:05:00 2.0 3.0 0
2018-02-01 00:10:00 2.0 2.0 0
2018-02-01 00:15:00 2.0 5.0 0
2018-02-01 00:20:00 2.0 2.0 0
2018-02-01 00:25:00 NaN NaN 0
2018-02-01 00:30:00 NaN 1.0 0
2018-02-01 00:35:00 3.0 NaN 0
2018-02-01 00:40:00 NaN NaN 0
EDIT:
Added new condition for match dates from list:
dates = df.index.floor('d')
filtered = ['2018-01-01','2019-01-01']
m = dates.isin(filtered)
df1 = df.isna() & m[:, None]
df2 = df1.ne(df1.groupby(dates).shift()).cumsum().where(df1)
df2['Y'] *= len(df2)
print (df2)
X Y
Datetime
2018-01-01 00:00:00 NaN NaN
2018-01-01 00:05:00 2.0 NaN
2018-01-01 00:10:00 NaN 36.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:20:00 NaN NaN
2018-01-01 00:25:00 4.0 NaN
2018-01-01 00:30:00 4.0 72.0
2018-01-01 00:35:00 4.0 72.0
2018-01-01 00:40:00 NaN NaN
2018-02-01 00:00:00 NaN NaN
2018-02-01 00:05:00 NaN NaN
2018-02-01 00:10:00 NaN NaN
2018-02-01 00:15:00 NaN NaN
2018-02-01 00:20:00 NaN NaN
2018-02-01 00:25:00 NaN NaN
2018-02-01 00:30:00 NaN NaN
2018-02-01 00:35:00 NaN NaN
2018-02-01 00:40:00 NaN NaN
a = df2.stack().value_counts().index[0]
#solution working also if no NaNs per filtered rows (prevent IndexError: index 0 is out of bounds)
#a = next(iter(df2.stack().value_counts().index), -1)
mask = df2.eq(a).any(axis=1)
df.loc[mask,:] = 0
df['Flag'] = mask.astype(int)
print (df)
X Y Flag
Datetime
2018-01-01 00:00:00 1.0 1.0 0
2018-01-01 00:05:00 NaN 2.0 0
2018-01-01 00:10:00 2.0 NaN 0
2018-01-01 00:15:00 3.0 4.0 0
2018-01-01 00:20:00 2.0 2.0 0
2018-01-01 00:25:00 0.0 0.0 1
2018-01-01 00:30:00 0.0 0.0 1
2018-01-01 00:35:00 0.0 0.0 1
2018-01-01 00:40:00 4.0 4.0 0
2018-02-01 00:00:00 NaN NaN 0
2018-02-01 00:05:00 2.0 3.0 0
2018-02-01 00:10:00 2.0 2.0 0
2018-02-01 00:15:00 2.0 5.0 0
2018-02-01 00:20:00 2.0 2.0 0
2018-02-01 00:25:00 NaN NaN 0
2018-02-01 00:30:00 NaN 1.0 0
2018-02-01 00:35:00 3.0 NaN 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently combining groupby, last and count in pandas - python

Related

Pandas dataframe expand rows in specific times

Mapping ranges of date in pandas dataframe

How to find occurrence of consecutive events in python timeseries data frame?

count contiguous NaN values by unique values

Finding maximum null values in stretch and generating flag

Categories

Resources