How to split a Pandas dataframe into chunks from NaN to NaN? - python

Let's say I have the following data:
import pandas as pd
csv = [
['2019-05-01 00:00', ],
['2019-05-01 01:00', 2],
['2019-05-01 02:00', 4],
['2019-05-01 03:00', ],
['2019-05-01 04:00', 2],
['2019-05-01 05:00', 4],
['2019-05-01 06:00', 6],
['2019-05-01 07:00', ],
['2019-05-01 08:00', ],
['2019-05-01 09:00', 2]]
df = pd.DataFrame(csv, columns=["DateTime", "Value"])
So I am working with a time series with gaps in data:
DateTime Value
0 2019-05-01 00:00 NaN
1 2019-05-01 01:00 2.0
2 2019-05-01 02:00 4.0
3 2019-05-01 03:00 NaN
4 2019-05-01 04:00 2.0
5 2019-05-01 05:00 4.0
6 2019-05-01 06:00 6.0
7 2019-05-01 07:00 NaN
8 2019-05-01 08:00 NaN
9 2019-05-01 09:00 2.0
Now, I want to work one by one with each chunk of existing data. I mean, I want to split the series in the compact pieces between NaNs. The goal is to iterate these chunks so I can pass each one individually to another function which can't handle gaps in data. Then, I want to store the result in the original dataframe in its corresponding place. For a trivial example, let's say the function calculates the average value of the chunk. Expected result:
DateTime Value ChunkAverage
0 2019-05-01 00:00 NaN NaN
1 2019-05-01 01:00 2.0 3.0
2 2019-05-01 02:00 4.0 3.0
3 2019-05-01 03:00 NaN NaN
4 2019-05-01 04:00 2.0 4.0
5 2019-05-01 05:00 4.0 4.0
6 2019-05-01 06:00 6.0 4.0
7 2019-05-01 07:00 NaN NaN
8 2019-05-01 08:00 NaN NaN
9 2019-05-01 09:00 2.0 2.0
I know this can be made in a "traditional way" with iterating loops, "if" clauses, slicing with indexes, etc. But I guess there is something more efficient and safe built in Pandas. But I can't figure out how.

You can use df.groupby, with using pd.Series.isna with pd.Series.cumsum
g = df.Value.isna().cumsum()
df.assign(chunk = df.Value.groupby(g).transform('mean').mask(df.Value.isna()))
# df['chunk'] = df.Value.groupby(g).transform('mean').mask(df.Value.isna()))
# df['chunk'] = df.Value.groupby(g).transform('mean').where(df.Value.notna())
DateTime Value chunk
0 2019-05-01 00:00 NaN NaN
1 2019-05-01 01:00 2.0 3.0
2 2019-05-01 02:00 4.0 3.0
3 2019-05-01 03:00 NaN NaN
4 2019-05-01 04:00 2.0 4.0
5 2019-05-01 05:00 4.0 4.0
6 2019-05-01 06:00 6.0 4.0
7 2019-05-01 07:00 NaN NaN
8 2019-05-01 08:00 NaN NaN
9 2019-05-01 09:00 2.0 2.0
Note:
df.assign(...) gives new dataframe.
df['chunk'] = ... mutate the original dataframe in-place

One possibility would be to add a separator column, based on the NaN in Value, and group by that:
df['separator']=df['Value'].isna().cumsum().fillna("")
df['Value'] = df['Value'].fillna("")
grp = df.groupby('separator').agg(avg = pd.NamedAgg(column='Value', aggfunc='sum'))
print(grp)
This counts the values in each group:
avg
separator
1 2
2 3
3 0
4 1
How you want to fill the NaNs depends a bit on what you want to achieve with the calculation.

Related

Converting a data frame of events into a timetable format

I am working on converting a list of online classes into a heat map using Python & Pandas and I've come to a dead end. Right now, I have a data frame 'data' with some events containing a day of the week listed as 'DAY' and the time of the event in hours listed as 'TIME'. The dataset is displayed as follows:
ID TIME DAY
108 15 Saturday
110 15 Sunday
112 16 Wednesday
114 16 Friday
116 15 Monday
.. ... ...
639 12 Wednesday
640 12 Saturday
641 18 Saturday
642 16 Thursday
643 15 Friday
I'm looking for a way to sum repetitions of every 'TIME' value for every 'DAY' and then present these sums in a new table 'event_count'. I need to turn the linear data in my 'data' table into a more timetable-like form that can later be converted into a visual heatmap.
Sounds like a difficult transformation, but I feel like I'm missing something very obvious.
TIME Monday Tuesday Wednesday Thursday Friday Saturday Sunday
10 5 2 4 6 1 0 2
11 4 2 4 6 1 0 2
12 6 2 4 6 1 0 2
13 3 2 4 6 1 0 2
14 7 2 4 6 1 0 2
I tried to achieve this through pivot_table and stack, however, the best I got was a list of all days of the week with mean averages for time. Could you advise me which direction should I look into and how can I approach solving this?
IIUC you can do something like this:
df is from your given example data.
import pandas as pd
df = pd.DataFrame({
'ID': [108, 110, 112, 114, 116, 639, 640, 641, 642, 643],
'TIME': [15, 15, 16, 16, 15, 12, 12, 18, 16, 15],
'DAY': ['Saturday','Sunday','Wednesday','Friday','Monday','Wednesday','Saturday','Saturday','Thursday','Friday']
})
weekdays = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
out = (pd.crosstab(index=df['TIME'], columns=df['DAY'], values=df['TIME'],aggfunc='count')
.sort_index(axis=0) #sort by the index 'TIME'
.reindex(weekdays, axis=1) # sort columns in order of the weekdays
.rename_axis(None, axis=1) # delete name of index
.reset_index() # 'TIME' from index to column
)
print(out)
TIME Monday Tuesday Wednesday Thursday Friday Saturday Sunday
0 12 NaN NaN 1.0 NaN NaN 1.0 NaN
1 15 1.0 NaN NaN NaN 1.0 1.0 1.0
2 16 NaN NaN 1.0 1.0 1.0 NaN NaN
3 18 NaN NaN NaN NaN NaN 1.0 NaN
You were also in the right path with pivot_table. I'm not sure what was missing to get you the right result but here is one approach with it. I added `margins, maybe it is also interesting for you to get the total amount of each index/column.
weekdays = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Total']
out2 = (pd.pivot_table(data=df, index='TIME', columns='DAY', aggfunc='count', margins=True, margins_name='Total')
.droplevel(0,axis=1)
.reindex(weekdays, axis=1)
)
print(out2)
DAY Monday Tuesday Wednesday Thursday Friday Saturday Sunday Total
TIME
12 NaN NaN 1.0 NaN NaN 1.0 NaN 2
15 1.0 NaN NaN NaN 1.0 1.0 1.0 4
16 NaN NaN 1.0 1.0 1.0 NaN NaN 3
18 NaN NaN NaN NaN NaN 1.0 NaN 1
Total 1.0 NaN 2.0 1.0 2.0 3.0 1.0 10

Efficiently combining groupby, last and count in pandas

From a list of logs, i want to get the number of active events at each timestamp for a specific event type.
A sample log input looks like this:
time
id
event
2022-03-01 10:00
1
A
2022-03-01 11:00
2
B
2022-03-01 12:00
3
A
2022-03-01 13:00
1
B
2022-03-01 14:00
4
A
2022-03-01 15:00
2
C
2022-03-01 16:00
1
A
...
...
...
What i want is basically how many ids have event A active at each time in the df, like in the table below.
time
eventA
2022-03-01 10:00
1
2022-03-01 11:00
1
2022-03-01 12:00
2
2022-03-01 13:00
1
2022-03-01 14:00
2
2022-03-01 15:00
2
2022-03-01 16:00
3
...
...
I achieved this with some basic pandas operations:
df = pd.DataFrame(
{
"time": pd.date_range("2022-03-01 10:00", periods=7, freq="H"),
"id": [1, 2, 3, 1, 4, 2, 1],
"event": ["A", "B", "A", "B", "A", "C", "A"],
}
)
timestamps = df.time
values = []
for timestamp in timestamps:
filtered_df = df.loc[df.time <= timestamp]
eventA = filtered_df.groupby("id").last().groupby("event").count().["time"]["A"]
values.append({"time": timestamp, "eventA": eventA})
df_count = pd.DataFrame(values)
In my case though, i have to go over >50,000 rows and this basic approach becomes very inefficient time wise.
Is there a better approach to achieve the desired result? I guess there might be some pandas groupby aggregation methods that could help here, but i found none that helped me.
df.set_index(['time', 'id']).unstack().fillna(method='ffill')\
.stack().value_counts(['time', 'event']).unstack().fillna(0)
The first line takes care of getting the latest event from each id at each hour by forward-filling the NaNs
event
id 1 2 3 4
time
2022-03-01 10:00:00 A NaN NaN NaN
2022-03-01 11:00:00 A B NaN NaN
2022-03-01 12:00:00 A B A NaN
2022-03-01 13:00:00 B B A NaN
2022-03-01 14:00:00 B B A A
2022-03-01 15:00:00 B C A A
2022-03-01 16:00:00 A C A A
The second line does the counting and thus
event A B C
time
2022-03-01 10:00:00 1.0 0.0 0.0
2022-03-01 11:00:00 1.0 1.0 0.0
2022-03-01 12:00:00 2.0 1.0 0.0
2022-03-01 13:00:00 1.0 2.0 0.0
2022-03-01 14:00:00 2.0 2.0 0.0
2022-03-01 15:00:00 2.0 1.0 1.0
2022-03-01 16:00:00 3.0 0.0 1.0

How to find occurrence of consecutive events in python timeseries data frame?

I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!
Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0

Filling missing dates by imputing on previous dates in Python

I have a time series that I want to lag and predict on for future data one year ahead that looks like:
Date Energy Pred Energy Lag Error
.
2017-09-01 9 8.4
2017-10-01 10 9
2017-11-01 11 10
2017-12-01 12 11.5
2018-01-01 1 1.3
NaT (pred-true)
NaT
NaT
NaT
.
.
All I want to do is impute dates into the NaT entries to continue from 2018-01-01 to 2019-01-01 (just fill them like we're in Excel drag and drop) because there are enough NaT positions to fill up to that point.
I've tried model['Date'].fillna() with various methods and either just repeats the same previous date or drops things I don't want to drop.
Any way to just fill these NaTs with 1 month increments like the previous data?
Make the df and set the index (there are better ways to set the index):
"""
Date,Energy,Pred Energy,Lag Error
2017-09-01,9,8.4
2017-10-01,10,9
2017-11-01,11,10
2017-12-01,12,11.5
2018-01-01,1,1.3
"""
import pandas as pd
df = pd.read_clipboard(sep=",", parse_dates=True)
df.set_index(pd.DatetimeIndex(df['Date']), inplace=True)
df.drop("Date", axis=1, inplace=True)
df
Reindex to a new date_range:
idx = pd.date_range(start='2017-09-01', end='2019-01-01', freq='MS')
df = df.reindex(idx)
Output:
Energy Pred Energy Lag Error
2017-09-01 9.0 8.4 NaN
2017-10-01 10.0 9.0 NaN
2017-11-01 11.0 10.0 NaN
2017-12-01 12.0 11.5 NaN
2018-01-01 1.0 1.3 NaN
2018-02-01 NaN NaN NaN
2018-03-01 NaN NaN NaN
2018-04-01 NaN NaN NaN
2018-05-01 NaN NaN NaN
2018-06-01 NaN NaN NaN
2018-07-01 NaN NaN NaN
2018-08-01 NaN NaN NaN
2018-09-01 NaN NaN NaN
2018-10-01 NaN NaN NaN
2018-11-01 NaN NaN NaN
2018-12-01 NaN NaN NaN
2019-01-01 NaN NaN NaN
Help from:
Pandas Set DatetimeIndex

Assitance needed in python pandas to reduce lines of code and cycle time

I have a DF where I am calculating the filling the emi value in fields
account Total Start Date End Date EMI
211829 107000 05/19/17 01/22/19 5350
320563 175000 08/04/17 10/30/18 12500
648336 246000 02/26/17 08/25/19 8482.7586206897
109996 175000 11/23/17 11/27/19 7291.6666666667
121213 317000 09/07/17 04/12/18 45285.7142857143
Then based on dates range I create new fields like Jan 17 , Feb 17 , Mar 17 etc. and fill them up with the code below.
jant17 = pd.to_datetime('2017-01-01')
febt17 = pd.to_datetime('2017-02-01')
mart17 = pd.to_datetime('2017-03-01')
jan17 = pd.to_datetime('2017-01-31')
feb17 = pd.to_datetime('2017-02-28')
mar17 = pd.to_datetime('2017-03-31')
df.ix[(df['Start Date'] <= jan17) & (df['End Date'] >= jant17) , 'Jan17'] = df['EMI']
But the drawback is when I have to do a forecast till 2019 or 2020 They become too many lines of code to write and when there is any update I need to modify too many lines of code. To reduce the lines of code I tried an alternate method with using for loop but the code started taking very long to execute.
monthend = { 'Jan17' : pd.to_datetime('2017-01-31'),
'Feb17' : pd.to_datetime('2017-02-28'),
'Mar17' : pd.to_datetime('2017-03-31')}
monthbeg = { 'Jant17' : pd.to_datetime('2017-01-01'),
'Febt17' : pd.to_datetime('2017-02-01'),
'Mart17' : pd.to_datetime('2017-03-01')}
for mend in monthend.values():
for mbeg in monthbeg.values():
for coln in colnames:
df.ix[(df['Start Date'] <= mend) & (df['End Date'] >= mbeg) , coln] = df['EMI']
This greatly reduced the no of lines of code but increased to execution time from 3-4 mins to 1 hour plus. Is there a better way to code this with less lines and lesser processing time
I think you can create helper df with start, end dates and names of columns, loop rows and create new columns of original df:
dates = pd.DataFrame({'start':pd.date_range('2017-01-01', freq='MS', periods=10),
'end':pd.date_range('2017-01-01', freq='M', periods=10)})
dates['names'] = dates.start.dt.strftime('%b%y')
print (dates)
end start names
0 2017-01-31 2017-01-01 Jan17
1 2017-02-28 2017-02-01 Feb17
2 2017-03-31 2017-03-01 Mar17
3 2017-04-30 2017-04-01 Apr17
4 2017-05-31 2017-05-01 May17
5 2017-06-30 2017-06-01 Jun17
6 2017-07-31 2017-07-01 Jul17
7 2017-08-31 2017-08-01 Aug17
8 2017-09-30 2017-09-01 Sep17
9 2017-10-31 2017-10-01 Oct17
#if necessary convert to datetimes
df['Start Date'] = pd.to_datetime(df['Start Date'])
df['End Date'] = pd.to_datetime(df['End Date'])
def f(x):
df.loc[(df['Start Date'] <= x.start) & (df['End Date'] >= x.end) , x.names] = df['EMI']
dates.apply(f, axis=1)
print (df)
account Total Start Date End Date EMI Jan17 Feb17 \
0 211829 107000 2017-05-19 2019-01-22 5350.000000 NaN NaN
1 320563 175000 2017-08-04 2018-10-30 12500.000000 NaN NaN
2 648336 246000 2017-02-26 2019-08-25 8482.758621 NaN NaN
3 109996 175000 2017-11-23 2019-11-27 7291.666667 NaN NaN
4 121213 317000 2017-09-07 2018-04-12 45285.714286 NaN NaN
Mar17 Apr17 May17 Jun17 Jul17 \
0 NaN NaN NaN 5350.000000 5350.000000
1 NaN NaN NaN NaN NaN
2 8482.758621 8482.758621 8482.758621 8482.758621 8482.758621
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
Aug17 Sep17 Oct17
0 5350.000000 5350.000000 5350.000000
1 NaN 12500.000000 12500.000000
2 8482.758621 8482.758621 8482.758621
3 NaN NaN NaN
4 NaN NaN 45285.714286

Categories

Resources