Daily total active time in a pandas dataframe - python

I'm new in python and my English are not so good so i ll try to explain my problem with the example below.
In :ds # is my dataframe
Out :DateStarted DateCompleted DayStarted DayCompleted \
1460 2017-06-12 14:03:32 2017-06-12 14:04:07 2017-06-12 2017-06-12
14445 2017-06-13 13:39:16 2017-06-13 13:40:32 2017-06-13 2017-06-13
14109 2017-06-21 10:25:36 2017-06-21 10:32:17 2017-06-21 2017-06-21
16652 2017-06-27 15:44:28 2017-06-27 15:44:41 2017-06-27 2017-06-27
30062 2017-07-05 09:49:01 2017-07-05 10:04:00 2017-07-05 2017-07-05
22357 2017-08-31 09:06:00 2017-08-31 09:10:31 2017-08-31 2017-08-31
39117 2017-09-08 08:43:07 2017-09-08 08:44:51 2017-09-08 2017-09-08
41903 2017-09-15 12:54:40 2017-09-15 14:00:06 2017-09-15 2017-09-15
74633 2017-09-27 12:41:09 2017-09-27 13:16:04 2017-09-27 2017-09-27
69315 2017-10-23 08:25:28 2017-10-23 08:26:09 2017-10-23 2017-10-23
87508 2017-10-30 12:19:19 2017-10-30 12:19:45 2017-10-30 2017-10-30
86828 2017-11-03 12:20:09 2017-11-03 12:24:56 2017-11-03 2017-11-03
89877 2017-11-06 13:52:05 2017-11-06 13:52:50 2017-11-06 2017-11-06
94970 2017-11-07 08:09:53 2017-11-07 08:10:15 2017-11-07 2017-11-07
94866 2017-11-28 14:38:14 2017-11-30 07:51:04 2017-11-28 2017-11-30
DailyTotalActiveTime diff
1460 NaN 35.0
14445 NaN 76.0
14109 NaN 401.0
16652 NaN 13.0
30062 NaN 899.0
22357 NaN 271.0
39117 NaN 104.0
41903 NaN 3926.0
74633 NaN 2095.0
69315 NaN 41.0
87508 NaN 26.0
86828 NaN 287.0
89877 NaN 45.0
94970 NaN 22.0
94866 NaN 148370.0
In the DailyTotalActiveTime column, i want to calculate how much time,
the specific days, will have in total. The diff column is in seconds.
I tried this, but i had no results:
for i in ds['diff']:
if i <= 86400:
ds['DailyTotalActiveTime']==i
else:
ds['DailyTotalActiveTime']==86400
ds['DailyTotalActiveTime']+1 == i-86400
What can i do? Again, sorry for the explanation..

You should try with = instead of ==

To get you halfway there, you could do something like the following (I am sure there must an a more simple way but I can't see it right now):
df['datestarted'] = pd.to_datetime(df['datestarted'])
df['datecompleted'] = pd.to_datetime(df['datecompleted'])
df['daystarted'] = df['datestarted'].dt.date
df['daycompleted'] = df['datecompleted'].dt.date
df['Date'] = df['daystarted'] # This is the unqiue date per row.
for row in df.itertuples():
if (row.daycompleted - row.daystarted) > pd.Timedelta(days=0):
for i in range(1, (row.daycompleted - row.daystarted).days+1):
df2 = pd.DataFrame([row]).drop('Index', axis=1)
df2['Date'] = df2['Date'] + pd.Timedelta(days=i)
df = df.append(df2)

Related

From hours to String

I have this df:
Index Dates
0 2017-01-01 23:30:00
1 2017-01-12 22:30:00
2 2017-01-20 13:35:00
3 2017-01-21 14:25:00
4 2017-01-28 22:30:00
5 2017-08-01 13:00:00
6 2017-09-26 09:39:00
7 2017-10-08 06:40:00
8 2017-10-04 07:30:00
9 2017-12-13 07:40:00
10 2017-12-31 14:55:00
The purpose was that between the time ranges 5:00 to 11:59 a new df would be created with data that would say: morning. To achieve this I converted those hours to booleans:
hour_morning=(pd.to_datetime(df['Dates']).dt.strftime('%H:%M:%S').between('05:00:00','11:59:00'))
and then passed them to a list with "morning" str
text_morning=[str('morning') for x in hour_morning if x==True]
I have the error in the last line because it only returns ´morning´ string values, it is as if the 'X' ignored the 'if' condition. Why is this happening and how do i fix it?
Do
text_morning=[str('morning') if x==True else 'not_morning' for x in hour_morning ]
You can also use np.where:
text_morning = np.where(hour_morning, 'morning', 'not morning')
Given:
Dates values
0 2017-01-01 23:30:00 0
1 2017-01-12 22:30:00 1
2 2017-01-20 13:35:00 2
3 2017-01-21 14:25:00 3
4 2017-01-28 22:30:00 4
5 2017-08-01 13:00:00 5
6 2017-09-26 09:39:00 6
7 2017-10-08 06:40:00 7
8 2017-10-04 07:30:00 8
9 2017-12-13 07:40:00 9
10 2017-12-31 14:55:00 10
Doing:
# df.Dates = pd.to_datetime(df.Dates)
df = df.set_index("Dates")
Now we can use pd.DataFrame.between_time:
new_df = df.between_time('05:00:00','11:59:00')
print(new_df)
Output:
values
Dates
2017-09-26 09:39:00 6
2017-10-08 06:40:00 7
2017-10-04 07:30:00 8
2017-12-13 07:40:00 9
Or use it to update the original dataframe:
df.loc[df.between_time('05:00:00','11:59:00').index, 'morning'] = 'morning'
# Output:
values morning
Dates
2017-01-01 23:30:00 0 NaN
2017-01-12 22:30:00 1 NaN
2017-01-20 13:35:00 2 NaN
2017-01-21 14:25:00 3 NaN
2017-01-28 22:30:00 4 NaN
2017-08-01 13:00:00 5 NaN
2017-09-26 09:39:00 6 morning
2017-10-08 06:40:00 7 morning
2017-10-04 07:30:00 8 morning
2017-12-13 07:40:00 9 morning
2017-12-31 14:55:00 10 NaN

Pandas fillna() method not filling all missing values

I have rain and temp data sourced from Environment Canada but it contains some NaN values.
start_date = '2015-12-31'
end_date = '2021-05-26'
mask = (data['date'] > start_date) & (data['date'] <= end_date)
df = data.loc[mask]
print(df)
date time rain_gauge_value temperature
8760 2016-01-01 00:00:00 0.0 -2.9
8761 2016-01-01 01:00:00 0.0 -3.4
8762 2016-01-01 02:00:00 0.0 -3.6
8763 2016-01-01 03:00:00 0.0 -3.6
8764 2016-01-01 04:00:00 0.0 -4.0
... ... ... ... ...
56107 2021-05-26 19:00:00 0.0 22.0
56108 2021-05-26 20:00:00 0.0 21.5
56109 2021-05-26 21:00:00 0.0 21.1
56110 2021-05-26 22:00:00 0.0 19.5
56111 2021-05-26 23:00:00 0.0 18.5
[47352 rows x 4 columns]
Find the rows with a NaN value
null = df[df['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
11028 2016-04-04 12:00:00 NaN -6.9
11986 2016-05-14 10:00:00 NaN NaN
11987 2016-05-14 11:00:00 NaN NaN
11988 2016-05-14 12:00:00 NaN NaN
11989 2016-05-14 13:00:00 NaN NaN
... ... ... ... ...
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
[6346 rows x 4 columns]
This is my dataframe I want to use to fill the NaN values
print(rain_df)
date time rain_gauge_value temperature
0 2015-12-28 00:00:00 0.1 -6.0
1 2015-12-28 01:00:00 0.0 -7.0
2 2015-12-28 02:00:00 0.0 -8.0
3 2015-12-28 03:00:00 0.0 -8.0
4 2015-12-28 04:00:00 0.0 -7.0
... ... ... ... ...
48043 2021-06-19 19:00:00 0.6 20.0
48044 2021-06-19 20:00:00 0.6 19.0
48045 2021-06-19 21:00:00 0.8 18.0
48046 2021-06-19 22:00:00 0.4 17.0
48047 2021-06-19 23:00:00 0.0 16.0
[48048 rows x 4 columns]
But when I use the fillna() method, some of the values don't get substitued.
null = null.fillna(rain_df)
null = null[null['rain_gauge_value'].isnull()]
print(null)
date time rain_gauge_value temperature
48057 2020-06-25 09:00:00 NaN NaN
48058 2020-06-25 10:00:00 NaN NaN
48059 2020-06-25 11:00:00 NaN NaN
48060 2020-06-25 12:00:00 NaN NaN
48586 2020-07-17 10:00:00 NaN NaN
48587 2020-07-17 11:00:00 NaN NaN
48588 2020-07-17 12:00:00 NaN NaN
49022 2020-08-04 14:00:00 NaN NaN
49023 2020-08-04 15:00:00 NaN NaN
49024 2020-08-04 16:00:00 NaN NaN
49025 2020-08-04 17:00:00 NaN NaN
50505 2020-10-05 09:00:00 NaN 11.3
54083 2021-03-03 11:00:00 NaN -5.1
54084 2021-03-03 12:00:00 NaN -4.5
How can I resolve this issue?
when fillna, you probably want a method, like fill using previous/next value, mean of column etc, what we can do is like this
nulls_index = df['rain_gauge_value'].isnull()
df = df.fillna(method='ffill') # use ffill as example
nulls_after_fill = df[nulls_index]
take a look at:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
You need to inform pandas how you want to patch. It may be obvious to you want to use the "patch" dataframe's values when the date and times line up, but it won't be obvious to pandas. see my dummy example:
raw = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time= [time(0,0,0),time(0,0,1)],temp=[1.,np.nan],rain=[4.,np.nan]))
raw
date time temp rain
0 2015-12-28 00:00:00 1.0 4.0
1 2015-12-28 00:00:01 NaN NaN
patch = pd.DataFrame(dict(date=[date(2015,12,28), date(2015,12,28)], time=[time(0,0,0),time(0,0,1)],temp=[5.,5.],rain=[10.,10.]))
patch
date time temp rain
0 2015-12-28 00:00:00 5.0 10.0
1 2015-12-28 00:00:01 5.0 10.0
you need the indexes of raw and patch to correspond to how you want to patch the raw data (in this case, you want to patch based on date and time)
raw.set_index(['date','time']).fillna(patch.set_index(['date','time']))
returns
temp rain
date time
2015-12-28 00:00:00 1.0 4.0
00:00:01 5.0 10.0

How to handle columns mixed with datetime.time and timestamp

I would like to have a dataframe that all columns in the datetime.time format. But my original dataframe is like
Moorabbin Mordialloc Aspendale Edithvale Chelsea
0 04:48:00 05:00:00 05:05:00 05:10:00 05:15:00
1 06:45:00 06:57:00 07:02:00 07:07:00 07:12:00
2 1900-01-01 00:48:00 NaN 1900-01-01 01:03:00 1900-01-01 01:08:00 1900-01-01 01:13:00
3 05:09:00 NaN NaN 05:36:00 05:41:00
What I would like to get is
Moorabbin Mordialloc Aspendale Edithvale Chelsea
0 04:48:00 05:00:00 05:05:00 05:10:00 05:15:00
1 06:45:00 06:57:00 07:02:00 07:07:00 07:12:00
2 00:48:00 NaN 01:03:00 01:08:00 01:13:00
3 05:09:00 NaN NaN 05:36:00 05:41:00
The datatypes of those values are
> type(test_result.iloc[0,0])
datetime.time
> type(test_result.iloc[2,0])
pandas._libs.tslibs.timestamps.Timestamp
I tried to_datetime(format= "%H:%M:%S", error = "coerce"), datetime.strptime(test_result['Moorabbin'],"%H:%M:%S").time() and test_result.astype('datetime64[ns]', copy=True, errors='ignore'), but nothing worked. Could anyone please help?
one approach would be as follows.
Make sure the type is 'object', you can convert it to datetime after you have reduced the data to your required 'length.
Then do df_new = df.apply(lambda x: x.str.split(' ').str[-1], axis=1)
Input
Moorabbin Mordialloc Aspendale Edithvale Chelsea
0 4:48:00 5:00:00 5:05:00 5:10:00 5:15:00
1 6:45:00 6:57:00 7:02:00 7:07:00 7:12:00
2 1/1/1900 0:48:00 NaN 1/1/1900 1:03:00 1/1/1900 1:08:00 1/1/1900 1:13:00
3 5:09:00 NaN NaN 5:36:00 5:41:00
output (df_new)
Moorabbin Mordialloc Aspendale Edithvale Chelsea
0 4:48:00 5:00:00 5:05:00 5:10:00 5:15:00
1 6:45:00 6:57:00 7:02:00 7:07:00 7:12:00
2 0:48:00 NaN 1:03:00 1:08:00 1:13:00
3 5:09:00 NaN NaN 5:36:00 5:41:00
Note The result is object & not dateime object but you can convert it to datetime object using pd.to_datetime on these columns.

Calculate mean on dataframe given a frequency using pandas

I have a dataframe where values are measured each 30 minutes, as shown below:
2015-01-01 00:00:00 94.50
2015-01-01 00:30:00 78.75
2015-01-01 01:00:00 85.87
2015-01-01 01:30:00 85.88
2015-01-01 02:00:00 84.75
2015-01-01 02:30:00 87.50
So, each day has 48 values. The fist column is the Time index created by using:
date= pd.date_range( '1/1/2015', periods=len(series),freq='30min' )
series=series.values.reshape(-1,1)
df=pd.DataFrame(series, index=date)
What I would like to do is to obtain the mean for each time of the day and weekday. Something like this:
My initial idea was to group by weekday and frequency (30 min.) as follow:
df= df.groupby([ df.index.weekday,df.index.freq])
print(df.describe())
count mean std min 25% 50% 75%
0 2015-01-05 00:30:00 1.0 93.75 NaN 93.75 93.75 93.75 93.75
2015-01-05 01:00:00 1.0 110.25 NaN 110.25 110.25 110.25 110.25
2015-01-05 01:30:00 1.0 110.88 NaN 110.88 110.88 110.88 110.88
2015-01-05 02:00:00 1.0 90.12 NaN 90.12 90.12 90.12 90.12
2015-01-05 02:30:00 1.0 91.50 NaN 91.50 91.50 91.50 91.50
2015-01-05 03:00:00 1.0 94.13 NaN 94.13 94.13 94.13 94.13
2015-01-05 03:30:00 1.0 90.62 NaN 90.62 90.62 90.62 90.62
2015-01-05 04:00:00 1.0 91.88 NaN 91.88 91.88 91.88 91.88
2015-01-05 04:30:00 1.0 92.50 NaN 92.50 92.50 92.50 92.50
2015-01-05 05:00:00 1.0 98.12 NaN 98.12 98.12 98.12 98.12
2015-01-05 05:30:00 1.0 105.75 NaN 105.75 105.75 105.75 105.75
2015-01-05 06:00:00 1.0 100.50 NaN 100.50 100.50 100.50 100.50
2015-01-05 06:30:00 1.0 82.25 NaN 82.25 82.25 82.25 82.25
2015-01-05 07:00:00 1.0 81.75 NaN 81.75 81.75 81.75 81.75
2015-01-05 07:30:00 1.0 90.50 NaN 90.50 90.50 90.50 90.50
2015-01-05 08:00:00 1.0 89.50 NaN 89.50 89.50 89.50 89.50
2015-01-05 08:30:00 1.0 89.63 NaN 89.63 89.63 89.63 89.63
2015-01-05 09:00:00 1.0 84.62 NaN 84.62 84.62 84.62 84.62
2015-01-05 09:30:00 1.0 86.63 NaN 86.63 86.63 86.63 86.63
2015-01-05 10:00:00 1.0 96.12 NaN 96.12 96.12 96.12 96.12
2015-01-05 10:30:00 1.0 104.13 NaN 104.13 104.13 104.13 104.13
2015-01-05 11:00:00 1.0 101.12 NaN 101.12 101.12 101.12 101.12
2015-01-05 11:30:00 1.0 85.88 NaN 85.88 85.88 85.88 85.88
2015-01-05 12:00:00 1.0 77.12 NaN 77.12 77.12 77.12 77.12
2015-01-05 12:30:00 1.0 78.88 NaN 78.88 78.88 78.88 78.88
2015-01-05 13:00:00 1.0 76.62 NaN 76.62 76.62 76.62 76.62
2015-01-05 13:30:00 1.0 78.63 NaN 78.63 78.63 78.63 78.63
2015-01-05 14:00:00 1.0 85.37 NaN 85.37 85.37 85.37 85.37
2015-01-05 14:30:00 1.0 103.63 NaN 103.63 103.63 103.63 103.63
2015-01-05 15:00:00 1.0 112.87 NaN 112.87 112.87 112.87 112.87
... ... ... .. ... ... ... ...
6 2016-10-02 09:30:00 1.0 84.75 NaN 84.75 84.75 84.75 84.75
2016-10-02 10:00:00 1.0 60.49 NaN 60.49 60.49 60.49 60.49
2016-10-02 10:30:00 1.0 76.25 NaN 76.25 76.25 76.25 76.25
2016-10-02 11:00:00 1.0 68.13 NaN 68.13 68.13 68.13 68.13
2016-10-02 11:30:00 1.0 54.15 NaN 54.15 54.15 54.15 54.15
2016-10-02 12:00:00 1.0 79.91 NaN 79.91 79.91 79.91 79.91
2016-10-02 12:30:00 1.0 72.79 NaN 72.79 72.79 72.79 72.79
2016-10-02 13:00:00 1.0 77.49 NaN 77.49 77.49 77.49 77.49
2016-10-02 13:30:00 1.0 77.65 NaN 77.65 77.65 77.65 77.65
2016-10-02 14:00:00 1.0 70.44 NaN 70.44 70.44 70.44 70.44
2016-10-02 14:30:00 1.0 82.47 NaN 82.47 82.47 82.47 82.47
2016-10-02 15:00:00 1.0 41.53 NaN 41.53 41.53 41.53 41.53
2016-10-02 15:30:00 1.0 66.65 NaN 66.65 66.65 66.65 66.65
2016-10-02 16:00:00 1.0 55.23 NaN 55.23 55.23 55.23 55.23
2016-10-02 16:30:00 1.0 59.45 NaN 59.45 59.45 59.45 59.45
2016-10-02 17:00:00 1.0 79.92 NaN 79.92 79.92 79.92 79.92
2016-10-02 17:30:00 1.0 58.48 NaN 58.48 58.48 58.48 58.48
2016-10-02 18:00:00 1.0 92.56 NaN 92.56 92.56 92.56 92.56
2016-10-02 18:30:00 1.0 86.92 NaN 86.92 86.92 86.92 86.92
2016-10-02 19:00:00 1.0 88.61 NaN 88.61 88.61 88.61 88.61
2016-10-02 19:30:00 1.0 99.21 NaN 99.21 99.21 99.21 99.21
2016-10-02 20:00:00 1.0 81.02 NaN 81.02 81.02 81.02 81.02
2016-10-02 20:30:00 1.0 84.83 NaN 84.83 84.83 84.83 84.83
2016-10-02 21:00:00 1.0 59.29 NaN 59.29 59.29 59.29 59.29
2016-10-02 21:30:00 1.0 95.99 NaN 95.99 95.99 95.99 95.99
2016-10-02 22:00:00 1.0 76.95 NaN 76.95 76.95 76.95 76.95
2016-10-02 22:30:00 1.0 112.49 NaN 112.49 112.49 112.49 112.49
2016-10-02 23:00:00 1.0 88.85 NaN 88.85 88.85 88.85 88.85
2016-10-02 23:30:00 1.0 122.40 NaN 122.40 122.40 122.40 122.40
2016-10-03 00:00:00 1.0 82.84 NaN 82.84 82.84 82.84 82.84
By looking at this, you can see it just group by weekday. So this is not the proper way to group in order to calculate the mean as I wanted to.
I'd use df.index.weekday and df.index.time
df.groupby([ df.index.weekday,df.index.time]).mean()

Pandas: Find first occurrence - on daily basis in a timeseries

I'm struggling with this so any input appreciated. I want to iterate over the values in a dataframe column and return the first instance when a value is seen every day. Groupby looked to be a good option for this but when using df.groupby(grouper).first() with grouper set at daily the following output is seen.
In [95]:
df.groupby(grouper).first()
Out[95]:
test_1
2014-03-04 1.0
2014-03-05 1.0
This is only giving the day the value was seen in test _1 and not reseting the first() on a daily basis which is what I need (see desired output below).
I want to preserve the time this value was seen in the following format:
This is the input dataframe:
test_1
2014-03-04 09:00:00 NaN
2014-03-04 10:00:00 NaN
2014-03-04 11:00:00 NaN
2014-03-04 12:00:00 NaN
2014-03-04 13:00:00 NaN
2014-03-04 14:00:00 1.0
2014-03-04 15:00:00 NaN
2014-03-04 16:00:00 1.0
2014-03-05 09:00:00 1.0
This is the desired output:
test_1 test_output
2014-03-04 09:00:00 NaN NaN
2014-03-04 10:00:00 NaN NaN
2014-03-04 11:00:00 NaN NaN
2014-03-04 12:00:00 NaN NaN
2014-03-04 13:00:00 NaN NaN
2014-03-04 14:00:00 1.0 1.0
2014-03-04 15:00:00 NaN NaN
2014-03-04 16:00:00 1.0 NaN
2014-03-05 09:00:00 1.0 NaN
I just want to mark the time when an event first occurs in a new column named test_output.
Admins. Please note this question is different from the other marked as a duplicate as this requires a rolling one day first occurrence.
Try this, using this data:
rng = pd.DataFrame( {'test_1': [None, None,None, None, 1,1, 1 , None, None, None,1 , None, None, None,]}, index = pd.date_range('4/2/2014', periods=14, freq='BH'))
rng
test_1
2014-04-02 09:00:00 NaN
2014-04-02 10:00:00 NaN
2014-04-02 11:00:00 NaN
2014-04-02 12:00:00 NaN
2014-04-02 13:00:00 1.0
2014-04-02 14:00:00 1.0
2014-04-02 15:00:00 1.0
2014-04-02 16:00:00 NaN
2014-04-03 09:00:00 NaN
2014-04-03 10:00:00 NaN
2014-04-03 11:00:00 1.0
2014-04-03 12:00:00 NaN
2014-04-03 13:00:00 NaN
2014-04-03 14:00:00 NaN
The output is this:
rng['test_output'] = rng['test_1'].loc[rng.groupby(pd.TimeGrouper(freq='D'))['test_1'].idxmin()]
test_1 test_output
2014-04-02 09:00:00 NaN NaN
2014-04-02 10:00:00 NaN NaN
2014-04-02 11:00:00 NaN NaN
2014-04-02 12:00:00 NaN NaN
2014-04-02 13:00:00 1.0 1.0
2014-04-02 14:00:00 1.0 NaN
2014-04-02 15:00:00 1.0 NaN
2014-04-02 16:00:00 NaN NaN
2014-04-03 09:00:00 NaN NaN
2014-04-03 10:00:00 NaN NaN
2014-04-03 11:00:00 1.0 1.0
2014-04-03 12:00:00 NaN NaN
2014-04-03 13:00:00 NaN NaN
2014-04-03 14:00:00 NaN NaN

Categories

Resources