I have the following df1 below, showing hhmm times. These values represent literal times, but are in the incorrect format. Eg. 845 should be 08:45, and 1125 = 11:25.
CU Parameters 31-07-2017 01-08-2017 02-08-2017 03-08-2017
CU0111-039820-L Time of Full Charge 1125 0 1359 1112
CU0111-041796-H Time of Full Charge 1233 0 0 1135
CU0111-046907-0 Time of Full Charge 845 0 1229 1028
CU0111-046933-6 Time of Full Charge 1053 0 0 1120
CU0111-050103-K Time of Full Charge 932 0 1314 1108
CU0111-052525-J Time of Full Charge 1214 1424 1307 1254
CU0111-052534-M Time of Full Charge 944 0 0 1128
CU0111-052727-7 Time of Full Charge 1136 0 1443 1114
I need to convert all of these values into valid timestamps of hh:mm, and then work out the average of these timestamps, excluding the values that are '0'.
CU Parameters 31-07-2017 01-08-2017 02-08-2017 03-08-2017
CU0111-039820-L Time of Full Charge 11:25 0 13:59 11:12
CU0111-041796-H Time of Full Charge 12:33 0 0 11:35
CU0111-046907-0 Time of Full Charge 08:45 0 12:29 10:28
CU0111-046933-6 Time of Full Charge 10:53 0 0 11:20
CU0111-050103-K Time of Full Charge 09:32 0 13:14 11:08
CU0111-052525-J Time of Full Charge 12:14 14:24 13:07 12:54
CU0111-052534-M Time of Full Charge 09:44 0 0 11:28
CU0111-052727-7 Time of Full Charge 11:36 0 14:43 11:14
End result:
Average time of charge: hh:hh (excluding 0 values)
Number of no charges: =count(number of 0)
I have tried something along these lines, to no avail:
text = df1[col_list].astype(str)
df1[col_list] = text.str[:-2] + ':' + text.str[-2:]
hhmm = df1[col_list]
minutes = (hhmm / 100).astype(int) * 60 + hhmm % 100
df[col_list] = pd.to_timedelta(minutes, 'm')
I think you can convert all values to_timedelta first:
cols = df.columns.difference(['CU','Parameters'])
df[cols] = df[cols].replace(0, '0000')
.astype(str)
.apply(lambda x: pd.to_timedelta(x.str[:-2] + ':' + x.str[-2:] + ':00'))
print (df)
CU Parameters 31-07-2017 01-08-2017 02-08-2017 \
0 CU0111-039820-L Time of Full Charge 11:25:00 00:00:00 13:59:00
1 CU0111-041796-H Time of Full Charge 12:33:00 00:00:00 00:00:00
2 CU0111-046907-0 Time of Full Charge 08:45:00 00:00:00 12:29:00
3 CU0111-046933-6 Time of Full Charge 10:53:00 00:00:00 00:00:00
4 CU0111-050103-K Time of Full Charge 09:32:00 00:00:00 13:14:00
5 CU0111-052525-J Time of Full Charge 12:14:00 14:24:00 13:07:00
6 CU0111-052534-M Time of Full Charge 09:44:00 00:00:00 00:00:00
7 CU0111-052727-7 Time of Full Charge 11:36:00 00:00:00 14:43:00
03-08-2017
0 11:12:00
1 11:35:00
2 10:28:00
3 11:20:00
4 11:08:00
5 12:54:00
6 11:28:00
7 11:14:00
And then create new columns for average not null timedeltas and count 0 as sum of True values:
df['avg'] = df[cols][df[cols].ne(0)].mean(axis=1)
df['number no changes'] = df[cols].eq(0).sum(axis=1)
print (df)
CU Parameters 31-07-2017 01-08-2017 02-08-2017 \
0 CU0111-039820-L Time of Full Charge 11:25:00 00:00:00 13:59:00
1 CU0111-041796-H Time of Full Charge 12:33:00 00:00:00 00:00:00
2 CU0111-046907-0 Time of Full Charge 08:45:00 00:00:00 12:29:00
3 CU0111-046933-6 Time of Full Charge 10:53:00 00:00:00 00:00:00
4 CU0111-050103-K Time of Full Charge 09:32:00 00:00:00 13:14:00
5 CU0111-052525-J Time of Full Charge 12:14:00 14:24:00 13:07:00
6 CU0111-052534-M Time of Full Charge 09:44:00 00:00:00 00:00:00
7 CU0111-052727-7 Time of Full Charge 11:36:00 00:00:00 14:43:00
03-08-2017 avg number no changes
0 11:12:00 12:12:00 1
1 11:35:00 12:04:00 2
2 10:28:00 10:34:00 1
3 11:20:00 11:06:30 2
4 11:08:00 11:18:00 1
5 12:54:00 13:09:45 0
6 11:28:00 10:36:00 2
7 11:14:00 12:31:00 1
print (df[cols][df[cols].ne(0)])
01-08-2017 02-08-2017 03-08-2017 31-07-2017
0 NaT 13:59:00 11:12:00 11:25:00
1 NaT NaT 11:35:00 12:33:00
2 NaT 12:29:00 10:28:00 08:45:00
3 NaT NaT 11:20:00 10:53:00
4 NaT 13:14:00 11:08:00 09:32:00
5 14:24:00 13:07:00 12:54:00 12:14:00
6 NaT NaT 11:28:00 09:44:00
7 NaT 14:43:00 11:14:00 11:36:00
print (df[cols].eq(0))
01-08-2017 02-08-2017 03-08-2017 31-07-2017
0 True False False False
1 True True False False
2 True False False False
3 True True False False
4 True False False False
5 False False False False
6 True True False False
7 True False False False
Related
I have data about how many messages each account sends aggregated to an hourly level. For each row, I would like to add a column with the sum of the previous 7 days messages. I know I can groupby account and date and aggregate the number of messages to the daily level, but I'm having a hard time calculating the rolling average because there isn't a row in the data if the account didn't send any messages that day (and I'd like to not balloon my data by adding these in if at all possible). If I could figure out a way to calculate the rolling 7-day average for each day that each account sent messages, I could then re-join that number back to the hourly data (is my hope). Any suggestions?
Note: For any day not in the data, assume 0 messages sent.
Raw Data:
Account | Messages | Date | Hour
12 5 2022-07-11 09:00:00
12 6 2022-07-13 10:00:00
12 10 2022-07-13 11:00:00
12 9 2022-07-15 16:00:00
12 1 2022-07-19 13:00:00
15 2 2022-07-12 10:00:00
15 13 2022-07-13 11:00:00
15 3 2022-07-17 16:00:00
15 4 2022-07-22 13:00:00
Desired Output:
Account | Messages | Date | Hour | Rolling Previous 7 Day Average
12 5 2022-07-11 09:00:00 0
12 6 2022-07-13 10:00:00 0.714
12 10 2022-07-13 11:00:00 0.714
12 9 2022-07-15 16:00:00 3
12 1 2022-07-19 13:00:00 3.571
15 2 2022-07-12 10:00:00 0
15 13 2022-07-13 11:00:00 0.286
15 3 2022-07-17 16:00:00 2.143
15 4 2022-07-22 13:00:00 0.429
I hope I've understood your question right:
df["Date"] = pd.to_datetime(df["Date"])
df["Messages_tmp"] = df.groupby(["Account", "Date"])["Messages"].transform(
"sum"
)
df["Rolling Previous 7 Day Average"] = (
df.set_index("Date")
.groupby("Account")["Messages_tmp"]
.rolling("7D")
.apply(lambda x: x.loc[~x.index.duplicated()].shift().sum() / 7)
).values
df = df.drop(columns="Messages_tmp")
print(df)
Prints:
Account Messages Date Hour Rolling Previous 7 Day Average
0 12 5 2022-07-11 09:00:00 0.000000
1 12 6 2022-07-13 10:00:00 0.714286
2 12 10 2022-07-13 11:00:00 0.714286
3 12 9 2022-07-15 16:00:00 3.000000
4 12 1 2022-07-19 13:00:00 3.571429
5 15 2 2022-07-12 10:00:00 0.000000
6 15 13 2022-07-13 11:00:00 0.285714
7 15 3 2022-07-17 16:00:00 2.142857
8 15 4 2022-07-22 13:00:00 0.428571
I have this dataframe with this datatypes
Date Time
0 2022-05-20 17:07:00
1 2022-05-20 09:14:00
2 2022-05-19 18:56:00
3 2022-05-19 13:53:00
4 2022-05-19 13:52:00
... ... ...
81 2022-04-22 09:53:00
82 2022-04-20 18:20:00
83 2022-04-20 12:53:00
84 2022-04-20 12:12:00
85 2022-04-20 09:50:00
86 rows × 2 columns
Date datetime64[ns]
Time object
dtype: object
I tried
df1 = df[['Date','Time']].groupby(['Date']).agg(['count'])
and got
Time
Date count
2022-04-20 4
2022-04-22 4
2022-04-25 3
2022-04-26 6
2022-04-27 4
2022-04-28 4
2022-04-29 4
2022-05-02 4
2022-05-03 4
2022-05-04 4
Time also disappear when I tried
df = df.groupby(['Date'])['Date'].count().reset_index(name='Counts')
0 2022-04-20 4
1 2022-04-22 4
2 2022-04-25 2
3 2022-04-26 6
4 2022-04-27 4
So the Time column just gone. How do I get a dataframe where Date will be index, Time in that date, counts number of occurrence of that date? My project is to find the difference in Time within a date if number of date is odd. For example, if there are 4 time entries on 5/19/2020, then I need to find differences between entry 1 and entry 2, then entry 3 and entry 4, sum the above to get final result. I don't know if there is more elegant way to do it other than dataframe.
you can merge the count by dates to the original DF. Does that help?
df2=df.groupby(['Date'])['Date'].count().reset_index(name='count')
df3=df.merge(df2,
on='Date', how='left')
df3.set_index('Date', inplace=True)
df3
Time count
Date
2022-05-20 17:07:00 2
2022-05-20 09:14:00 2
2022-05-19 18:56:00 3
2022-05-19 13:53:00 3
2022-05-19 13:52:00 3
2022-04-22 09:53:00 1
2022-04-20 18:20:00 4
2022-04-20 12:53:00 4
2022-04-20 12:12:00 4
2022-04-20 09:50:00 4
To make date appear only once, here it is
df2=df.groupby(['Date'])['Date'].count().reset_index(name='count')
df3=df.merge(df2, on='Date', how='left')
df3=df3.reset_index()
df3['index'] = 'col' # it is added to make use of pd.pivot below, a workaround
df3.pivot(index=['Date','Time','count'], columns='index')
Date Time count
2022-04-20 09:50:00 4
12:12:00 4
12:53:00 4
18:20:00 4
2022-04-22 09:53:00 1
2022-05-19 13:52:00 3
13:53:00 3
18:56:00 3
2022-05-20 09:14:00 2
You can use nunique:
df['count'] = df.groupby('Date').transform('nunique')
print(df)
# Output
Date Time count
0 2022-05-20 0 days 17:07:00 2
1 2022-05-20 0 days 09:14:00 2
2 2022-05-19 0 days 18:56:00 3
3 2022-05-19 0 days 13:53:00 3
4 2022-05-19 0 days 13:52:00 3
81 2022-04-22 0 days 09:53:00 1
82 2022-04-20 0 days 18:20:00 4
83 2022-04-20 0 days 12:53:00 4
84 2022-04-20 0 days 12:12:00 4
85 2022-04-20 0 days 09:50:00 4
I want the time without the date in Pandas.
I want to keep the time as dtype datetime64[ns] and not as an object so that I can determine periods between times.
The closest I have gotten is as follows, but it gives back the date in a new column not the time as needed as dtype datetime.
df_pres_mf['time'] = pd.to_datetime(df_pres_mf['time'], format ='%H:%M', errors = 'coerce') # returns date (1900-01-01) and actual time as a dtype datetime64[ns] format
df_pres_mf['just_time'] = df_pres_mf['time'].dt.date
df_pres_mf['normalised_time'] = df_pres_mf['time'].dt.normalize()
df_pres_mf.head()
Returns the date as 1900-01-01 and not the time that is needed.
Edit: Data
time
1900-01-01 11:16:00
1900-01-01 15:20:00
1900-01-01 09:55:00
1900-01-01 12:01:00
You could do it like Vishnudev suggested but then you would have dtype: object (or even strings, after using dt.strftime), which you said you didn't want.
What you are looking for doesn't exist, but the closest thing that I can get you is converting to timedeltas. Which won't seem like a solution at first but is actually very useful.
Convert it like this:
# sample df
df
>>
time
0 2021-02-07 09:22:00
1 2021-05-10 19:45:00
2 2021-01-14 06:53:00
3 2021-05-27 13:42:00
4 2021-01-18 17:28:00
df["timed"] = df.time - df.time.dt.normalize()
df
>>
time timed
0 2021-02-07 09:22:00 0 days 09:22:00 # this is just the time difference
1 2021-05-10 19:45:00 0 days 19:45:00 # since midnight, which is essentially the
2 2021-01-14 06:53:00 0 days 06:53:00 # same thing as regular time, except
3 2021-05-27 13:42:00 0 days 13:42:00 # that you can go over 24 hours
4 2021-01-18 17:28:00 0 days 17:28:00
this allows you to calculate periods between times like this:
# subtract the last time from the current
df["difference"] = df.timed - df.timed.shift()
df
Out[48]:
time timed difference
0 2021-02-07 09:22:00 0 days 09:22:00 NaT
1 2021-05-10 19:45:00 0 days 19:45:00 0 days 10:23:00
2 2021-01-14 06:53:00 0 days 06:53:00 -1 days +11:08:00 # <-- this is because the last
3 2021-05-27 13:42:00 0 days 13:42:00 0 days 06:49:00 # time was later than the current
4 2021-01-18 17:28:00 0 days 17:28:00 0 days 03:46:00 # (see below)
to get rid of odd differences, make it absolute:
df["abs_difference"] = df.difference.abs()
df
>>
time timed difference abs_difference
0 2021-02-07 09:22:00 0 days 09:22:00 NaT NaT
1 2021-05-10 19:45:00 0 days 19:45:00 0 days 10:23:00 0 days 10:23:00
2 2021-01-14 06:53:00 0 days 06:53:00 -1 days +11:08:00 0 days 12:52:00 ### <<--
3 2021-05-27 13:42:00 0 days 13:42:00 0 days 06:49:00 0 days 06:49:00
4 2021-01-18 17:28:00 0 days 17:28:00 0 days 03:46:00 0 days 03:46:00
Use proper formatting according to your date format and convert to datetime
df['time'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
Format according to the preferred format
df['time'].dt.strftime('%H:%M')
Output
0 11:16
1 15:20
2 09:55
3 12:01
Name: time, dtype: object
I have time-series on a minute level where I log certain events, for simplification here I use binary classification, is there an event or not. And I want to get some daily stats on it.
I tried to explain what I have and what I want to get in the figure below
So in summary I would like to detect all events (1) with their duration.
What would be the easiest way of doing this in Python, Pandas?
Here is extract from the dataframe
Time
2020-01-27 09:26:00 0
2020-01-27 09:28:00 0
2020-01-27 09:30:00 0
2020-01-27 09:32:00 0
2020-01-27 09:34:00 0
2020-01-27 09:36:00 0
2020-01-27 09:38:00 0
2020-01-27 09:40:00 0
2020-01-27 09:42:00 0
2020-01-27 09:44:00 0
2020-01-27 09:46:00 0
2020-01-27 09:48:00 1
2020-01-27 09:50:00 1
2020-01-27 09:52:00 1
2020-01-27 09:54:00 1
2020-01-27 09:56:00 1
2020-01-27 09:58:00 1
2020-01-27 10:00:00 1
2020-01-27 10:02:00 1
2020-01-27 10:04:00 1
2020-01-27 10:06:00 1
2020-01-27 10:08:00 1
2020-01-27 10:10:00 1
2020-01-27 10:12:00 1
2020-01-27 10:14:00 1
2020-01-27 10:16:00 1
2020-01-27 10:18:00 1
2020-01-27 10:20:00 1
2020-01-27 10:22:00 1
2020-01-27 10:24:00 1
I've solved it by myself. Here is how:
I calculated the difference between current and previous timestep and based on that I've counted number of occurrences of an event and calculated time difference between start (1) and end of an event (-1).
Sessions['Diff']=Sessions['Event'].diff()
begin = Sessions.loc[Sessions['Diff'] == 1].index
cutoffs = Sessions.loc[Sessions['Diff'] == -1].index
idx = cutoffs.searchsorted(begin)
mask = idx < len(cutoffs)
idx = idx[mask]
begin = begin[mask]
end = cutoffs[idx]
result = pd.DataFrame({'begin':begin, 'end':end})
result['dT']=result['end']-result['begin']
result['dT']=result['dT'].dt.total_seconds().div(60)
result=result.set_index('begin',drop=False)
result=result.rename(columns={'begin':'sessions','dT':'avg_dT'})
result['tot_dT']=result['avg_dT']
Sessions_daily=result.resample('D').apply({'sessions':'count','avg_dT':'mean','tot_dT':'sum'})
Where Sessions is a data frame of time-series (minute sampling) is there event (1) or not (0)
This resulted in
sessions avg_dT tot_dT
begin
2020-01-03 5 31.200000 156.0
2020-01-04 0 NaN 0.0
2020-01-05 0 NaN 0.0
2020-01-06 0 NaN 0.0
2020-01-07 9 39.333333 354.0
2020-01-08 8 38.000000 304.0
2020-01-09 8 33.000000 264.0
2020-01-10 8 39.250000 314.0
I am working on a dataframe in pandas with four columns of user_id, time_stamp1, time_stamp2, and interval. Time_stamp1 and time_stamp2 are of type datetime64[ns] and interval is of type timedelta64[ns].
I want to sum up interval values for each user_id in the dataframe and I tried to calculate it in many ways as:
1)df["duration"]= df.groupby('user_id')['interval'].apply (lambda x: x.sum())
2)df ["duration"]= df.groupby('user_id').aggregate (np.sum)
3)df ["duration"]= df.groupby('user_id').agg (np.sum)
but none of them work and the value of the duration will be NaT after running the codes.
UPDATE: you can use transform() method:
In [291]: df['duration'] = df.groupby('user_id')['interval'].transform('sum')
In [292]: df
Out[292]:
a user_id b interval duration
0 2016-01-01 00:00:00 0.01 2015-11-11 00:00:00 51 days 00:00:00 838 days 08:00:00
1 2016-03-10 10:39:00 0.01 2015-12-08 18:39:00 NaT 838 days 08:00:00
2 2016-05-18 21:18:00 0.01 2016-01-05 13:18:00 134 days 08:00:00 838 days 08:00:00
3 2016-07-27 07:57:00 0.01 2016-02-02 07:57:00 176 days 00:00:00 838 days 08:00:00
4 2016-10-04 18:36:00 0.01 2016-03-01 02:36:00 217 days 16:00:00 838 days 08:00:00
5 2016-12-13 05:15:00 0.01 2016-03-28 21:15:00 259 days 08:00:00 838 days 08:00:00
6 2017-02-20 15:54:00 0.02 2016-04-25 15:54:00 301 days 00:00:00 1454 days 00:00:00
7 2017-05-01 02:33:00 0.02 2016-05-23 10:33:00 342 days 16:00:00 1454 days 00:00:00
8 2017-07-09 13:12:00 0.02 2016-06-20 05:12:00 384 days 08:00:00 1454 days 00:00:00
9 2017-09-16 23:51:00 0.02 2016-07-17 23:51:00 426 days 00:00:00 1454 days 00:00:00
OLD answer:
Demo:
In [260]: df
Out[260]:
a b interval user_id
0 2016-01-01 00:00:00 2015-11-11 00:00:00 51 days 00:00:00 1
1 2016-03-10 10:39:00 2015-12-08 18:39:00 NaT 1
2 2016-05-18 21:18:00 2016-01-05 13:18:00 134 days 08:00:00 1
3 2016-07-27 07:57:00 2016-02-02 07:57:00 176 days 00:00:00 1
4 2016-10-04 18:36:00 2016-03-01 02:36:00 217 days 16:00:00 1
5 2016-12-13 05:15:00 2016-03-28 21:15:00 259 days 08:00:00 1
6 2017-02-20 15:54:00 2016-04-25 15:54:00 301 days 00:00:00 2
7 2017-05-01 02:33:00 2016-05-23 10:33:00 342 days 16:00:00 2
8 2017-07-09 13:12:00 2016-06-20 05:12:00 384 days 08:00:00 2
9 2017-09-16 23:51:00 2016-07-17 23:51:00 426 days 00:00:00 2
In [261]: df.dtypes
Out[261]:
a datetime64[ns]
b datetime64[ns]
interval timedelta64[ns]
user_id int64
dtype: object
In [262]: df.groupby('user_id')['interval'].sum()
Out[262]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [263]: df.groupby('user_id')['interval'].apply(lambda x: x.sum())
Out[263]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [264]: df.groupby('user_id').agg(np.sum)
Out[264]:
interval
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
So check your data...