How to calculate the duration of events using Pandas given as strings? - python

After finding the following link regarding calculating time differences using Pandas, I'm still stuck attempting to fit that knowledge to my own data. Here's what my dataset looks like:
In [10]: df
Out[10]:
id time
0 420 1/3/2018 8:32
1 420 1/3/2018 8:36
2 420 1/3/2018 8:42
3 425 1/7/2018 12:35
4 425 1/7/2018 14:29
5 425 1/7/2018 16:15
6 425 1/7/2018 16:36
7 427 1/11/2018 20:50
8 428 1/13/2018 16:35
9 428 1/13/2018 17:36
I'd like to perform a groupby or another function on ID where the output is:
In [11]: pd.groupby(df[id])
Out [11]:
id time (duration)
0 420 0:10
1 425 4:01
2 427 0:00
3 428 1:01
The types for id and time are int64 and object respectively. Using python3 and pandas 0.20.
Edit:
Coming from SQL, this appears that it would be functionally equivalent to:
select id, max(time) - min(time)
from df
group by id
Edit 2:
Thank you all for the quick responses. All of the solutions give me some version of the following error. Not sure what is relevant to my particular dataset that I'm missing here:
TypeError: unsupported operand type(s) for -: 'str' and 'str'

groupby with np.ptp
df.groupby('id').time.apply(np.ptp)
id
420 00:10:00
425 04:01:00
427 00:00:00
428 01:01:00
Name: time, dtype: timedelta64[ns]

Group the dataframe by event IDs and select the smallest and the largest times:
df1 = df.groupby('id').agg([max, min])
Find the difference:
(df1[('time','max')] - df1[('time','min')]).reset_index()
# id 0
#0 420 00:10:00
#1 425 04:01:00
#2 427 00:00:00
#3 428 01:01:00

You need to sort the dataframe by time and group by id before getting the difference between time in each group.
df['time'] = pd.to_datetime(df['time'])
df.sort_values(by='time').groupby('id')['time'].apply(lambda g: g.max() - g.min()).reset_index(name='duration')
Output:
id duration
0 420 00:10:00
1 425 04:01:00
2 427 00:00:00
3 428 01:01:00

Related

How to unify (collapse) multiple columns into one assigning unique values

Edited my previous question:
Want to distinguish each Devices (FOUR types) that are attached to a particular Building's particular Elevator (represented by height).
As there is no unique IDs for the devices, want to identify them and assign unique IDs to each of them by Grouping ('BldID', 'BldHt', 'Deivce') to identify any particular 'Device'.
Count their testing results, i.e. how many times it failed (NG) out of total number of testing (NG + OK) for any particular date for the entire duration consisting of few months.
Original dataframe looks like this
BldgID BldgHt Device Date Time Result
1074 34.0 790 2018/11/20 10:30 OK
1072 31.0 780 2018/11/19 11:10 NG
1072 36.0 780 2018/11/17 05:30 OK
1074 10.0 790 2018/11/19 06:10 OK
1074 10.0 790 2018/12/20 11:50 NG
1076 17.0 760 2018/08/15 09:20 NG
1076 17.0 760 2018/09/20 13:40 OK
As 'Time' is irrelevant, dropped it. Want to find the number of [NG] per day for each set (consists of 'BldgID', 'BlgHt', 'Device'].
#aggregate both functions only once by groupby
df1 = mel_df.groupby(['BldgID','BldgHt','Device','Date'])\
['Result'].agg([('NG', lambda x :(x=='NG').sum()), \
('ALL','count')]).round(2).reset_index()
#create New_ID by insert with Series with zero fill 3 values
s = pd.Series(np.arange(1, len(mel_df2) + 1),
index=mel_df2.index).astype(str).str.zfill(3)
mel_df2.insert(0, 'New_ID', s)
Now the filtered DataFrame looks like:
print (mel_df2)
New_ID BldgID BldgHt Device Date NG ALL
1 001 1072 31.0 780 2018/11/19 1 2
8 002 1076 17.0 760 2018/11/20 1 1
If I groupby ['BldgID', 'BldgHt', 'Device', 'Date'] then I get per day 'NG'.
But it would consider every day differently and if I assign 'unique' IDs I can plot how the unique Devices behave in every other single day.
If I groupby ['BldgId', 'BldgHt', 'Device'] then I get the overall 'NG' for that set (or unique Device), which is not my goal.
What I want to achieve is:
print (mel_df2)
New_ID BldgID BldgHt Device Date NG ALL
001 1072 31.0 780 2018/11/19 1 2
1072 31.0 780 2018/12/30 3 4
002 1076 17.0 760 2018/11/20 1 1
1076 17.0 760 2018/09/20 2 4
003 1072 36.0 780 2018/08/15 1 3
Any tips would be very much appreciated.
Use:
#aggregate both aggregate function only in once groupby
df1 = mel_df.groupby(['BldgID','BldgHt','Device','Date'])\
['Result'].agg([('NG', lambda x :(x=='NG').sum()), ('ALL','count')]).round(2).reset_index()
#filter non 0 rows
mel_df2 = df1[df1.NG != 0]
#filter first rows by Date
mel_df2 = mel_df2.drop_duplicates('Date')
#create New_ID by insert with Series with zero fill 3 values
s = pd.Series(np.arange(1, len(mel_df2) + 1), index=mel_df2.index).astype(str).str.zfill(3)
mel_df2.insert(0, 'New_ID', s)
Output from data from question:
print (mel_df2)
New_ID BldgID BldgHt Device Date NG ALL
1 001 1072 31.0 780 2018/11/19 1 1
8 002 1076 17.0 780 2018/11/20 1 1

How can I loop though pandas groupby and manipulate data?

I am trying to work out the time delta between values in a grouped pandas df.
My df looks like this:
Location ID Item Qty Time
0 7 202545942 100130 1 07:19:46
1 8 202545943 100130 1 07:20:08
2 11 202545950 100130 1 07:20:31
3 13 202545955 100130 1 07:21:08
4 15 202545958 100130 1 07:21:18
5 18 202545963 100130 3 07:21:53
6 217 202546320 100130 1 07:22:43
7 219 202546324 100130 1 07:22:54
8 229 202546351 100130 1 07:23:32
9 246 202546376 100130 1 07:24:09
10 273 202546438 100130 1 07:24:37
11 286 202546464 100130 1 07:24:59
12 296 202546490 100130 1 07:25:16
13 297 202546491 100130 1 07:25:24
14 310 202546516 100130 1 07:25:59
15 321 202546538 100130 1 07:26:17
16 329 202546549 100130 1 07:28:09
17 388 202546669 100130 1 07:29:02
18 420 202546717 100130 2 07:30:01
19 451 202546766 100130 1 07:30:19
20 456 202546773 100130 1 07:30:27
(...)
42688 458 202546777 999969 1 06:51:16
42689 509 202546884 999969 1 06:53:09
42690 567 202546977 999969 1 06:54:21
42691 656 202547104 999969 1 06:57:27
I have grouped this using the following method:
ndf = df.groupby(['ID','Location','Time'])
If I add .size() to the end of the above and print(ndf) I get the following output:
(...)
ID Location Time
995812 696 07:10:36 1
730 07:11:41 1
761 07:12:30 1
771 07:20:49 1
995820 381 06:55:07 1
761 07:12:44 1
(...)
This is the as desired.
My challenge is that I need to work out the time delta between each time per Item and add this as a column in the dataframe grouping. It should give me the following:
ID Location Time Delta
(...)
995812 696 07:10:36 0
730 07:11:41 00:01:05
761 07:12:30 00:00:49
771 07:20:49 00:08:19
995820 381 06:55:07 0
761 07:12:44 00:17:37
(...)
I am pulling my hair out trying to work out a method of doing this, so I'm turning to the greats.
Please help. Thanks in advance.
Convert Time column to timedeltas by to_timedelta, sort by all 3 columns by DataFrame.sort_values, get difference per groups by DataFrameGroupBy.diff, replace missing values to 0 timedelta by Series.fillna:
#if strings astype should be omit
df['Time'] = pd.to_timedelta(df['Time'].astype(str))
df = df.sort_values(['ID','Location','Time'])
df['Delta'] = df.groupby('ID')['Time'].diff().fillna(pd.Timedelta(0))
Also is possible convert timedeltas to seconds - add Series.dt.total_seconds:
df['Delta_sec'] = df.groupby('ID')['Time'].diff().dt.total_seconds().fillna(0)
If you just wanted to iterate over the groupby object, based on your original question title you can do it:
for (x, y) in df.groupby(['ID','Location','Time']):
print("{0}, {1}".format(x, y))
# your logic
However, this works for 10.000 rows, 100.000 rows, but not so good for 10^6 rows or more.

how to multiply all values from a column in a particular year in pandas

I'm trying to multiply all values in a particular year and push it to another column. With the code below I'm getting this error
TypeError: ("'NoneType' object is not callable", 'occurred at index
I'm getting NaT and NaN when I use shift(1). How can I get it to work?
def check_date():
next_row = df.Date.shift(1)
first_row = df.Date
date1 = pd.to_datetime(first_row).year
date2 = pd.to_datetime(next_row).year
if date1 == date2:
df['all_data_in_year'] = date1 * date2
df.apply(check_date(), axis=1)
DataSet:
Date Open High Low Last Close Total Trade Quantity Turnover (Lacs)
31/12/10 816 824.5 807.3 815 818.45 1165987 9529.64
31/01/11 675 680 654 670.1 669.35 535039 3553.92
28/02/11 550 561.6 542 548.5 548.4 749166 4136.09
31/03/11 621.5 624.7 607.1 618 616.25 628572 3866
29/04/11 654.7 657.95 626 631 632.05 833213 5338.91
31/05/11 575 590 565.6 589.3 585.15 908185 5239.36
30/06/11 527 530.7 521.3 524 524.6 534496 2804.89
29/07/11 496.95 502.9 486 486.2 489.7 500743 2477.96
30/08/11 365.95 382.7 365 380 376.65 844439 3171.6
30/09/11 362.4 365.9 348.1 352 352.75 617537 2196.56
31/10/11 430 439.5 425 429.1 431.2 1033903 4493.97
30/11/11 349.05 354.95 344.15 348 350 686735 2404.1
30/12/11 353 355.9 340.1 340.1 342.75 740222 2565.39
31/01/12 443 451.45 428 445.5 446 1344942 5952.77
29/02/12 485.55 505.9 484 497 495.1 1011007 5004.46
30/03/12 421 436.45 418.4 432.5 432.95 867832 3740.04
30/04/12 410.35 419.4 406.85 414.3 414.05 418539 1733.81
31/05/12 362 363.05 351.2 359 358.3 840753 3000.41
29/06/12 385.05 395.3 382.9 388 389.75 1171690 4581.58
31/07/12 377.75 386 367.7 380.5 381.35 499246 1886.06
31/08/12 473.7 473.7 394.25 399 400.85 631225 2544.24
I think better is avoid loops (apply under the hood) and use numpy.where:
#sample Dataframe with sample datetimes
rng = pd.date_range('2017-04-03', periods=10, freq='8m')
df = pd.DataFrame({'Date': rng, 'a': range(10)})
date1 = df.Date.shift(1).dt.year
date2 = df.Date.dt.year
df['all_data_in_year'] = np.where(date1 == date2, date1 * date2, np.nan)
print (df)
Date a all_data_in_year
0 2017-04-30 0 NaN
1 2017-12-31 1 4068289.0
2 2018-08-31 2 NaN
3 2019-04-30 3 NaN
4 2019-12-31 4 4076361.0
5 2020-08-31 5 NaN
6 2021-04-30 6 NaN
7 2021-12-31 7 4084441.0
8 2022-08-31 8 NaN
9 2023-04-30 9 NaN
EDIT1:
df['new'] = df.groupby( pd.to_datetime(df['Date']).dt.year)['Close'].transform('prod')

Pandas, sorting a dataframe in a useful way to find the difference between times. Why are key and value errors appearing?

I have a pandas DataFrame containing 5 columns.
['date', 'sensorId', 'readerId', 'rssi']
df_json['time'] = df_json.date.dt.time
I am aiming to find people who have entered a store (rssi > 380). However this would be much more accurate if I could also check every record a sensorId appears in and whether the time in that record is within 5 seconds of the current record.
Data from the dataFrame: (df_json)
date sensorId readerId rssi
0 2017-03-17 09:15:59.453 4000068 76 352
0 2017-03-17 09:20:17.708 4000068 56 374
1 2017-03-17 09:20:42.561 4000068 60 392
0 2017-03-17 09:44:21.728 4000514 76 352
0 2017-03-17 10:32:45.227 4000461 76 332
0 2017-03-17 12:47:06.639 4000046 43 364
0 2017-03-17 12:49:34.438 4000046 62 423
0 2017-03-17 12:52:28.430 4000072 62 430
1 2017-03-17 12:52:32.593 4000072 62 394
0 2017-03-17 12:53:17.708 4000917 76 335
0 2017-03-17 12:54:24.848 4000072 25 402
1 2017-03-17 12:54:35.738 4000072 20 373
I would like to use jezrael's answer of df['date'].diff(). However I cannot successfully use this, I receive many different errors. The ['date'] column is of dtype datetime64[ns].
How the data is stored above is not useful, for the .diff() to be of any use the data must be stored as below (dfEntered):
Sample Data: dfEntered
date sensorId readerId time rssi
2017-03-17 4000046 43 12:47:06.639000 364
62 12:49:34.438000 423
4000068 56 09:20:17.708000 374
60 09:20:42.561000 392
76 09:15:59.453000 352
4000072 20 12:54:35.738000 373
12:54:42.673000 374
25 12:54:24.848000 402
12:54:39.723000 406
62 12:52:28.430000 430
12:52:32.593000 394
4000236 18 13:28:14.834000 411
I am planning on replacing 'time' with 'date'. Time is of dtype object and I cannot seem to cast it or diff() it.'date' will be just as useful.
The only way (I have found) of having df_json appear as dfEntered is with:
dfEntered = df_json.groupby(by=[df_json.date.dt.time, 'sensorId', 'readerId', 'date'])
If I do:
dfEntered = df_json.groupby(by=[df_json.date.dt.time, 'sensorId', 'readerId'])['date'].diff()
results in:
File "processData.py", line 61, in <module>
dfEntered = df_json.groupby(by=[df_json.date.dt.date, 'sensorId', 'readerId', 'rssi'])['date'].diff()
File "<string>", line 17, in diff
File "C:\Users\danie\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 614, in wrapper
raise ValueError
ValueError
If I do:
dfEntered = df_json.groupby(by=[df_json.date.dt.date, 'sensorId', 'readerId', 'rssi'])['time'].count()
print(dfEntered['date'])
Results in:
File "processData.py", line 65, in <module>
print(dfEntered['date'])
File "C:\Users\danie\Anaconda2\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "C:\Users\danie\Anaconda2\lib\site-packages\pandas\core\indexes\multi.py", line 821, in get_value
raise e1
KeyError: 'date'
I applied a .count() to the groupby just so that I can output it. I had previously tried a .agg({'date':'diff'}) which resluts in the valueError, but the dtype is datetime64[ns] (atleast in the original df_json, I cannot view the dtype of dfEntered['date']
If the above would work I would like to have a df of [df_json.date.dt.date, 'sensorId', 'readerId', 'mask'] mask being true if they entered a store.
I then have the below df (contains sensorIds that received a text)
sensor_id sms_status date_report rssi readerId
0 5990100 SUCCESS 2017-05-03 13:41:28.412800 500 10
1 5990001 SUCCESS 2017-05-03 13:41:28.412800 500 11
2 5990100 SUCCESS 2017-05-03 13:41:30.413000 500 12
3 5990001 SUCCESS 2017-05-03 13:41:31.413100 500 13
4 5990100 SUCCESS 2017-05-03 13:41:34.413400 500 14
5 5990001 SUCCESS 2017-05-03 13:41:35.413500 500 52
6 5990100 SUCCESS 2017-05-03 13:41:38.413800 500 60
7 5990001 SUCCESS 2017-05-03 13:41:39.413900 500 61
I would then like to merge the two together on day, sensorId, readerId.
I am hoping that would result in a df that could appear as [df_json.date.dt.date, 'sensorId', 'readerId', 'mask'] and therefore I could say that a sensorId with a mask of true is a conversion. A conversion being that sensorId received a text that day and also entered the store that day.
I'm beginning to get wary that my end aim isn't even achievable, as I simply do not understand how pandas works yet :D (damn errors)
UPDATE
dfEntered = dfEntered.reset_index()
This is allowing me to access the date and apply a diff.
I don't quite understand the theory of how this problem occurred, and why reset_index() fixed this.
I think you need boolean indexing with mask created with diff:
df = pd.DataFrame({'rssi': [500,530,1020,1201,1231,10],
'time': pd.to_datetime(['2017-01-01 14:01:08','2017-01-01 14:01:14',
'2017-01-01 14:01:17', '2017-01-01 14:01:27',
'2017-01-01 14:01:29', '2017-01-01 14:01:30'])})
print (df)
rssi time
0 500 2017-01-01 14:01:08
1 530 2017-01-01 14:01:14
2 1020 2017-01-01 14:01:17
3 1201 2017-01-01 14:01:27
4 1231 2017-01-01 14:01:29
5 10 2017-01-01 14:01:30
print (df['time'].diff())
0 NaT
1 00:00:06
2 00:00:03
3 00:00:10
4 00:00:02
5 00:00:01
Name: time, dtype: timedelta64[ns]
mask = (df['time'].diff() >'00:00:05') & (df['rssi'] > 380)
print (mask)
0 False
1 True
2 False
3 True
4 False
5 False
dtype: bool
df1 = df[mask]
print (df1)
rssi time
1 530 2017-01-01 14:01:14
3 1201 2017-01-01 14:01:27

Time Delta of Year Excluding Certain Days

I am making a heat map that has Company Name on the x axis, months on the y-axis, and shaded regions as the number of calls.
I am taking a slice of data from a database for the past year in order to create the heat map. However, this means that if you hover over the current month, say for example today is July 13, you will get the calls of July 1-13 of this year, and the calls of July 13-31 from last year added together. In the current month, I only want to show calls from July 1-13.
#This section selects the last year of data
# convert strings to datetimes
df['recvd_dttm'] = pd.to_datetime(df['recvd_dttm'])
#Only retrieve data before now (ignore typos that are future dates)
mask = df['recvd_dttm'] <= datetime.datetime.now()
df = df.loc[mask]
# get first and last datetime for final week of data
range_max = df['recvd_dttm'].max()
range_min = range_max - datetime.timedelta(days=365)
# take slice with final week of data
df = df[(df['recvd_dttm'] >= range_min) &
(df['recvd_dttm'] <= range_max)]
You can use the pd.tseries.offsets.MonthEnd() to achieve your goal here.
import pandas as pd
import numpy as np
import datetime as dt
np.random.seed(0)
val = np.random.randn(600)
date_rng = pd.date_range('2014-01-01', periods=600, freq='D')
df = pd.DataFrame(dict(dates=date_rng,col=val))
print(df)
col dates
0 1.7641 2014-01-01
1 0.4002 2014-01-02
2 0.9787 2014-01-03
3 2.2409 2014-01-04
4 1.8676 2014-01-05
5 -0.9773 2014-01-06
6 0.9501 2014-01-07
7 -0.1514 2014-01-08
8 -0.1032 2014-01-09
9 0.4106 2014-01-10
.. ... ...
590 0.5433 2015-08-14
591 0.4390 2015-08-15
592 -0.2195 2015-08-16
593 -1.0840 2015-08-17
594 0.3518 2015-08-18
595 0.3792 2015-08-19
596 -0.4700 2015-08-20
597 -0.2167 2015-08-21
598 -0.9302 2015-08-22
599 -0.1786 2015-08-23
[600 rows x 2 columns]
print(df.dates.dtype)
datetime64[ns]
datetime_now = dt.datetime.now()
datetime_now_month_end = datetime_now + pd.tseries.offsets.MonthEnd(1)
print(datetime_now_month_end)
2015-07-31 03:19:18.292739
datetime_start = datetime_now_month_end - pd.tseries.offsets.DateOffset(years=1)
print(datetime_start)
2014-07-31 03:19:18.292739
print(df[(df.dates > datetime_start) & (df.dates < datetime_now)])
col dates
212 0.7863 2014-08-01
213 -0.4664 2014-08-02
214 -0.9444 2014-08-03
215 -0.4100 2014-08-04
216 -0.0170 2014-08-05
217 0.3792 2014-08-06
218 2.2593 2014-08-07
219 -0.0423 2014-08-08
220 -0.9559 2014-08-09
221 -0.3460 2014-08-10
.. ... ...
550 0.1639 2015-07-05
551 0.0963 2015-07-06
552 0.9425 2015-07-07
553 -0.2676 2015-07-08
554 -0.6780 2015-07-09
555 1.2978 2015-07-10
556 -2.3642 2015-07-11
557 0.0203 2015-07-12
558 -1.3479 2015-07-13
559 -0.7616 2015-07-14
[348 rows x 2 columns]

Categories

Resources