Getting time difference per unique row items using pandas - python

Can someone please show me how to use pandas to get time difference per unique rows in the following data (df):
Round Order Date
1 1 2011.02.04 00:20:21
1 2 2011.02.04 00:25:11
1 3 2011.02.04 00:35:10
1 4 2011.02.04 00:47:10
2 1 2011.02.04 00:21:21
2 2 2011.02.04 00:31:11
2 3 2011.02.04 00:41:10
Because of the sequential order i column 'Order', the time difference will be the date value in row 4 minus the date value in row 1. So I want to arrive at this table (time_df):
Round TimeDiff
1 26.39
2 19.39

You can use groupby with difference min and max:
df['Date'] = pd.to_datetime(df['Date'], format='%Y.%m.%d %H:%M:%S')
print df
Round Order Date
0 1 1 2011-02-04 00:20:21
1 1 2 2011-02-04 00:25:11
2 1 3 2011-02-04 00:35:10
3 1 4 2011-02-04 00:47:10
4 2 1 2011-02-04 00:21:21
5 2 2 2011-02-04 00:31:11
6 2 3 2011-02-04 00:41:10
print df.groupby('Round')['Date'].apply(lambda x: x.max() - x.min())
Round
1 00:26:49
2 00:19:49
Name: Date, dtype: timedelta64[ns]

I would do it this way:
In [324]: df
Out[324]:
Round Order Date
0 1 1 2011-02-04 00:20:21
1 1 2 2011-02-04 00:25:11
2 1 3 2011-02-04 00:35:10
3 1 4 2011-02-04 00:47:10
4 2 1 2011-02-04 00:21:21
5 2 2 2011-02-04 00:31:11
6 2 3 2011-02-04 00:41:10
In [325]: grp = df.groupby('Round')
In [327]: grp.Date.max()-grp.Date.min()
Out[327]:
Round
1 00:26:49
2 00:19:49
Name: Date, dtype: timedelta64[ns]

Related

Create a column that contains the sum of rows above within group

Here's the dataset I've got. Basically I would like to create a column containing the sum of the values before the date (which means the sum of the values that is above the row) within the same group. So the first row of each group is supposed to be always 0.
group
date
value
1
10/04/2022
2
1
12/04/2022
3
1
17/04/2022
5
1
22/04/2022
1
2
11/04/2022
3
2
15/04/2022
2
2
17/04/2022
4
The column I want would look like this.
Could you give me an idea how to create such a column?
group
date
value
sum
1
10/04/2022
2
0
1
12/04/2022
3
2
1
17/04/2022
5
5
1
22/04/2022
1
10
2
11/04/2022
3
0
2
15/04/2022
2
3
2
17/04/2022
4
5
You can try groupby.transform and call Series.cumsum().shift()
df['sum'] = (df
# sort the dataframe if needed
.assign(date=pd.to_datetime(df['date'], dayfirst=True))
.sort_values(['group', 'date'])
.groupby('group')['value']
.transform(lambda col: col.cumsum().shift())
.fillna(0))
print(df)
group date value sum
0 1 10/04/2022 2 0.0
1 1 12/04/2022 3 2.0
2 1 17/04/2022 5 5.0
3 1 22/04/2022 1 10.0
4 2 11/04/2022 3 0.0
5 2 15/04/2022 2 3.0
6 2 17/04/2022 4 5.0

Count number of columns above a date

I have a pandas dataframe with several columns and I would like to know the number of columns above the date 2016-12-31 . Here is an example:
ID
Bill
Date 1
Date 2
Date 3
Date 4
Bill 2
4
6
2000-10-04
2000-11-05
1999-12-05
2001-05-04
8
6
8
2016-05-03
2017-08-09
2018-07-14
2015-09-12
17
12
14
2016-11-16
2017-05-04
2017-07-04
2018-07-04
35
And I would like to get this column
Count
0
2
3
Just create the mask and call sum on axis=1
date = pd.to_datetime('2016-12-31')
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1)
OUTPUT:
0 0
1 2
2 3
dtype: int64
If needed, call .to_frame('count') to create datarame with column as count
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1).to_frame('Count')
Count
0 0
1 2
2 3
Use df.filter to filter the Date* columns + .sum(axis=1)
(df.filter(like='Date') > '2016-12-31').sum(axis=1).to_frame(name='Count')
Result:
Count
0 0
1 2
2 3
You can do:
df['Count'] = (df.loc[:, [x for x in df.columns if 'Date' in x]] > '2016-12-31').sum(axis=1)
Output:
ID Bill Date 1 Date 2 Date 3 Date 4 Bill 2 Count
0 4 6 2000-10-04 2000-11-05 1999-12-05 2001-05-04 8 0
1 6 8 2016-05-03 2017-08-09 2018-07-14 2015-09-12 17 2
2 12 14 2016-11-16 2017-05-04 2017-07-04 2018-07-04 35 3
We select columns with 'Date' in the name. It's better when we have lots of columns like these and don't want to put them one by one. Then we compare it with lookup date and sum 'True' values.

How to bin rows into 10 minutes interval?

Assume this is my sample data:
ID datetime
0 2 2015-01-09 19:05:39
1 1 2015-01-10 20:33:38
2 1 2015-01-10 20:33:38
3 1 2015-01-10 20:45:39
4 1 2015-01-10 20:46:39
5 1 2015-01-10 20:46:59
6 1 2015-01-10 20:50:39
I want to create a new column "BIN" which tells us which 10 minute bin this row belongs to.
i.e) Select minimum datetime and start from there. In this example data first row is the minimum time but it's not the case which my real data. My real data is not sorted.
ID datetime bin
0 2 2015-01-09 19:05:39 1
1 1 2015-01-10 20:33:38 2
2 1 2015-01-10 20:33:38 2
3 1 2015-01-10 20:45:39 3
4 1 2015-01-10 20:46:39 3
5 1 2015-01-10 20:46:59 3
6 1 2015-01-10 20:50:39 3
First subtract minimum value of datetime for timedeltas, then create 10minutes values by Series.dt.floor, then Series.rank and last convert to integers by Series.astype:
df['datetime'] = pd.to_datetime(df['datetime'])
df['bin'] = (df['datetime'].sub(df['datetime'].min())
.dt.floor('10Min')
.rank(method='dense')
.astype(int))
print (df)
ID datetime bin
0 2 2015-01-09 19:05:39 1
1 1 2015-01-10 20:33:38 2
2 1 2015-01-10 20:33:38 2
3 1 2015-01-10 20:45:39 3
4 1 2015-01-10 20:46:39 3
5 1 2015-01-10 20:46:59 3
6 1 2015-01-10 20:50:39 3
If you dataframe is called df. Assuming the bins you are referring to range from 1 - 6, where 1 is between 0 - 10 minutes and 6 between 50 - 60, then you can use the following formula:
import math
df['datetime'] = pd.to_datetime(df['datetime'])
df['bin'] = math.ceil(df['datetime'].minute / 10)

Grouping by date range with pandas

I am looking to group by two columns: user_id and date; however, if the dates are close enough, I want to be able to consider the two entries part of the same group and group accordingly. Date is m-d-y
user_id date val
1 1-1-17 1
2 1-1-17 1
3 1-1-17 1
1 1-1-17 1
1 1-2-17 1
2 1-2-17 1
2 1-10-17 1
3 2-1-17 1
The grouping would group by user_id and dates +/- 3 days from each other. so the group by summing val would look like:
user_id date sum(val)
1 1-2-17 3
2 1-2-17 2
2 1-10-17 1
3 1-1-17 1
3 2-1-17 1
Any way someone could think of that this could be done (somewhat) easily? I know there are some problematic aspects of this. for example, what to do if the dates string together endlessly with three days apart. but the exact data im using only has 2 values per person..
Thanks!
I'd convert this to a datetime column and then use pd.TimeGrouper:
dates = pd.to_datetime(df.date, format='%m-%d-%y')
print(dates)
0 2017-01-01
1 2017-01-01
2 2017-01-01
3 2017-01-01
4 2017-01-02
5 2017-01-02
6 2017-01-10
7 2017-02-01
Name: date, dtype: datetime64[ns]
df = (df.assign(date=dates).set_index('date')
.groupby(['user_id', pd.TimeGrouper('3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Similar solution using pd.Grouper:
df = (df.assign(date=dates)
.groupby(['user_id', pd.Grouper(key='date', freq='3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Update: TimeGrouper will be deprecated in future versions of pandas, so Grouper would be preferred in this scenario (thanks for the heads up, Vaishali!).
I come with a very ugly solution but still work...
df=df.sort_values(['user_id','date'])
df['Key']=df.sort_values(['user_id','date']).groupby('user_id')['date'].diff().dt.days.lt(3).ne(True).cumsum()
df.groupby(['user_id','Key'],as_index=False).agg({'val':'sum','date':'first'})
Out[586]:
user_id Key val date
0 1 1 3 2017-01-01
1 2 2 2 2017-01-01
2 2 3 1 2017-01-10
3 3 4 1 2017-01-01
4 3 5 1 2017-02-01

Pandas: count some values in a column

I have dataframe, it's part of them
ID,"url","app_name","used_at","active_seconds","device_connection","device_os","device_type","device_usage"
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-05-01 09:29:11,13,3g,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-05-01 09:33:00,3,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-06-01 09:33:07,1,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-06-01 09:34:30,5,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Messaging,2015-06-01 09:36:22,133,3g,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Messaging,2015-05-02 09:38:40,5,3g,android,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Yandex.Navigator,2015-05-01 11:04:48,70,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",VK Client,2015-6-01 12:02:27,248,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Viber,2015-07-01 12:06:35,7,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",VK Client,2015-08-01 12:23:26,86,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Talking Angela,2015-08-02 12:24:52,0,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",My Talking Angela,2015-08-03 12:24:52,167,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Talking Angela,2015-08-04 12:27:39,34,3g,ios,smartphone,home
I need to count quantity of days in every month to every ID.
If I try df.groupby('ID')['used_at'].count() I get quantity of visiting, how can I take and count days at month?
I think you need groupby by ID, month and day and aggregate size:
df1 = df.used_at.groupby([df['ID'], df.used_at.dt.month,df.used_at.dt.day ]).size()
print (df1)
ID used_at used_at
574c4969b017ae6481db9a7c77328bc3 5 1 1
6 1 1
7 1 1
8 1 1
2 1
3 1
4 1
e990fae0f48b7daf52619b5ccbec61bc 5 1 2
2 1
6 1 3
dtype: int64
Or by date - it is same as by year, month and day:
df1 = df.used_at.groupby([df['ID'], df.used_at.dt.date]).size()
print (df1)
ID used_at
574c4969b017ae6481db9a7c77328bc3 2015-05-01 1
2015-06-01 1
2015-07-01 1
2015-08-01 1
2015-08-02 1
2015-08-03 1
2015-08-04 1
e990fae0f48b7daf52619b5ccbec61bc 2015-05-01 2
2015-05-02 1
2015-06-01 3
dtype: int64
Differences between count and size:
size counts NaN values, count does not.

Categories

Resources