Below is script for a simplified version of the df in question:
plan_dates=pd.DataFrame({'start_date':['2021-01-01','2021-01-02','2021-01-03','2021-01-04','2021-01-05'],
'end_date': ['2021-01-03','2021-01-04','2021-02-03','2021-03-04','2021-03-05']})
plan_dates
start_date end_date
0 2021-01-01 2021-01-03
1 2021-01-02 2021-01-04
2 2021-01-03 2021-02-03
3 2021-01-04 2021-03-04
4 2021-01-05 2021-03-05
I would like to create a new DataFrame which has 2 columns:
date
count of active plans (the count of cases where the date is within the start_date & end_date in each row of the plan_dates df)
INTENDED DF:
date count_active_plans
0 2021-01-01 1
1 2021-01-02 2
2 2021-01-03 3
3 2021-01-04 3
4 2021-01-05 3
Any help would be greatly appreciated.
First convert both columns to datetimes and add one day to end_date, then repeat index by Index.repeat with subtraction of days and add counter values by GroupBy.cumcount with to_timedelta, last count by Series.value_counts with come data cleaning and converting to DataFrame:
plan_dates['start_date'] = pd.to_datetime(plan_dates['start_date'])
plan_dates['end_date'] = pd.to_datetime(plan_dates['end_date']) + pd.Timedelta(1, unit='d')
s = plan_dates['end_date'].sub(plan_dates['start_date']).dt.days
df = plan_dates.loc[plan_dates.index.repeat(s)].copy()
counter = df.groupby(level=0).cumcount()
df1 = (df['start_date'].add(pd.to_timedelta(counter, unit='d'))
.value_counts()
.sort_index()
.rename_axis('date')
.reset_index(name='count_active_plans'))
print (df1)
date count_active_plans
0 2021-01-01 1
1 2021-01-02 2
2 2021-01-03 3
3 2021-01-04 3
4 2021-01-05 3
.. ... ...
59 2021-03-01 2
60 2021-03-02 2
61 2021-03-03 2
62 2021-03-04 2
63 2021-03-05 1
[64 rows x 2 columns]
I'm trying to aggregate a dataframe by both ID and date. Suppose I had a dataframe:
Publish date ID Price
0 2000-01-02 0 10
1 2000-01-03 0 20
2 2000-02-17 0 30
3 2000-01-04 1 40
I would like to aggregate the value by ID and date (frequency = 1W) and get a dataframe like:
Publish date ID Price
0 2000-01-02 0 30
1 2000-02-17 0 30
2 2000-01-04 1 40
I understand it can be achieved by iterating the ID and using grouper to aggregate the price. Is there any more efficient way without iterating the IDs? Many thanks.
Use Grouper with aggregate sum, but not sure about frequency of Grouper (because all looks different like in question):
df['Publish date'] = pd.to_datetime(df['Publish date'])
df = (df.groupby([pd.Grouper(freq='W', key='Publish date'),'ID'], sort=False)['Price']
.sum()
.reset_index())
print (df)
Publish date ID Price
0 2000-01-02 0 10
1 2000-01-09 0 20
2 2000-02-20 0 30
3 2000-01-09 1 40
df['Publish date'] = pd.to_datetime(df['Publish date'])
df = (df.groupby([pd.Grouper(freq='W-Mon', key='Publish date'),'ID'], sort=False)['Price']
.sum()
.reset_index())
print (df)
Publish date ID Price
0 2000-01-03 0 30
1 2000-02-21 0 30
2 2000-01-10 1 40
Or:
df['Publish date'] = pd.to_datetime(df['Publish date'])
df = (df.groupby([pd.Grouper(freq='7D', key='Publish date'),'ID'], sort=False)['Price']
.sum()
.reset_index())
print (df)
Publish date ID Price
0 2000-01-02 0 30
1 2000-02-13 0 30
2 2000-01-02 1 40
id time_taken
1 2017-06-21 07:36:53
2 2017-06-21 07:32:28
3 2017-06-22 08:55:09
4 2017-06-22 08:04:31
5 2017-06-21 03:38:46
current_time = 2017-06-22 10:08:16
i want to create df2 where time difference of time_taken columns is greater than 24 hours with current_time
i.e
1 2017-06-21 07:36:53
2 2017-06-21 07:32:28
5 2017-06-21 03:38:46
You can convert Timedelta to total_seconds and compare or compare with Timedelta, filter by boolean indexing:
current_time = '2017-06-22 10:08:16'
df['time_taken'] = pd.to_datetime(df['time_taken'])
df = df[(pd.to_datetime(current_time) - df['time_taken']).dt.total_seconds() > 60 * 60 * 24]
print (df)
id time_taken
0 1 2017-06-21 07:36:53
1 2 2017-06-21 07:32:28
4 5 2017-06-21 03:38:46
Or:
df = df[(pd.to_datetime(current_time) - df['time_taken']) > pd.Timedelta(24, unit='h')]
print (df)
id time_taken
0 1 2017-06-21 07:36:53
1 2 2017-06-21 07:32:28
4 5 2017-06-21 03:38:46
Have a df like that:
Client Status Dat_Start Dat_End
1 A 2015-01-01 2015-01-19
1 B 2016-01-01 2016-02-02
1 A 2015-02-12 2015-02-20
1 B 2016-01-30 2016-03-01
I'd like to get average between two dates (Dat_end and Dat_Start) for Status='A' grouping by client column using Pandas syntax.
So it will be smth SQL-like:
Select Client, AVG (Dat_end-Dat_Start) as Date_Diff
from Table
where Status='A'
Group by Client
Thanks!
Calculate the timedeltas:
df['duration'] = df.Dat_End-df.Dat_Start
df
Out[92]:
Client Status Dat_Start Dat_End duration
0 1 A 2015-01-01 2015-01-19 18 days
1 1 B 2016-01-01 2016-02-02 32 days
2 1 A 2015-02-12 2015-02-20 8 days
3 1 B 2016-01-30 2016-03-01 31 days
Filter and ask for sum and count for pandas <0.20:
df[df.Status=='A'].groupby('Client').duration.agg(['sum', 'count'])
Out[98]:
sum count
Client
1 26 days 2
For upcoming pandas 0.20, see mean added to groupby here for timedeltas. This will work:
df[df.Status=='A'].groupby('Client').duration.mean()
In [10]: df.loc[df.Status == 'A'].groupby('Client') \
.apply(lambda x: (x.Dat_End-x.Dat_Start).mean()).reset_index()
Out[10]:
Client 0
0 1 13 days
I have dataframe, it's part of them
ID,"url","app_name","used_at","active_seconds","device_connection","device_os","device_type","device_usage"
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-05-01 09:29:11,13,3g,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-05-01 09:33:00,3,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-06-01 09:33:07,1,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-06-01 09:34:30,5,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Messaging,2015-06-01 09:36:22,133,3g,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Messaging,2015-05-02 09:38:40,5,3g,android,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Yandex.Navigator,2015-05-01 11:04:48,70,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",VK Client,2015-6-01 12:02:27,248,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Viber,2015-07-01 12:06:35,7,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",VK Client,2015-08-01 12:23:26,86,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Talking Angela,2015-08-02 12:24:52,0,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",My Talking Angela,2015-08-03 12:24:52,167,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Talking Angela,2015-08-04 12:27:39,34,3g,ios,smartphone,home
I need to count quantity of days in every month to every ID.
If I try df.groupby('ID')['used_at'].count() I get quantity of visiting, how can I take and count days at month?
I think you need groupby by ID, month and day and aggregate size:
df1 = df.used_at.groupby([df['ID'], df.used_at.dt.month,df.used_at.dt.day ]).size()
print (df1)
ID used_at used_at
574c4969b017ae6481db9a7c77328bc3 5 1 1
6 1 1
7 1 1
8 1 1
2 1
3 1
4 1
e990fae0f48b7daf52619b5ccbec61bc 5 1 2
2 1
6 1 3
dtype: int64
Or by date - it is same as by year, month and day:
df1 = df.used_at.groupby([df['ID'], df.used_at.dt.date]).size()
print (df1)
ID used_at
574c4969b017ae6481db9a7c77328bc3 2015-05-01 1
2015-06-01 1
2015-07-01 1
2015-08-01 1
2015-08-02 1
2015-08-03 1
2015-08-04 1
e990fae0f48b7daf52619b5ccbec61bc 2015-05-01 2
2015-05-02 1
2015-06-01 3
dtype: int64
Differences between count and size:
size counts NaN values, count does not.