Aggregation based on previous month from eventdate - python

I'm Stuck on a problem it would be great if you could help me :)
I created a dataframe with pandas:
looks like that:
HostName
Date
A
2021-01-01 12:30
B
2021-01-01 12:42
B
2021-02-01 12:30
A
2021-02-01 12:40
A
2021-02-25 12:40
A
2021-03-01 12:41
A
2021-03-01 12:42
I try to Aggregat based on the previous month but it's not working.
the end result should look like this:
HostName
Date
previous month
A
2021-01-01 12:30
Nan
B
2021-01-01 12:42
Nan
B
2021-02-01 12:30
1
A
2021-02-01 12:40
Nan
A
2021-02-25 12:40
1
A
2021-03-01 12:41
2
A
2021-03-01 12:42
3
for every row Date should look one-month before and Aggregat the number of Hostnames found.
for example row number 6 count HostName A from 2021-02-01 12:41 to 2021-03-01 12:41
what I try to do and failed:
extract the previous month:
df['Date Before'] = df['Date'] - pd.DateOffset(months=1)
and Aggregate between this month
df.resample('M', on='Date').HostName.count()
df.groupby('HostName').resample('M', on='Date Before').HostName.count()
Please Help Me many thanks!!!

use shift to look back a n rows for a dataframe column. df is the group by results.
data1="""HostName Date
A 2021-01-01 12:30
B 2021-01-01 12:42
B 2021-02-01 12:30
A 2021-02-01 12:40
A 2021-02-25 12:40
A 2021-03-01 12:41
A 2021-03-01 12:42"""
df = pd.read_table(StringIO(data1), sep='\t')
df['Date']=pd.to_datetime(df['Date'])
grouped=df.groupby('HostName')['Date']
def previous_date(group):
return group.sort_values().shift(1)
df['Previous Date']=grouped.apply(previous_date)
df['Previous Count']=df.apply(lambda x: x['Date']-x['Previous Date'],axis=1)
print(df.sort_values(by=["HostName","Date"]))
df['Con'] = np.where( (df['Previous Date'].notnull() & df['Previous Count']>0) , 1, 0)
print(df.sort_values(by=["HostName","Date"]))
output:
HostName Date Previous Date Previous Count Con
0 A 2021-01-01 12:30:00 NaT NaN 0
3 A 2021-02-01 12:40:00 2021-01-01 12:30:00 31.0 1
4 A 2021-02-25 12:40:00 2021-02-01 12:40:00 24.0 1
5 A 2021-03-01 12:41:00 2021-02-25 12:40:00 4.0 1
6 A 2021-03-01 12:42:00 2021-03-01 12:41:00 0.0 0
1 B 2021-01-01 12:42:00 NaT NaN 0
2 B 2021-02-01 12:30:00 2021-01-01 12:42:00 30.0 1
use cumsum to create a running total by hostname
​

i found solution:
original:
HostName Date
0 A 2021-01-01 12:30:00
1 B 2021-01-01 12:42:00
2 B 2021-02-01 12:30:00
3 A 2021-02-01 12:40:00
4 A 2021-02-25 12:40:00
5 A 2021-03-01 12:41:00
6 A 2021-03-01 12:42:00
get month before
df['Month Before'] = df['Date'] - pd.DateOffset(months=1)
order datafarme
df = df.sort_values(['HostName','Date'])
shift by Host
df['prev_value'] = df.groupby('HostName')['Date'].shift()
checking
df['Con'] = np.where((df['Month Before'] <= df['prev_value']) | (df['prev_value'].notnull()) , 1, 0)
and group
gpc = df.groupby(['HostName','Con'])['HostName']
df['Count Per Host'] = gpc.cumcount()
look like that
HostName Date Month Before prev_value Con CountPerHost
0 A 2021-01-01 12:30:00 2020-12-01 12:30:00 NaT 0 0
3 A 2021-02-01 12:40:00 2021-01-01 12:40:00 2021-01-01 12:30:00 0 0
4 A 2021-02-25 12:40:00 2021-01-25 12:40:00 2021-02-01 12:40:00 1 1
5 A 2021-03-01 12:41:00 2021-02-01 12:41:00 2021-02-25 12:40:00 1 2
6 A 2021-03-01 12:42:00 2021-02-01 12:42:00 2021-03-01 12:41:00 1 3
1 B 2021-01-01 12:42:00 2020-12-01 12:42:00 NaT 0 0
2 B 2021-02-01 12:30:00 2021-01-01 12:30:00 2021-01-01 12:42:00 1 0

Related

For each date - is it between any of the provided date bounds?

Data:
df:
ts_code
2018-01-01 A
2018-02-07 A
2018-03-11 A
2022-07-08 A
df_cal:
start_date end_date
2018-02-07 2018-03-12
2018-10-22 2018-11-16
2019-01-07 2019-03-08
2019-03-11 2019-04-22
2019-05-24 2019-07-02
2019-08-06 2019-09-09
2019-10-09 2019-11-05
2019-11-29 2020-01-14
2020-02-03 2020-02-21
2020-02-28 2020-03-05
2020-03-19 2020-04-28
2020-05-06 2020-07-13
2020-07-24 2020-08-31
2020-11-02 2021-01-13
2020-09-11 2020-10-13
2021-01-29 2021-02-18
2021-03-09 2021-04-30
2021-05-06 2021-07-22
2021-07-28 2021-09-14
2021-10-12 2021-12-13
2022-04-27 2022-06-30
Expected result:
ts_code col
2018-01-01 A 0
2018-02-07 A 1
2018-03-11 A 1
2022-07-08 A 0
Goal:
I want to assign values to a new column col: to 1 if df.index is between any of df_cal date ranges, and to 0 otherwise.
Reference:
I refer this post. But it just works for one condition and mine is lots of date ranges. And I don't want to use dataframe join method to achieve it because it will break index order.
You check with numpy broadcasting
df2['new'] = np.any((df1.end_date.values >=df2.index.values[:,None])&
(df1.start_date.values <= df2.index.values[:,None]),1).astype(int)
df2
Out[55]:
ts_code col new
2018-01-01 A 0 0
2018-02-07 A 1 1
2018-03-11 A 1 1
2022-07-08 A 0 0

All month ends until the end date

I have a df with two columns:
index start_date end_date
0 2000-01-03 2000-01-20
1 2000-01-04 2000-01-31
2 2000-01-05 2000-02-02
3 2000-01-05 2000-02-17
...
5100 2020-12-29 2021-01-11
5111 2020-12-30 2021-03-15
I would like to add columns of all month end dates between the start and end date, so that if the end_date is in the middle of a month, I would still take into account the end of this month.
So, my df would look like this:
index start_date end_date first_monthend second_monthend third_monthend fourth_monthend
0 2000-01-03 2000-01-20 2000-01-31 0 0 0
1 2000-01-04 2000-01-31 2000-01-31 0 0 0
2 2000-01-05 2000-02-02 2000-01-31 2000-02-28 0 0
3 2000-01-05 2000-02-17 2000-01-31 2000-02-28 0 0
... ... ... ... ... ...
5100 2020-12-29 2021-02-11 2020-12-31 2021-01-31 2021-02-28 0
5111 2020-12-30 2021-03-15 2020-12-31 2021-01-31 2021-02-28 2021-03-31
I would be very grateful if you could help me
If need parse months between start and end datetimes and add last day of each month use custom lambda function with period_range:
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
def f(x):
r = pd.period_range(x['start_date'],
x['end_date'], freq='m').to_timestamp(how='end').normalize()
return pd.Series(r)
df = df.join(df.apply(f, axis=1).fillna(0).add_suffix('_monthend'))
print (df)
start_date end_date 0_monthend 1_monthend \
0 2000-01-03 2000-01-20 2000-01-31 0
1 2000-01-04 2000-01-31 2000-01-31 0
2 2000-01-05 2000-02-02 2000-01-31 2000-02-29 00:00:00
3 2000-01-05 2000-02-17 2000-01-31 2000-02-29 00:00:00
5100 2020-12-29 2021-01-11 2020-12-31 2021-01-31 00:00:00
5111 2020-12-30 2021-03-15 2020-12-31 2021-01-31 00:00:00
2_monthend 3_monthend
0 0 0
1 0 0
2 0 0
3 0 0
5100 0 0
5111 2021-02-28 00:00:00 2021-03-31 00:00:00
If not replace missing values by 0:
df = df.join(df.apply(f, axis=1).add_suffix('_monthend'))
print (df)
start_date end_date 0_monthend 1_monthend 2_monthend 3_monthend
0 2000-01-03 2000-01-20 2000-01-31 NaT NaT NaT
1 2000-01-04 2000-01-31 2000-01-31 NaT NaT NaT
2 2000-01-05 2000-02-02 2000-01-31 2000-02-29 NaT NaT
3 2000-01-05 2000-02-17 2000-01-31 2000-02-29 NaT NaT
5100 2020-12-29 2021-01-11 2020-12-31 2021-01-31 NaT NaT
5111 2020-12-30 2021-03-15 2020-12-31 2021-01-31 2021-02-28 2021-03-31

Count How Many Different Users Have in specific day

I created a dataframe with pandas:
looks like that:
HostName
Date
B
2021-01-01 12:30
A
2021-01-01 12:45
C
2021-01-01 12:46
A
2021-02-01 12:42
B
2021-02-01 12:43
A
2021-02-01 12:45
B
2021-02-25 12:46
C
2021-03-01 12:41
A
2021-03-01 12:42
A
2021-03-01 12:43
C
2021-03-01 12:45
For every day, it should count how many different HostName
there is form the beginning of the day (example: 2021-01-01 00:00) to the specific row
Example:
for example lets take the 2021-01-01
HostName
Date
B
2021-01-01 12:30
A
2021-01-01 12:45
C
2021-01-01 12:46
there is tree rows:
the first result would be 1 - because its was the first row in the day.(B)
the second result would be 2 - because form the beginning of
the day till this line there is two different Hostname (B,A)
the third result would be 3 - because form the beginning of the day till this
line there is tree different Hostname ( B,A,C)
the end result should look like this:
HostName
Date
Result
B
2021-01-01 12:30
1
A
2021-01-01 12:45
2
C
2021-01-01 12:46
3
A
2021-02-01 12:42
1
B
2021-02-01 12:43
2
A
2021-02-01 12:45
2
B
2021-02-25 12:46
1
C
2021-03-01 12:41
1
A
2021-03-01 12:42
2
A
2021-03-01 12:43
2
C
2021-03-01 12:45
2
what it try do to but failed:
df.groupby(['HostName','Date')['HostName'].cumcount() + 1
or
def f(x):
one = x['HostName'].to_numpy()
twe = x['Date'].to_numpy()
both = x[['HostName','Date']].shift(1).to_numpy()
x['Host_1D_CumCount_Conn'] = [np.sum((one == a) & (twe == b)) for a, b in both]
return x
df.groupby('HostName').apply(f)
Use lambda function in GroupBy.transform with lambda function with Series.duplicated and cumulative sum:
df['Result'] = (df.groupby(df['Date'].dt.date)['HostName']
.transform(lambda x: (~x.duplicated()).cumsum()))
print (df)
HostName Date Result
0 B 2021-01-01 12:30:00 1
1 A 2021-01-01 12:45:00 2
2 C 2021-01-01 12:46:00 3
3 A 2021-02-01 12:42:00 1
4 B 2021-02-01 12:43:00 2
5 A 2021-02-01 12:45:00 2
6 B 2021-02-25 12:46:00 1
7 C 2021-03-01 12:41:00 1
8 A 2021-03-01 12:42:00 2
9 A 2021-03-01 12:43:00 2
10 C 2021-03-01 12:45:00 2
Alternative solution, faster is create helper columns d for dates and duplicates per d with HostName and use GroupBy.cumsum:
df['Result'] = (df.assign(d = df['Date'].dt.date,
new = lambda x: ~x.duplicated(['d','HostName']))
.groupby('d')['new']
.cumsum())
print (df)
HostName Date Result
0 B 2021-01-01 12:30:00 1
1 A 2021-01-01 12:45:00 2
2 C 2021-01-01 12:46:00 3
3 A 2021-02-01 12:42:00 1
4 B 2021-02-01 12:43:00 2
5 A 2021-02-01 12:45:00 2
6 B 2021-02-25 12:46:00 1
7 C 2021-03-01 12:41:00 1
8 A 2021-03-01 12:42:00 2
9 A 2021-03-01 12:43:00 2
10 C 2021-03-01 12:45:00 2
You can groupby the Date and use expanding+nunique. The issue is that, currently, expanding only works with numerical values (I wish we could simply do expanding().nunique()).
Thus we have to cheat a bit and factorize the column to numbers before applying pd.Series.nunique.
df['Result'] = (df.groupby(pd.to_datetime(df['Date']).dt.date, group_keys=False)
['HostName']
.apply(lambda s: pd.Series(s.factorize()[0]).expanding().apply(pd.Series.nunique))
.astype(int)
.values
)
output:
HostName Date Result
0 B 2021-01-01 12:30 1
1 A 2021-01-01 12:45 2
2 C 2021-01-01 12:46 3
3 A 2021-02-01 12:42 1
4 B 2021-02-01 12:43 2
5 A 2021-02-01 12:45 2
6 B 2021-02-25 12:46 1
7 C 2021-03-01 12:41 1
8 A 2021-03-01 12:42 2
9 A 2021-03-01 12:43 2
10 C 2021-03-01 12:45 2

Aggregate efficiently between dates

Hello I Have a Df Look like that:
HostName Date
0 B 2021-01-01 12:42:00
1 B 2021-02-01 12:30:00
2 B 2021-02-01 12:40:00
3 B 2021-02-25 12:40:00
4 B 2021-03-01 12:41:00
5 B 2021-03-01 12:42:00
6 B 2021-03-02 12:43:00
7 B 2021-03-03 12:44:00
8 B 2021-04-04 12:44:00
9 B 2021-06-05 12:44:00
10 B 2021-08-06 12:44:00
11 B 2021-09-07 12:44:00
12 A 2021-03-12 12:45:00
13 A 2021-03-13 12:46:00
i what do to aggregation this is how I solved the problem but its not efficient at all and if there are 1M rows
it will take a long time
is there a better way to Aggregate efficiently between dates?
end results:
HostName Date ds
0 B 2021-01-01 12:42:00 1
1 B 2021-02-01 12:30:00 2
2 B 2021-02-01 12:40:00 3
3 B 2021-02-25 12:40:00 3
4 B 2021-03-01 12:41:00 2
5 B 2021-03-01 12:42:00 3
6 B 2021-03-02 12:43:00 4
7 B 2021-03-03 12:44:00 5
8 B 2021-04-04 12:44:00 1
9 B 2021-06-05 12:44:00 1
10 B 2021-08-06 12:44:00 1
11 B 2021-09-07 12:44:00 1
12 A 2021-03-12 12:45:00 1
13 A 2021-03-13 12:46:00 2
TheList = []
for index, row in df.iterrows():
TheList.append((df[(df['Date'] > (df['Date'].iloc[index] - pd.DateOffset(months=1))) & (df['Date'] <= df['Date'].iloc[index])].groupby(['HostName']).size()[row[0]]))
df['ds'] = TheList
is there is a better way to do it but with the same result?
Here is used broadcasting between groups and for count Trues is used sum in custom function in GroupBy.transform:
Notice: Performance depends also by length of groups, if few very big groups here should be problem with memory.
df['Date'] = pd.to_datetime(df['Date'])
def f(x):
a = x.to_numpy()
b = x.sub(pd.DateOffset(months=1)).to_numpy()
return np.sum((a > b[:, None]) & (a <= a[:, None]), axis=1)
df['ds'] = df.groupby('HostName')['Date'].transform(f)
print (df)
HostName Date ds
0 B 2021-01-01 12:42:00 1
1 B 2021-02-01 12:30:00 2
2 B 2021-02-01 12:40:00 3
3 B 2021-02-25 12:40:00 3
4 B 2021-03-01 12:41:00 2
5 B 2021-03-01 12:42:00 3
6 B 2021-03-02 12:43:00 4
7 B 2021-03-03 12:44:00 5
8 B 2021-04-04 12:44:00 1
9 B 2021-06-05 12:44:00 1
10 B 2021-08-06 12:44:00 1
11 B 2021-09-07 12:44:00 1
12 A 2021-03-12 12:45:00 1
13 A 2021-03-13 12:46:00 2
Unfortunately need loops if memory problems:
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = pd.to_datetime(df['Date']).sub(pd.DateOffset(months=1))
def f(x):
one = x['Date'].to_numpy()
both = x[['Date','Date1']].to_numpy()
x['ds'] = [np.sum((one > b) & (one <= a)) for a, b in both]
return x
df = df.groupby('HostName').apply(f)
print (df)
HostName Date Date1 ds
0 B 2021-01-01 12:42:00 2020-12-01 12:42:00 1
1 B 2021-02-01 12:30:00 2021-01-01 12:30:00 2
2 B 2021-02-01 12:40:00 2021-01-01 12:40:00 3
3 B 2021-02-25 12:40:00 2021-01-25 12:40:00 3
4 B 2021-03-01 12:41:00 2021-02-01 12:41:00 2
5 B 2021-03-01 12:42:00 2021-02-01 12:42:00 3
6 B 2021-03-02 12:43:00 2021-02-02 12:43:00 4
7 B 2021-03-03 12:44:00 2021-02-03 12:44:00 5
8 B 2021-04-04 12:44:00 2021-03-04 12:44:00 1
9 B 2021-06-05 12:44:00 2021-05-05 12:44:00 1
10 B 2021-08-06 12:44:00 2021-07-06 12:44:00 1
11 B 2021-09-07 12:44:00 2021-08-07 12:44:00 1
12 A 2021-03-12 12:45:00 2021-02-12 12:45:00 1
13 A 2021-03-13 12:46:00 2021-02-13 12:46:00 2

date and hours from datetime [python]

I have datetime parameter in pandas dataframe, with time that include minutes and seconds .
index date
0 2021-03-01 07:55:00
1 2021-03-01 07:56:13
2 2021-03-01 07:56:43
3 2021-03-01 07:57:19
4 2021-03-01 07:57:57
5 2021-03-01 11:39:25
6 2021-03-01 11:39:59
7 2021-03-01 11:40:53
8 2021-03-01 11:41:44
9 2021-03-01 11:43:31
how can I create parameter like this (date and hour)
index date
0 2021-03-01 07:00:00
1 2021-03-01 07:00:00
2 2021-03-01 07:00:00
3 2021-03-01 07:00:00
4 2021-03-01 07:00:00
5 2021-03-01 11:00:00
6 2021-03-01 11:00:00
7 2021-03-01 11:00:00
8 2021-03-01 11:00:00
9 2021-03-01 11:00:00
Update
The data is in fact a pandas DataFrame with the column containing datetime objects. Given that the column is named "date" here's one way to effect the change:
df = df['date'].apply(lambda dt: dt.replace(minute=0, second=0))
print(df)
0 2021-03-01 07:00:00
1 2021-03-01 07:00:00
2 2021-03-01 07:00:00
3 2021-03-01 07:00:00
4 2021-03-01 07:00:00
5 2021-03-01 11:00:00
6 2021-03-01 11:00:00
7 2021-03-01 11:00:00
8 2021-03-01 11:00:00
9 2021-03-01 11:00:00
Name: date, dtype: datetime64[ns]
Original answer follows...
Use datetime.replace() to reset the minute and second in the datetime object:
from datetime import datetime
dt = datetime.strptime('2021-03-01 07:55:00', '%Y-%m-%d %H:%M:%S')
dt = dt.replace(minute=0, second=0)
print(dt)
# 2021-03-01 07:00:00
Your examples do not appear to have resolution smaller than one second, however, if it does, you could also set microseconds to 0 too:
dt = dt.replace(minute=0, second=0, microsecond=0)

Categories

Resources