Count How Many Different Users Have in specific day - python

I created a dataframe with pandas:
looks like that:
HostName
Date
B
2021-01-01 12:30
A
2021-01-01 12:45
C
2021-01-01 12:46
A
2021-02-01 12:42
B
2021-02-01 12:43
A
2021-02-01 12:45
B
2021-02-25 12:46
C
2021-03-01 12:41
A
2021-03-01 12:42
A
2021-03-01 12:43
C
2021-03-01 12:45
For every day, it should count how many different HostName
there is form the beginning of the day (example: 2021-01-01 00:00) to the specific row
Example:
for example lets take the 2021-01-01
HostName
Date
B
2021-01-01 12:30
A
2021-01-01 12:45
C
2021-01-01 12:46
there is tree rows:
the first result would be 1 - because its was the first row in the day.(B)
the second result would be 2 - because form the beginning of
the day till this line there is two different Hostname (B,A)
the third result would be 3 - because form the beginning of the day till this
line there is tree different Hostname ( B,A,C)
the end result should look like this:
HostName
Date
Result
B
2021-01-01 12:30
1
A
2021-01-01 12:45
2
C
2021-01-01 12:46
3
A
2021-02-01 12:42
1
B
2021-02-01 12:43
2
A
2021-02-01 12:45
2
B
2021-02-25 12:46
1
C
2021-03-01 12:41
1
A
2021-03-01 12:42
2
A
2021-03-01 12:43
2
C
2021-03-01 12:45
2
what it try do to but failed:
df.groupby(['HostName','Date')['HostName'].cumcount() + 1
or
def f(x):
one = x['HostName'].to_numpy()
twe = x['Date'].to_numpy()
both = x[['HostName','Date']].shift(1).to_numpy()
x['Host_1D_CumCount_Conn'] = [np.sum((one == a) & (twe == b)) for a, b in both]
return x
df.groupby('HostName').apply(f)

Use lambda function in GroupBy.transform with lambda function with Series.duplicated and cumulative sum:
df['Result'] = (df.groupby(df['Date'].dt.date)['HostName']
.transform(lambda x: (~x.duplicated()).cumsum()))
print (df)
HostName Date Result
0 B 2021-01-01 12:30:00 1
1 A 2021-01-01 12:45:00 2
2 C 2021-01-01 12:46:00 3
3 A 2021-02-01 12:42:00 1
4 B 2021-02-01 12:43:00 2
5 A 2021-02-01 12:45:00 2
6 B 2021-02-25 12:46:00 1
7 C 2021-03-01 12:41:00 1
8 A 2021-03-01 12:42:00 2
9 A 2021-03-01 12:43:00 2
10 C 2021-03-01 12:45:00 2
Alternative solution, faster is create helper columns d for dates and duplicates per d with HostName and use GroupBy.cumsum:
df['Result'] = (df.assign(d = df['Date'].dt.date,
new = lambda x: ~x.duplicated(['d','HostName']))
.groupby('d')['new']
.cumsum())
print (df)
HostName Date Result
0 B 2021-01-01 12:30:00 1
1 A 2021-01-01 12:45:00 2
2 C 2021-01-01 12:46:00 3
3 A 2021-02-01 12:42:00 1
4 B 2021-02-01 12:43:00 2
5 A 2021-02-01 12:45:00 2
6 B 2021-02-25 12:46:00 1
7 C 2021-03-01 12:41:00 1
8 A 2021-03-01 12:42:00 2
9 A 2021-03-01 12:43:00 2
10 C 2021-03-01 12:45:00 2

You can groupby the Date and use expanding+nunique. The issue is that, currently, expanding only works with numerical values (I wish we could simply do expanding().nunique()).
Thus we have to cheat a bit and factorize the column to numbers before applying pd.Series.nunique.
df['Result'] = (df.groupby(pd.to_datetime(df['Date']).dt.date, group_keys=False)
['HostName']
.apply(lambda s: pd.Series(s.factorize()[0]).expanding().apply(pd.Series.nunique))
.astype(int)
.values
)
output:
HostName Date Result
0 B 2021-01-01 12:30 1
1 A 2021-01-01 12:45 2
2 C 2021-01-01 12:46 3
3 A 2021-02-01 12:42 1
4 B 2021-02-01 12:43 2
5 A 2021-02-01 12:45 2
6 B 2021-02-25 12:46 1
7 C 2021-03-01 12:41 1
8 A 2021-03-01 12:42 2
9 A 2021-03-01 12:43 2
10 C 2021-03-01 12:45 2

Related

Aggregation based on previous month from eventdate

I'm Stuck on a problem it would be great if you could help me :)
I created a dataframe with pandas:
looks like that:
HostName
Date
A
2021-01-01 12:30
B
2021-01-01 12:42
B
2021-02-01 12:30
A
2021-02-01 12:40
A
2021-02-25 12:40
A
2021-03-01 12:41
A
2021-03-01 12:42
I try to Aggregat based on the previous month but it's not working.
the end result should look like this:
HostName
Date
previous month
A
2021-01-01 12:30
Nan
B
2021-01-01 12:42
Nan
B
2021-02-01 12:30
1
A
2021-02-01 12:40
Nan
A
2021-02-25 12:40
1
A
2021-03-01 12:41
2
A
2021-03-01 12:42
3
for every row Date should look one-month before and Aggregat the number of Hostnames found.
for example row number 6 count HostName A from 2021-02-01 12:41 to 2021-03-01 12:41
what I try to do and failed:
extract the previous month:
df['Date Before'] = df['Date'] - pd.DateOffset(months=1)
and Aggregate between this month
df.resample('M', on='Date').HostName.count()
df.groupby('HostName').resample('M', on='Date Before').HostName.count()
Please Help Me many thanks!!!
use shift to look back a n rows for a dataframe column. df is the group by results.
data1="""HostName Date
A 2021-01-01 12:30
B 2021-01-01 12:42
B 2021-02-01 12:30
A 2021-02-01 12:40
A 2021-02-25 12:40
A 2021-03-01 12:41
A 2021-03-01 12:42"""
df = pd.read_table(StringIO(data1), sep='\t')
df['Date']=pd.to_datetime(df['Date'])
grouped=df.groupby('HostName')['Date']
def previous_date(group):
return group.sort_values().shift(1)
df['Previous Date']=grouped.apply(previous_date)
df['Previous Count']=df.apply(lambda x: x['Date']-x['Previous Date'],axis=1)
print(df.sort_values(by=["HostName","Date"]))
df['Con'] = np.where( (df['Previous Date'].notnull() & df['Previous Count']>0) , 1, 0)
print(df.sort_values(by=["HostName","Date"]))
output:
HostName Date Previous Date Previous Count Con
0 A 2021-01-01 12:30:00 NaT NaN 0
3 A 2021-02-01 12:40:00 2021-01-01 12:30:00 31.0 1
4 A 2021-02-25 12:40:00 2021-02-01 12:40:00 24.0 1
5 A 2021-03-01 12:41:00 2021-02-25 12:40:00 4.0 1
6 A 2021-03-01 12:42:00 2021-03-01 12:41:00 0.0 0
1 B 2021-01-01 12:42:00 NaT NaN 0
2 B 2021-02-01 12:30:00 2021-01-01 12:42:00 30.0 1
use cumsum to create a running total by hostname
​
i found solution:
original:
HostName Date
0 A 2021-01-01 12:30:00
1 B 2021-01-01 12:42:00
2 B 2021-02-01 12:30:00
3 A 2021-02-01 12:40:00
4 A 2021-02-25 12:40:00
5 A 2021-03-01 12:41:00
6 A 2021-03-01 12:42:00
get month before
df['Month Before'] = df['Date'] - pd.DateOffset(months=1)
order datafarme
df = df.sort_values(['HostName','Date'])
shift by Host
df['prev_value'] = df.groupby('HostName')['Date'].shift()
checking
df['Con'] = np.where((df['Month Before'] <= df['prev_value']) | (df['prev_value'].notnull()) , 1, 0)
and group
gpc = df.groupby(['HostName','Con'])['HostName']
df['Count Per Host'] = gpc.cumcount()
look like that
HostName Date Month Before prev_value Con CountPerHost
0 A 2021-01-01 12:30:00 2020-12-01 12:30:00 NaT 0 0
3 A 2021-02-01 12:40:00 2021-01-01 12:40:00 2021-01-01 12:30:00 0 0
4 A 2021-02-25 12:40:00 2021-01-25 12:40:00 2021-02-01 12:40:00 1 1
5 A 2021-03-01 12:41:00 2021-02-01 12:41:00 2021-02-25 12:40:00 1 2
6 A 2021-03-01 12:42:00 2021-02-01 12:42:00 2021-03-01 12:41:00 1 3
1 B 2021-01-01 12:42:00 2020-12-01 12:42:00 NaT 0 0
2 B 2021-02-01 12:30:00 2021-01-01 12:30:00 2021-01-01 12:42:00 1 0

Aggregate efficiently between dates

Hello I Have a Df Look like that:
HostName Date
0 B 2021-01-01 12:42:00
1 B 2021-02-01 12:30:00
2 B 2021-02-01 12:40:00
3 B 2021-02-25 12:40:00
4 B 2021-03-01 12:41:00
5 B 2021-03-01 12:42:00
6 B 2021-03-02 12:43:00
7 B 2021-03-03 12:44:00
8 B 2021-04-04 12:44:00
9 B 2021-06-05 12:44:00
10 B 2021-08-06 12:44:00
11 B 2021-09-07 12:44:00
12 A 2021-03-12 12:45:00
13 A 2021-03-13 12:46:00
i what do to aggregation this is how I solved the problem but its not efficient at all and if there are 1M rows
it will take a long time
is there a better way to Aggregate efficiently between dates?
end results:
HostName Date ds
0 B 2021-01-01 12:42:00 1
1 B 2021-02-01 12:30:00 2
2 B 2021-02-01 12:40:00 3
3 B 2021-02-25 12:40:00 3
4 B 2021-03-01 12:41:00 2
5 B 2021-03-01 12:42:00 3
6 B 2021-03-02 12:43:00 4
7 B 2021-03-03 12:44:00 5
8 B 2021-04-04 12:44:00 1
9 B 2021-06-05 12:44:00 1
10 B 2021-08-06 12:44:00 1
11 B 2021-09-07 12:44:00 1
12 A 2021-03-12 12:45:00 1
13 A 2021-03-13 12:46:00 2
TheList = []
for index, row in df.iterrows():
TheList.append((df[(df['Date'] > (df['Date'].iloc[index] - pd.DateOffset(months=1))) & (df['Date'] <= df['Date'].iloc[index])].groupby(['HostName']).size()[row[0]]))
df['ds'] = TheList
is there is a better way to do it but with the same result?
Here is used broadcasting between groups and for count Trues is used sum in custom function in GroupBy.transform:
Notice: Performance depends also by length of groups, if few very big groups here should be problem with memory.
df['Date'] = pd.to_datetime(df['Date'])
def f(x):
a = x.to_numpy()
b = x.sub(pd.DateOffset(months=1)).to_numpy()
return np.sum((a > b[:, None]) & (a <= a[:, None]), axis=1)
df['ds'] = df.groupby('HostName')['Date'].transform(f)
print (df)
HostName Date ds
0 B 2021-01-01 12:42:00 1
1 B 2021-02-01 12:30:00 2
2 B 2021-02-01 12:40:00 3
3 B 2021-02-25 12:40:00 3
4 B 2021-03-01 12:41:00 2
5 B 2021-03-01 12:42:00 3
6 B 2021-03-02 12:43:00 4
7 B 2021-03-03 12:44:00 5
8 B 2021-04-04 12:44:00 1
9 B 2021-06-05 12:44:00 1
10 B 2021-08-06 12:44:00 1
11 B 2021-09-07 12:44:00 1
12 A 2021-03-12 12:45:00 1
13 A 2021-03-13 12:46:00 2
Unfortunately need loops if memory problems:
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = pd.to_datetime(df['Date']).sub(pd.DateOffset(months=1))
def f(x):
one = x['Date'].to_numpy()
both = x[['Date','Date1']].to_numpy()
x['ds'] = [np.sum((one > b) & (one <= a)) for a, b in both]
return x
df = df.groupby('HostName').apply(f)
print (df)
HostName Date Date1 ds
0 B 2021-01-01 12:42:00 2020-12-01 12:42:00 1
1 B 2021-02-01 12:30:00 2021-01-01 12:30:00 2
2 B 2021-02-01 12:40:00 2021-01-01 12:40:00 3
3 B 2021-02-25 12:40:00 2021-01-25 12:40:00 3
4 B 2021-03-01 12:41:00 2021-02-01 12:41:00 2
5 B 2021-03-01 12:42:00 2021-02-01 12:42:00 3
6 B 2021-03-02 12:43:00 2021-02-02 12:43:00 4
7 B 2021-03-03 12:44:00 2021-02-03 12:44:00 5
8 B 2021-04-04 12:44:00 2021-03-04 12:44:00 1
9 B 2021-06-05 12:44:00 2021-05-05 12:44:00 1
10 B 2021-08-06 12:44:00 2021-07-06 12:44:00 1
11 B 2021-09-07 12:44:00 2021-08-07 12:44:00 1
12 A 2021-03-12 12:45:00 2021-02-12 12:45:00 1
13 A 2021-03-13 12:46:00 2021-02-13 12:46:00 2

Time sequence in pandas dataframe

Below is the sample of dataframe (df):-
alpha
value
0
a
5
1
a
8
2
a
4
3
b
2
4
b
1
I know how to make the sequence (numbers) as per the group:
df["serial"] = df.groupby("alpha").cumcount()+1
alpha
value
serial
0
a
5
1
1
a
8
2
2
a
4
3
3
b
2
1
4
b
1
2
But instead of number I need date-time in sequence having 30 mins interval:
Expected result:
alpha
value
serial
0
a
5
2021-01-01 23:30:00
1
a
8
2021-01-02 00:00:00
2
a
4
2021-01-02 00:30:00
3
b
2
2021-01-01 23:30:00
4
b
1
2021-01-02 00:00:00
You can simply multiply your result with a pd.Timedelta:
print ((df.groupby("alpha").cumcount()+1)*pd.Timedelta(minutes=30)+pd.Timestamp("2021-01-01 23:00:00"))
0 2021-01-01 23:30:00
1 2021-01-02 00:00:00
2 2021-01-02 00:30:00
3 2021-01-01 23:30:00
4 2021-01-02 00:00:00
dtype: datetime64[ns]
Try with to_datetime and groupby with cumcount, and then multiplying by pd.Timedelta for 30 minutes:
>>> df['serial'] = pd.to_datetime('2021-01-01 23:30:00') + df.groupby('alpha').cumcount() * pd.Timedelta(minutes=30)
>>> df
alpha value serial
0 a 5 2021-01-01 23:30:00
1 a 8 2021-01-02 00:00:00
2 a 4 2021-01-02 00:30:00
3 b 2 2021-01-01 23:30:00
4 b 1 2021-01-02 00:00:00
>>>

find days between 2 dates in python but only number

I was trying to find difference of a series of dates and a date. for example, the series is
from may1 to june1 which is
date = pd.DataFrame()
In [0]: date['test'] = pd.date_range("2021-05-01", "2021-06-01", freq = "D")
Out[0]: date
test
0 2021-05-01 00:00:00
1 2021-05-02 00:00:00
2 2021-05-03 00:00:00
3 2021-05-04 00:00:00
4 2021-05-05 00:00:00
5 2021-05-06 00:00:00
6 2021-05-07 00:00:00
7 2021-05-08 00:00:00
8 2021-05-09 00:00:00
9 2021-05-10 00:00:00
In[1]
date['test'] = date['test'].dt.date
Out[1]:
test
0 2021-05-01
1 2021-05-02
2 2021-05-03
3 2021-05-04
4 2021-05-05
5 2021-05-06
6 2021-05-07
7 2021-05-08
8 2021-05-09
9 2021-05-10
In[2]:date['base'] = dt.strptime("2021-05-01",'%Y-%m-%d')
Out[2]:
0 2021-05-01 00:00:00
1 2021-05-01 00:00:00
2 2021-05-01 00:00:00
3 2021-05-01 00:00:00
4 2021-05-01 00:00:00
5 2021-05-01 00:00:00
6 2021-05-01 00:00:00
7 2021-05-01 00:00:00
8 2021-05-01 00:00:00
9 2021-05-01 00:00:00
In[3]:date['base'] = date['base'].dt.date
Out[3]:
base
0 2021-05-01
1 2021-05-01
2 2021-05-01
3 2021-05-01
4 2021-05-01
5 2021-05-01
6 2021-05-01
7 2021-05-01
8 2021-05-01
9 2021-05-01
In[4]:date['test']-date['base']
Out[4]:
diff
0 0 days 00:00:00.000000000
1 1 days 00:00:00.000000000
2 2 days 00:00:00.000000000
3 3 days 00:00:00.000000000
4 4 days 00:00:00.000000000
5 5 days 00:00:00.000000000
6 6 days 00:00:00.000000000
7 7 days 00:00:00.000000000
8 8 days 00:00:00.000000000
9 9 days 00:00:00.000000000
10 10 days 00:00:00.000000000
the only thing i could get is this. I don't want anything other than the number 1-10 cuz i need them for further numerical calculation but i can't get rid of those. Also how could i construct a time series which just outputs the date not the hms after it? i don't want to manually .dt.date for all of those and it sometimes mess things up
You don't need to create a column base for this, simply do:
>>> (date['test'] - pd.to_datetime("2021-05-01", format='%Y-%m-%d')).dt.days
0 0
1 1
2 2
3 3
4 4
...
27 27
28 28
29 29
30 30
31 31
Name: test, dtype: int64
You can convert the timestamps first to epoch seconds (they are actually stored internally as some number, and likely a factor of epoch seconds)
Using pandas datetime to unix timestamp seconds
import pandas as pd
# start df with date column
df = pd.DataFrame({"date": pd.date_range("2021-05-01", "2021-06-01", freq = "D")})
# create a column for datetimes
df["ts"] = (df["date"] - pd.Timestamp("1970-01-01")) // pd.Timedelta("1s")
>>> df
date ts
0 2021-05-01 1619827200
1 2021-05-02 1619913600
2 2021-05-03 1620000000
3 2021-05-04 1620086400
...
31 2021-06-01 1622505600
This will allow you to do integer math before converting back
>>> df["days"] = (df["ts"] - min(df["ts"])) // (60*60*24) # 1 day in seconds
>>> df
date ts days
0 2021-05-01 1619827200 0
1 2021-05-02 1619913600 1
2 2021-05-03 1620000000 2
3 2021-05-04 1620086400 3
...
31 2021-06-01 1622505600 31
Alternatively, with a naive day-based series, you can use the index as the day offset (as that's how the DataFrame was generated)!
>>> import pandas as pd
>>> df = pd.DataFrame({"date": pd.date_range("2021-05-01", "2021-06-01", freq = "D")})
>>> df["days"] = df.index
>>> df
date days
0 2021-05-01 0
1 2021-05-02 1
2 2021-05-03 2
3 2021-05-04 3
...
31 2021-06-01 31

How to copy paste values from another dataset conditional on a column

I have df1
Id Data Group_Id
0 1 A 1
1 2 B 2
2 3 B 3
...
100 4 A 101
101 5 A 102
...
and df2
Timestamp Group_Id
2012-01-01 00:00:05.523 1
2013-07-01 00:00:10.757 2
2014-01-12 00:00:15.507. 3
...
2016-03-05 00:00:05.743 101
2017-12-24 00:00:10.407 102
...
I want to match the 2 datasets by Group_Id, then copy only date from Timestamp in df2 and paste to a new column in df1 based on corresponding Group_Id, name the column day1.
Then I want to add 6 more columns next to day1, name them day2, ..., day7 with the next six days based on day1. So it looks like:
Id Data Group_Id day1 day2 day3 ... day7
0 1 A 1 2012-01-01 2012-01-02 2012-01-03 ...
1 2 B 2 2013-07-01 2013-07-02 2013-07-03 ...
2 3 B 3 2014-01-12 2014-01-13 2014-01-14 ...
...
100 4 A 101 2016-03-05 2016-03-06 2016-03-07 ...
101 5 A 102 2017-12-24 2017-12-25 2017-12-26 ...
...
Thanks.
First we need merge here
df1=df1.merge(df2,how='left')
s=pd.DataFrame([pd.date_range(x,periods=6,freq ='D') for x in df1.Timestamp],index=df1.index)
s.columns+=1
df1.join(s.add_prefix('Day'))
another approach here, basically just merges the dfs, grabs the date from the timestamp and makes 6 new columns adding a day each time:
import pandas as pd
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
df3 = df1.merge(df2, on='Group_Id')
df3['Timestamp'] = pd.to_datetime(df3['Timestamp']) #only necessary if not already timestamp
df3['day1'] = df3['Timestamp'].dt.date
for i in (range(1,7)):
df3['day'+str(i+1)] = df3['day1'] + pd.Timedelta(i,unit='d')
output:
Id Data Group_Id Timestamp day1 day2 day3 day4 day5 day6 day7
0 1 A 1 2012-01-01 00:00:05.523 2012-01-01 2012-01-02 2012-01-03 2012-01-04 2012-01-05 2012-01-06 2012-01-07
1 2 B 2 2013-07-01 00:00:10.757 2013-07-01 2013-07-02 2013-07-03 2013-07-04 2013-07-05 2013-07-06 2013-07-07
2 3 B 3 2014-01-12 00:00:15.507 2014-01-12 2014-01-13 2014-01-14 2014-01-15 2014-01-16 2014-01-17 2014-01-18
3 4 A 101 2016-03-05 00:00:05.743 2016-03-05 2016-03-06 2016-03-07 2016-03-08 2016-03-09 2016-03-10 2016-03-11
4 5 A 102 2017-12-24 00:00:10.407 2017-12-24 2017-12-25 2017-12-26 2017-12-27 2017-12-28 2017-12-29 2017-12-30
note that I copied your data frame into a csv and only had the 5 entires so the index is not the same as your example (i.e. 100, 101)
you can delete the timestamp col if not needed

Categories

Resources