df['check'] = ((df['id'] == 123) & (df['date1'] >= date1)) | ((df['id'] == 456) & (df['date2'] >= date2))
present = df.groupby(['id', 'month', 'check'])['userid'].nunique().reset_index(name="usercount")
This is my code, so my expected output must have number of unique users per month in the column usercount
grouped by id. i used id, month and check in groupby.
The check column is type bool, based on first line of my code, but when i got output from present dataframe, the users are counted who has check value as True, also who has as False.
Actually, it should count the users who have only True in check column.
help me out in this
You need filter by check column by boolean indexing, not pass to by parameter in groupby:
#first convert datetimes to start of months
df['month'] = df['month'].dt.floor('d') - pd.offsets.MonthBegin(1)
print (df)
check month id userid
0 True 2019-06-01 123 a
1 False 2019-02-01 123 b
2 False 2019-01-01 123 c
3 False 2019-02-01 123 d
4 True 2019-06-01 123 e
5 True 2020-07-01 123 f
6 True 2020-07-01 123 g
7 True 2020-06-01 123 h
print (df[df['check']])
check month id userid
0 True 2019-06-01 123 a
4 True 2019-06-01 123 e
5 True 2020-07-01 123 f
6 True 2020-07-01 123 g
7 True 2020-06-01 123 h
present = (df[df['check']].groupby(['id', 'month'])['userid']
.nunique()
.reset_index(name="usercount"))
print (present)
id month usercount
0 123 2019-06-01 2
1 123 2020-06-01 1
2 123 2020-07-01 2
Related
please I want to filter AccountID that has transaction data for at least >=3 months ?. This is just a small fraction of the entire dataset
Here is what I did but, I am not sure it is right.
data = data.groupby('AccountID').apply(lambda x: x['TransactionDate'].nunique() >= 3)
I get a series as an output with boolean values. I want to get a pandas dataframe
TransactionDate AccountID TransactionAmount
0 2020-12-01 8 400000.0
1 2020-12-01 22 25000.0
2 2020-12-02 22 551500.0
3 2020-01-01 17 116.0
4 2020-01-01 24 2000.0
5 2020-01-02 68 6000.0
6 2020-03-03. 20 180000.0
7 2020-03-01 66 34000.0
8 2020-02-01 66 20000.0
9 2020-02-01 66 40600.0
The ouput I get
AccountID
1 True
2 True
3 True
4 True
5 True
You are close, need GroupBy.transform for repeat aggregated values for Series with same size like original df, so possible filtering in boolean indexing:
data = data[data.groupby('AccountID')['TransactionDate'].transform('nunique') >= 3]
If possible some dates has no same day, 1 use Series.dt.to_period for helper column filler by months periods:
s = data.assign(new = data['TransactionDate'].dt.to_period('m')).groupby('AccountID')['new'].transform('nunique')
data = data[s >= 3]
How does one return all observations from a dataframe that only hold the max value for each unique key associated? I tried groupby and max but for every difference I get more than on value per key returned back. See example below:
import pandas as pd
key = [111,111,222,333,444,555]
flag = [0,1,1,1,1,1]
date = [pd.to_datetime('2020-01-01'),pd.to_datetime('2020-01-02'),
pd.to_datetime('2020-02-01'),pd.to_datetime('2020-04-01'),
pd.to_datetime('2020-03-01'),pd.to_datetime('2020-05-01')]
df_dic = {'key':key, 'flag':flag, 'date':date}
df = pd.DataFrame(df_dic)
df
df.groupby(['key', 'flag']).agg({'date':'max'}).reset_index()
This returns:
key flag date
0 111 0 2020-01-01
1 111 1 2020-01-02
2 222 1 2020-02-01
3 333 1 2020-04-01
I want to return only the observations with the max date per unique key like so:
key flag date
0 111 1 2020-01-02
1 222 1 2020-02-01
2 333 1 2020-04-01
you can avoid groupby and use sort_values and drop_dplicates.
print(df.sort_values('date').drop_duplicates('key', keep='last'))
key flag date
1 111 1 2020-01-02
2 222 1 2020-02-01
4 444 1 2020-03-01
3 333 1 2020-04-01
5 555 1 2020-05-01
Note that you can add the key column in the sort_values if the order matters to you in the result df.sort_values(['key', 'date'])...
One possible solution (although maybe not the best one) is to use an inner join after the groupby method. This ensures that the flags are taken from the original dataframe, either they are 0 or 1.
df.groupby(['key'])\
.agg({'date': 'max'})\
.reset_index()\
.merge(df, how='inner', on=['key', 'date'])
# key date flag
# 0 111 2020-01-02 1
# 1 222 2020-02-01 1
# 2 333 2020-04-01 1
# 3 444 2020-03-01 1
# 4 555 2020-05-01 1
I have a df that looks like this:
And I'm trying to turn it into this:
the following code gets me a list of a list that I can convert to a df and includes the first 3 columns of expected output, but not sure how to get the number columns I need (note: I have way more than 3 number columns but using this as a simple illustration).
x=[['ID','Start','End','Number1','Number2','Number3']]
for i in range(len(df)):
if not(df.iloc[i-1]['DateSpellIndicator']):
ID= df.iloc[i]['ID']
start = df.iloc[i]['Date']
if not(df.iloc[i]['DateSpellIndicator']):
newrow = [ID, start,df.iloc[i]['Date'],...]
x.append(newrow)
Here's one way to do it by making use of pandas groupby.
Input Dataframe:
ID DATE NUM TORF
0 1 2020-01-01 40 True
1 1 2020-02-01 50 True
2 1 2020-03-01 60 False
3 1 2020-06-01 70 True
4 2 2020-07-01 20 True
5 2 2020-08-01 30 False
Output Dataframe:
END ID Number1 Number2 Number3 START
0 2020-08-01 2 20 30.0 NaN 2020-07-01
1 2020-06-01 1 70 NaN NaN 2020-06-01
2 2020-03-01 1 40 50.0 60.0 2020-01-01
Code:
new_df=pd.DataFrame()
#create groups based on ID
for index, row in df.groupby('ID'):
#Within each group split at the occurence of False
dfnew=np.split(row, np.where(row.TORF == False)[0] + 1)
for sub_df in dfnew:
#within each subgroup
if sub_df.empty==False:
dfmod=pd.DataFrame({'ID':sub_df['ID'].iloc[0],'START':sub_df['DATE'].iloc[0],'END':sub_df['DATE'].iloc[-1]},index=[0])
j=0
for nindex, srow in sub_df.iterrows():
dfmod['Number{}'.format(j+1)]=srow['NUM']
j=j+1
#concatenate the existing and modified dataframes
new_df=pd.concat([dfmod, new_df], axis=0)
new_df.reset_index(drop=True)
Some of the steps could be reduced to get the same output.
I used cumsum to get the fist and last date. Used list to get the columns the way you want. Please note the output has different column names than your example. I assume you can change them the way you want.
df ['new1'] = ~df['datespell']
df['new2'] = df['new1'].cumsum()-df['new1']
check = df.groupby(['id', 'new2']).agg({'date': {'start': 'first', 'end': 'last'}, 'number': {'cols': lambda x: list(x)}})
check.columns = check.columns.droplevel(0)
check.reset_index(inplace=True)
pd.concat([check,check['cols'].apply(pd.Series)], axis=1).drop(['cols'], axis=1)
id new2 start end 0 1 2
0 1 0 2020-01-01 2020-03-01 40.0 50.0 60.0
1 1 1 2020-06-01 2020-06-01 70.0 NaN NaN
2 2 1 2020-07-01 2020-08-01 20.0 30.0 NaN
Here is the dataframe i used.
id date number datespell new1 new2
0 1 2020-01-01 40 True False 0
1 1 2020-02-01 50 True False 0
2 1 2020-03-01 60 False True 0
3 1 2020-06-01 70 True False 1
4 2 2020-07-01 20 True False 1
5 2 2020-08-01 30 False True 1
I am trying to 'join' two DataFrames based on a condition.
Condition
if df1.Year == df2.Year &
df1.Date >= df2.BeginDate or df1.Date <= df2.EndDate &
df1.ID == df2.ID
#if the condition is True, I would love to add an extra column (binary) to df1, something like
#df1.condition = Yes or No.
My data looks like this:
df1:
Year Week ID Date
2020 1 123 2020-01-01 00:00:00
2020 1 345 2020-01-01 00:00:00
2020 2 123 2020-01-07 00:00:00
2020 1 123 2020-01-01 00:00:00
df2:
Year BeginDate EndDate ID
2020 2020-01-01 00:00:00 2020-01-02 00:00:00 123
2020 2020-01-01 00:00:00 2020-01-02 00:00:00 123
2020 2020-01-01 00:00:00 2020-01-02 00:00:00 978
2020 2020-09-21 00:00:00 2020-01-02 00:00:00 978
end_df: #Expected output
Year Week ID Condition
2020 1 123 True #Year is matching, week1 is between the dates, ID is matching too
2019 1 345 False #Year is not matching
2020 2 187 False # ID is not matching
2020 1 123 True # Same as first row.
I thought to solve this by looping over two DataFrames:
for row in df1.iterrrows():
for row2 in df2.iterrows():
if row['Year'] == row2['Year2']:
if row['ID] == row2['ID']:
.....
.....
row['Condition'] = True
else:
row['Condition'] = False
However... this is leading to error after error.
Really looking forward how you guys will tackle this problem. Many thanks in advance!
UPDATE 1
I created a loop. However, this loop is taking ages (and I am not sure how to add the value to a new column).
Note, in df1 I created a 'Date' column (in the same format as the begin & enddate from df2).
Key now: How can I add the True value (in the end of the loop..) to my df1 (in an extra column)?
for index, row in df1.interrows():
row['Year'] = str(row['Year'])
for index1, row1 in df2.iterrows():
row1['Year'] = str(row1['Year'])
if row['Year'] == row1['Year']:
row['ID'] = str(row['ID'])
row1['ID'] = str(row1['ID'])
if row['ID] == row1['ID']:
if row['Date'] >= row1['BeginDate'] and row['Date'] <= row1['Enddate']:
print("I would like to add this YES to df1 in an extra column")
Edit 2
Trying #davidbilla solution: It looks like the 'condition' column is not doing well. As you can see, it match even while df1.Year != df2.Year. Note that df2 is sorted based on ID (so all the same unique numbers should be there
I guess you are expecting something like this - if you are trying to match the dataframes row wise (i.e compare row1 of df1 with row1 of df2):
df1['condition'] = np.where((df1['Year']==df2['Year'])&(df1['ID']==df2['ID'])&((df1['Date']>=df2['BeginDate'])or(df1['Date']<=df2['EndDate'])), True, False)
np.where takes the conditions as the first parameter, the second parameter will the be the value if the condition pass, the 3rd parameter is the value if the condition fail.
EDIT 1:
Based on your sample dataset
df1 = pd.DataFrame([[2020,1,123],[2020,1,345],[2020,2,123],[2020,1,123]],
columns=['Year','Week','ID'])
df2 = pd.DataFrame([[2020,'2020-01-01 00:00:00','2020-01-02 00:00:00',123],
[2020,'2020-01-01 00:00:00','2020-01-02 00:00:00',123],
[2020,'2020-01-01 00:00:00','2020-01-02 00:00:00',978],
[2020,'2020-09-21 00:00:00','2020-01-02 00:00:00',978]],
columns=['Year','BeginDate','EndDate','ID'])
df2['BeginDate'] = pd.to_datetime(df2['BeginDate'])
df2['EndDate'] = pd.to_datetime(df2['EndDate'])
df1['condition'] = np.where((df1['Year']==df2['Year'])&(df1['ID']==df2['ID']),True, False)
# &((df1['Date']>=df2['BeginDate'])or(df1['Date']<=df2['EndDate'])) - removed this condition as the df has no Date field
print(df1)
Output:
Year Date ID condition
0 2020 1 123 True
1 2020 1 345 False
2 2020 2 123 False
3 2020 1 123 False
EDIT 2: To compare one row in df1 with all rows in df2
df1['condition'] = (df1['Year'].isin(df2['Year']))&(df1['ID'].isin(df2['ID']))
This takes df1['Year'] and compares it against all values of df2['Year'].
Based on the sample dataset:
df1:
Year Date ID
0 2020 2020-01-01 123
1 2020 2020-01-01 345
2 2020 2020-10-01 123
3 2020 2020-11-13 123
df2:
Year BeginDate EndDate ID
0 2020 2020-01-01 2020-02-01 123
1 2020 2020-01-01 2020-01-02 123
2 2020 2020-03-01 2020-05-01 978
3 2020 2020-09-21 2020-10-01 978
Code change:
date_range = list(zip(df2['BeginDate'],df2['EndDate']))
def check_date(date):
for (s,e) in date_range:
if date>=s and date<=e:
return True
return False
df1['condition'] = (df1['Year'].isin(df2['Year']))&(df1['ID'].isin(df2['ID']))
df1['date_compare'] = df1['Date'].apply(lambda x: check_date(x)) # you can directly store this in df1['condition']. I just wanted to print the values so have used a new field
df1['condition'] = (df1['condition']==True)&(df1['date_compare']==True)
Output:
Year Date ID condition date_compare
0 2020 2020-01-01 123 True True # Year match, ID match and Date is within the range of df2 row 1
1 2020 2020-01-01 345 False True # Year match, ID no match
2 2020 2020-10-01 123 True True # Year match, ID match, Date is within range of df2 row 4
3 2020 2020-11-13 123 False False # Year match, ID match, but Date is not in range of any row in df2
EDIT 3:
Based on updated question (Earlier I thought it was ok if the 3 values year, id and date match df2 in any of the rows not on the same row). I think I got better understanding of your requirement now.
df2['BeginDate'] = pd.to_datetime(df2['BeginDate'])
df2['EndDate'] = pd.to_datetime(df2['EndDate'])
df1['Date'] = pd.to_datetime(df1['Date'])
df1['condition'] = False
for idx1, row1 in df1.iterrows():
match = False
for idx2, row2 in df2.iterrows():
if (row1['Year']==row2['Year']) & \
(row1['ID']==row2['ID']) & \
(row1['Date']>=row2['BeginDate']) & \
(row1['Date']<=row2['EndDate']):
match = True
df1.at[idx1, 'condition'] = match
Output - Set 1:
DF1:
Year Date ID
0 2020 2020-01-01 123
1 2020 2020-01-01 123
2 2020 2020-01-01 345
3 2020 2020-01-10 123
4 2020 2020-11-13 123
DF2:
Year BeginDate EndDate ID
0 2020 2020-01-15 2020-02-01 123
1 2020 2020-01-01 2020-01-02 123
2 2020 2020-03-01 2020-05-01 978
3 2020 2020-09-21 2020-10-01 978
DF1 result:
Year Date ID condition
0 2020 2020-01-01 123 True
1 2020 2020-01-01 123 True
2 2020 2020-01-01 345 False
3 2020 2020-01-10 123 False
4 2020 2020-11-13 123 False
Output - Set 2:
DF1:
Year Date ID
0 2019 2019-01-01 s904112
1 2019 2019-01-01 s911243
2 2019 2019-01-01 s917131
3 2019 2019-01-01 sp986214
4 2019 2019-01-01 s510006
5 2020 2020-01-10 s540006
DF2:
Year BeginDate EndDate ID
0 2020 2020-01-27 2020-09-02 s904112
1 2020 2020-01-27 2020-09-02 s904112
2 2020 2020-01-03 2020-03-15 s904112
3 2020 2020-04-15 2020-01-05 s904112
4 2020 2020-01-05 2020-05-15 s540006
5 2019 2019-01-05 2019-05-15 s904112
DF1 Result:
Year Date ID condition
0 2019 2019-01-01 s904112 False
1 2019 2019-01-01 s911243 False
2 2019 2019-01-01 s917131 False
3 2019 2019-01-01 sp986214 False
4 2019 2019-01-01 s510006 False
5 2020 2020-01-10 s540006 True
The 2nd row of the desired output has Year as 2019, so I assume the 2nd row of df1.Year is also 2019 instead of 2020
If I understand correctly, you need to merge and filter-out Date outside of the BeginDate and EndDate range. First, there are duplicates and invalid date ranges in df2. We need to drop duplicates and invalid ranges before merge. Invalid date ranges are ranges where BeginDate >= EndDate which is index 3 of df2.
#convert all date columns of both `df1` and `df2` to datetime dtype
df1['Date'] = pd.to_datetime(df1['Date'])
df2[['BeginDate', 'EndDate']] = df2[['BeginDate', 'EndDate']].apply(pd.to_datetime)
#left-merge on `Year`, `ID` and using `eval` to compute
#columns `Condition` where `Date` is between `BeginDate` and `EndDate`.
#Finally assign back to `df1`
df1['Condition'] = (df1.merge(df2.loc[df2.BeginDate < df2.EndDate].drop_duplicates(),
on=['Year','ID'], how='left')
.eval('Condition= BeginDate <= Date <= EndDate')['Condition'])
Out[614]:
Year Week ID Date Condition
0 2020 1 123 2020-01-01 True
1 2019 1 345 2020-01-01 False
2 2020 2 123 2020-01-07 False
3 2020 1 123 2020-01-01 True
I have a data frame that looks like this.
ATM ID Ref no Timestamp
1 11 2020/02/01 15:10:23
1 11 2020/02/01 15:11:03
1 111 2020/02/06 17:45:41
1 111 2020/02/06 18:11:03
2 22 2020/02/07 15:11:03
2 22 2020/02/07 15:25:01
2 22 2020/02/07 15:38:51
2 222 2020/02/07 15:11:03
and I would like to have it grouped by ATM ID and Ref no to return only 1 result of refno and ATM ID combination with the duration between the timestamp of the 1st and last ref no.
output format
ATM ID Ref no Timestamp Diff
1 11 2020/02/01 15:11:03 00:00:40
1 111 2020/02/06 18:11:03 00:25:22
2 22 2020/02/07 15:38:51 00:27:48
2 222 2020/02/07 15:11:03 00:00:00
Use custom lambda function in GroupBy.agg for difference last with first values:
df1 = (df.groupby(['ATM ID','Ref no'])['Timestamp']
.agg(lambda x: x.iat[-1] - x.iat[0])
.reset_index(name='diff'))
print (df1)
ATM ID Ref no diff
0 1 11 00:00:40
1 1 111 00:25:22
2 2 22 00:27:48
3 2 222 00:00:00
Or aggregate last and first and create new column by DataFrame.assign:
df1 = (df.groupby(['ATM ID','Ref no'])['Timestamp']
.agg(['last','first'])
.assign(diff = lambda x: x.pop('last') - x.pop('first'))
.reset_index()
)
print (df1)
ATM ID Ref no diff
0 1 11 00:00:40
1 1 111 00:25:22
2 2 22 00:27:48
3 2 222 00:00:00