Python - Count duplicate user Id's occurence in a given month - python

If I create a Dataframe from
df = pd.DataFrame({"date": ['2022-08-10','2022-08-18','2022-08-18','2022-08-20','2022-08-20','2022-08-24','2022-08-26','2022-08-30','2022-09-3','2022-09-8','2022-09-13'],
"id": ['A','B','C','D','E','B','A','F','G','F','H']})
df['date'] = pd.to_datetime(df['date'])
(Table 1 below showing the data)
I am interested in counting how many times an ID appears in a given month. For example in a given month A, B and F all occur twice whilst everything else occurs once. The difficulty with this data is that the the frequency of dates are not evenly spread out.
I attempted to resample on date by month, with the hope of counting duplicates.
df.resample('M', on='date')['id']
But all the functions that can be used on resample just give me the number of unique occurences rather than how many times each ID occured.
A rough example of the output is below [Table 2]
All of the examples I have seen merely count how many total or unique occurences occur for a given month, this question is focused on finding out how many occurences each Id had in a month.
Thankyou for your time.
[Table 1] - Data
idx
date
id
0
2022-08-10
A
1
2022-08-18
B
2
2022-08-18
C
3
2022-08-20
D
4
2022-08-20
E
5
2022-08-24
B
6
2022-08-26
A
7
2022-08-30
F
8
2022-09-03
G
9
2022-09-08
F
10
2022-09-13
H
[Table 2] - Rough example of desired output
id
occurences in a month
A
2
B
2
C
1
D
1
E
1
F
2
G
1
H
1

Use Series.dt.to_period for month periods and count values per id by GroupBy.size, then aggregate sum:
df1 = (df.groupby(['id', df['date'].dt.to_period('m')])
.size()
.groupby(level=0)
.sum()
.reset_index(name='occurences in a month'))
print (df1)
id occurences in a month
0 A 2
1 B 2
2 C 1
3 D 1
4 E 1
5 F 2
6 G 1
7 H 1
Or use Grouper:
df1 = (df.groupby(['id',pd.Grouper(freq='M', key='date')])
.size()
.groupby(level=0)
.sum()
.reset_index(name='occurences in a month'))
print (df1)
EDIT:
df = pd.DataFrame({"date": ['2022-08-10','2022-08-18','2022-08-18','2022-08-20','2022-08-20','2022-08-24','2022-08-26',
'2022-08-30','2022-09-3','2022-09-8','2022-09-13','2050-12-15'],
"id": ['A','B','C','D','E','B','A','F','G','F','H','H']})
df['date'] = pd.to_datetime(df['date'],format='%Y-%m-%d')
print (df)
Because count first per month or days or dates and sum values it is same like:
df1 = df.groupby('id').size().reset_index(name='occurences')
print (df1)
id occurences
0 A 2
1 B 2
2 C 1
3 D 1
4 E 1
5 F 2
6 G 1
7 H 2
Same sum of counts per id:
df1 = (df.groupby(['id', df['date'].dt.to_period('m')])
.size())
print (df1)
id date
A 2022-08 2
B 2022-08 2
C 2022-08 1
D 2022-08 1
E 2022-08 1
F 2022-08 1
2022-09 1
G 2022-09 1
H 2022-09 1
2050-12 1
dtype: int64
df1 = (df.groupby(['id', df['date'].dt.to_period('d')])
.size())
print (df1)
id date
A 2022-08-10 1
2022-08-26 1
B 2022-08-18 1
2022-08-24 1
C 2022-08-18 1
D 2022-08-20 1
E 2022-08-20 1
F 2022-08-30 1
2022-09-08 1
G 2022-09-03 1
H 2022-09-13 1
2050-12-15 1
dtype: int64
df1 = (df.groupby(['id', df['date'].dt.day])
.size())
print (df1)
id date
A 10 1
26 1
B 18 1
24 1
C 18 1
D 20 1
E 20 1
F 8 1
30 1
G 3 1
H 13 1
15 1
dtype: int64

Related

Generate a crosstab type dataframe with a binary count value in pandas

I have a pandas dataframe like this
UIID ISBN
a 12
b 13
I want to compare each UUID with the ISBN and add a count column in the dataframe.
UUID ISBN Count
a 12 1
a 13 0
b 12 0
b 13 1
How can this be done in pandas. I know the crosstab function does the same thing but I want the data in this format.
Use crosstab with melt:
df = pd.crosstab(df['UIID'], df['ISBN']).reset_index().melt('UIID', value_name='count')
print (df)
UIID ISBN count
0 a 12 1
1 b 12 0
2 a 13 0
3 b 13 1
Alternative solution with GroupBy.size and reindex by MultiIndex.from_product:
s = df.groupby(['UIID','ISBN']).size()
mux = pd.MultiIndex.from_product(s.index.levels, names=s.index.names)
df = s.reindex(mux, fill_value=0).reset_index(name='count')
print (df)
UIID ISBN count
0 a 12 1
1 a 13 0
2 b 12 0
3 b 13 1
You can also use pd.DataFrame.unstack:
df = pd.crosstab(df.UIID, df.ISBN).unstack().reset_index()
print(df)
ISBN UIID 0
0 12 a 1
1 12 b 0
2 13 a 0
3 13 b 1

Python Pandas - Deal with duplicates

I want to deal with duplicates in a pandas df:
df=pd.DataFrame({'A':[1,1,1,2,1],'B':[2,2,1,2,1],'C':[2,2,1,1,1],'D':['a','c','a','c','c']})
df
I want to keep only rows with unique values of A, B, C an create binary columns D_a and D_c, so the results will be something like this without doing super slow loops on each row..
result= pd.DataFrame({'A':[1,1,2],'B':[2,1,2],'C':[2,1,1],'D_a':[1,1,0],'D_c':[1,1,1]})
Thanks a lot
You can use:
df1 = (df.groupby(['A','B','C'])['D']
.value_counts()
.unstack(fill_value=0)
.add_prefix('D_')
.clip_upper(1)
.reset_index()
.rename_axis(None, axis=1))
print (df1)
A B C D_a D_c
0 1 1 1 1 1
1 1 2 2 1 1
2 2 2 1 0 1
Using get_dummies + sum -
df = df.set_index(['A', 'B', 'C'])\
.D.str.get_dummies()\
.sum(level=[0, 1, 2])\
.add_prefix('D_')\
.reset_index()
df
A B C D_a D_c
0 1 1 1 1 1
1 1 2 2 1 1
2 2 2 1 0 1
You can do something like this
df.loc[df['D']=='a', 'D_a'] = 1
df.loc[df['D']=='c', 'D_c'] = 1
This will put a 1 in a new column where every an "a" or "c" appears.
A B C D D_a D_c
0 1 2 2 a 1.0 NaN
1 1 2 2 c NaN 1.0
2 1 1 1 a 1.0 NaN
3 2 2 1 c NaN 1.0
4 1 1 1 c NaN 1.0
but then you have to replace the NaN with a 0.
df = df.fillna(0)
Next you only have to select the columns you need and then drop the duplicates.
df = df[["A","B","C", "D_a", "D_c"]].drop_duplicates()
Hope this is the solution you were looking for.

Removing last rows of each group based on condition in a pandas dataframe

I have following dataframe:
name gender count
0 A M 3
1 A F 2
2 A Nan 3
3 B NaN 2
4 C F 4
5 D M 5
6 D Nan 5
I would like to build a resulting dataframe df1 which deletes that last row of group of name attribute if the count of that group is greater than 1. For eq- name A is present 3 times, hence the last row containing A should be removed. B and C are only present once, hence the rows containing them should be retained.
Resulting dataframe df1 should be like this:
name gender count
0 A M 3
1 A F 2
2 B NaN 2
3 C F 4
4 D M 5
Please advice.
Use
In [4598]: (df.groupby('name').apply(lambda x: x.iloc[:-1] if len(x)>1 else x)
.reset_index(drop=True))
Out[4598]:
name gender count
0 A M 3
1 A F 2
2 B NaN 2
3 C F 4
4 D M 5
Using groupby + head:
g = df.groupby('name', as_index=False, group_keys=False)\
.apply(lambda x: x.head(-1) if x.shape[0] > 1 else x)
print(g)
name gender count
0 A M 3
1 A F 2
3 B NaN 2
4 C F 4
5 D M 5

Pandas groupby and average across unique values

I have the following dataframe
ID ID2 SCORE X Y
0 0 a 10 1 2
1 0 b 20 2 3
2 0 b 20 3 4
3 0 b 30 4 5
4 1 c 5 5 6
5 1 d 6 6 7
What I would like to do, is to groupby ID and ID2 and to average the SCORE taking into consideration only UNIQUE scores.
Now, if I use the standard df.groupby(['ID', 'ID2'])['SCORE'].mean() I would get 23.33~, where what I am looking for is a score of 25.
I know I can filter out X and Y, drop the duplicates and do that, but I want to keep them as they are relevant.
How can I achieve that?
If i understand correctly:
In [41]: df.groupby(['ID', 'ID2'])['SCORE'].agg(lambda x: x.unique().sum()/x.nunique())
Out[41]:
ID ID2
0 a 10
b 25
1 c 5
d 6
Name: SCORE, dtype: int64
or bit easier:
In [43]: df.groupby(['ID', 'ID2'])['SCORE'].agg(lambda x: x.unique().mean())
Out[43]:
ID ID2
0 a 10
b 25
1 c 5
d 6
Name: SCORE, dtype: int64
You can get the unique scores within groups of ('ID', 'ID2') by dropping duplicates before hand.
cols = ['ID', 'ID2', 'SCORE']
d1 = df.drop_duplicates(cols)
d1.groupby(cols[:-1]).SCORE.mean()
ID ID2
0 a 10
b 25
1 c 5
d 6
Name: SCORE, dtype: int64
You could also use
In [108]: df.drop_duplicates(['ID', 'ID2', 'SCORE']).groupby(['ID', 'ID2'])['SCORE'].mean()
Out[108]:
ID ID2
0 a 10
b 25
1 c 5
d 6
Name: SCORE, dtype: int64

Merge dataframes on nearest datetime / timestamp

I have two data frames as follows:
A = pd.DataFrame({"ID":["A", "A", "C" ,"B", "B"], "date":["06/22/2014","07/02/2014","01/01/2015","01/01/1991","08/02/1999"]})
B = pd.DataFrame({"ID":["A", "A", "C" ,"B", "B"], "date":["02/15/2015","06/30/2014","07/02/1999","10/05/1990","06/24/2014"], "value": ["3","5","1","7","8"] })
Which look like the following:
>>> A
ID date
0 A 2014-06-22
1 A 2014-07-02
2 C 2015-01-01
3 B 1991-01-01
4 B 1999-08-02
>>> B
ID date value
0 A 2015-02-15 3
1 A 2014-06-30 5
2 C 1999-07-02 1
3 B 1990-10-05 7
4 B 2014-06-24 8
I want to merge A with the values of B using the nearest date. In this example, none of the dates match, but it could the the case that some do.
The output should be something like this:
>>> C
ID date value
0 A 06/22/2014 8
1 A 07/02/2014 5
2 C 01/01/2015 3
3 B 01/01/1991 7
4 B 08/02/1999 1
It seems to me that there should be a native function in pandas that would allow this.
Note: as similar question has been asked here
pandas.merge: match the nearest time stamp >= the series of timestamps
You can use reindex with method='nearest' and then merge:
A['date'] = pd.to_datetime(A.date)
B['date'] = pd.to_datetime(B.date)
A.sort_values('date', inplace=True)
B.sort_values('date', inplace=True)
B1 = B.set_index('date').reindex(A.set_index('date').index, method='nearest').reset_index()
print (B1)
print (pd.merge(A,B1, on='date'))
ID_x date ID_y value
0 B 1991-01-01 B 7
1 B 1999-08-02 C 1
2 A 2014-06-22 B 8
3 A 2014-07-02 A 5
4 C 2015-01-01 A 3
You can also add parameter suffixes:
print (pd.merge(A,B1, on='date', suffixes=('_', '')))
ID_ date ID value
0 B 1991-01-01 B 7
1 B 1999-08-02 C 1
2 A 2014-06-22 B 8
3 A 2014-07-02 A 5
4 C 2015-01-01 A 3
pd.merge_asof(A, B, on="date", direction='nearest')

Categories

Resources