Pandas groupby and average across unique values - python

I have the following dataframe
ID ID2 SCORE X Y
0 0 a 10 1 2
1 0 b 20 2 3
2 0 b 20 3 4
3 0 b 30 4 5
4 1 c 5 5 6
5 1 d 6 6 7
What I would like to do, is to groupby ID and ID2 and to average the SCORE taking into consideration only UNIQUE scores.
Now, if I use the standard df.groupby(['ID', 'ID2'])['SCORE'].mean() I would get 23.33~, where what I am looking for is a score of 25.
I know I can filter out X and Y, drop the duplicates and do that, but I want to keep them as they are relevant.
How can I achieve that?

If i understand correctly:
In [41]: df.groupby(['ID', 'ID2'])['SCORE'].agg(lambda x: x.unique().sum()/x.nunique())
Out[41]:
ID ID2
0 a 10
b 25
1 c 5
d 6
Name: SCORE, dtype: int64
or bit easier:
In [43]: df.groupby(['ID', 'ID2'])['SCORE'].agg(lambda x: x.unique().mean())
Out[43]:
ID ID2
0 a 10
b 25
1 c 5
d 6
Name: SCORE, dtype: int64

You can get the unique scores within groups of ('ID', 'ID2') by dropping duplicates before hand.
cols = ['ID', 'ID2', 'SCORE']
d1 = df.drop_duplicates(cols)
d1.groupby(cols[:-1]).SCORE.mean()
ID ID2
0 a 10
b 25
1 c 5
d 6
Name: SCORE, dtype: int64

You could also use
In [108]: df.drop_duplicates(['ID', 'ID2', 'SCORE']).groupby(['ID', 'ID2'])['SCORE'].mean()
Out[108]:
ID ID2
0 a 10
b 25
1 c 5
d 6
Name: SCORE, dtype: int64

Related

Python - Count duplicate user Id's occurence in a given month

If I create a Dataframe from
df = pd.DataFrame({"date": ['2022-08-10','2022-08-18','2022-08-18','2022-08-20','2022-08-20','2022-08-24','2022-08-26','2022-08-30','2022-09-3','2022-09-8','2022-09-13'],
"id": ['A','B','C','D','E','B','A','F','G','F','H']})
df['date'] = pd.to_datetime(df['date'])
(Table 1 below showing the data)
I am interested in counting how many times an ID appears in a given month. For example in a given month A, B and F all occur twice whilst everything else occurs once. The difficulty with this data is that the the frequency of dates are not evenly spread out.
I attempted to resample on date by month, with the hope of counting duplicates.
df.resample('M', on='date')['id']
But all the functions that can be used on resample just give me the number of unique occurences rather than how many times each ID occured.
A rough example of the output is below [Table 2]
All of the examples I have seen merely count how many total or unique occurences occur for a given month, this question is focused on finding out how many occurences each Id had in a month.
Thankyou for your time.
[Table 1] - Data
idx
date
id
0
2022-08-10
A
1
2022-08-18
B
2
2022-08-18
C
3
2022-08-20
D
4
2022-08-20
E
5
2022-08-24
B
6
2022-08-26
A
7
2022-08-30
F
8
2022-09-03
G
9
2022-09-08
F
10
2022-09-13
H
[Table 2] - Rough example of desired output
id
occurences in a month
A
2
B
2
C
1
D
1
E
1
F
2
G
1
H
1
Use Series.dt.to_period for month periods and count values per id by GroupBy.size, then aggregate sum:
df1 = (df.groupby(['id', df['date'].dt.to_period('m')])
.size()
.groupby(level=0)
.sum()
.reset_index(name='occurences in a month'))
print (df1)
id occurences in a month
0 A 2
1 B 2
2 C 1
3 D 1
4 E 1
5 F 2
6 G 1
7 H 1
Or use Grouper:
df1 = (df.groupby(['id',pd.Grouper(freq='M', key='date')])
.size()
.groupby(level=0)
.sum()
.reset_index(name='occurences in a month'))
print (df1)
EDIT:
df = pd.DataFrame({"date": ['2022-08-10','2022-08-18','2022-08-18','2022-08-20','2022-08-20','2022-08-24','2022-08-26',
'2022-08-30','2022-09-3','2022-09-8','2022-09-13','2050-12-15'],
"id": ['A','B','C','D','E','B','A','F','G','F','H','H']})
df['date'] = pd.to_datetime(df['date'],format='%Y-%m-%d')
print (df)
Because count first per month or days or dates and sum values it is same like:
df1 = df.groupby('id').size().reset_index(name='occurences')
print (df1)
id occurences
0 A 2
1 B 2
2 C 1
3 D 1
4 E 1
5 F 2
6 G 1
7 H 2
Same sum of counts per id:
df1 = (df.groupby(['id', df['date'].dt.to_period('m')])
.size())
print (df1)
id date
A 2022-08 2
B 2022-08 2
C 2022-08 1
D 2022-08 1
E 2022-08 1
F 2022-08 1
2022-09 1
G 2022-09 1
H 2022-09 1
2050-12 1
dtype: int64
df1 = (df.groupby(['id', df['date'].dt.to_period('d')])
.size())
print (df1)
id date
A 2022-08-10 1
2022-08-26 1
B 2022-08-18 1
2022-08-24 1
C 2022-08-18 1
D 2022-08-20 1
E 2022-08-20 1
F 2022-08-30 1
2022-09-08 1
G 2022-09-03 1
H 2022-09-13 1
2050-12-15 1
dtype: int64
df1 = (df.groupby(['id', df['date'].dt.day])
.size())
print (df1)
id date
A 10 1
26 1
B 18 1
24 1
C 18 1
D 20 1
E 20 1
F 8 1
30 1
G 3 1
H 13 1
15 1
dtype: int64

Assign list of arrays to several columns in Pandas DataFrame (Performance-optimized)

given the following DF:
df = pd.DataFrame(data=np.random.randint(1,10,size=(10,4)),columns=list("abcd"),dtype=np.int64)
Lets say i want to update the first two columns with a list of two numpy arrays(having a specific dtype: e.g. np.int8 and np.float32) --> update_vals = [np.arange(1,11,dtype=np.int8),np.ones(10,dtype=np.float32)]
I can do the following that works: df[["a","b"]] = pd.DataFrame(dict(zip(list("ab"),update_vals)))
Expected outcome of Column Dtypes:
a: np.int8
b=np.float32
[c,d]=np.int64
Is there maybe a faster way to do this?
Update
Why don't simply:
df['a'] = update_vals[0]
df['b'] = update_vals[1]
print(df.dtypes)
# Output:
a int8
b float32
c int64
d int64
dtype: object
Or:
for col, arr in zip(df.columns, update_vals):
df[col] = arr
Use:
df[['a', 'b']] = np.array(update_vals).T
print(df)
# Output:
a b c d
0 1 1 1 2
1 2 1 5 1
2 3 1 4 8
3 4 1 6 3
4 5 1 3 4
5 6 1 8 2
6 7 1 3 1
7 8 1 8 7
8 9 1 4 1
9 10 1 3 6

How to normalize dataframe based on a column weight

I have this dataframe, and i want to normalize/standarlize it (columns B,C,D) using column A as weight.
A
B
C
D
34
5
1
12
26
9
0
2
10
0
4
1
Is that possible?
It sounds like you would like to divide the the values in columns B, C, and D by the corresponding row value in column A.
To do this with a pandas dataframe called df:
print(df)
A B C D
34 5 1 12
26 9 0 2
10 0 4 1
cols = df.columns[1:]
for column in cols:
df[column] = df[column]/df["A"]
print(df)
A B C D
34 0.147059 0.029412 0.352941
26 0.346154 0.000000 0.076923
10 0.000000 0.400000 0.100000

pandas dataframe delete groups with more than n rows in groupby

I have a dataframe:
df = [type1 , type2 , type3 , val1, val2, val3
a b q 1 2 3
a c w 3 5 2
b c t 2 9 0
a b p 4 6 7
a c m 2 1 8
a b h 8 6 3
a b e 4 2 7]
I want to apply groupby based on columns type1, type2 and delete from the dataframe the groups with more than 2 rows. So the new dataframe will be:
df = [type1 , type2 , type3 , val1, val2, val3
a c w 3 5 2
b c t 2 9 0
a c m 2 1 8
]
What is the best way to do so?
Use GroupBy.transform for get counts of groups for Series with same size like original, so possible filter by Series.le for <= in boolean indexing:
df = df[df.groupby(['type1','type2'])['type1'].transform('size').le(2)]
print (df)
type1 type2 type3 val1 val2 val3
1 a c w 3 5 2
2 b c t 2 9 0
4 a c m 2 1 8
If performace is not important or small DataFrame is possible use DataFrameGroupBy.filter:
df =df.groupby(['type1','type2']).filter(lambda x: len(x) <= 2)

Change data type of a specific column of a pandas dataframe

I want to sort a dataframe with many columns by a specific column, but first I need to change type from object to int. How to change the data type of this specific column while keeping the original column positions?
df['colname'] = df['colname'].astype(int) works when changing from float values to int atleast.
I have tried following:
df['column']=df.column.astype('int64')
and it worked for me.
You can use reindex by sorted column by sort_values, cast to int by astype:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'colname':['7','3','9'],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B D E F colname
0 1 4 1 5 7 7
1 2 5 3 3 4 3
2 3 6 5 6 3 9
print (df.colname.astype(int).sort_values())
1 3
0 7
2 9
Name: colname, dtype: int32
print (df.reindex(df.colname.astype(int).sort_values().index))
A B D E F colname
1 2 5 3 3 4 3
0 1 4 1 5 7 7
2 3 6 5 6 3 9
print (df.reindex(df.colname.astype(int).sort_values().index).reset_index(drop=True))
A B D E F colname
0 2 5 3 3 4 3
1 1 4 1 5 7 7
2 3 6 5 6 3 9
If first solution does not works because None or bad data use to_numeric:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'colname':['7','3','None'],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B D E F colname
0 1 4 1 5 7 7
1 2 5 3 3 4 3
2 3 6 5 6 3 None
print (pd.to_numeric(df.colname, errors='coerce').sort_values())
1 3.0
0 7.0
2 NaN
Name: colname, dtype: float64
To simply change one column, here is what you can do:
df.column_name.apply(int)
you can replace int with the desired datatype you want e.g (np.int64), str, category.
For multiple datatype changes, I would recommend the following:
df = pd.read_csv(data, dtype={'Col_A': str,'Col_B':int64})
The documentation provides all the information needed. Let's take the toy dataframe from the docs:
d = {'col1': [1, 2], 'col2': [3, 4]}
If we want to cast col1 to int32 we can use, for instance, a dictionary:
df.astype({'col1': 'int32'})
In addition, the approach above allows to avoid the SettingWithCopyWarning.

Categories

Resources