multiple groupby and get unique count - python

I have the following df1:
id period color size rate
1 01 red 12 30
1 02 red 12 30
2 01 blue 12 35
3 03 blue 12 35
4 01 blue 12 35
4 02 blue 12 35
5 01 pink 10 40
6 01 pink 10 40
I need to create an index column that is basically a group of color-size-rate and count the number of unique id(s) that have this combination. My output df after transformation should look like this df1:
id period color size rate index count
1 01 red 12 30 red-12-30 1
1 02 red 12 30 red-12-30 1
2 01 blue 12 35 blue-12-35 3
2 03 blue 12 35 blue-12-35 3
4 01 blue 12 35 blue-12-35 3
4 02 blue 12 35 blue-12-35 3
5 01 pink 10 40 pink-10-40 2
6 01 pink 10 40 pink-10-40 2
I am able to get the count but it is not counting "unique" ids but instead , just the number of occurrences.
1 01 red 12 30 red-12-30 2
1 02 red 12 30 red-12-30 2
2 01 blue 12 35 blue-12-35 4
2 03 blue 12 35 blue-12-35 4
4 01 blue 12 35 blue-12-35 4
4 02 blue 12 35 blue-12-35 4
This is wrong as it is not actually grouping by id to count unique ones.
Appreciate any pointers in this direction.
Adding an edit here as my requirement changed:
The count need to also be grouped by 'period' ie my final df should be:
index period count
red-12-30 01 1
red-12-30 02 1
blue-12-35 01 2
blue-12-35 03 1
blue-12-35 02 1
pink-10-40 01 2
Solution: from #anky:
When I try adding another groupby['period'], I am getting dimension mismatch error.
Thank you in advance.

You can try aggregating a join for creating the index column, then group on the same and get nunique using groupby+transform
idx = df[['color','size','rate']].astype(str).agg('-'.join,1)
out = df.assign(index=idx,count=df.groupby(idx)['id'].transform('nunique'))
print(out)
id period color size rate index count
0 1 1 red 12 30 red-12-30 1
1 1 2 red 12 30 red-12-30 1
2 2 1 blue 12 35 blue-12-35 3
3 3 3 blue 12 35 blue-12-35 3
4 4 1 blue 12 35 blue-12-35 3
5 4 2 blue 12 35 blue-12-35 3
6 5 1 pink 10 40 pink-10-40 2
7 6 1 pink 10 40 pink-10-40 2

Related

Iterate over rows and calculate values

I have the following pandas dataframe:
temp stage issue_datetime
20 1 2022/11/30 19:20
21 1 2022/11/30 19:21
20 1 None
25 1 2022/11/30 20:10
30 2 None
22 2 2022/12/01 10:00
22 2 2022/12/01 10:01
31 3 2022/12/02 11:00
32 3 2022/12/02 11:01
19 1 None
20 1 None
I want to get the following result:
temp stage num_issues
20 1 3
21 1 3
20 1 3
25 1 3
30 2 2
22 2 2
22 2 2
31 3 2
32 3 2
19 1 0
20 1 0
Basically, I need to calculate the number of non-None per continuous value of stage and create a new column called num_issues.
How can I do it?
You can find the blocks of continuous value with cumsum on the diff, then groupby that and transform the non-null`
blocks = df['stage'].ne(df['stage'].shift()).cumsum()
df['num_issues'] = df['issue_datetime'].notna().groupby(blocks).transform('sum')
# or
# df['num_issues'] = df['issue_datetime'].groupby(blocks).transform('count')
Output:
temp stage issue_datetime num_issues
0 20 1 2022/11/30 19:20 3
1 21 1 2022/11/30 19:21 3
2 20 1 None 3
3 25 1 2022/11/30 20:10 3
4 30 2 None 2
5 22 2 2022/12/01 10:00 2
6 22 2 2022/12/01 10:01 2
7 31 3 2022/12/02 11:00 2
8 32 3 2022/12/02 11:01 2
9 19 1 None 0
10 20 1 None 0

multiple cumulative sum based on grouped columns

I have a dataset where I would like to sum two columns and then perform a subtraction while displaying a cumulative sum.
Data
id date t1 t2 total start cur_t1 cur_t2 final_o finaldb de_t1 de_t2
a q122 4 1 5 50 25 20 55 21 1 1
a q222 1 1 2 50 25 20 57 22 0 0
a q322 0 0 0 50 25 20 57 22 5 5
b q122 5 5 10 100 30 40 110 27 4 4
b q222 2 2 4 100 30 70 114 29 5 1
b q322 3 4 7 100 30 70 121 33 0 1
Desired
id date t1 t2 total start cur_t1 cur_t2 final_o finaldb de_t1 de_t2 finalt1
a q122 4 1 5 50 25 20 55 21 1 1 28
a q222 1 1 2 50 25 20 57 22 0 0 29
a q322 0 0 0 50 25 20 57 22 5 5 24
b q122 5 5 10 100 30 40 110 27 4 4 31
b q222 2 2 4 100 30 70 114 29 5 1 28
b q322 3 4 7 100 30 70 121 33 0 1 31
Logic
Create 'finalt1' column by summing 't1' and 'cur_t1'
initially and then subtracting 'de_t1' cumulatively and grouping by 'id' and 'date'
Doing
df['finalt1'] = df['cur_t1'].add(df.groupby('id')['t1'].cumsum())
I am still researching on how to subtract the 'de_t1' column cumulatively.
I can't test right now, but logically:
(df['cur_t1'].add(df.groupby('id')['t1'].cumsum())
.sub(df.groupby('id')['de_t1'].cumsum())
)
Of note, there was also this possibility to avoid grouping twice (it is calculating both cumsums at once and computing the difference), but it is actually slower:
df['cur_t1'].add(df.groupby('id')[['de_t1', 't1']].cumsum().diff(axis=1)['t1'])

Multiple values to pivot in dataframe

I have a dataset where I would like to pivot the entire dataframe, using certain columns as values.
Data
id date sun moon stars total pcp base final status space galaxy
aa Q1 21 5 1 2 8 0 200 41 5 1 1
aa Q2 21 4 1 2 7 1 200 50 6 2 1
Desired
id date type pcp base final final2 status type2 final3
aa Q1 21 sun 0 200 41 5 5 space 1
aa Q1 21 moon 0 200 41 1 5 galaxy 1
aa Q1 21 stars 0 200 41 2 5 space 1
aa Q2 21 sun 1 200 50 4 6 space 2
aa Q2 21 moon 1 200 50 1 6 galaxy 1
aa Q2 21 stars 1 200 50 2 6 space 2
Doing
df.drop(columns='total').melt(['id','date','final','final2','base','ppp'],var_name='type',value_name='ppp')
This works well in pivoting the first set of values (sun, moon etc) however, not sure how to incorporate the second 'set' space and galaxy.
Any suggestion is appreciated
This is a partial answer:
cols = ['id', 'date', 'pcp', 'base', 'final', 'status']
df = df.drop(columns='total')
df1 = df.melt(id_vars=cols, value_vars=['sun', 'moon', 'stars'], var_name='type')
df2 = df.melt(id_vars=cols, value_vars=['galaxy', 'space'], var_name='type2')
out = pd.merge(df1, df2, on=cols)
At this point, your dataframe looks like:
>>> out
id date pcp base final status type value_x type2 value_y
0 aa Q1 21 0 200 41 5 sun 5 galaxy 1
1 aa Q1 21 0 200 41 5 sun 5 space 1
2 aa Q1 21 0 200 41 5 moon 1 galaxy 1
3 aa Q1 21 0 200 41 5 moon 1 space 1
4 aa Q1 21 0 200 41 5 stars 2 galaxy 1
5 aa Q1 21 0 200 41 5 stars 2 space 1
6 aa Q2 21 1 200 50 6 sun 4 galaxy 1
7 aa Q2 21 1 200 50 6 sun 4 space 2
8 aa Q2 21 1 200 50 6 moon 1 galaxy 1
9 aa Q2 21 1 200 50 6 moon 1 space 2
10 aa Q2 21 1 200 50 6 stars 2 galaxy 1
11 aa Q2 21 1 200 50 6 stars 2 space 2
Now the question is how total3 is set to reduce the dataframe?

groupby two columns and count unique values from a third column

I have the following df1:
id period color size rate
1 01 red 12 30
1 02 red 12 30
2 01 blue 12 35
3 03 blue 12 35
4 01 blue 12 35
4 02 blue 12 35
5 01 pink 10 40
6 01 pink 10 40
I need to create a new df2 with an index that is an aggregate of 3 columns color-size-rate, then groupby 'period' and get the count of unique ids.
My final df should be have the following structure:
index period count
red-12-30 01 1
red-12-30 02 1
blue-12-35 01 2
blue-12-35 03 1
blue-12-35 02 1
pink-10-40 01 2
Thank you in advance for your help.
try .agg('-'.join) and .groupby
df1 = df.groupby([df[["color", "size", "rate"]].astype(str)\
.agg("-".join, 1).rename('index'), "period"])\
.agg(count=("id", "nunique"))\
.reset_index()
print(df1)
index period count
0 blue-12-35 1 2
1 blue-12-35 2 1
2 blue-12-35 3 1
3 pink-10-40 1 2
4 red-12-30 1 1
5 red-12-30 2 1
you can achieve this with a groupby
df2 = df1.groupby(['color', 'size', 'rate', 'period']).count().reset_index();
df2['index'] = df2.apply(lambda x: '-'.join([x['color'], x['size'], x['rate']]), axis = 1)

Pandas DataFrame Return Value from Column Index

I have a dataframe that has values of the different column numbers for another dataframe. Is there a way that I can just return the value from the other dataframe instead of just having the column index.
I basically want to match up the index between the Push and df dataframes. The values in the Push dataframe contain what column I want to return from the df dataframe.
Push dataframe:
0 1
0 1 2
1 0 3
2 0 3
3 1 3
4 0 2
df dataframe:
0 1 2 3 4
0 10 11 22 33 44
1 10 11 22 33 44
2 10 11 22 33 44
3 10 11 22 33 44
4 10 11 22 33 44
return:
0 1
0 11 22
1 10 33
2 10 33
3 11 33
4 10 22
You can do it with np.take ; However this function works on the flattened array. push must be shift like that :
In [285]: push1 = push.values+np.arange(0,25,5)[:,None]
In [229]: pd.DataFrame(df.values.take(push1))
EDIT
No, I just reinvent np.choose :
In [24]: df
Out[24]:
0 1 2 3 4
0 0 1 2 3 4
1 10 11 12 13 14
2 20 21 22 23 24
3 30 31 32 33 34
4 40 41 42 43 44
In [25]: push
Out[25]:
0 1
0 1 2
1 0 3
2 0 3
3 1 3
4 0 2
In [27]: np.choose(push.T,df).T
Out[27]:
0 1
0 1 2
1 10 13
2 20 23
3 31 33
4 40 42
We using melt then replace notice (df1 is your push , df2 is your df)
df1.astype(str).replace(df2.melt().drop_duplicates().set_index('variable').value.to_dict())
Out[31]:
0 1
0 11 22
1 10 33
2 10 33
3 11 33
4 10 22

Categories

Resources