pandas getting highest frequency value for each group in another column - python

I have a Pandas dataframe like this:
id color size test
0 0 blue medium 1
1 1 blue small 2
2 5 blue small 4
3 2 blue big 3
4 3 red small 4
5 4 red small 5
My desired output is this:
color size
blue small
red small
I've tried:
df = df[['id', 'color', 'size']]
df = df.groupby(['color'])['size'].value_counts()
and get this:
color size
blue small 2
big 1
medium 1
red small 2
Name: size, dtype: int64
but it turns into a series and the columns seem all messed up.
Basically, for each of the groups of 'color', I want the 'size' with the highest frequency. I'm really having a lot of trouble with this. Any suggestions? Thanks so much.

We can do sort_values the groupby with tail
s=df.groupby(['color','size']).size().sort_values().groupby(level=0).tail(1).reset_index()
color size 0
0 blue small 2
1 red small 2

Related

How to replace empty value with value from another row

I am dealing with a dataframe that has some empty cells. For example:
df = pd.DataFrame(data={'id': [1, 2, 3, 1, 2, 3], 'id2': [1,1,1,2,2,2], 'color': ["red", "", "green", "yellow", "", "blue"], 'rate':["good","","good","average","","good"]})
id id2 color rate
0 1 1 red good
1 2 1
2 3 1 green good
3 1 2 yellow average
4 2 2
5 3 2 blue good
For both the columns "color" and "rate", I would like to replace the empty rows with values from another row where id is 1. Therefore, I would like the dataframe to look like this in the end:
id id2 color rate
0 1 1 red good
1 2 1 red good
2 3 1 green good
3 1 2 yellow average
4 2 2 yellow average
5 3 2 blue good
I prefer not to replace the empty cells with values from the previous row. I would like to indicate the id and replace the empty cells with rows that have the specific id.
IIUC you can groupby and transform with first, and finally assign to empty values:
df.loc[df["color"].eq(""), ["color", "rate"]] = df.groupby(df["id"].eq(1).cumsum())[["color","rate"]].transform("first")
print (df)
id id2 color rate
0 1 1 red good
1 2 1 red good
2 3 1 green good
3 1 2 yellow average
4 2 2 yellow average
5 3 2 blue good
Its only works if the number of rows with empty values in [color,rate] is equal to the number of rows with id==1. Expand further on the question of whether this is not the intention.
If I understood you correctly:
empty_rows = df.loc[df['color']=='',['color','rate']].index
df.loc[empty_rows, ['color','rate']] = df.loc[df['id']==1,['color','rate']].values
Result df:

Sampling of a pandas dataframe with groupby on one column such that the sample has no duplicates in another column

Assume the below dummy dataframe:
a b
0 red 0
1 red 1
2 red 3
3 blue 0
4 blue 1
5 blue 3
6 black 4
7 black 2
I want to sample on column a, with a sample size of 1. But the only condition is that in the sample created the values of column b should be unique, such that it gives me a result something like:
a b
2 red 0
3 blue 1
5 black 4
The below result is not acceptable:
a b
2 red 1
3 blue 1
5 black 4
Right now I am using below code with pandas sample function, but it is returning duplicate values in column b
df.groupby('a').sample(n=1)
sample(n=1) selects values randomly from each element in column 'a'.
You can use df.groupby('a').first() which chooses the first value of each element, and would return the output you are looking for.

Using groupby() and value_counts()

The goal is to identify the count of colors in each groupby().
If you see the outcome below shows the first group color blue appears 3 times.
The second group yellow appears 2 times and third group blue appears 5 times
So far this is what I got.
df.groupby(['X','Y','Color'})..Color.value_counts()
but this produces only the count of 1 since color for each row appears once.
The final output should be like this:
Thanks in advance for any assistance.
If you give size to the transform function, it will be in tabular form without aggregation.
df['Count'] = df.groupby(['X','Y']).Color.transform('size')
df.set_index(['X','Y'], inplace=True)
df
Color Count
X Y
A B Blue 3
B Blue 3
B Blue 3
C D Yellow 2
D Yellow 2
E F Blue 5
F Blue 5
F Blue 5
F Blue 5
F Blue 5

Count the amount of NaNs in each group

In this dataframe, I'm trying to count how many NaN's there are for each color within the color column.
This is what the sample data looks like. In reality, there's 100k rows.
color value
0 blue 10
1 blue NaN
2 red NaN
3 red NaN
4 red 8
5 red NaN
6 yellow 2
I'd like the output to look like this:
color count
0 blue 1
1 red 3
2 yellow 0
You can use DataFrame.isna, GroupBy the column color and sum to add up all True rows in each group:
df.value.isna().groupby(df.color).sum().reset_index()
color value
0 blue 1.0
1 red 3.0
2 yellow 0.0
Also you may use agg() and isnull() or isna() as follows:
df.groupby('color').agg({'value': lambda x: x.isnull().sum()}).reset_index()
Use isna().sum()
df.groupby('color').value.apply(lambda x: x.isna().sum())
color
blue 1
red 3
yellow 0
A usage from size and count
g=df.groupby('color')['value']
g.size()-g.count()
Out[115]:
color
blue 1
red 3
yellow 0
Name: value, dtype: int64

Create column in pandas based on two other columns and table

table = pd.DataFrame(data=[[1,2,3],[4,5,6],[7,8,9]],
columns=['High','Middle','Low'],
index=['Blue','Green','Red'])
df = pd.DataFrame(data=[['High','Blue'],
['High','Green'],
['Low','Red'],
['Middle','Blue'],
['Low','Blue'],
['Low','Red']],
columns=['A','B'])
>>> df
A B
0 High Blue
1 High Green
2 Low Red
3 Middle Blue
4 Low Blue
5 Low Red
>>> table
High Middle Low
Blue 1 2 3
Green 4 5 6
Red 7 8 9
I'm trying to add a third column 'C' which is based on the values in the table. So the first row would get a value of 1, the second of 4 etc.
If this would be be a one-dimensional lookup I would convert the table to a dictionary and would use df['C'] = df['A'].map(table). However since this is two-dimensional I can't figure out how to use map or apply.
Ideally I would convert the table to dictionary format so I save it together with other dictionaries in a json, however this is not essential.
pandas lookup
table.lookup(df.B,df.A)
Out[248]: array([1, 4, 9, 2, 3, 9], dtype=int64)
#table['c']=table.lookup(df.B,df.A)
Or df.apply(lambda x : table.loc[x['B'],x['A']],1) personally do not like apply
You can use a merge for this:
df2 = (df.merge(table.stack().reset_index(),
left_on=['A','B'], right_on=['level_1', 'level_0'])
.drop(['level_0', 'level_1'], 1)
.rename(columns={0:'C'}))
>>> df2
A B C
0 High Blue 1
1 High Green 4
2 Low Red 9
3 Low Red 9
4 Middle Blue 2
5 Low Blue 3

Categories

Resources