In this dataframe, I'm trying to count how many NaN's there are for each color within the color column.
This is what the sample data looks like. In reality, there's 100k rows.
color value
0 blue 10
1 blue NaN
2 red NaN
3 red NaN
4 red 8
5 red NaN
6 yellow 2
I'd like the output to look like this:
color count
0 blue 1
1 red 3
2 yellow 0
You can use DataFrame.isna, GroupBy the column color and sum to add up all True rows in each group:
df.value.isna().groupby(df.color).sum().reset_index()
color value
0 blue 1.0
1 red 3.0
2 yellow 0.0
Also you may use agg() and isnull() or isna() as follows:
df.groupby('color').agg({'value': lambda x: x.isnull().sum()}).reset_index()
Use isna().sum()
df.groupby('color').value.apply(lambda x: x.isna().sum())
color
blue 1
red 3
yellow 0
A usage from size and count
g=df.groupby('color')['value']
g.size()-g.count()
Out[115]:
color
blue 1
red 3
yellow 0
Name: value, dtype: int64
Related
I'm aiming to drop duplicates values in a df. However, I want to drop where two values in separate columns are the same. Using below, I want to drop where Value and Item are duplicates. However, I want to keep the row where ['Group1'] == df['Group'].
Note: df = df.drop_duplicates(['Value', 'Item']) will not always be ideal as it depends on the ordering in Group. For instance, duplicates found in Item 80.0 and 260.0, the first row should be kept, but the second row should be kept for Item 310.0. I don't want to sort values here either as the strings could change. For example the groups could be Blue and Green, which would alter the intended output.
df = pd.DataFrame({
'Value' : ['X','X','Y','Z','D','D','E','E','X'],
'Item' : [80.0,80.0,200.0,210.0,260.0,260.0,300.0,300.0,310.0],
'Group' : ['Red','Green','Red','Green','Red','Green','Green','Red','Green'],
'Group1' : ['Red','Red','Red','Red','Red','Red','Red','Red','Red'],
'Group2' : ['Green','Green','Green','Green','Green','Green','Green','Green','Green'],
})
df = df[df['Group1'] == df['Group']].drop_duplicates(subset = ['Item','Value'])
If I perform df = df.drop_duplicates(['Value', 'Item']), the output is:
Value Item Group Group1 Group2
0 X 80.0 Red Red Green
2 Y 200.0 Red Red Green
3 Z 210.0 Green Red Green
4 D 260.0 Red Red Green
6 E 300.0 Green Red Green # incorrect
8 X 310.0 Green Red Green
intended output:
Value Item Group Group1 Group2
0 X 80.0 Red Red Green
1 Y 200.0 Red Red Green
2 Z 210.0 Green Red Green
3 D 260.0 Red Red Green
4 E 300.0 Red Red Green
5 X 310.0 Green Red Green
df1 = df.drop_duplicates(subset = ['Item','Value'])
df2 = df[df['Group'] == df['Group1']]
Dataframe df1 drop duplicates row on columns Item and Value.
Dataframe df2 keeps the rows where the value between column Group and column Group1 is the same.
I want to keep the row where ['Group1'] == df['Group'].
One left thing you need do is to replace values of dataframe df1 with the values of dataframe df2, if their both Item and Value column values are the same.
pandas.DataFrame.update() can modify in place using non-NA values from another DataFrame. You can use it like:
df1.set_index(['Value', 'Item'], inplace=True)
df1.update(df2.set_index(['Value', 'Item']))
df1.reset_index(inplace=True) # to recover the initial structure
# print(df1)
Value Item Group Group1 Group2
0 X 80.0 Red Red Green
1 Y 200.0 Red Red Green
2 Z 210.0 Green Red Green
3 D 260.0 Red Red Green
4 E 300.0 Red Red Green
5 X 310.0 Green Red Green
Besides update, you can use the index of the dataframe df2 to slice df1 and then assign.
df1.set_index(['Value', 'Item'], inplace=True)
df2.set_index(['Value', 'Item'], inplace=True)
df1.loc[df2.index] = df2
df1.reset_index(inplace=True)
df = pd.concat([df[df.Group == df.Group1],df[df.Group != df.Group1]]).drop_duplicates(subset = ['Item','Value']).sort_index()
Output
Value Item Group Group1 Group2
0 X 80.0 Red Red Green
2 Y 200.0 Red Red Green
3 Z 210.0 Green Red Green
4 D 260.0 Red Red Green
7 E 300.0 Red Red Green
8 X 310.0 Green Red Green
I am dealing with a dataframe that has some empty cells. For example:
df = pd.DataFrame(data={'id': [1, 2, 3, 1, 2, 3], 'id2': [1,1,1,2,2,2], 'color': ["red", "", "green", "yellow", "", "blue"], 'rate':["good","","good","average","","good"]})
id id2 color rate
0 1 1 red good
1 2 1
2 3 1 green good
3 1 2 yellow average
4 2 2
5 3 2 blue good
For both the columns "color" and "rate", I would like to replace the empty rows with values from another row where id is 1. Therefore, I would like the dataframe to look like this in the end:
id id2 color rate
0 1 1 red good
1 2 1 red good
2 3 1 green good
3 1 2 yellow average
4 2 2 yellow average
5 3 2 blue good
I prefer not to replace the empty cells with values from the previous row. I would like to indicate the id and replace the empty cells with rows that have the specific id.
IIUC you can groupby and transform with first, and finally assign to empty values:
df.loc[df["color"].eq(""), ["color", "rate"]] = df.groupby(df["id"].eq(1).cumsum())[["color","rate"]].transform("first")
print (df)
id id2 color rate
0 1 1 red good
1 2 1 red good
2 3 1 green good
3 1 2 yellow average
4 2 2 yellow average
5 3 2 blue good
Its only works if the number of rows with empty values in [color,rate] is equal to the number of rows with id==1. Expand further on the question of whether this is not the intention.
If I understood you correctly:
empty_rows = df.loc[df['color']=='',['color','rate']].index
df.loc[empty_rows, ['color','rate']] = df.loc[df['id']==1,['color','rate']].values
Result df:
Assume the below dummy dataframe:
a b
0 red 0
1 red 1
2 red 3
3 blue 0
4 blue 1
5 blue 3
6 black 4
7 black 2
I want to sample on column a, with a sample size of 1. But the only condition is that in the sample created the values of column b should be unique, such that it gives me a result something like:
a b
2 red 0
3 blue 1
5 black 4
The below result is not acceptable:
a b
2 red 1
3 blue 1
5 black 4
Right now I am using below code with pandas sample function, but it is returning duplicate values in column b
df.groupby('a').sample(n=1)
sample(n=1) selects values randomly from each element in column 'a'.
You can use df.groupby('a').first() which chooses the first value of each element, and would return the output you are looking for.
I have a Pandas dataframe like this:
id color size test
0 0 blue medium 1
1 1 blue small 2
2 5 blue small 4
3 2 blue big 3
4 3 red small 4
5 4 red small 5
My desired output is this:
color size
blue small
red small
I've tried:
df = df[['id', 'color', 'size']]
df = df.groupby(['color'])['size'].value_counts()
and get this:
color size
blue small 2
big 1
medium 1
red small 2
Name: size, dtype: int64
but it turns into a series and the columns seem all messed up.
Basically, for each of the groups of 'color', I want the 'size' with the highest frequency. I'm really having a lot of trouble with this. Any suggestions? Thanks so much.
We can do sort_values the groupby with tail
s=df.groupby(['color','size']).size().sort_values().groupby(level=0).tail(1).reset_index()
color size 0
0 blue small 2
1 red small 2
I have a df like this:
Key Class
1 Green
1 NaN
1 NaN
2 Red
2 NaN
2 NaN
and I want to forward fill Class. Im using this code:
df=df.Class.fillna(method='ffill')
and this returns:
Green
Green
Green
Red
Red
Red
how can I retain the Key column while doing this?
df['class'] = df.Class.fillna(method='ffill')
in your code you're assigning the whole dataframe to be the result , so instead you need to assign only the class column
or another way is to do the following
In [126]:
df.ffill()
Out[126]:
Key Class
0 1 Green
1 1 Green
2 1 Green
3 2 Red
4 2 Red
5 2 Red
you can set also the inplace parameter to be true if you don't want to create a new copy from your df
df.ffill(inplace=True)