Using groupby() and value_counts() - python

The goal is to identify the count of colors in each groupby().
If you see the outcome below shows the first group color blue appears 3 times.
The second group yellow appears 2 times and third group blue appears 5 times
So far this is what I got.
df.groupby(['X','Y','Color'})..Color.value_counts()
but this produces only the count of 1 since color for each row appears once.
The final output should be like this:
Thanks in advance for any assistance.

If you give size to the transform function, it will be in tabular form without aggregation.
df['Count'] = df.groupby(['X','Y']).Color.transform('size')
df.set_index(['X','Y'], inplace=True)
df
Color Count
X Y
A B Blue 3
B Blue 3
B Blue 3
C D Yellow 2
D Yellow 2
E F Blue 5
F Blue 5
F Blue 5
F Blue 5
F Blue 5

Related

Pandas, count rows per unique value

I have a pandas dataframe like this:
c1 c2
0 A red
1 B blue
2 B blue
3 C red
4 C red
5 C blue
6 D blue
All I want to do is find out how many red/blue values there are per all values in c1. Something like this:
red blue
A 1 0
B 0 2
C 2 1
D 0 1
I tried using masks and groupby() but failed coming up with a solution. The main reason is that I am not allowed to use loops. Feels like there is an obvious solution but I'm not that good at using pandas :/ Any advice?
Simple groupby with value_counts.
df.groupby('c1')['c2'].value_counts().unstack(fill_value=0)
Alternatively, groupby with the size of the groups...
df.groupby(['c1','c2']).size()
Output:
c1 c2
A red 1
B blue 2
C blue 1
red 2
D blue 1
dtype: int64
It's not exactly as you wanted, but gives you the important information. You said "Something like this"...

Sampling of a pandas dataframe with groupby on one column such that the sample has no duplicates in another column

Assume the below dummy dataframe:
a b
0 red 0
1 red 1
2 red 3
3 blue 0
4 blue 1
5 blue 3
6 black 4
7 black 2
I want to sample on column a, with a sample size of 1. But the only condition is that in the sample created the values of column b should be unique, such that it gives me a result something like:
a b
2 red 0
3 blue 1
5 black 4
The below result is not acceptable:
a b
2 red 1
3 blue 1
5 black 4
Right now I am using below code with pandas sample function, but it is returning duplicate values in column b
df.groupby('a').sample(n=1)
sample(n=1) selects values randomly from each element in column 'a'.
You can use df.groupby('a').first() which chooses the first value of each element, and would return the output you are looking for.

pandas getting highest frequency value for each group in another column

I have a Pandas dataframe like this:
id color size test
0 0 blue medium 1
1 1 blue small 2
2 5 blue small 4
3 2 blue big 3
4 3 red small 4
5 4 red small 5
My desired output is this:
color size
blue small
red small
I've tried:
df = df[['id', 'color', 'size']]
df = df.groupby(['color'])['size'].value_counts()
and get this:
color size
blue small 2
big 1
medium 1
red small 2
Name: size, dtype: int64
but it turns into a series and the columns seem all messed up.
Basically, for each of the groups of 'color', I want the 'size' with the highest frequency. I'm really having a lot of trouble with this. Any suggestions? Thanks so much.
We can do sort_values the groupby with tail
s=df.groupby(['color','size']).size().sort_values().groupby(level=0).tail(1).reset_index()
color size 0
0 blue small 2
1 red small 2

How to repeatedly loop through list to assign values

I have two pandas data frames. Within df1 I have a string column with a finite list of unique values. I want to make those values a list, then loop through and append a new column onto df2. The value would loop through the list and then start over for the entire range of the second data frame.
df1
my_value
0 A
1 B
2 C
df2
color
0 red
1 orange
2 yellow
3 green
4 blue
5 indigo
6 violet
7 maroon
8 brown
9 black
What I want
color my_value
0 red A
1 orange B
2 yellow C
3 green A
4 blue B
5 indigo C
6 violet A
7 maroon B
8 brown C
9 black A
#create list
my_list = pd.Series(df1.my_value.values).to_list()
# create column
my_new_column = []
for i in range(len(df2)):
assigned_value = my_list[i]
my_new_column.append(assigned_value)
df2['my_new_column'] = my_new_column
return df2
The list index and range are differing lengths which is where I'm getting hung up.
This is super straight forward and I'm completely looking past the solution, please feel free to link me to another question if this is answered elsewhere. Thanks for you input!
#You can use zip with itertools.cycle() to cycle thru the smallest list/Series
df1 = pd.Series(data=['a','b','c'],name='my_values')
df2 = pd.Series(data= 'red','orange','yellow','green','blue','indigo','violet','maroon','brown','black'], name='color')
import itertools
df2 = pd.concat([df2, pd.Series([b for a,b in zip(df2 , itertools.cycle(df1))], name='my_value')],axis=1)
df2
color my_value
0 red a
1 orange b
2 yellow c
3 green a
4 blue b
5 indigo c
6 violet a
7 maroon b
8 brown c
9 black a

Concatenate two pandas dataframe with the same index but on different positions

I have a data frame like
id value_right color_right value_left color_left
1 asd red dfs blue
2 dfs blue afd green
3 ccd yellow asd blue
4 hty red hrr red
I need to get the left values below the right values, something like
id value color
1 asd red
1 dfs blue
2 dfs blue
2 afd green
3 ccd yellow
3 asd blue
4 hty red
4 hrr red
I tried to split in two data frames and to interleave using the id, but I got only half of the data on it, using the mod of the value of the id. Any ideas?
Take a view of the desired left and right side dfs, then rename the columns and then concat them and sort on 'id' column:
In [205]:
left = df[['id','value_left','color_left']].rename(columns={'value_left':'value','color_left':'color'})
right = df[['id','value_right','color_right']].rename(columns={'value_right':'value','color_right':'color'})
merged = pd.concat([right,left]).sort('id')
merged
Out[205]:
id value color
0 1 asd red
0 1 dfs blue
1 2 dfs blue
1 2 afd green
2 3 ccd yellow
2 3 asd blue
3 4 hty red
3 4 hrr red

Categories

Resources