Pandas, count rows per unique value - python

I have a pandas dataframe like this:
c1 c2
0 A red
1 B blue
2 B blue
3 C red
4 C red
5 C blue
6 D blue
All I want to do is find out how many red/blue values there are per all values in c1. Something like this:
red blue
A 1 0
B 0 2
C 2 1
D 0 1
I tried using masks and groupby() but failed coming up with a solution. The main reason is that I am not allowed to use loops. Feels like there is an obvious solution but I'm not that good at using pandas :/ Any advice?

Simple groupby with value_counts.
df.groupby('c1')['c2'].value_counts().unstack(fill_value=0)

Alternatively, groupby with the size of the groups...
df.groupby(['c1','c2']).size()
Output:
c1 c2
A red 1
B blue 2
C blue 1
red 2
D blue 1
dtype: int64
It's not exactly as you wanted, but gives you the important information. You said "Something like this"...

Related

How to replace empty value with value from another row

I am dealing with a dataframe that has some empty cells. For example:
df = pd.DataFrame(data={'id': [1, 2, 3, 1, 2, 3], 'id2': [1,1,1,2,2,2], 'color': ["red", "", "green", "yellow", "", "blue"], 'rate':["good","","good","average","","good"]})
id id2 color rate
0 1 1 red good
1 2 1
2 3 1 green good
3 1 2 yellow average
4 2 2
5 3 2 blue good
For both the columns "color" and "rate", I would like to replace the empty rows with values from another row where id is 1. Therefore, I would like the dataframe to look like this in the end:
id id2 color rate
0 1 1 red good
1 2 1 red good
2 3 1 green good
3 1 2 yellow average
4 2 2 yellow average
5 3 2 blue good
I prefer not to replace the empty cells with values from the previous row. I would like to indicate the id and replace the empty cells with rows that have the specific id.
IIUC you can groupby and transform with first, and finally assign to empty values:
df.loc[df["color"].eq(""), ["color", "rate"]] = df.groupby(df["id"].eq(1).cumsum())[["color","rate"]].transform("first")
print (df)
id id2 color rate
0 1 1 red good
1 2 1 red good
2 3 1 green good
3 1 2 yellow average
4 2 2 yellow average
5 3 2 blue good
Its only works if the number of rows with empty values in [color,rate] is equal to the number of rows with id==1. Expand further on the question of whether this is not the intention.
If I understood you correctly:
empty_rows = df.loc[df['color']=='',['color','rate']].index
df.loc[empty_rows, ['color','rate']] = df.loc[df['id']==1,['color','rate']].values
Result df:

Sampling of a pandas dataframe with groupby on one column such that the sample has no duplicates in another column

Assume the below dummy dataframe:
a b
0 red 0
1 red 1
2 red 3
3 blue 0
4 blue 1
5 blue 3
6 black 4
7 black 2
I want to sample on column a, with a sample size of 1. But the only condition is that in the sample created the values of column b should be unique, such that it gives me a result something like:
a b
2 red 0
3 blue 1
5 black 4
The below result is not acceptable:
a b
2 red 1
3 blue 1
5 black 4
Right now I am using below code with pandas sample function, but it is returning duplicate values in column b
df.groupby('a').sample(n=1)
sample(n=1) selects values randomly from each element in column 'a'.
You can use df.groupby('a').first() which chooses the first value of each element, and would return the output you are looking for.

Using groupby() and value_counts()

The goal is to identify the count of colors in each groupby().
If you see the outcome below shows the first group color blue appears 3 times.
The second group yellow appears 2 times and third group blue appears 5 times
So far this is what I got.
df.groupby(['X','Y','Color'})..Color.value_counts()
but this produces only the count of 1 since color for each row appears once.
The final output should be like this:
Thanks in advance for any assistance.
If you give size to the transform function, it will be in tabular form without aggregation.
df['Count'] = df.groupby(['X','Y']).Color.transform('size')
df.set_index(['X','Y'], inplace=True)
df
Color Count
X Y
A B Blue 3
B Blue 3
B Blue 3
C D Yellow 2
D Yellow 2
E F Blue 5
F Blue 5
F Blue 5
F Blue 5
F Blue 5

How to convert straight forward a dataframe column into a dataframe with column values as column indexes?

Sorry if the title is not good descriptive enought. I was not able to figure out a better description.
I hope the example will help to explain my question.
I have one dataframe with one column:
import pandas as pd
df=pd.DataFrame(data=[1,2,2,3,3,1],index=(('blue','A'), ('blue','B'),('red','A'), ('red','B'),('black','A'), ('black','B')))
0
blue A 1
B 2
red A 2
B 3
black A 3
B 1
I want to transform the column into a dataframe with column indexes the values of the original column. This might be the result:
Out[14]:
1 2 3
blue A 1 0 0
B 0 2 0
red A 0 2 0
B 0 0 3
black A 0 0 3
B 1 0 0
It might be also good for me to get True/False values. Whichever is the more straight forward method.
Thanks in advance
Run:
result = pd.get_dummies(df[0])
and you will get:
1 2 3
blue A 1 0 0
B 0 1 0
red A 0 1 0
B 0 0 1
black A 0 0 1
B 1 0 0
Values other than 1 are not needed, because the "true" source
value is in the column name.
If you want this result as a boolean DataFrame, append .astype(bool)
to the above code

How to repeatedly loop through list to assign values

I have two pandas data frames. Within df1 I have a string column with a finite list of unique values. I want to make those values a list, then loop through and append a new column onto df2. The value would loop through the list and then start over for the entire range of the second data frame.
df1
my_value
0 A
1 B
2 C
df2
color
0 red
1 orange
2 yellow
3 green
4 blue
5 indigo
6 violet
7 maroon
8 brown
9 black
What I want
color my_value
0 red A
1 orange B
2 yellow C
3 green A
4 blue B
5 indigo C
6 violet A
7 maroon B
8 brown C
9 black A
#create list
my_list = pd.Series(df1.my_value.values).to_list()
# create column
my_new_column = []
for i in range(len(df2)):
assigned_value = my_list[i]
my_new_column.append(assigned_value)
df2['my_new_column'] = my_new_column
return df2
The list index and range are differing lengths which is where I'm getting hung up.
This is super straight forward and I'm completely looking past the solution, please feel free to link me to another question if this is answered elsewhere. Thanks for you input!
#You can use zip with itertools.cycle() to cycle thru the smallest list/Series
df1 = pd.Series(data=['a','b','c'],name='my_values')
df2 = pd.Series(data= 'red','orange','yellow','green','blue','indigo','violet','maroon','brown','black'], name='color')
import itertools
df2 = pd.concat([df2, pd.Series([b for a,b in zip(df2 , itertools.cycle(df1))], name='my_value')],axis=1)
df2
color my_value
0 red a
1 orange b
2 yellow c
3 green a
4 blue b
5 indigo c
6 violet a
7 maroon b
8 brown c
9 black a

Categories

Resources