Create a matrix from two columns - python

I'm trying to create a matrix from two columns within an excel sheet. The first column is a key with multiple repeating instances and the second column references the different values tied to the key. I'd like to be able to create a matrix of all the values in the second column to reference the number of times they are paired together for all the key instances.
a b
1 red
1 blue
1 green
2 yellow
2 red
3 blue
3 green
3 yellow
and I'd like to turn this sample dataframe into
color red blue yellow green
red 0 1 1 1
blue 1 0 1 2
yellow 1 1 0 1
green 1 2 1 0
Essentially using column a as a groupby() to segment each key then making counts of the relationships encountered as a running tally. Can't quite figure out how to implement a pivot table or a cross tab to accomplish this (if that's even the best route).

Use how='cross' as parameter of pd.merge. I assume you have no ('a', 'b') duplicates like two (1, red).
out = (
pd.merge(df, df, how='cross').query('a_x == a_y & b_x != b_y')[['b_x', 'b_y']] \
.assign(dummy=1).pivot_table('dummy', 'b_x', 'b_y', 'count', fill_value=0) \
.rename_axis(index=None, columns=None)
)
print(out)
# Output:
blue green red yellow
blue 0 2 1 1
green 2 0 1 1
red 1 1 0 1
yellow 1 1 1 0

This looks like an outer join so I went with that:
df = pd.DataFrame( {'a': [1,1,1,2,2,3,3,3],
'b':['red', 'blue', 'green', 'yellow', 'red', 'blue', 'green', 'yellow']})
df_count = df.merge(df, on = 'a').groupby(['b_x', 'b_y']).count().reset_index().pivot(index = 'b_x', columns='b_y', values='a')
np.fill_diagonal(df_count.values, 0)
df_count.index.name='color'
df_count.columns.name=None
blue green red yellow
color
blue 0 2 1 1
green 2 0 1 1
red 1 1 0 1
yellow 1 1 1 0

import numpy as np
import pandas as pd
s = pd.crosstab(df.a, df.b) # crosstabulate
s = s.T # s # transpose and take dot product
np.fill_diagonal(s.values, 0) # Fill the diagonals with 0
print(s)
b blue green red yellow
b
blue 0 2 1 1
green 2 0 1 1
red 1 1 0 1
yellow 1 1 1 0

Related

Pandas Groupby Based on Values in Multiple Columns

I have a dataframe that I am trying to use pandas.groupby on to get the cumulative sum. The values that I am grouping by show up in two different columns, and I am having trouble getting the groupby to work correctly. My starting dataframe is:
df = pd.DataFrame({'col_A': ['red', 'red', 'blue', 'red'], 'col_B': ['blue', 'red', 'blue', 'red'], 'col_A_qty': [1, 1, 1, 1], 'col_B_qty': [1, 1, 1, 1]})
col_A col_B col_A_qty col_B_qty
red blue 1 1
red red 1 1
blue blue 1 1
red red 1 1
The result I am trying to get is:
col_A col_B col_A_qty col_B_qty red_cumsum blue_cumsum
red blue 1 1 1 1
red red 1 1 3 1
blue blue 1 1 3 3
red red 1 1 5 3
I've tried:
df.groupby(['col_A', 'col_B'])['col_A_qty'].cumsum()
but this groups on the combination of col_A and col_B. How can I use pandas.groupby to calculate the cumulative sum of red and blue, regardless of if it's in col_A or col_B?
Try two pivot
out = pd.pivot(df,columns='col_A',values='col_A_qty').fillna(0).cumsum().add(pd.pivot(df,columns='col_B',values='col_B_qty').fillna(0).cumsum(),fill_value=0)
Out[404]:
col_A blue red
0 1.0 1.0
1 1.0 3.0
2 3.0 3.0
3 3.0 5.0
df = df.join(out)
A simple method is to define each cumsum column by two Series.cumsum, as follows:
df['red_cumsum'] = df['col_A'].eq('red').cumsum() + df['col_B'].eq('red').cumsum()
df['blue_cumsum'] = df['col_A'].eq('blue').cumsum() + df['col_B'].eq('blue').cumsum()
In each column col_A and col_B, check for values equal 'red' / 'blue' (results are boolean series). Then, we use Series.cumsum on these resultant boolean series to get the cumulative counts. You don't really need to use pandas.groupby in this use case.
If you have multiple items in col_A and col_B, you can also iterate through the unique item list, as follows:
for item in np.unique(df[['col_A', 'col_B']]):
df[f'{item}_cumsum'] = df['col_A'].eq(item).cumsum() + df['col_B'].eq(item).cumsum()
Result:
print(df)
col_A col_B col_A_qty col_B_qty red_cumsum blue_cumsum
0 red blue 1 1 1 1
1 red red 1 1 3 1
2 blue blue 1 1 3 3
3 red red 1 1 5 3

How to replace empty value with value from another row

I am dealing with a dataframe that has some empty cells. For example:
df = pd.DataFrame(data={'id': [1, 2, 3, 1, 2, 3], 'id2': [1,1,1,2,2,2], 'color': ["red", "", "green", "yellow", "", "blue"], 'rate':["good","","good","average","","good"]})
id id2 color rate
0 1 1 red good
1 2 1
2 3 1 green good
3 1 2 yellow average
4 2 2
5 3 2 blue good
For both the columns "color" and "rate", I would like to replace the empty rows with values from another row where id is 1. Therefore, I would like the dataframe to look like this in the end:
id id2 color rate
0 1 1 red good
1 2 1 red good
2 3 1 green good
3 1 2 yellow average
4 2 2 yellow average
5 3 2 blue good
I prefer not to replace the empty cells with values from the previous row. I would like to indicate the id and replace the empty cells with rows that have the specific id.
IIUC you can groupby and transform with first, and finally assign to empty values:
df.loc[df["color"].eq(""), ["color", "rate"]] = df.groupby(df["id"].eq(1).cumsum())[["color","rate"]].transform("first")
print (df)
id id2 color rate
0 1 1 red good
1 2 1 red good
2 3 1 green good
3 1 2 yellow average
4 2 2 yellow average
5 3 2 blue good
Its only works if the number of rows with empty values in [color,rate] is equal to the number of rows with id==1. Expand further on the question of whether this is not the intention.
If I understood you correctly:
empty_rows = df.loc[df['color']=='',['color','rate']].index
df.loc[empty_rows, ['color','rate']] = df.loc[df['id']==1,['color','rate']].values
Result df:

Pandas count size of groupby groups idiomatically

I often want a dataframe of counts for how many members are in each group after a groupby operation in pandas. I have a verbose way of doing it with size and reset index and rename, but I'm sure there is a better way.
Here's an example of what I'd like to do:
import pandas as pd
import numpy as np
np.random.seed(0)
colors = ['red','green','blue']
cdf = pd.DataFrame({
'color1':np.random.choice(colors,10),
'color2':np.random.choice(colors,10),
})
print(cdf)
#better way to do next line? (somehow use agg?)
gb_count = cdf.groupby(['color1','color2']).size().reset_index().rename(columns={0:'num'})
print(gb_count)
#cdf.groupby(['color1','color2']).count() #<-- this doesn't work
Final output:
color1 color2 num
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1
To avoid getting your MultiIndex, use as_index=False:
cdf.groupby(['color1','color2'], as_index=False).size()
color1 color2 size
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1
If you explicitly want to name your new column num. You can use reset_index with name=.. since groupby will return a series:
cdf.groupby(['color1','color2']).size().reset_index(name='num')
color1 color2 num
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1
Another way is to reset the grouper_index after sending it to_frame(with preferred column name) in an agg operation.
gb_count = cdf.groupby(['color1','color2']).agg('size').to_frame('num').reset_index()
color1 color2 num
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1

How to assign a new descriptive column while concatenating dataframes

I have two data frames that i want to concatenate in python. However, I want to add another column type in order to distinguish among the columns.
Here is my sample data:
import pandas as pd
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': ['red', 'white', 'blue']},
columns=['numbers', 'colors'])
df1 = pd.DataFrame({'numbers': [7, 9, 9], 'colors': ['yellow', 'brown', 'blue']},
columns=['numbers', 'colors'])
pd.concat([df,df1])
This code will give me the following result:
numbers colors
0 1 red
1 2 white
2 3 blue
0 7 yellow
1 9 brown
2 9 blue
but what I would like to get is as follows:
numbers colors type
0 1 red first
1 2 white first
2 3 blue first
0 7 yellow second
1 9 brown second
2 9 blue second
type column is going to help me to differentiate between the values of the two data frames.
Can anyone help me with this please?
Use DataFrame.assign for new columns:
df = pd.concat([df.assign(typ='first'),df1.assign(typ='second')])
print (df)
numbers colors typ
0 1 red first
1 2 white first
2 3 blue first
0 7 yellow second
1 9 brown second
2 9 blue second
Using a list-comprehension
df = pd.concat([d.assign(typ=f'id{i}') for i, d in enumerate([df, df1])], ignore_index=True)
numbers colors typ
0 1 red id0
1 2 white id0
2 3 blue id0
3 7 yellow id1
4 9 brown id1
5 9 blue id1

How to return a dataframe of number of duplicates based on filtering the two columns in python

ID group categories
1 0 red
2 1 blue
3 1 green
4 1 green
1 0 blue
1 0 blue
2 1 red
3 0 red
4 0 red
4 1 red
Hi, I am new to python I am trying to get the count of duplicates of ID columns based on multiple conditions on the other 2 columns. So I am filtering out red and 0 and then I wanted ID's that repeated more than once.
df1 = df[(df['categories']=='red')& (df['group'] == 0)]
df1['ID'].value_counts()[df1['ID'].value_counts()>1]
There are almost 10 categories in the categories column so I was thinking if there is any way to write a function or for loop instead of repeating the same steps. The final goal is to see how many duplicate ID's in each group given category is 'red'/'blue'/'green'. Thanks in advance
P.S : the group values doesn't change it is a binomial variable
output
ID count
1 3
2 2
3 2
4 3
I think you can use groupby with SeriesGroupBy.value_counts:
s = df.groupby(['ID','group'])['categories'].value_counts()
print (s)
ID group categories
1 0 blue 2
red 1
2 1 blue 1
red 1
3 0 red 1
1 green 1
4 0 red 1
1 green 1
red 1
Name: categories, dtype: int64
out = s[s > 1].reset_index(name='count')
print (out)
ID group categories count
0 1 0 blue 2
Another solution is get duplicates first by filtering with duplicated and then count:
df = df[df.duplicated(['ID','group','categories'], keep=False)]
print (df)
ID group categories
4 1 0 blue
5 1 0 blue
df1 = df.groupby(['ID','group'])['categories'].value_counts().reset_index(name='count')
print (df1)
ID group categories count
0 1 0 blue 2
EDIT: For count categories (all rows) per ID use GroupBy.size:
df1 = df.groupby('ID').size().reset_index(name='count')
print (df1)
ID count
0 1 3
1 2 2
2 3 2
3 4 3

Categories

Resources