Pandas count size of groupby groups idiomatically - python

I often want a dataframe of counts for how many members are in each group after a groupby operation in pandas. I have a verbose way of doing it with size and reset index and rename, but I'm sure there is a better way.
Here's an example of what I'd like to do:
import pandas as pd
import numpy as np
np.random.seed(0)
colors = ['red','green','blue']
cdf = pd.DataFrame({
'color1':np.random.choice(colors,10),
'color2':np.random.choice(colors,10),
})
print(cdf)
#better way to do next line? (somehow use agg?)
gb_count = cdf.groupby(['color1','color2']).size().reset_index().rename(columns={0:'num'})
print(gb_count)
#cdf.groupby(['color1','color2']).count() #<-- this doesn't work
Final output:
color1 color2 num
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1

To avoid getting your MultiIndex, use as_index=False:
cdf.groupby(['color1','color2'], as_index=False).size()
color1 color2 size
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1
If you explicitly want to name your new column num. You can use reset_index with name=.. since groupby will return a series:
cdf.groupby(['color1','color2']).size().reset_index(name='num')
color1 color2 num
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1

Another way is to reset the grouper_index after sending it to_frame(with preferred column name) in an agg operation.
gb_count = cdf.groupby(['color1','color2']).agg('size').to_frame('num').reset_index()
color1 color2 num
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1

Related

Finding combinations of values across multiple rows and columns in pandas dataframe

I am trying to create all possible sets of values from a pandas dataframe, taking 3 values from each row.
For example, if I consider the dataframe below:
0 1 2 3 4 5 6
Table Black Brown Red Blue Green Amber
Chair Magenta Turquoise White Orange Violet Pink
Window Indigo Yellow Cerulean Grey Peach Aqua
I want to generate all possible solution sets, by taking 3 values from the first, second, third and fourth rows each.
This is what I tried:
from itertools import product
uniques = [df[i].unique().tolist() for i in df.columns ]
pd.DataFrame(product(*uniques), columns = df.columns)
But this generates all combinations with all 6 columns like this:
0 1 2 3 4 5 6
Table Black Brown Red Blue Green Aqua
Table Black Brown Red Blue Green Violet
Here, all values of Row 1 remain the same except for the last value, and all combinations are generated like this.
But what I need is this:
0 1 2 3 4 5 6 7 8 9
Table Black Red Blue Magenta White Orange Yellow Peach Aqua
Here, the first three values are from Row 1, the second 3 values are from Row 2, and the last 3 values are from Row 3.
Similarly, I want to display all such sets, and store them in a new dataframe.
Any help will be appreciated.
df
###
0 1 2 3 4 5 6
0 Table Black Brown Red Blue Green Amber
1 Chair Magenta Turquoise White Orange Violet Pink
2 Window Indigo Yellow Cerulean Grey Peach Aqua
from itertools import product
import random
uniques = [df[i].unique().tolist() for i in df.columns]
rl = list(product(*uniques))
pd.DataFrame(random.choices(rl))
product() generates all the combinations from sets, but you want a random result from the combinations list.
0 1 2 3 4 5 6
0 Table Black Brown White Orange Peach Aqua
Supplement
Combination
with 3 sets.
Select one element from each set, how many combinations would be?
2 * 2 * 3 = 12
Let's check whether the total number of all combinations is 12 or not.
list_of_lists = [['Yellow','Blue'],['Cat','Dog'],['Swim','Run','Sleep']]
combination = product(*list_of_lists)
combination_list = list(combination)
pd.DataFrame(combination_list)
###
0 1 2
0 Yellow Cat Swim
1 Yellow Cat Run
2 Yellow Cat Sleep
3 Yellow Dog Swim
4 Yellow Dog Run
5 Yellow Dog Sleep
6 Blue Cat Swim
7 Blue Cat Run
8 Blue Cat Sleep
9 Blue Dog Swim
10 Blue Dog Run
11 Blue Dog Sleep
And choose one row from above randomly, would be the solution to generate a set from combinations.

Create a matrix from two columns

I'm trying to create a matrix from two columns within an excel sheet. The first column is a key with multiple repeating instances and the second column references the different values tied to the key. I'd like to be able to create a matrix of all the values in the second column to reference the number of times they are paired together for all the key instances.
a b
1 red
1 blue
1 green
2 yellow
2 red
3 blue
3 green
3 yellow
and I'd like to turn this sample dataframe into
color red blue yellow green
red 0 1 1 1
blue 1 0 1 2
yellow 1 1 0 1
green 1 2 1 0
Essentially using column a as a groupby() to segment each key then making counts of the relationships encountered as a running tally. Can't quite figure out how to implement a pivot table or a cross tab to accomplish this (if that's even the best route).
Use how='cross' as parameter of pd.merge. I assume you have no ('a', 'b') duplicates like two (1, red).
out = (
pd.merge(df, df, how='cross').query('a_x == a_y & b_x != b_y')[['b_x', 'b_y']] \
.assign(dummy=1).pivot_table('dummy', 'b_x', 'b_y', 'count', fill_value=0) \
.rename_axis(index=None, columns=None)
)
print(out)
# Output:
blue green red yellow
blue 0 2 1 1
green 2 0 1 1
red 1 1 0 1
yellow 1 1 1 0
This looks like an outer join so I went with that:
df = pd.DataFrame( {'a': [1,1,1,2,2,3,3,3],
'b':['red', 'blue', 'green', 'yellow', 'red', 'blue', 'green', 'yellow']})
df_count = df.merge(df, on = 'a').groupby(['b_x', 'b_y']).count().reset_index().pivot(index = 'b_x', columns='b_y', values='a')
np.fill_diagonal(df_count.values, 0)
df_count.index.name='color'
df_count.columns.name=None
blue green red yellow
color
blue 0 2 1 1
green 2 0 1 1
red 1 1 0 1
yellow 1 1 1 0
import numpy as np
import pandas as pd
s = pd.crosstab(df.a, df.b) # crosstabulate
s = s.T # s # transpose and take dot product
np.fill_diagonal(s.values, 0) # Fill the diagonals with 0
print(s)
b blue green red yellow
b
blue 0 2 1 1
green 2 0 1 1
red 1 1 0 1
yellow 1 1 1 0

Split (Explode) String into Multiple columns and rows - Python

Good afternoon, i am trying to split text in a column to a specfic format
here is my table below
UserId Application
1 Grey Blue::Black Orange;White:Green
2 Yellow Purple::Orange Grey;Blue Pink::Red
I would like it to read the following:
UserId Application Role
1 Grey Blue Black Orange
1 White Green
2 Yellow Purple Orange Grey
2 Blue Pink Red
So far my code is
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
df['Application']=df.Roles.str.split(';|::|:').map(lambda x : x[0::2])
unnesting(df.drop('Roles',1),['Application'])
The following output code reads
UserId Application
1 Grey Blue
1 White
2 Yellow Purple
2 Blue Pink
i do not know how to add the second column (role) in the code for the second split after ::
Given this dataframe:
UserId Application
0 1 Grey Blue::Black Orange;White::Green
1 2 Yellow Purple::Orange Grey;Blue Pink::Red
you could at least achieve the last two columns directly via
df.Application.str.split(';', expand=True).stack().str.split('::', expand=True).reset_index().drop(columns=['level_0', 'level_1'])
which results in
0 1
0 Grey Blue Black Orange
1 White Green
2 Yellow Purple Orange Grey
3 Blue Pink Red
However, defining UserId as index before would also provide the proper UserId column:
result = df.set_index('UserId').Application.str.split(';', expand=True).stack().str.split('::', expand=True).reset_index().drop(columns=['level_1'])
result.columns = ['UserId', 'Application', 'Role']
UserId Application Role
0 1 Grey Blue Black Orange
1 1 White Green
2 2 Yellow Purple Orange Grey
3 2 Blue Pink Red

Convert text table to pandas dataframe

Many times when I'm trying to answer questions on Stackoverflow, the question contains a table, which I have to convert to a pandas dataframe in order to process. for example, in this question:
http://stackoverflow.com/questions/43172116/pandas-count-some-value-in-all-columns
My question is, is there a faster way to convert it to a dataframe rather than writing:
df=pd.DataFrame({'graph':[1,2,3,4,5,6],
0:['blue','blue','red','red','blue','blue'],
1:['blue','blue','red','blue','red','blue'],
2:['blue','blue','blue','red','blue','blue'],
3:['blue','blue','blue','red','red','blue'],
4:['blue','blue','red','blue','red','blue']})
given that I can copy the text:
graph 0 1 2 3 4
1 blue blue blue blue blue
2 blue blue blue blue blue
3 blue red blue blue red
4 red blue red red blue
5 red red blue red red
6 blue blue blue blue blue
Make sure the desired data set is in clipboard and use pd.read_clipboard() method.
Step by step:
mark desired data set
press Ctrl+C (for MS Windows)
execute: df = pd.read_clipboard()
In [40]: df = pd.read_clipboard()
In [41]: df
Out[41]:
graph 0 1 2 3 4
0 1 blue blue blue blue blue
1 2 blue blue blue blue blue
2 3 blue red blue blue red
3 4 red blue red red blue
4 5 red red blue red red
5 6 blue blue blue blue blue

Concatenate two pandas dataframe with the same index but on different positions

I have a data frame like
id value_right color_right value_left color_left
1 asd red dfs blue
2 dfs blue afd green
3 ccd yellow asd blue
4 hty red hrr red
I need to get the left values below the right values, something like
id value color
1 asd red
1 dfs blue
2 dfs blue
2 afd green
3 ccd yellow
3 asd blue
4 hty red
4 hrr red
I tried to split in two data frames and to interleave using the id, but I got only half of the data on it, using the mod of the value of the id. Any ideas?
Take a view of the desired left and right side dfs, then rename the columns and then concat them and sort on 'id' column:
In [205]:
left = df[['id','value_left','color_left']].rename(columns={'value_left':'value','color_left':'color'})
right = df[['id','value_right','color_right']].rename(columns={'value_right':'value','color_right':'color'})
merged = pd.concat([right,left]).sort('id')
merged
Out[205]:
id value color
0 1 asd red
0 1 dfs blue
1 2 dfs blue
1 2 afd green
2 3 ccd yellow
2 3 asd blue
3 4 hty red
3 4 hrr red

Categories

Resources