Summing over a multiindex pandas DataFrame - python

Let's say I create the following dataframe with a df.set_index('Class','subclass'), bearing in mind there are multiple Classes with subclasses... A>Z.
Class subclass
A a
A b
A c
A d
B a
B b
How would I count the subclasses in the Class and create a separate column named no of classes such that I can see the Class with the greatest number of subclasses? I was thinking some sort of for loop which runs through the Class letters and counts the subclass if that Class letter is still the same. However, this seems a bit counterintuitive for such a problem. Would there be a more simple approach such as a df.groupby[].count?
The desired output would be:
Class subclass No. of classes
A a 4
A b
A c
A d
B a 2
B b
I have tried the level parameter as shown in group multi-index pandas dataframe but this doesn't seem to work for me
EDIT:
I did not mention that I wanted a return of the Class with the greatest number of subclasses. I achieved this with:
df.reset_index().groupby('Class')['subclass'].nunique().idxmax()

You can use transform, but get duplicates values:
df['No. of classes'] = df.groupby(level='Class')['val'].transform('size')
print (df)
val No. of classes
Class subclass
A a 1 4
b 4 4
c 5 4
d 4 4
B a 1 2
b 2 2
But if need empty values:
df['No. of classes'] = df.groupby(level='Class')
.apply(lambda x: pd.Series( [len(x)] + [np.nan] * (len(x)-1)))
.values
print (df)
val No. of classes
Class subclass
A a 1 4.0
b 4 NaN
c 5 NaN
d 4 NaN
B a 1 2.0
b 2 NaN
Another solution for get Class with greatest number is:
df = df.groupby(level=['Class'])
.apply(lambda x: x.index.get_level_values('subclass').nunique())
.idxmax()
print (df)
A

You can use transform to add an aggregated calculation back to the original df as a new column:
In [165]:
df['No. of classes'] = df.groupby('Class')['subclass'].transform('count')
df
Out[165]:
Class subclass No. of classes
0 A a 4
1 A b 4
2 A c 4
3 A d 4
4 B a 2
5 B b 2

Related

How would I count all unique values of a dataframe in python without double counting?

Let's suppose I have a python data frame that looks something like this:
Factor_1 Factor_2 Factor_3 Factor_4 Factor_5
A B A Nan Nan
B D F A Nan
F A D B A
Something like this in which I have 5 columns that have different factors. I would like to create a column that counts how many of this factors appear in the dtaframe but without double counting in what terms without double counting if the value apperas in one row it only counts it as 1 for example if one row has A, B, C, A, A only 1 A would be counted. The expected out output would be this.
Factor Count
A 3
B 3
D 2
F 2
Nan 2
I used a a code I was helped with
df.stack(dropna=False).value_counts(dropna=False)
I was using an if to drop the double count but I would like to know if there is a practical and simple way to do this, like the code above, and not with an If because what I am doing is not efficient.
You can use Series.unique + Series.value_counts:
s = pd.Series(np.hstack(df.T.apply(pd.Series.unique))).value_counts(dropna=False)
B 3
A 3
F 2
D 2
NaN 2
dtype: int64
Here is a way following your logic , additionally chaining a conditional check using groupby on level=0
s = df.stack(dropna=False)
s.groupby(level=0).apply(lambda x: x[~x.duplicated()]).value_counts(dropna=False)
A 3
B 3
D 2
F 2
NaN 2
dtype: int64

Rearrange rows based on condition alternating?

I have a bunch of rows which I want to rearrange one after the other based on a particular column.
df
B/S
0 B
1 B
2 S
3 S
4 B
5 S
I have thought about doing a loc based on B and S and then adding them all together in a new dataframe but that doesn't seem like good practice for pandas.
Is there a pandas centric approach to this?
Output required
B/S
0 B
2 S
1 B
3 S
4 B
5 S
We can achieve this by making smart use of reset_index:
m = df['B/S'].eq('B')
b = df[m].reset_index(drop=True)
s = df[~m].reset_index(drop=True)
out = b.append(s).sort_index().reset_index(drop=True)
B/S
0 B
1 S
2 B
3 S
4 B
5 S
If you want to keep your index information, we can slightly adjust our approach:
m = df['B/S'].eq('B')
b = df[m].reset_index()
s = df[~m].reset_index()
out = b.append(s).sort_index().set_index('index')
B/S
index
0 B
2 S
1 B
3 S
4 B
5 S

Rename name in Python Pandas MultiIndex

I try to rename a column name in a Pandas MultiIndex but it doesn't work. Here you can see my series object. Btw, why is the dataframe df_injury_record becoming a series object in this function?
Frequency_BodyPart = df_injury_record.groupby(["Surface","BodyPart"]).size()
In the next line you will see my try to rename the column.
Frequency_BodyPart.rename_axis(index={'Surface': 'Class'})
But after this, the column has still the same name.
Regards
One possible problem should be pandas version under 0.24 or you forget assign back like mentioned #anky_91:
df_injury_record = pd.DataFrame({'Surface':list('aaaabbbbddd'),
'BodyPart':list('abbbbdaaadd')})
Frequency_BodyPart = df_injury_record.groupby(["Surface","BodyPart"]).size()
print (Frequency_BodyPart)
Surface BodyPart
a a 1
b 3
b a 2
b 1
d 1
d a 1
d 2
dtype: int64
Frequency_BodyPart = Frequency_BodyPart.rename_axis(index={'Surface': 'Class'})
print (Frequency_BodyPart)
Class BodyPart
a a 1
b 3
b a 2
b 1
d 1
d a 1
d 2
dtype: int64
If want 3 columns DataFrame working also for oldier pandas versions:
df = Frequency_BodyPart.reset_index(name='count').rename(columns={'Surface': 'Class'})
print (df)
Class BodyPart count
0 a a 1
1 a b 3
2 b a 2
3 b b 1
4 b d 1
5 d a 1
6 d d 2

total no. of combinations of a column with other in pandas df

i have a table in pandas df
id_x id_y
a b
b c
c d
d a
b a
and so on around (1000 rows)
i want to find the count of combinations for each id_x with id_y.
ie. a has combinations with a-b,d-a(total 2 combinations)
similarly b has total 2 combinations(b-c) and also a-b to be considered as a combination for b( a-b = b-a)
and create a dataframe df2 which has
id combinations
a 2
b 2
c 2 #(c-d and b-c)
d 1
and so on ..(distinct product_id_'s)
i tried doing this code
df.groupby(['id_x']).size().reset_index()
but getting wrong result;
id_x 0
0 a 1
1 b 1
2 c 1
3 d 1
what approach should i follow?
my skills on python are at a beginner level.
Thanks in advance.
You can first sort all rows by apply sorted, then create Series by stack and last value_counts:
df = df.apply(sorted,axis=1).drop_duplicates().stack().value_counts()
print (df)
d 2
a 2
b 2
c 2
dtype: int64

Pandas group by aggregate using division

I'm wondering how to aggregate data within a grouped pandas dataframe by a function where I take into account the value stored in some column of the dataframe. This would be useful in operations where order of operations matters, such as division.
For example I have:
In [8]: df
Out[8]:
class cat xer
0 a 1 2
1 b 1 4
2 c 1 9
3 a 2 6
4 b 2 8
5 c 2 3
I want to group by by class and for each class divide the xer value corresponding to cat == 1 by that for cat == 2. In other words, the entries in the final output should be:
class div
0 a 0.33 (i.e. 2/6)
1 b 0.5 (i.e. 4/8)
2 c 3 (i.e. 9/3)
Is this possible to do using groupby? I can't quite figure out how to do it without manually iterating through each class and even so it's not clean or fun.
Without doing anything too clever:
In [11]: one = df[df["cat"] == 1].set_index("class")["xer"]
In [12]: two = df[df["cat"] == 2].set_index("class")["xer"]
In [13]: one / two
Out[13]:
class
a 0.333333
b 0.500000
c 3.000000
Name: xer, dtype: float64
Given your DataFrame, you can use the following:
df.groupby('class').agg({'xer': lambda L: reduce(pd.np.divide, L)})
Which gives you:
xer
class
a 0.333333
b 0.500000
c 3.000000
This caters for > 2 per group (if needs be), but you might want to ensure your df is sorted by cat first to ensure they appear in the right order.
You may want to rearrange your data to make it easier to view:
df2 = df.set_index(['class', 'cat']).unstack()
>>> df2
xer
cat 1 2
class
a 2 6
b 4 8
c 9 3
You can then do the following to get your desired result:
>>> df2.iloc[:,0].div(df2.iloc[:, 1])
class
a 0.333333
b 0.500000
c 3.000000
Name: (xer, 1), dtype: float64
This is one approach, step by step:
# get cat==1 and cat==2 merged by class
grouped = df[df.cat==1].merge(df[df.cat==2], on='class')
# calculate div
grouped['div'] = grouped.xer_x / grouped.xer_y
# return the final dataframe
grouped[['class', 'div']]
which yields:
class div
0 a 0.333333
1 b 0.500000
2 c 3.000000

Categories

Resources