How to use nlargest on multilevel pivot_table in pandas? - python

I'm summing the values in a pivot table using pandas.
dfr = pd.DataFrame({'A': [1,1,1,1,2,2,2,2],
'B': [1,2,2,3,1,2,2,2],
'C': [1,1,1,2,1,1,2,2],
'Val':[1,1,1,1,1,1,1,1]})
dfr = dfr.pivot_table(values='Val', index=['A', 'B', 'C'], aggfunc=np.sum)
dfr
Output:
A B C |Val
------------|---
1 1 1 |1
2 1 |2
3 2 |1
2 1 1 |1
2 1 |1
2 |2
The way I need the output is to show the largest in each group A, like this:
A B C |Val
------------|---
1 2 1 |2
2 2 2 |2
I've googled a bit around and tried using nlargest() in different ways without being able to produce the result I want. Anyone got any ideas?

I think you need groupby + nlargest by level A:
dfr = dfr.pivot_table(values='Val', index=['A', 'B', 'C'], aggfunc=np.sum)
dfr = dfr.groupby(level='A')['Val'].nlargest(1).reset_index(level=0, drop=True).reset_index()
print (dfr)
A B C Val
0 1 2 1 2
1 2 2 2 2
because if use pivot_table another levels are lost:
dfr = dfr.pivot_table(values='Val', index=['A', 'B', 'C'], aggfunc=np.sum).reset_index()
dfr = dfr.pivot_table(values='Val', index='A', aggfunc=lambda x: x.nlargest(1))
print (dfr)
Val
A
1 2
2 2
And if use all levels it return nlrgest by all levels (not what you want)
dfr = dfr.pivot_table(values='Val', index=['A', 'B', 'C'], aggfunc=np.sum).reset_index()
dfr = dfr.pivot_table(values='Val', index=['A', 'B', 'C'], aggfunc=lambda x: x.nlargest(1))
print (dfr)
Val
A B C
1 1 1 1
2 1 2
3 2 1
2 1 1 1
2 1 1
2 2

Related

Subtract values in a column in blocks

Suppose there is the following dataframe:
import pandas as pd
df = pd.DataFrame({'Group': ['A', 'A', 'B', 'B', 'C', 'C'], 'Value': [1, 2, 3, 4, 5, 6]})
I would like to subtract the values from group B and C with those of group A and make a new column with the difference. That is, I would like to do something like this:
df[df['Group'] == 'B']['Value'].reset_index() - df[df['Group'] == 'A']['Value'].reset_index()
df[df['Group'] == 'C']['Value'].reset_index() - df[df['Group'] == 'A']['Value'].reset_index()
and place the result in a new column. Is there a way of doing it without a for loop?
Assuming you want to subtract the first A to the first B/C, second A to second B/C, etc. the easiest might be to reshape:
df2 = (df
.assign(cnt=df.groupby('Group').cumcount())
.pivot('cnt', 'Group', 'Value')
)
# Group A B C
# cnt
# 0 1 3 5
# 1 2 4 6
df['new_col'] = df2.sub(df2['A'], axis=0).melt()['value']
variant:
df['new_col'] = (df
.assign(cnt=df.groupby('Group').cumcount())
.groupby('cnt', group_keys=False)
.apply(lambda d: d['Value'].sub(d.loc[d['Group'].eq('A'), 'Value'].iloc[0]))
)
output:
Group Value new_col
0 A 1 0
1 A 2 0
2 B 3 2
3 B 4 2
4 C 5 4
5 C 6 4

Pandas how to perform outer merge with specific order of adding rows

I have two data frames:
df:
col1 col2
0 x 1
1 a 2
2 b 3
3 c 4
and
df2:
col1 col2
0 x 1
1 a 2
2 f 6
3 c 4
And I want to obtain data frame in which new row from df2 will be added to new data frame after row with the same index from the df, like this:
col1 col2
0 x 1
1 a 2
2 b 3
3 f 6
4 c 4
df1 = pd.DataFrame({
'col1': ['x', 'a', 'b', 'c'],
'col2': [1, 2, 3, 4]
})
df2 = pd.DataFrame({
'col1': ['x', 'a', 'f', 'c'],
'col2': [1, 2, 6, 4]
})
In order to get your output, I concatenated the two dataframes and sorted by the index, as requested, then you can set the index in order with reset_index().
df = pd.concat([df1, df2]).drop_duplicates().sort_index().reset_index(drop=True)
# Output
col1 col2
0 x 1
1 a 2
2 b 3
3 f 6
4 c 4

Concatenating multiple pandas dataframes when columns are not aligned

I have 3 dataframes:
df1
A B C
1 1 1
2 2 2
df2
A B C
3 3 3
4 4 4
df3
A B
5 5
So I want to concat all dataframes to become the following one:
A B C
1 1 1
2 2 2
3 3 3
4 4 4
5 5 NaN
I tried with pd.concat([df1,df2,df3]) with both axis=0 and axis=1 but none of them works as expected.
df = pd.concat([df1,df2,df3], ignore_index=True)
df.fillna("NA", inplace=True)
If there are same common columns names , working nice - common columns are aligned properly:
print (df1.columns.tolist())
['A', 'B', 'C']
print (df2.columns.tolist())
['A', 'B', 'C']
print (df3.columns.tolist())
['A', 'B']
If possible som trailing whitespaces, is possible use str.strip:
print (df1.columns.tolist())
['A', 'B ', 'C']
df1.columns = df1.columns.str.strip()
print (df1.columns.tolist())
['A', 'B', 'C']
Also parameter ignore_index=True is for default RangeIndex after concat, for avoid duplicated index and add parameter sort for avoid FutureWarning:
df = pd.concat([df1,df2,df3], ignore_index=True, sort=True)
print (df)
A B C
0 1 1 1.0
1 2 2 2.0
2 3 3 3.0
3 4 4 4.0
4 5 5 NaN
I think you need to tell concat to ignore the index:
result = pd.concat([df1,df2,df3], ignore_index=True)

Pandas GroupBy on column names

I have a dataframe, we can proxy by
df = pd.DataFrame({'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]})
and a category series
category = pd.Series(['A', 'B', 'B', 'A'], ['a', 'b', 'c', 'd'])
I'd like to get a sum of df's columns grouped into the categories 'A', 'B'. Maybe something like:
result = df.groupby(??, axis=1).sum()
returning
result = pd.DataFrame({'A':[3,3,4], 'B':[1,1,0]})
Use groupby + sum on the columns (the axis=1 is important here):
df.groupby(df.columns.map(category.get), axis=1).sum()
A B
0 3 1
1 3 1
2 4 0
After reindex you can assign the category to the column of df
df=df.reindex(columns=category.index)
df.columns=category
df.groupby(df.columns.values,axis=1).sum()
Out[1255]:
A B
0 3 1
1 3 1
2 4 0
Or pd.Series.get
df.groupby(category.get(df.columns),axis=1).sum()
Out[1262]:
A B
0 3 1
1 3 1
2 4 0
Here what i did to group dataframe with similar column names
data_df:
1 1 2 1
q r f t
Code:
df_grouped = data_df.groupby(data_df.columns, axis=1).agg(lambda x: ' '.join(x.values))
df_grouped:
1 2
q r t f

Python/Pandas: Selecting dataframe rows with multiple conditions - col 1 matches col 2 OR col 3

If I have a df like so:
dfdict = {'1': ['a', 'a', 'a', 'b'], '2': ['a', 'b', 'c', 'a'], '3': ['b', 'a', 'd', 'c']}
df1 = pd.DataFrame(dfdict)
1 2 3
0 a a b
1 a b a
2 a c d
3 b a c
I want to save only the rows where col 1 matches 2 OR 1 matches 3. In this case, rows 0 and 1 would be saved:
1 2 3
0 a a b
1 a b a
I tried:
df2 = df1.loc[df1['1'] == df1['2'] & df1['1'] == df1['3']]
but I get error TypeError: unsupported operand type(s) for &: 'str' and 'bool'.
I would also like to get the other lines where col 1 does NOT match 2 OR 3, i.e. rows 2 and 3, in a separate df.
Option 1
eq, fixing your code,
df1[df1['1'].eq(df1['2']) | df1['1'].eq(df1['3'])]
1 2 3
0 a a b
1 a b a
Option 2
np.vectorize
f = np.vectorize(lambda x, y, z: x in (y, z))
df[f(df1['1'], df1['2'], df1['3'])]
1 2 3
0 a a b
1 a b a

Categories

Resources