Reorder Multiindex Pandas Dataframe - python

I would like to reorder the columns in a dataframe, and keep the underlying values in the right columns.
For example this is the dataframe I have
cols = [ ['Three', 'Two'],['A', 'D', 'C', 'B']]
header = pd.MultiIndex.from_product(cols)
df = pd.DataFrame([[1,4,3,2,5,8,7,6]]*4,index=np.arange(1,5),columns=header)
df.loc[:,('One','E')] = 9
df.loc[:,('One','F')] = 10
>>> df
And I would like to change it as follows:
header2 = pd.MultiIndex(levels=[['One', 'Two', 'Three'], ['E', 'F', 'A', 'B', 'C', 'D']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2], [0, 1, 2, 3, 4, 5, 2, 3, 4, 5]])
df2 = pd.DataFrame([[9,10,1,2,3,4,5,6,7,8]]*4,index=np.arange(1,5), columns=header2)
>>>>df2

First, define a categorical ordering on the top level. Then, call sort_index on the first axis with both levels.
v = pd.Categorical(df.columns.get_level_values(0),
categories=['One', 'Two', 'Three'],
ordered=True)
v2 = pd.Categorical(df.columns.get_level_values(1),
categories=['E', 'F', 'C', 'B', 'A', 'D'],
ordered=True)
df.columns = pd.MultiIndex.from_arrays([v, v2])
df = df.sort_index(axis=1, level=[0, 1])
df
One Two Three
E F C B A D C B A D
1 9 10 7 6 5 8 3 2 1 4
2 9 10 7 6 5 8 3 2 1 4
3 9 10 7 6 5 8 3 2 1 4
4 9 10 7 6 5 8 3 2 1 4

Related

Ordering a dataframe with a key function and multiple columns

I have the following
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
'col2': [2, -1, 9, -8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
'col4': ['a', 'B', 'c', 'D', 'e', 'F'],
'col5': [2, 1, 9, 8, 7, 4],
'col6': [1.00005,1.00001,-2.12132, -2.12137,1.00003,-2.12135]
})
print(df)
print(df.sort_values(by=['col5']))
print(df.sort_values(by=['col2']))
print(df.sort_values(by='col2', key=lambda col: col.abs() ))
So far so good.
However I would like to order the dataframe by two columns:
First col6 and then col5
However, with the following conditions:
col6 only has to consider 4 decimals (meaning that 1.00005 and 1.00001 should be consider equal
col6 should be considered as absolute (meaning 1.00005 is less than -2.12132)
So the desired output would be
col1 col2 col3 col4 col5 col6
1 A -1 1 B 1 1.00001
0 A 2 0 a 2 1.00005
4 D 7 2 e 7 1.00003
5 C 4 3 F 4 -2.12135
3 NaN -8 4 D 8 -2.12137
2 B 9 9 c 9 -2.12132
How can I combine the usage of keys with multiple columns?
If you want to use arbitrary conditions on different columns, the easiest (ans most efficient) is to use numpy.lexsort:
import numpy as np
out = df.iloc[np.lexsort([df['col5'].abs(), df['col6'].round(4)])]
NB. unlike sort_values, the keys with higher priority are in the end with lexsort.
If you really want to use sort_values, you can use a custom function that choses the operation to apply depending on the Series name:
def sorter(s):
funcs = {
'col5': lambda s: s.abs(),
'col6': lambda s: s.round(4)
}
return funcs[s.name](s) if s.name in funcs else s
out = df.sort_values(by=['col6', 'col5'], key=sorter)
Output:
col1 col2 col3 col4 col5 col6
5 C 4 3 F 4 -2.12135
3 NaN -8 4 D 8 -2.12137
2 B 9 9 c 9 -2.12132
1 A -1 1 B 1 1.00001
4 D 7 2 e 7 1.00003
0 A 2 0 a 2 1.00005
provided example
reading again the question and the provided example, I think you might want:
df.iloc[np.lexsort([df['col5'], np.trunc(df['col6'].abs()*10**4)/10**4])]
Output:
col1 col2 col3 col4 col5 col6
1 A -1 1 B 1 1.00001
0 A 2 0 a 2 1.00005
4 D 7 2 e 7 1.00003
5 C 4 3 F 4 -2.12135
3 NaN -8 4 D 8 -2.12137
2 B 9 9 c 9 -2.12132
round() should not be used to truncate because round(1.00005, 4) = 1.0001.
Proposed code :
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
'col2': [2, -1, 9, -8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
'col4': ['a', 'B', 'c', 'D', 'e', 'F'],
'col5': [2, 1, 9, 8, 7, 4],
'col6': [1.00005,1.00001,-2.12132, -2.12137,1.00003,-2.12135]
})
r = df.sort_values(by=['col6', 'col5'], key=lambda c: c.apply(lambda x: abs(float(str(x)[:-1]))) if c.name=='col6' else c)
print(r)
Result :
col1 col2 col3 col4 col5 col6
1 A -1 1 B 1 1.00001
0 A 2 0 a 2 1.00005
4 D 7 2 e 7 1.00003
5 C 4 3 F 4 -2.12135
3 NaN -8 4 D 8 -2.12137
2 B 9 9 c 9 -2.12132
Other coding style inspired from Mozway
I have read the inspiring #Mozway way.
Very interesting but like s is a serie you should use the following script :
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
'col2': [2, -1, 9, -8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
'col4': ['a', 'B', 'c', 'D', 'e', 'F'],
'col5': [2, 1, 9, 8, 7, 4],
'col6': [1.00005,1.00001,-2.12132, -2.12137,1.00003,-2.12135]
})
def truncate(x):
s = str(x).split('.')
s[1] = s[1][:4]
return '.'.join(s)
def sorter(s):
funcs = {
'col5': lambda s: s,
'col6': lambda s: s.apply(lambda x: abs(float(truncate(x))))
}
return funcs[s.name](s) if s.name in funcs else s
out = df.sort_values(by=['col6', 'col5'], key=sorter)
print(out)

Function equivalent of Excel's SUMIFS()

I have a sales table with columns item, week, and sales. I wanted to create a week to date sales column (wtd sales) that is a weekly roll-up of sales per item.
I have no idea how to create this in Python.
I'm stuck at groupby(), which probably is not the answer. Can anyone help?
output_df['wtd sales'] = input_df.groupby(['item'])['sales'].transform(wtd)
As I stated in my comment, you are looking for cumsum():
import pandas as pd
df = pd.DataFrame({
'items': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'weeks': [1, 2, 3, 4, 1, 2, 3, 4],
'sales': [100, 101, 102, 130, 10, 11, 12, 13]
})
df.groupby(['items'])['sales'].cumsum()
Which results in:
0 100
1 201
2 303
3 433
4 10
5 21
6 33
7 46
Name: sales, dtype: int64
I'm using:
pd.__version__
'1.5.1'
Putting it all together:
import pandas as pd
df = pd.DataFrame({
'items': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'weeks': [1, 2, 3, 4, 1, 2, 3, 4],
'sales': [100, 101, 102, 130, 10, 11, 12, 13]
})
df['wtds'] = df.groupby(['items'])['sales'].cumsum()
Resulting in:
items weeks sales wtds
0 A 1 100 100
1 A 2 101 201
2 A 3 102 303
3 A 4 130 433
4 B 1 10 10
5 B 2 11 21
6 B 3 12 33
7 B 4 13 46

Pandas dataframe create new row for every value over multiple columns [duplicate]

This question already has answers here:
Reshape wide to long in pandas
(2 answers)
Closed 1 year ago.
I have a pandas dataframe with unique values in ID column.
df = pd.DataFrame({'ID': ['A', 'B', 'C'],
'STAT': ['X', 'X', 'X'],
'IN1': [1, 3, 7],
'IN2': [2, 5, 8],
'IN3': [3, 6, 9]})
I need to create a new dataframe where I have a row for each value in IN1, IN2 and IN3 with corresponding ID and STAT:
df_new = pd.DataFrame({'IN': [1, 2, 3, 3, 5, 6, 7, 8, 9],
'ID': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'STAT': ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X']})
You can use pandas.wide_to_long:
(pd.wide_to_long(df, ['IN'], j='to_drop', i='ID')
.droplevel('to_drop')
.sort_index()
.reset_index()
)
output:
ID STAT IN
0 A X 1
1 A X 2
2 A X 3
3 B X 3
4 B X 5
5 B X 6
6 C X 7
7 C X 8
8 C X 9
You can use melt
df.melt(id_vars=['ID','STAT'], value_name='IN')
Gives:
ID STAT variable IN
0 A X IN1 1
1 B X IN1 3
2 C X IN1 7
3 A X IN2 2
4 B X IN2 5
5 C X IN2 8
6 A X IN3 3
7 B X IN3 6
8 C X IN3 9
To make the df into a row:
(df.melt(id_vars=['ID','STAT'], value_name='IN')
.sort_values(by='ID')
.drop('variable', axis=1)
)
Gives the exact same results.

Filter rows of DataFrame by list of tuples

Let's assume I have the following DataFrame:
dic = {'a' : [1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'b' : [1, 1, 1, 1, 2, 2, 1, 1, 2, 2],
'c' : ['f', 'f', 'f', 'e', 'f', 'f', 'f', 'e', 'f', 'f'],
'd' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
df = pd.DataFrame(dic)
df
Out[10]:
a b c d
0 1 1 f 10
1 1 1 f 20
2 2 1 f 30
3 2 1 e 40
4 2 2 f 50
5 2 2 f 60
6 3 1 f 70
7 3 1 e 80
8 3 2 f 90
9 3 2 f 100
In the following I want to take the values of column a and b, where c='e' and use those values to select respective rows of df (which would filter rows 2, 3, 6, 7). The idea is to create a list of tuples and index df by that list:
list_tup = list(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False))
df_new = df.set_index(['a', 'b']).sort_index()
df_new
Out[13]:
c d
a b
1 1 f 10
1 f 20
2 1 f 30
1 e 40
2 f 50
2 f 60
3 1 f 70
1 e 80
2 f 90
2 f 100
list_tup
Out[14]: [(2, 1), (3, 1)]
df.loc[list_tup]
Results in an TypeError: unhashable type: 'writeable void-scalar', which I don't understand. Any suggestions? I'm pretty new to python and pandas, hence I assume that I miss something fundamental.
I believe it's better to groupby().transform() and boolean indexing in this use case:
valids = (df['c'].eq('e') # check if `c` is 'e`
.groupby([df['a'],df['b']]) # group by `a` and `b`
.transform('any') # check if `True` occurs in the group
# use the same label for all rows in group
)
# filter with `boolean indexing
df[valids]
Output:
a b c d
2 2 1 f 30
3 2 1 e 40
6 3 1 f 70
7 3 1 e 80
A similar idea with groupby().filter() which is more readable but can be slightly slower:
df.groupby(['a','b']).filter(lambda x: x['c'].eq('e').any())
You could try an innner join.
import pandas as pd
dic = {'a' : [1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'b' : [1, 1, 1, 1, 2, 2, 1, 1, 2, 2],
'c' : ['f', 'f', 'f', 'e', 'f', 'f', 'f', 'e', 'f', 'f'],
'd' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
df = pd.DataFrame(dic)
df.merge(df.loc[df['c']=='e', ['a','b']], on=['a','b'])
Output
a b c d
0 2 1 f 30
1 2 1 e 40
2 3 1 f 70
3 3 1 e 80
May be try a MultiIndex:
df_new.loc[pd.MultiIndex.from_tuples(list_tup)]
Full code:
list_tup = list(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False))
df_new = df.set_index(['a', 'b']).sort_index()
df_new.loc[pd.MultiIndex.from_tuples(list_tup)]
Outputs:
c d
a b
2 1 f 30
1 e 40
3 1 f 70
1 e 80
Let us try
m = df[['a','b']].apply(tuple,1).isin(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False).tolist())
out = df[m].copy()

How to get back the index after groupby in pandas

I am trying to find the the record with maximum value from the first record in each group after groupby and delete the same from the original dataframe.
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
print df
t = df.groupby('item_id').first() #lost track of the index
desired_row = t[t.cost == t.cost.max()]
#delete this row from df
cost
item_id
d 5
I need to keep track of desired_row and delete this row from df and repeat the process.
What is the best way to find and delete the desired_row?
I am not sure of a general way, but this will work in your case since you are taking the first item of each group (it would also easily work on the last). In fact, because of the general nature of split-aggregate-combine, I don't think this is easily achievable without doing it yourself.
gb = df.groupby('item_id', as_index=False)
>>> gb.groups # Index locations of each group.
{'a': [0, 1], 'b': [2, 3, 4], 'c': [5], 'd': [6]}
# Get the first index location from each group using a dictionary comprehension.
subset = {k: v[0] for k, v in gb.groups.iteritems()}
df2 = df.iloc[subset.values()]
# These are the first items in each groupby.
>>> df2
cost item_id
0 1 a
5 1 c
2 1 b
6 5 d
# Exclude any items from above where the cost is equal to the max cost across the first item in each group.
>>> df[~df.index.isin(df2[df2.cost == df2.cost.max()].index)]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Try this ?
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
t=df.drop_duplicates(subset=['item_id'],keep='first')
desired_row = t[t.cost == t.cost.max()]
df[~df.index.isin([desired_row.index[0]])]
Out[186]:
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Or using not in
Consider this df with few more rows
pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd', 'd','d'],
'cost': [1, 2, 1, 1, 3, 1, 5,1,7]})
df[~df.cost.isin(df.groupby('item_id').first().max().tolist())]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
7 1 d
8 7 d
Overview: Create a dataframe using an dictionary. Group by item_id and find the max value. enumerate over the grouped dataframe and use the key which is an numeric value to return the alpha index value. Create an result_df dataframe if you desire.
df_temp = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
grouped=df_temp.groupby(['item_id'])['cost'].max()
result_df=pd.DataFrame(columns=['item_id','cost'])
for key, value in enumerate(grouped):
index=grouped.index[key]
result_df=result_df.append({'item_id':index,'cost':value},ignore_index=True)
print(result_df.head(5))

Categories

Resources