Split a MultiIndex DataFrame base on another DataFrame - python

Say you have a multiindex DataFrame
x y z
a 1 0 1 2
2 3 4 5
b 1 0 1 2
2 3 4 5
3 6 7 8
c 1 0 1 2
2 0 4 6
Now you have another DataFrame which is
col1 col2
0 a 1
1 b 1
2 b 3
3 c 1
4 c 2
How do you split the multiindex DataFrame based on the one above?

Use loc by tuples:
df = df1.loc[df2.set_index(['col1','col2']).index.tolist()]
print (df)
x y z
a 1 0 1 2
b 1 0 1 2
3 6 7 8
c 1 0 1 2
2 0 4 6
df = df1.loc[[tuple(x) for x in df2.values.tolist()]]
print (df)
x y z
a 1 0 1 2
b 1 0 1 2
3 6 7 8
c 1 0 1 2
2 0 4 6
Or join:
df = df2.join(df1, on=['col1','col2']).set_index(['col1','col2'])
print (df)
x y z
col1 col2
a 1 0 1 2
b 1 0 1 2
3 6 7 8
c 1 0 1 2
2 0 4 6

Simply using isin
df[df.index.isin(list(zip(df2['col1'],df2['col2'])))]
Out[342]:
0 1 2 3
index1 index2
a 1 1 0 1 2
b 1 1 0 1 2
3 3 6 7 8
c 1 1 0 1 2
2 2 0 4 6

You can also do this using the MultiIndex reindex method https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html
## Recreate your dataframes
tuples = [('a', 1), ('a', 2),
('b', 1), ('b', 2),
('b', 3), ('c', 1),
('c', 2)]
data = [[1, 0, 1, 2],
[2, 3, 4, 5],
[1, 0, 1, 2],
[2, 3, 4, 5],
[3, 6, 7, 8],
[1, 0, 1, 2],
[2, 0, 4, 6]]
idx = pd.MultiIndex.from_tuples(tuples, names=['index1','index2'])
df= pd.DataFrame(data=data, index=idx)
df2 = pd.DataFrame([['a', 1],
['b', 1],
['b', 3],
['c', 1],
['c', 2]])
# Answer Question
idx_subset = pd.MultiIndex.from_tuples([(a, b) for a, b in df2.values], names=['index1', 'index2'])
out = df.reindex(idx_subset)
print(out)
0 1 2 3
index1 index2
a 1 1 0 1 2
b 1 1 0 1 2
3 3 6 7 8
c 1 1 0 1 2
2 2 0 4 6

Related

Calculate DataFrame mode based on a grouped data

I have the following DataFrame:
>>> df = pd.DataFrame({"a": [1, 1, 1, 1, 2, 2, 3, 3, 3], "b": [1, 5, 7, 9, 2, 4, 6, 14, 5], "c": [1, 0, 0, 1, 1, 1, 1, 0, 1]})
>>> df
a b c
0 1 1 1
1 1 5 0
2 1 7 0
3 1 9 1
4 2 2 1
5 2 4 1
6 3 6 1
7 3 14 0
8 3 5 1
I want to calculate the mode of column c for every unique value in a and then select the rows where c has this value.
This is my own solution:
>>> major_types = df.groupby(['a'])['c'].apply(lambda x: pd.Series.mode(x)[0])
>>> df = df.merge(major_types, how="left", right_index=True, left_on="a", suffixes=("", "_major"))
>>> df = df[df['c'] == df['c_major']].drop(columns="c_major", axis=1)
Which would output the following:
>>> df
a b c
1 1 5 0
2 1 7 0
4 2 2 1
5 2 4 1
6 3 6 1
8 3 5 1
It is very insufficient for large DataFrames. Any idea on what to do?
IIUC, GroupBy.transform instead apply + merge
df.loc[df['c'].eq(df.groupby('a')['c'].transform(lambda x: x.mode()[0]))]
a b c
1 1 5 0
2 1 7 0
4 2 2 1
5 2 4 1
6 3 6 1
8 3 5 1
Or
s = df.groupby(['a','c'])['c'].transform('size')
df.loc[s.eq(s.groupby(df['c']).transform('max'))]

Python dataframe, drop everything after a certain record

I have a dataframe like this
import pandas as pd
df = pd.DataFrame({'id' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3], \
'counter' : [1, 2, 3, 4, 1, 2, 3, 1, 2, 3], \
'status':['a', 'b', 'b' ,'c', 'a', 'a', 'a', 'a', 'a', 'b'], \
'additional_data' : [12,35,13,523,6,12,6,1,46,236]}, \
columns=['id', 'counter', 'status', 'additional_data'])
df
Out[37]:
id counter status additional_data
0 1 1 a 12
1 1 2 b 35
2 1 3 b 13
3 1 4 c 523
4 2 1 a 6
5 2 2 a 12
6 2 3 a 6
7 3 1 a 1
8 3 2 a 46
9 3 3 b 236
The id column indicates which data belongs together, the counter indicates the order of the rows, and status is a special status code. I want to drop all rows after the first occurence of a row with status='b', keeping the first row with status='b'.
Final output should look like this
id counter status additional_data
0 1 1 a 12
1 1 2 b 35
4 2 1 a 6
5 2 2 a 12
6 2 3 a 6
7 3 1 a 1
8 3 2 a 46
9 3 3 b 236
All help is, as always, greatly appreciated.
Use custom function with idxmax for return index of values by condition, add 1 for not lost b row:
def f(x):
m = x['status'].eq('b')
b = m.idxmax()
if m.any():
x = x.loc[:b]
else:
x
return x
a = df.groupby('id', group_keys=False).apply(f)
print (a)
id counter status additional_data
0 1 1 a 12
1 1 2 b 35
4 2 1 a 6
5 2 2 a 12
6 2 3 a 6
7 3 1 a 1
8 3 2 a 46
9 3 3 b 236

My dataframe has a list as an index, how do I access a cell or edit a cell

My pandas dataframe looks like this
A B C D E
(Name1, 1) NaN NaN NaN NaN NaN
(Name2, 2) NaN NaN NaN NaN NaN
How do I access the a particular cell or change the value of a particular cell
I created the dataframe using this
id=list(product(array1,array2))
data=pd.DataFrame(index=id ,columns=array3)
I think you need MultiIndex:
np.random.seed(124)
array1 = np.array(['Name1','Name2'])
array2 = np.array([1,2])
array3 = np.array(list('ABCDE'))
idx= pd.MultiIndex.from_product([array1,array2])
data=pd.DataFrame(np.random.randint(10, size=[len(idx), len(array3)]),
index=idx ,columns=array3)
print (data)
A B C D E
Name1 1 1 7 2 9 0
2 4 4 5 5 6
Name2 1 9 6 0 8 9
2 9 0 2 2 1
print (data.index)
MultiIndex(levels=[['Name1', 'Name2'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
data.loc[('Name1', 2), 'B'] = 20
print (data)
A B C D E
Name1 1 1 7 2 9 0
2 4 20 5 5 6
Name2 1 9 6 0 8 9
2 9 0 2 2 1
For complicated selects are used slicers:
idx = pd.IndexSlice
data.loc[idx['Name1', 2], 'B'] = 20
print (data)
A B C D E
Name1 1 1 7 2 9 0
2 4 20 5 5 6
Name2 1 9 6 0 8 9
2 9 0 2 2 1
idx = pd.IndexSlice
print (data.loc[idx['Name1', 2], 'A'])
4
#select all values with 2 of second level and column A
idx = pd.IndexSlice
print (data.loc[idx[:, 2], 'A'])
Name1 2 4
Name2 2 9
Name: A, dtype: int32
#select 1 form second level and slice between B and D columns
idx = pd.IndexSlice
print (data.loc[idx[:, 1], idx['B':'D']])
B C D
Name1 1 7 2 9
Name2 1 6 0 8
For simplier selects use DataFrame.xs:
print (data.xs('Name1', axis=0, level=0))
A B C D E
1 1 7 2 9 0
2 4 4 5 5 6

How to sort rows in pandas with a non-standard order

I have a pandas dataframe, say:
df = pd.DataFrame ([['a', 3, 3], ['b', 2, 5], ['c', 4, 9], ['d', 1, 43]], columns = ['col 1' , 'col2', 'col 3'])
or:
col 1 col2 col 3
0 a 3 3
1 b 2 5
2 c 4 9
3 d 1 43
If I want to sort by col2, I can use df.sort, and that will sort ascending and descending.
However, if I want to sort the rows so that col2 is: [4, 2, 1, 3], how would I do that?
Try this:
sortMap = {4:1, 2:2, 1:3,3:4 }
df["new"] = df2['col2'].map(sortMap)
df.sort_values('new', inplace=True)
df
col1 col2 col3 new
2 c 4 9 1
1 b 2 5 2
3 d 1 43 3
0 a 3 3 4
alt method to create dict:
ll = [4, 2, 1, 3]
sortMap = dict(zip(ll,range(len(ll))))
One way is to convert that column to a Categorical type, which can have an arbitrary ordering.
In [51]: df['col2'] = df['col2'].astype('category', categories=[4, 1, 2, 3], ordered=True)
In [52]: df.sort_values('col2')
Out[52]:
col 1 col2 col 3
2 c 4 9
3 d 1 43
1 b 2 5
0 a 3 3
alternative solution:
In [409]: lst = [4, 2, 1, 3]
In [410]: srt = pd.Series(np.arange(len(lst)), index=lst)
In [411]: srt
Out[411]:
4 0
2 1
1 2
3 3
dtype: int32
In [412]: df.assign(x=df.col2.map(srt))
Out[412]:
col 1 col2 col 3 x
0 a 3 3 3
1 b 2 5 1
2 c 4 9 0
3 d 1 43 2
In [413]: df.assign(x=df.col2.map(srt)).sort_values('x')
Out[413]:
col 1 col2 col 3 x
2 c 4 9 0
1 b 2 5 1
3 d 1 43 2
0 a 3 3 3
In [414]: df.assign(x=df.col2.map(srt)).sort_values('x').drop('x',1)
Out[414]:
col 1 col2 col 3
2 c 4 9
1 b 2 5
3 d 1 43
0 a 3 3
NOTE: i do like #chrisb's solution more - it's much more elegant and probably will work faster

cumcout groupby --- how to list by groups

My question is related to this question
import pandas as pd
df = pd.DataFrame(
[['A', 'X', 3], ['A', 'X', 5], ['A', 'Y', 7], ['A', 'Y', 1],
['B', 'X', 3], ['B', 'X', 1], ['B', 'X', 3], ['B', 'Y', 1],
['C', 'X', 7], ['C', 'Y', 4], ['C', 'Y', 1], ['C', 'Y', 6]],
columns=['c1', 'c2', 'v1'])
df['CNT'] = df.groupby(['c1', 'c2']).cumcount()+1
I got column 'CNT'. But I'd like to break it apart according to group 'c2' to obtain cumulative count of 'X' and 'Y' respectively.
c1 c2 v1 CNT Xcnt Ycnt
0 A X 3 1 1 0
1 A X 5 2 2 0
2 A Y 7 1 2 1
3 A Y 1 2 2 2
4 B X 3 1 1 0
5 B X 1 2 2 0
6 B X 3 3 3 0
7 B Y 1 1 3 1
8 C X 7 1 1 0
9 C Y 4 1 1 1
10 C Y 1 2 1 2
11 C Y 6 3 1 3
Any suggestions? I am just starting to explore Pandas and appreciate your help.
I don't directly know a way to do this directly, but starting from the calculated CNT column, you can do it as follows:
Make the Xcnt and Ycnt columns:
In [13]: df['Xcnt'] = df['CNT'][df['c2']=='X']
In [14]: df['Ycnt'] = df['CNT'][df['c2']=='Y']
In [15]: df
Out[15]:
c1 c2 v1 CNT Xcnt Ycnt
0 A X 3 1 1 NaN
1 A X 5 2 2 NaN
2 A Y 7 1 NaN 1
3 A Y 1 2 NaN 2
4 B X 3 1 1 NaN
5 B X 1 2 2 NaN
6 B X 3 3 3 NaN
7 B Y 1 1 NaN 1
8 C X 7 1 1 NaN
9 C Y 4 1 NaN 1
10 C Y 1 2 NaN 2
11 C Y 6 3 NaN 3
Next, we want to fill the NaN's per group of c1 by forward filling:
In [23]: df['Xcnt'] = df.groupby('c1')['Xcnt'].fillna(method='ffill')
In [24]: df['Ycnt'] = df.groupby('c1')['Ycnt'].fillna(method='ffill').fillna(0)
In [25]: df
Out[25]:
c1 c2 v1 CNT Xcnt Ycnt
0 A X 3 1 1 0
1 A X 5 2 2 0
2 A Y 7 1 2 1
3 A Y 1 2 2 2
4 B X 3 1 1 0
5 B X 1 2 2 0
6 B X 3 3 3 0
7 B Y 1 1 3 1
8 C X 7 1 1 0
9 C Y 4 1 1 1
10 C Y 1 2 1 2
11 C Y 6 3 1 3
For the Ycnt an extra fillna was needed to fill the convert the NaN's to 0's where the group started with NaNs (couldn't fill forward).

Categories

Resources