Pandas multiIndex: add new indexes for each existing index - python

I have a dataframe that looks like this:
ID Type
0 a1 y
1 a1 y
2 a2 y
3 a2 n
4 a3 n
I want to re-index it to look like this:
ID Subindex Type
a1 1 y
2 y
a2 1 y
2 n
a3 1 n
Any command in pandas that could do this? Thank you so much!

To number the items in each group, use cumcount:
import pandas as pd
df = pd.DataFrame({'ID': ['a1', 'a1', 'a2', 'a2', 'a3'],
'Type': ['y', 'y', 'y', 'n', 'n']})
df['Subindex'] = df.groupby('ID').cumcount()+1
print(df)
yields
ID Type Subindex
0 a1 y 1
1 a1 y 2
2 a2 y 1
3 a2 n 2
4 a3 n 1

Related

Add a list to a dataframe in Python

I have a df
col1 col2
0 1 ONE AAKLD
1 2 TWO ERBB
2 3 THE COCCNUT
3 4 WOW AACE
and I have the following lists
list1 = ['a1', 'a2', 'a3']
list2 = ['b1', 'b2', 'b3']
list3 = ['c1', 'c2', 'c3']
I want to add the list values to different columns of particular rows in the df based on condition i.e., if col2 in df has AA ,the i need the list1 to be appended to that.
Expected output:
col1 col2 1 2 3
0 1 ONE AAKLD a1 a2 a3
1 2 TWO ERBB b1 b2 b3
2 3 THE COCCNUT c1 c2 c3
3 4 WOW AACE a1 a2 a3
Thanks
Check below provides required output.
import pandas as pd
import numpy as np
df_1 = pd.DataFrame( {'col1':[1,2,3,4,], 'col2':['AA','BB','CC','AA']})
list1 = ['a1', 'a2', 'a3']
list2 = ['b1', 'b2', 'b3']
list3 = ['c1', 'c2', 'c3']
df = pd.DataFrame([list1, list2, list3], columns=['1','2','3'])
df['col2'] = df['1'].str.split('', expand=True)[1].str.upper()*2
pd.merge(df_1, df, left_on='col2',right_on='col2')
Update as per OP's comments below
Building upon earlier logic of MERGE. You can easily change MATCHING criteria if need arise.
import pandas as pd
import numpy as np
## Data Prep
df_1 = pd.DataFrame( {'col1':[1,2,3,4,], 'col2':['ONE AAKLD','TWO ERBB','THE COCCNUT','WOW AACE']})
df_1['join_col'] = 1
list1 = ['a1', 'a2', 'a3']
list2 = ['b1', 'b2', 'b3']
list3 = ['c1', 'c2', 'c3']
## Logic
df = pd.DataFrame([list1, list2, list3], columns=['1','2','3'])
df['match_col'] = df['1'].str.split('', expand=True)[1].str.upper()*2
df['join_col'] = 1
df_2 = pd.merge(df_1, df, left_on='join_col', right_on = 'join_col')
df_2['Is_Match'] = df_2[['col2', 'match_col']].apply(lambda x: x[1] in x[0] , axis = 1)
df_2[df_2['Is_Match'] == True][['col1','col2','1','2','3']]
Output:
another option is using map:
d = {'AA':list1,'BB':list2,'CC':list3}
df_1[[1,2,3]] = df_1['col2'].map(d).agg(pd.Series)
print(df_1)
'''
col1 col2 1 2 3
0 1 AA a1 a2 a3
1 2 BB b1 b2 b3
2 3 CC c1 c2 c3
3 4 AA a1 a2 a3
UPD
so it's not absolutly clear what your data is, anyway you can try this (it won't work properly if you have more than one double caracters per string):
df_1[[1,2,3]] = df_1['col2'].str.extract(r'(([A-Z])\2)')[0].map(d).agg(pd.Series)
>>> df_1
'''
col1 col2 1 2 3
0 1 ONE AAKLD a1 a2 a3
1 2 TWO ERBB b1 b2 b3
2 3 THE COCCNUT c1 c2 c3
3 4 WOW AACE a1 a2 a3
You can play with pd.concat() and Pandas' Series.
df = pd.DataFrame({"col1":[1,2,3],"col2":[4,5,6]})
lst = pd.Series([7,8,9])
pd.concat((df,lst),axis=1)

how to map count of groups to another dataframe in pandas

I have 2 dataframes which look like this:
df1 = pd.DataFrame({'A': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
'B': ['C1', 'C1', 'C1', 'C2', 'C2', 'C2', 'C2', 'C2'],
'rank': [2, 5, 1, 8, 6, 3, 4, 7]})
Out[3]:
A B rank
0 A C1 2
1 B C1 5
2 C C1 1
3 D C2 8
4 E C2 6
5 F C2 3
6 G C2 4
7 H C2 7
df2 = pd.DataFrame({'B': ['C1', 'C1', 'C1', 'C2'],
'C': [1, 2, 3, 4]})
Out[6]:
B C
0 C1 1
1 C1 2
2 C1 3
3 C2 4
I would like to select the 3 highest ranked rows (by column "rank") of df1 but can only select a maximum of 4 names per group (column B) and this needs to include the count of rows in each group in df2.
The resulting dataframe should look like this:
A B rank
2 C C1 1
5 F C2 3
6 G C2 4
Logic:
The count of rows in df2 for group C1 is 3 (leaving a maximum of 1 more rows to select from this group in df1) and the count of C2 is 1 (leaving max 3 rows to select from df1)
item C has highest rank so gets selected, now the total count of group C1 is 4
item F and item G are the next highest ranked and are part of group C2, total count is 3 so less than 4
I tried the following:
df1.sort_values('rank').groupby('B').head(4).head(5)
but this restricts to select max 4 rows of group in B only to rows in df1 and ignores df2
Here's an idea:
max_per_group = 4
# maximal rows to pick from each group
max_sizes = max_per_group - df2.groupby('B').size()
# 4 rows from each group
heads = df1.sort_values('rank').groupby('B').head(max_per_group)
# enumerate the rows within each group
enum = heads.groupby('B').cumcount()
# output
heads[enum<heads['B'].map(sizes).fillna(max_per_group)].head(3)
Output:
A B rank
2 C C1 1
5 F C2 3
6 G C2 4
First, find the number remaining by group:
In [4]: remaining = (4 - df2.groupby('B').size()).to_dict()
Then, select that number from each sorted group in your groupby:
In [5]: (
...: df1.sort_values('rank').groupby('B').apply(
...: lambda x: x.sort_values('rank').head(remaining.get(x.name, 4))
...: ).sort_values('rank').iloc[:3].reset_index('B', drop=True)
...: )
Out[5]:
A B rank
2 C C1 1
5 F C2 3
6 G C2 4

How can I find the "set difference" of rows in two dataframes on a subset of columns in Pandas?

I have two dataframes, say df1 and df2, with the same column names.
Example:
df1
C1 | C2 | C3 | C4
A 1 2 AA
B 1 3 A
A 3 2 B
df2
C1 | C2 | C3 | C4
A 1 3 E
B 1 2 C
Q 4 1 Z
I would like to filter out rows in df1 based on common values in a fixed subset of columns between df1 and df2. In the above example, if the columns are C1 and C2, I would like the first two rows to be filtered out, as their values in both df1 and df2 for these columns are identical.
What would be a clean way to do this in Pandas?
So far, based on this answer, I have been able to find the common rows.
common_df = pandas.merge(df1, df2, how='inner', on=['C1','C2'])
This gives me a new dataframe with only those rows that have common values in the specified columns, i.e., the intersection.
I have also seen this thread, but the answers all seem to assume a difference on all the columns.
The expected result for the above example (rows common on specified columns removed):
C1 | C2 | C3 | C4
A 3 2 B
Maybe not the cleanest, but you could add a key column to df1 to check against.
Setting up the datasets
import pandas as pd
df1 = pd.DataFrame({ 'C1': ['A', 'B', 'A'],
'C2': [1, 1, 3],
'C3': [2, 3, 2],
'C4': ['AA', 'A', 'B']})
df2 = pd.DataFrame({ 'C1': ['A', 'B', 'Q'],
'C2': [1, 1, 4],
'C3': [3, 2, 1],
'C4': ['E', 'C', 'Z']})
Adding a key, using your code to find the commons
df1['key'] = range(1, len(df1) + 1)
common_df = pd.merge(df1, df2, how='inner', on=['C1','C2'])
df_filter = df1[~df1['key'].isin(common_df['key'])].drop('key', axis=1)
You can use an anti-join method where you do an outer join on the specified columns while returning the method of the join with an indicator. Only downside is that you'd have to rename and drop the extra columns after the join.
>>> import pandas as pd
>>> df1 = pd.DataFrame({'C1':['A','B','A'],'C2':[1,1,3],'C3':[2,3,2],'C4':['AA','A','B']})
>>> df2 = pd.DataFrame({'C1':['A','B','Q'],'C2':[1,1,4],'C3':[3,2,1],'C4':['E','C','Z']})
>>> df_merged = df1.merge(df2, on=['C1','C2'], indicator=True, how='outer')
>>> df_merged
C1 C2 C3_x C4_x C3_y C4_y _merge
0 A 1 2.0 AA 3.0 E both
1 B 1 3.0 A 2.0 C both
2 A 3 2.0 B NaN NaN left_only
3 Q 4 NaN NaN 1.0 Z right_only
>>> df1_setdiff = df_merged[df_merged['_merge'] == 'left_only'].rename(columns={'C3_x': 'C3', 'C4_x': 'C4'}).drop(['C3_y', 'C4_y', '_merge'], axis=1)
>>> df1_setdiff
C1 C2 C3 C4
2 A 3 2.0 B
>>> df2_setdiff = df_merged[df_merged['_merge'] == 'right_only'].rename(columns={'C3_y': 'C3', 'C4_y': 'C4'}).drop(['C3_x', 'C4_x', '_merge'], axis=1)
>>> df2_setdiff
C1 C2 C3 C4
3 Q 4 1.0 Z
import pandas as pd
df1 = pd.DataFrame({'C1':['A','B','A'],'C2':[1,1,3],'C3':[2,3,2],'C4':['AA','A','B']})
df2 = pd.DataFrame({'C1':['A','B','Q'],'C2':[1,1,4],'C3':[3,2,1],'C4':['E','C','Z']})
common = pd.merge(df1, df2,on=['C1','C2'])
R1 = df1[~((df1.C1.isin(common.C1))&(df1.C2.isin(common.C2)))]
R2 = df2[~((df2.C1.isin(common.C1))&(df2.C2.isin(common.C2)))]
df1:
C1 C2 C3 C4
0 A 1 2 AA
1 B 1 3 A
2 A 3 2 B
df2:
C1 C2 C3 C4
0 A 1 3 E
1 B 1 2 C
2 Q 4 1 Z
common:
C1 C2 C3_x C4_x C3_y C4_y
0 A 1 2 AA 3 E
1 B 1 3 A 2 C
R1:
C1 C2 C3 C4
2 A 3 2 B
R2:
C1 C2 C3 C4
2 Q 4 1 Z

Find intersection of two sets of columns in python pandas dataframe for each row without looping

I have a following pandas.DataFrame:
df = pd.DataFrame({'A1':['a','a','d'], 'A2':['b','c','c'],
'B1':['d','a','c'], 'B2': ['e','d','e']})
A1 A2 B1 B2
0 a b d e
1 a c a d
2 d c c e
I would like to choose the rows in which values in A1 and A2 are different from B1 and B2, or intersection of values in ['A1', 'A2'] and ['B1', 'B2'] is empty, so in the above example only the row 0 should be chosen.
So far the best I could do is to loop over every row of my data frame with the following code
for i in df.index.values:
if df.loc[i,['A1','A2']].isin(df.loc[i,['B1','B2']]).sum()>0:
df = df.drop(i,0)
Is there a way to do this without looping?
You can test for that directly like:
Code:
df[(df.A1 != df.B1) & (df.A2 != df.B2) & (df.A1 != df.B2) & (df.A2 != df.B1)]
Test Code:
df = pd.DataFrame({'A1': ['a', 'a', 'd'], 'A2': ['b', 'c', 'c'],
'B1': ['d', 'a', 'c'], 'B2': ['e', 'd', 'e']})
print(df)
print(df[(df.A1 != df.B1) & (df.A2 != df.B2) &
(df.A1 != df.B2) & (df.A2 != df.B1)])
Results:
A1 A2 B1 B2
0 a b d e
1 a c a d
2 d c c e
A1 A2 B1 B2
0 a b d e
By using intersection
df['Key1']=df[['A1','A2']].values.tolist()
df['Key2']=df[['B1','B2']].values.tolist()
df.apply(lambda x : len(set(x['Key1']).intersection(x['Key2']))==0,axis=1)
Out[517]:
0 True
1 False
2 False
dtype: bool
df[df.apply(lambda x : len(set(x['Key1']).intersection(x['Key2']))==0,axis=1)].drop(['Key1','Key2'],1)
Out[518]:
A1 A2 B1 B2
0 a b d e
In today's edition of
Way More Complicated Than It Needs To Be
Chapter 1
We bring you map, generators, and set logic
mask = list(map(lambda x: not bool(x),
(set.intersection(*map(set, pair))
for pair in df.values.reshape(-1, 2, 2).tolist())
))
df[mask]
A1 A2 B1 B2
0 a b d e
Chapter 2
Numpy broadcasting
v = df.values
df[(v[:, :2, None] != v[:, None, 2:]).all((1, 2))]
A1 A2 B1 B2
0 a b d e

how to convert column names into column values in pandas - python

df=pd.DataFrame(index=['x','y'], data={'a':[1,2],'b':[3,4]})
how can I convert column names into values of a column? This is my desired output
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b
You can use:
print (df.T.unstack().reset_index(level=1, name='c1')
.rename(columns={'level_1':'c2'})[['c1','c2']])
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b
Or:
print (df.stack().reset_index(level=1, name='c1')
.rename(columns={'level_1':'c2'})[['c1','c2']])
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b
try this:
In [279]: df.stack().reset_index().set_index('level_0').rename(columns={'level_1':'c2',0:'c1'})
Out[279]:
c2 c1
level_0
x a 1
x b 3
y a 2
y b 4
Try:
df1 = df.stack().reset_index(-1).iloc[:, ::-1]
df1.columns = ['c1', 'c2']
df1
In [62]: (pd.melt(df.reset_index(), var_name='c2', value_name='c1', id_vars='index')
.set_index('index'))
Out[62]:
c2 c1
index
x a 1
y a 2
x b 3
y b 4

Categories

Resources