pandas combine_first but always overwrite - python

Can the following be improved upon?
It achieved the desired result of copying values from df2 to df1 where the index can be matched. It seems inefficient and clunky.
df1 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], index=pd.MultiIndex.from_tuples(['AB', 'AC']), columns=['X', 'Y', 'Z'])
df2 = pd.DataFrame([102, 103], index=pd.MultiIndex.from_tuples(['AC', 'AD']), columns=['Y'])
desired = df2.combine_first(df1).combine_first(df2)
print(df1)
print(df2)
print(desired)
Output:
df1
X Y Z
A B 0 1 2
C 3 4 5
df2
Y
A C 102
D 103
desired
X Y Z
A B 0 1 2
C 3 102 5
D NaN 103 NaN
The closest I could come to using slicing was
print(df1.loc[df2.index, df2.columns]) # This works, demonstrated lhs of below is OK
df1.loc[df2.index, df2.columns] = df2 # This fails, as does df2.values

Why not use merge ??
>>df3 = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
>>df3
X Y_x Z Y_y
A B 0.0 1.0 2.0 NaN
C 3.0 4.0 5.0 102.0
D NaN NaN NaN 103.0
>>df3['Y'] = df3['Y_y'].combine_first(df3['Y_x'])
>>df3.drop(['Y_x', 'Y_y'], axis=1)
X Z Y
A B 0.0 2.0 1.0
C 3.0 5.0 102.0
D NaN NaN 103.0

Related

Move Null rows to the bottom of the dataframe

I have a dataframe:
df1 = pd.DataFrame({'a': [1, 2, 10, np.nan, 5, 6, np.nan, 8],
'b': list('abcdefgh')})
df1
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 NaN d
4 5.0 e
5 6.0 f
6 NaN g
7 8.0 h
I would like to move all the rows where a is np.nan to the bottom of the dataframe
df2 = pd.DataFrame({'a': [1, 2, 10, 5, 6, 8, np.nan, np.nan],
'b': list('abcefhdg')})
df2
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
I have tried this:
na = df1[df1.a.isnull()]
df1.dropna(subset = ['a'], inplace=True)
df1 = df1.append(na)
df1
Is there a cleaner way to do this? Or is there a function that I can use for this?
New answer after edit OP
You were close but you can clean up your code a bit by using the following:
df1 = pd.concat([df1[df1['a'].notnull()], df1[df1['a'].isnull()]], ignore_index=True)
print(df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
Old answer
Use sort_values with the na_position=last argument:
df1 = df1.sort_values('a', na_position='last')
print(df1)
a b
0 1.0 a
1 2.0 b
2 3.0 c
4 5.0 e
5 6.0 f
7 8.0 h
3 NaN d
6 NaN g
Not exist in pandas yet, use Series.isna with Series.argsort for positions and change ordering by DataFrame.iloc:
df1 = df1.iloc[df1['a'].isna().argsort()].reset_index(drop=True)
print (df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
Or pure pandas solution with helper column and DataFrame.sort_values:
df1 = (df1.assign(tmp=df1['a'].isna())
.sort_values('tmp')
.drop('tmp', axis=1)
.reset_index(drop=True))
print (df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g

Pivot Table to fill pairs of observation in pandas

The objective is to get a table with values of pair T1-T2. I have data in form of:
df
T1 T2 Score
0 A B 5
1 A C 8
2 B C 4
I tried:
df.pivot_table('Score','T1','T2')
B C
A 5.0 8.0
B NaN 4.0
I expected:
A B C
A 5 8
B 5 4
C 8 4
So kind of like correlation table I think. Because A-B pair is same as B-A in this case.
First add all possible index with columns values by reindex with another pivot by swap T1 and T2 and last combine_first:
idx = np.unique(df[['T1','T2']].values.ravel())
df1 = df.pivot_table('Score','T1','T2').reindex(index=idx, columns=idx)
df2 = df.pivot_table('Score','T2','T1').reindex(index=idx, columns=idx)
df = df1.combine_first(df2)
print (df)
A B C
T1
A NaN 5.0 8.0
B 5.0 NaN 4.0
C 8.0 4.0 NaN
Another method using merge:
df1 = df.pivot_table('Score','T1','T2')
df2 = df.pivot_table('Score','T2','T1')
common_val = np.intersect1d(df['T1'].unique(), df['T2'].unique()).tolist()
df = df1.merge(df2, how='outer', left_index=True, right_index=True, on=common_val)
print(df)
B C A
A 5.0 8.0 NaN
B NaN 4.0 5.0
C 4.0 NaN 8.0
Another way:
In [11]: df1 = df.set_index(['T1', 'T2']).unstack(1)
In [12]: df1.columns = df1.columns.droplevel(0)
In [13]: df2 = df1.reindex(index=df1.index | df1.columns, columns=df1.index | df1.columns)
In [14]: df2.update(df2.T)
In [15]: df2
Out[15]:
A B C
A NaN 5.0 8.0
B 5.0 NaN 4.0
C 8.0 4.0 NaN

Merging/Combining Dataframes in Pandas

I have a df1, example:
B A C
B 1
A 1
C 2
,and a df2, example:
C E D
C 2 3
E 1
D 2
The column and row 'C' is common in both dataframes.
I would like to combine these dataframes such that I get,
B A C D E
B 1
A 1
C 2 2 3
D 1
E 2
Is there an easy way to do this? pd.concat and pd.append do not seem to work. Thanks!
Edit: df1.combine_first(df2) works (thanks #jezarel), but can we keep the original ordering?
There is problem combine_first always sorted columns namd index, so need reindex with combine columns names:
idx = df1.columns.append(df2.columns).unique()
print (idx)
Index(['B', 'A', 'C', 'E', 'D'], dtype='object')
df = df1.combine_first(df2).reindex(index=idx, columns=idx)
print (df)
B A C E D
B NaN 1.0 NaN NaN NaN
A NaN NaN 1.0 NaN NaN
C 2.0 NaN NaN 2.0 3.0
E NaN NaN NaN NaN 1.0
D NaN NaN 2.0 NaN NaN
More general solution:
c = df1.columns.append(df2.columns).unique()
i = df1.index.append(df2.index).unique()
df = df1.combine_first(df2).reindex(index=i, columns=c)

Combine two rows in Dataframe using a Unique Value

I converted a list into a Dataframe and now my data looks like this.
I want to use the unique Business ID to merge two rows in this Dataframe. How can I do this?
Use first in a groupby to get first non-null value
Consider the data frame df
df = pd.DataFrame(dict(
Bars=[np.nan, 1, 1, np.nan],
BusID=list('AABB'),
Nightlife=[1, np.nan, np.nan, 1]
))
df
Bars BusID Nightlife
0 NaN A 1.0
1 1.0 A NaN
2 1.0 B NaN
3 NaN B 1.0
Then
df.groupby('BusID', as_index=False).first()
BusID Bars Nightlife
0 A 1.0 1.0
1 B 1.0 1.0
You could use something like df.groupby('Business ID').sum(). As an example:
df = pd.DataFrame(data = {'a': [1, 2, 3, 1],
'b': [5, 6, None, None],
'c': [None, None, 7, 8]})
df
# a b c
# 0 1 5.0 NaN
# 1 2 6.0 NaN
# 2 3 NaN 7.0
# 3 1 NaN 8.0
new_df = df.groupby('a').sum()
new_df
# b c
# a
# 1 5.0 8.0
# 2 6.0 0.0
# 3 0.0 7.0

retrieve list element from pandas dataframe column

Assuming there is a pandas.DataFrame like:
pd.DataFrame([[np.nan,np.nan],[[1,2],[3,4]],[[11,22],[33,44]]],columns=['A','B'])
What's the easiest way to produce 2 pandas.DataFrames that each contains the 1st and 2nd element from every value list in the frame (nan if the position is nan).
pd.DataFrame([[np.nan,np.nan],[1,3],[11,33]],columns=['A','B'])
pd.DataFrame([[np.nan,np.nan],[2,4],[22,44]],columns=['A','B'])
You can use:
#replace NaN to [] - a bit hack
df = df.mask(df.isnull(), pd.Series([[]] * len(df.columns), index=df.columns), axis=1)
print (df)
A B
0 [] []
1 [1, 2] [3, 4]
2 [11, 22] [33, 44]
#create new df by each column, concanecate together
df3 = pd.concat([pd.DataFrame(df[col].values.tolist()) for col in df],
axis=1,
keys=df.columns)
print (df3)
A B
0 1 0 1
0 NaN NaN NaN NaN
1 1.0 2.0 3.0 4.0
2 11.0 22.0 33.0 44.0
#select by xs
df1 = df3.xs(0, level=1, axis=1)
print (df1)
A B
0 NaN NaN
1 1.0 3.0
2 11.0 33.0
df2 = df3.xs(1, level=1, axis=1)
print (df2)
A B
0 NaN NaN
1 2.0 4.0
2 22.0 44.0
You can do what you need with a function that return the n'th element of each column.
Code:
def row_element(elem_num):
def func(row):
ret = []
for item in row:
try:
ret.append(item[elem_num])
except:
ret.append(item)
return ret
return func
Test Code:
df = pd.DataFrame(
[[np.nan, np.nan], [[1, 2], [3, 4]], [[11, 22], [33, 44]]],
columns=['A', 'B'])
print(df)
print(df.apply(row_element(0), axis=1))
print(df.apply(row_element(1), axis=1))
Results:
A B
0 NaN NaN
1 [1, 2] [3, 4]
2 [11, 22] [33, 44]
A B
0 NaN NaN
1 1.0 3.0
2 11.0 33.0
A B
0 NaN NaN
1 2.0 4.0
2 22.0 44.0

Categories

Resources