Can the following be improved upon?
It achieved the desired result of copying values from df2 to df1 where the index can be matched. It seems inefficient and clunky.
df1 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], index=pd.MultiIndex.from_tuples(['AB', 'AC']), columns=['X', 'Y', 'Z'])
df2 = pd.DataFrame([102, 103], index=pd.MultiIndex.from_tuples(['AC', 'AD']), columns=['Y'])
desired = df2.combine_first(df1).combine_first(df2)
print(df1)
print(df2)
print(desired)
Output:
df1
X Y Z
A B 0 1 2
C 3 4 5
df2
Y
A C 102
D 103
desired
X Y Z
A B 0 1 2
C 3 102 5
D NaN 103 NaN
The closest I could come to using slicing was
print(df1.loc[df2.index, df2.columns]) # This works, demonstrated lhs of below is OK
df1.loc[df2.index, df2.columns] = df2 # This fails, as does df2.values
Why not use merge ??
>>df3 = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
>>df3
X Y_x Z Y_y
A B 0.0 1.0 2.0 NaN
C 3.0 4.0 5.0 102.0
D NaN NaN NaN 103.0
>>df3['Y'] = df3['Y_y'].combine_first(df3['Y_x'])
>>df3.drop(['Y_x', 'Y_y'], axis=1)
X Z Y
A B 0.0 2.0 1.0
C 3.0 5.0 102.0
D NaN NaN 103.0
Related
I have a dataframe:
df1 = pd.DataFrame({'a': [1, 2, 10, np.nan, 5, 6, np.nan, 8],
'b': list('abcdefgh')})
df1
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 NaN d
4 5.0 e
5 6.0 f
6 NaN g
7 8.0 h
I would like to move all the rows where a is np.nan to the bottom of the dataframe
df2 = pd.DataFrame({'a': [1, 2, 10, 5, 6, 8, np.nan, np.nan],
'b': list('abcefhdg')})
df2
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
I have tried this:
na = df1[df1.a.isnull()]
df1.dropna(subset = ['a'], inplace=True)
df1 = df1.append(na)
df1
Is there a cleaner way to do this? Or is there a function that I can use for this?
New answer after edit OP
You were close but you can clean up your code a bit by using the following:
df1 = pd.concat([df1[df1['a'].notnull()], df1[df1['a'].isnull()]], ignore_index=True)
print(df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
Old answer
Use sort_values with the na_position=last argument:
df1 = df1.sort_values('a', na_position='last')
print(df1)
a b
0 1.0 a
1 2.0 b
2 3.0 c
4 5.0 e
5 6.0 f
7 8.0 h
3 NaN d
6 NaN g
Not exist in pandas yet, use Series.isna with Series.argsort for positions and change ordering by DataFrame.iloc:
df1 = df1.iloc[df1['a'].isna().argsort()].reset_index(drop=True)
print (df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
Or pure pandas solution with helper column and DataFrame.sort_values:
df1 = (df1.assign(tmp=df1['a'].isna())
.sort_values('tmp')
.drop('tmp', axis=1)
.reset_index(drop=True))
print (df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
The objective is to get a table with values of pair T1-T2. I have data in form of:
df
T1 T2 Score
0 A B 5
1 A C 8
2 B C 4
I tried:
df.pivot_table('Score','T1','T2')
B C
A 5.0 8.0
B NaN 4.0
I expected:
A B C
A 5 8
B 5 4
C 8 4
So kind of like correlation table I think. Because A-B pair is same as B-A in this case.
First add all possible index with columns values by reindex with another pivot by swap T1 and T2 and last combine_first:
idx = np.unique(df[['T1','T2']].values.ravel())
df1 = df.pivot_table('Score','T1','T2').reindex(index=idx, columns=idx)
df2 = df.pivot_table('Score','T2','T1').reindex(index=idx, columns=idx)
df = df1.combine_first(df2)
print (df)
A B C
T1
A NaN 5.0 8.0
B 5.0 NaN 4.0
C 8.0 4.0 NaN
Another method using merge:
df1 = df.pivot_table('Score','T1','T2')
df2 = df.pivot_table('Score','T2','T1')
common_val = np.intersect1d(df['T1'].unique(), df['T2'].unique()).tolist()
df = df1.merge(df2, how='outer', left_index=True, right_index=True, on=common_val)
print(df)
B C A
A 5.0 8.0 NaN
B NaN 4.0 5.0
C 4.0 NaN 8.0
Another way:
In [11]: df1 = df.set_index(['T1', 'T2']).unstack(1)
In [12]: df1.columns = df1.columns.droplevel(0)
In [13]: df2 = df1.reindex(index=df1.index | df1.columns, columns=df1.index | df1.columns)
In [14]: df2.update(df2.T)
In [15]: df2
Out[15]:
A B C
A NaN 5.0 8.0
B 5.0 NaN 4.0
C 8.0 4.0 NaN
I have a df1, example:
B A C
B 1
A 1
C 2
,and a df2, example:
C E D
C 2 3
E 1
D 2
The column and row 'C' is common in both dataframes.
I would like to combine these dataframes such that I get,
B A C D E
B 1
A 1
C 2 2 3
D 1
E 2
Is there an easy way to do this? pd.concat and pd.append do not seem to work. Thanks!
Edit: df1.combine_first(df2) works (thanks #jezarel), but can we keep the original ordering?
There is problem combine_first always sorted columns namd index, so need reindex with combine columns names:
idx = df1.columns.append(df2.columns).unique()
print (idx)
Index(['B', 'A', 'C', 'E', 'D'], dtype='object')
df = df1.combine_first(df2).reindex(index=idx, columns=idx)
print (df)
B A C E D
B NaN 1.0 NaN NaN NaN
A NaN NaN 1.0 NaN NaN
C 2.0 NaN NaN 2.0 3.0
E NaN NaN NaN NaN 1.0
D NaN NaN 2.0 NaN NaN
More general solution:
c = df1.columns.append(df2.columns).unique()
i = df1.index.append(df2.index).unique()
df = df1.combine_first(df2).reindex(index=i, columns=c)
I converted a list into a Dataframe and now my data looks like this.
I want to use the unique Business ID to merge two rows in this Dataframe. How can I do this?
Use first in a groupby to get first non-null value
Consider the data frame df
df = pd.DataFrame(dict(
Bars=[np.nan, 1, 1, np.nan],
BusID=list('AABB'),
Nightlife=[1, np.nan, np.nan, 1]
))
df
Bars BusID Nightlife
0 NaN A 1.0
1 1.0 A NaN
2 1.0 B NaN
3 NaN B 1.0
Then
df.groupby('BusID', as_index=False).first()
BusID Bars Nightlife
0 A 1.0 1.0
1 B 1.0 1.0
You could use something like df.groupby('Business ID').sum(). As an example:
df = pd.DataFrame(data = {'a': [1, 2, 3, 1],
'b': [5, 6, None, None],
'c': [None, None, 7, 8]})
df
# a b c
# 0 1 5.0 NaN
# 1 2 6.0 NaN
# 2 3 NaN 7.0
# 3 1 NaN 8.0
new_df = df.groupby('a').sum()
new_df
# b c
# a
# 1 5.0 8.0
# 2 6.0 0.0
# 3 0.0 7.0
Assuming there is a pandas.DataFrame like:
pd.DataFrame([[np.nan,np.nan],[[1,2],[3,4]],[[11,22],[33,44]]],columns=['A','B'])
What's the easiest way to produce 2 pandas.DataFrames that each contains the 1st and 2nd element from every value list in the frame (nan if the position is nan).
pd.DataFrame([[np.nan,np.nan],[1,3],[11,33]],columns=['A','B'])
pd.DataFrame([[np.nan,np.nan],[2,4],[22,44]],columns=['A','B'])
You can use:
#replace NaN to [] - a bit hack
df = df.mask(df.isnull(), pd.Series([[]] * len(df.columns), index=df.columns), axis=1)
print (df)
A B
0 [] []
1 [1, 2] [3, 4]
2 [11, 22] [33, 44]
#create new df by each column, concanecate together
df3 = pd.concat([pd.DataFrame(df[col].values.tolist()) for col in df],
axis=1,
keys=df.columns)
print (df3)
A B
0 1 0 1
0 NaN NaN NaN NaN
1 1.0 2.0 3.0 4.0
2 11.0 22.0 33.0 44.0
#select by xs
df1 = df3.xs(0, level=1, axis=1)
print (df1)
A B
0 NaN NaN
1 1.0 3.0
2 11.0 33.0
df2 = df3.xs(1, level=1, axis=1)
print (df2)
A B
0 NaN NaN
1 2.0 4.0
2 22.0 44.0
You can do what you need with a function that return the n'th element of each column.
Code:
def row_element(elem_num):
def func(row):
ret = []
for item in row:
try:
ret.append(item[elem_num])
except:
ret.append(item)
return ret
return func
Test Code:
df = pd.DataFrame(
[[np.nan, np.nan], [[1, 2], [3, 4]], [[11, 22], [33, 44]]],
columns=['A', 'B'])
print(df)
print(df.apply(row_element(0), axis=1))
print(df.apply(row_element(1), axis=1))
Results:
A B
0 NaN NaN
1 [1, 2] [3, 4]
2 [11, 22] [33, 44]
A B
0 NaN NaN
1 1.0 3.0
2 11.0 33.0
A B
0 NaN NaN
1 2.0 4.0
2 22.0 44.0