Merging/Combining Dataframes in Pandas - python

I have a df1, example:
B A C
B 1
A 1
C 2
,and a df2, example:
C E D
C 2 3
E 1
D 2
The column and row 'C' is common in both dataframes.
I would like to combine these dataframes such that I get,
B A C D E
B 1
A 1
C 2 2 3
D 1
E 2
Is there an easy way to do this? pd.concat and pd.append do not seem to work. Thanks!
Edit: df1.combine_first(df2) works (thanks #jezarel), but can we keep the original ordering?

There is problem combine_first always sorted columns namd index, so need reindex with combine columns names:
idx = df1.columns.append(df2.columns).unique()
print (idx)
Index(['B', 'A', 'C', 'E', 'D'], dtype='object')
df = df1.combine_first(df2).reindex(index=idx, columns=idx)
print (df)
B A C E D
B NaN 1.0 NaN NaN NaN
A NaN NaN 1.0 NaN NaN
C 2.0 NaN NaN 2.0 3.0
E NaN NaN NaN NaN 1.0
D NaN NaN 2.0 NaN NaN
More general solution:
c = df1.columns.append(df2.columns).unique()
i = df1.index.append(df2.index).unique()
df = df1.combine_first(df2).reindex(index=i, columns=c)

Related

Python Pandas: how do I fill none empty rows with its corresponding column names?

Here's the original df:
A B C
32 4
2
2
9 2
2 6
I want to fill in cells that have data with the column names.The output will look like this:
A B C
A C
B
A
A C
A B
Thanks
RJ
Another way is np.where and would be very fast as well:
out = df.copy()
out[:] = np.where(df.notna(),df.columns,np.nan)
print(out)
A B C
0 A NaN C
1 NaN B NaN
2 A NaN NaN
3 A NaN C
4 A B NaN
We could do stack and unstack
s=df.stack()
s[:]=s.index.get_level_values(1)
s=s.unstack()
s
Out[496]:
A B C
0 A NaN C
1 NaN B NaN
2 A NaN NaN
3 A NaN C
4 A B NaN
Alternatively, we can use .transform & .mask:
m = df.notna()
df = m.transform(lambda s: [s.name] * s.size).mask(~m)
#print(df)
A B C
0 A NaN C
1 NaN B NaN
2 A NaN NaN
3 A NaN C
4 A B NaN
Try this,
df.where(df.isnull(),df.columns.tolist())
[out]
A B C
A NaN C
NaN B NaN
A NaN NaN
A NaN C
A B NaN

Pivot Table to fill pairs of observation in pandas

The objective is to get a table with values of pair T1-T2. I have data in form of:
df
T1 T2 Score
0 A B 5
1 A C 8
2 B C 4
I tried:
df.pivot_table('Score','T1','T2')
B C
A 5.0 8.0
B NaN 4.0
I expected:
A B C
A 5 8
B 5 4
C 8 4
So kind of like correlation table I think. Because A-B pair is same as B-A in this case.
First add all possible index with columns values by reindex with another pivot by swap T1 and T2 and last combine_first:
idx = np.unique(df[['T1','T2']].values.ravel())
df1 = df.pivot_table('Score','T1','T2').reindex(index=idx, columns=idx)
df2 = df.pivot_table('Score','T2','T1').reindex(index=idx, columns=idx)
df = df1.combine_first(df2)
print (df)
A B C
T1
A NaN 5.0 8.0
B 5.0 NaN 4.0
C 8.0 4.0 NaN
Another method using merge:
df1 = df.pivot_table('Score','T1','T2')
df2 = df.pivot_table('Score','T2','T1')
common_val = np.intersect1d(df['T1'].unique(), df['T2'].unique()).tolist()
df = df1.merge(df2, how='outer', left_index=True, right_index=True, on=common_val)
print(df)
B C A
A 5.0 8.0 NaN
B NaN 4.0 5.0
C 4.0 NaN 8.0
Another way:
In [11]: df1 = df.set_index(['T1', 'T2']).unstack(1)
In [12]: df1.columns = df1.columns.droplevel(0)
In [13]: df2 = df1.reindex(index=df1.index | df1.columns, columns=df1.index | df1.columns)
In [14]: df2.update(df2.T)
In [15]: df2
Out[15]:
A B C
A NaN 5.0 8.0
B 5.0 NaN 4.0
C 8.0 4.0 NaN

drops a column if it exceeds a specific number of NA values

i want to write a program that drops a column if it exceeds a specific number of NA values .This is what i did.
def check(x):
for column in df:
if df.column.isnull().sum() > 2:
df.drop(column,axis=1)
there is no error in executing the above code , but while doing df.apply(check), there are a ton of errors.
P.S:I know about the thresh arguement in df.dropna(thresh,axis)
Any tips?Why isnt my code working?
Thanks
Although jezrael's answer works that is not the approach you should do. Instead, create a mask: ~df.isnull().sum().gt(2) and apply it with .loc[:,m] to access columns.
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')
})
m = ~df.isnull().sum().gt(2)
df = df.loc[:,m]
print(df)
Returns:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Explanation
Assume we print the columns and the mask before applying it.
print(df.columns.tolist())
print(m.tolist())
It would return this:
['A', 'B', 'C', 'D', 'E', 'F']
[True, False, False, True, True, True]
Columns B and C are unwanted (False). They are removed when the mask is applied.
I think best here is use dropna with parameter thresh:
thresh : int, optional
Require that many non-NA values.
So for vectorize solution subtract it from length of DataFrame:
N = 2
df = df.dropna(thresh=len(df)-N, axis=1)
print (df)
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
I suggest use DataFrame.pipe for apply function for input DataFrame with change df.column to df[column], because dot notation with dynamic column names from variable failed (it try select column name column):
df = pd.DataFrame({'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a NaN NaN 1 5.0 a
1 b NaN 8.0 3 3.0 a
2 c NaN NaN 5 6.0 a
3 d 5.0 NaN 7 9.0 b
4 e 5.0 2.0 1 2.0 b
5 f NaN 3.0 0 NaN b
def check(df):
for column in df:
if df[column].isnull().sum() > 2:
df.drop(column,axis=1, inplace=True)
return df
print (df.pipe(check))
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Alternatively, you can use count which counts non-null values
In [23]: df.loc[:, df.count().gt(len(df.index) - 2)]
Out[23]:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b

Pandas: Reshape two columns into one row

I want to reshape a pandas DataFrame from two columns into one row:
import numpy as np
import pandas as pd
df_a = pd.DataFrame({ 'Type': ['A', 'B', 'C', 'D', 'E'], 'Values':[2,4,7,9,3]})
df_a
Type Values
0 A 2
1 B 4
2 C 7
3 D 9
4 E 3
df_b = df_a.pivot(columns='Type', values='Values')
df_b
Which gives me this:
Type A B C D E
0 2.0 NaN NaN NaN NaN
1 NaN 4.0 NaN NaN NaN
2 NaN NaN 7.0 NaN NaN
3 NaN NaN NaN 9.0 NaN
4 NaN NaN NaN NaN 3.0
When I want it condensed into a single row like this:
Type A B C D E
0 2.0 4.0 7.0 9.0 3.0
I believe you dont need pivot, better is DataFrame constructor only:
df_b = pd.DataFrame([df_a['Values'].values], columns=df_a['Type'].values)
print (df_b)
A B C D E
0 2 4 7 9 3
Or set_index with transpose by T:
df_b = df_a.set_index('Type').T.rename({'Values':0})
print (df_b)
Type A B C D E
0 2 4 7 9 3
Another way:
df_a['col'] = 0
df_a.set_index(['col','Type'])['Values'].unstack().reset_index().drop('col', axis=1)
Type A B C D E
0 2 4 7 9 3
We can fix your df_b
df_b.ffill().iloc[[-1],:]
Out[360]:
Type A B C D E
4 2.0 4.0 7.0 9.0 3.0
Or we do
df_a.assign(key=[0]*len(df_a)).pivot(columns='Type', values='Values',index='key')
Out[366]:
Type A B C D E
key
0 2 4 7 9 3

Arranging columns in a pandas DataFrame

I am currently working on a dataframe from a cross-tab operation.
pd.crosstab(data['One'],data['two'], margins=True).apply(lambda r: r/len(data)*100,axis = 1)
Columns come out in the following order
A B C D E All
B
C
D
E
All 100
But I want the columns ordered as shown below:
A C D B E All
B
C
D
E
All 100
Is there a easy way to organize the columns?
when I use colnames=['C', 'D','B','E'] it returns an error:
'AssertionError: arrays and names must have the same length '
You can use reindex or reindex_axis or change order by subset:
colnames=['C', 'D','B','E']
new_cols = colnames + ['All']
#solution 1 change ordering by reindexing
df1 = df.reindex_axis(new_cols,axis=1)
#solution 2 change ordering by reindexing
df1 = df.reindex(columns=new_cols)
#solution 3 change order by subset
df1 = df[new_cols]
print (df1)
C D B E All
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN 100.0
To specify the columns of any dataframe in pandas, just index with a list of the columns in the order you want:
columns = ['A', 'C', 'D', 'B', 'E', 'All']
df2 = df.loc[:, columns]
print(df2)

Categories

Resources