Pivot Table to fill pairs of observation in pandas - python

The objective is to get a table with values of pair T1-T2. I have data in form of:
df
T1 T2 Score
0 A B 5
1 A C 8
2 B C 4
I tried:
df.pivot_table('Score','T1','T2')
B C
A 5.0 8.0
B NaN 4.0
I expected:
A B C
A 5 8
B 5 4
C 8 4
So kind of like correlation table I think. Because A-B pair is same as B-A in this case.

First add all possible index with columns values by reindex with another pivot by swap T1 and T2 and last combine_first:
idx = np.unique(df[['T1','T2']].values.ravel())
df1 = df.pivot_table('Score','T1','T2').reindex(index=idx, columns=idx)
df2 = df.pivot_table('Score','T2','T1').reindex(index=idx, columns=idx)
df = df1.combine_first(df2)
print (df)
A B C
T1
A NaN 5.0 8.0
B 5.0 NaN 4.0
C 8.0 4.0 NaN

Another method using merge:
df1 = df.pivot_table('Score','T1','T2')
df2 = df.pivot_table('Score','T2','T1')
common_val = np.intersect1d(df['T1'].unique(), df['T2'].unique()).tolist()
df = df1.merge(df2, how='outer', left_index=True, right_index=True, on=common_val)
print(df)
B C A
A 5.0 8.0 NaN
B NaN 4.0 5.0
C 4.0 NaN 8.0

Another way:
In [11]: df1 = df.set_index(['T1', 'T2']).unstack(1)
In [12]: df1.columns = df1.columns.droplevel(0)
In [13]: df2 = df1.reindex(index=df1.index | df1.columns, columns=df1.index | df1.columns)
In [14]: df2.update(df2.T)
In [15]: df2
Out[15]:
A B C
A NaN 5.0 8.0
B 5.0 NaN 4.0
C 8.0 4.0 NaN

Related

Creating two shifted columns in grouped pandas data-frame

I have looked all over and I still can't find an example of how to create two shifted columns in a Pandas Dataframe within its groups.
I have done it with one column as follows:
data_frame['previous_category'] = data_frame.groupby('id')['category'].shift()
But I have to do it with 2 columns, shifting one upwards and the other downwards.
Any ideas?
It is possible by custom function with GroupBy.apply, because one column need shift down and second shift up:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'F':list('aaabbb')
})
def f(x):
x['B'] = x['B'].shift()
x['C'] = x['C'].shift(-1)
return x
df = df.groupby('F').apply(f)
print (df)
B C F
0 NaN 8.0 a
1 4.0 9.0 a
2 5.0 NaN a
3 NaN 2.0 b
4 5.0 3.0 b
5 5.0 NaN b
If want shift same way only specify all columns in lists:
df[['B','C']] = df.groupby('F')['B','C'].shift()
print (df)
B C F
0 NaN NaN a
1 4.0 7.0 a
2 5.0 8.0 a
3 NaN NaN b
4 5.0 4.0 b
5 5.0 2.0 b

Move Null rows to the bottom of the dataframe

I have a dataframe:
df1 = pd.DataFrame({'a': [1, 2, 10, np.nan, 5, 6, np.nan, 8],
'b': list('abcdefgh')})
df1
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 NaN d
4 5.0 e
5 6.0 f
6 NaN g
7 8.0 h
I would like to move all the rows where a is np.nan to the bottom of the dataframe
df2 = pd.DataFrame({'a': [1, 2, 10, 5, 6, 8, np.nan, np.nan],
'b': list('abcefhdg')})
df2
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
I have tried this:
na = df1[df1.a.isnull()]
df1.dropna(subset = ['a'], inplace=True)
df1 = df1.append(na)
df1
Is there a cleaner way to do this? Or is there a function that I can use for this?
New answer after edit OP
You were close but you can clean up your code a bit by using the following:
df1 = pd.concat([df1[df1['a'].notnull()], df1[df1['a'].isnull()]], ignore_index=True)
print(df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
Old answer
Use sort_values with the na_position=last argument:
df1 = df1.sort_values('a', na_position='last')
print(df1)
a b
0 1.0 a
1 2.0 b
2 3.0 c
4 5.0 e
5 6.0 f
7 8.0 h
3 NaN d
6 NaN g
Not exist in pandas yet, use Series.isna with Series.argsort for positions and change ordering by DataFrame.iloc:
df1 = df1.iloc[df1['a'].isna().argsort()].reset_index(drop=True)
print (df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
Or pure pandas solution with helper column and DataFrame.sort_values:
df1 = (df1.assign(tmp=df1['a'].isna())
.sort_values('tmp')
.drop('tmp', axis=1)
.reset_index(drop=True))
print (df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g

drops a column if it exceeds a specific number of NA values

i want to write a program that drops a column if it exceeds a specific number of NA values .This is what i did.
def check(x):
for column in df:
if df.column.isnull().sum() > 2:
df.drop(column,axis=1)
there is no error in executing the above code , but while doing df.apply(check), there are a ton of errors.
P.S:I know about the thresh arguement in df.dropna(thresh,axis)
Any tips?Why isnt my code working?
Thanks
Although jezrael's answer works that is not the approach you should do. Instead, create a mask: ~df.isnull().sum().gt(2) and apply it with .loc[:,m] to access columns.
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')
})
m = ~df.isnull().sum().gt(2)
df = df.loc[:,m]
print(df)
Returns:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Explanation
Assume we print the columns and the mask before applying it.
print(df.columns.tolist())
print(m.tolist())
It would return this:
['A', 'B', 'C', 'D', 'E', 'F']
[True, False, False, True, True, True]
Columns B and C are unwanted (False). They are removed when the mask is applied.
I think best here is use dropna with parameter thresh:
thresh : int, optional
Require that many non-NA values.
So for vectorize solution subtract it from length of DataFrame:
N = 2
df = df.dropna(thresh=len(df)-N, axis=1)
print (df)
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
I suggest use DataFrame.pipe for apply function for input DataFrame with change df.column to df[column], because dot notation with dynamic column names from variable failed (it try select column name column):
df = pd.DataFrame({'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a NaN NaN 1 5.0 a
1 b NaN 8.0 3 3.0 a
2 c NaN NaN 5 6.0 a
3 d 5.0 NaN 7 9.0 b
4 e 5.0 2.0 1 2.0 b
5 f NaN 3.0 0 NaN b
def check(df):
for column in df:
if df[column].isnull().sum() > 2:
df.drop(column,axis=1, inplace=True)
return df
print (df.pipe(check))
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Alternatively, you can use count which counts non-null values
In [23]: df.loc[:, df.count().gt(len(df.index) - 2)]
Out[23]:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b

Merging/Combining Dataframes in Pandas

I have a df1, example:
B A C
B 1
A 1
C 2
,and a df2, example:
C E D
C 2 3
E 1
D 2
The column and row 'C' is common in both dataframes.
I would like to combine these dataframes such that I get,
B A C D E
B 1
A 1
C 2 2 3
D 1
E 2
Is there an easy way to do this? pd.concat and pd.append do not seem to work. Thanks!
Edit: df1.combine_first(df2) works (thanks #jezarel), but can we keep the original ordering?
There is problem combine_first always sorted columns namd index, so need reindex with combine columns names:
idx = df1.columns.append(df2.columns).unique()
print (idx)
Index(['B', 'A', 'C', 'E', 'D'], dtype='object')
df = df1.combine_first(df2).reindex(index=idx, columns=idx)
print (df)
B A C E D
B NaN 1.0 NaN NaN NaN
A NaN NaN 1.0 NaN NaN
C 2.0 NaN NaN 2.0 3.0
E NaN NaN NaN NaN 1.0
D NaN NaN 2.0 NaN NaN
More general solution:
c = df1.columns.append(df2.columns).unique()
i = df1.index.append(df2.index).unique()
df = df1.combine_first(df2).reindex(index=i, columns=c)

Pandas: General Data Imputation Based on Column Dtype

I'm working with a dataset with ~80 columns, many of which contain NaN. I definitely don't want to manually inspect dtype for each column and impute based on that.
So I wrote a function to impute a column's missing values based on its dtype:
def impute_df(df, col):
# if col is float, impute mean
if df[col].dtype == "int64":
df[col].fillna(df[col].mean(), inplace=True)
else:
df[col].fillna(df[col].mode()[0], inplace=True)
But to use this, I'd have to loop over all columns in my DataFrame, something like:
for col in train_df.columns:
impute_df(train_df, col)
And I know looping in Pandas is generally slow. Is there a better way of going about this?
Thanks!
I think you need select_dtypes for numeric and non numeric columns and then apply fillna for filtered columns:
df = pd.DataFrame({'A':list('abcdef'),
'B':[np.nan,5,4,5,5,4],
'C':[7,8,np.nan,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':['a','a','b','b','b',np.nan]})
print (df)
A B C D E F
0 a NaN 7.0 1 5 a
1 b 5.0 8.0 3 3 a
2 c 4.0 NaN 5 6 b
3 d 5.0 4.0 7 9 b
4 e 5.0 2.0 1 2 b
5 f 4.0 3.0 0 4 NaN
cols1 = df.select_dtypes([np.number]).columns
cols2 = df.select_dtypes(exclude = [np.number]).columns
df[cols1] = df[cols1].fillna(df[cols1].mean())
df[cols2] = df[cols2].fillna(df[cols2].mode().iloc[0])
print (df)
A B C D E F
0 a 4.6 7.0 1 5 a
1 b 5.0 8.0 3 3 a
2 c 4.0 4.8 5 6 b
3 d 5.0 4.0 7 9 b
4 e 5.0 2.0 1 2 b
5 f 4.0 3.0 0 4 b
I think you do not need a function here,
for example:
df=pd.DataFrame({'A':[1,np.nan,3,4],'A_1':[1,np.nan,3,4],'B':['A','A',np.nan,'B']})
v=df.select_dtypes(exclude=['object']).columns
t=~df.columns.isin(v)
df.loc[:,v]=df.loc[:,v].fillna(df.loc[:,v].mean().to_dict())
df.loc[:,t]=df.loc[:,t].fillna(df.loc[:,t].mode().iloc[0].to_dict())
df
Out[1440]:
A A_1 B
0 1.000000 1.000000 A
1 2.666667 2.666667 A
2 3.000000 3.000000 A
3 4.000000 4.000000 B

Categories

Resources