Pandas: Reshape two columns into one row - python

I want to reshape a pandas DataFrame from two columns into one row:
import numpy as np
import pandas as pd
df_a = pd.DataFrame({ 'Type': ['A', 'B', 'C', 'D', 'E'], 'Values':[2,4,7,9,3]})
df_a
Type Values
0 A 2
1 B 4
2 C 7
3 D 9
4 E 3
df_b = df_a.pivot(columns='Type', values='Values')
df_b
Which gives me this:
Type A B C D E
0 2.0 NaN NaN NaN NaN
1 NaN 4.0 NaN NaN NaN
2 NaN NaN 7.0 NaN NaN
3 NaN NaN NaN 9.0 NaN
4 NaN NaN NaN NaN 3.0
When I want it condensed into a single row like this:
Type A B C D E
0 2.0 4.0 7.0 9.0 3.0

I believe you dont need pivot, better is DataFrame constructor only:
df_b = pd.DataFrame([df_a['Values'].values], columns=df_a['Type'].values)
print (df_b)
A B C D E
0 2 4 7 9 3
Or set_index with transpose by T:
df_b = df_a.set_index('Type').T.rename({'Values':0})
print (df_b)
Type A B C D E
0 2 4 7 9 3

Another way:
df_a['col'] = 0
df_a.set_index(['col','Type'])['Values'].unstack().reset_index().drop('col', axis=1)
Type A B C D E
0 2 4 7 9 3

We can fix your df_b
df_b.ffill().iloc[[-1],:]
Out[360]:
Type A B C D E
4 2.0 4.0 7.0 9.0 3.0
Or we do
df_a.assign(key=[0]*len(df_a)).pivot(columns='Type', values='Values',index='key')
Out[366]:
Type A B C D E
key
0 2 4 7 9 3

Related

create new pandas column based on if and else rule

I have this dataframe and I want to create column e:
df
a b c d
1 2 1 2
Nan Nan 3 1
Nan Nan Nan 5
4 5 0 2
I want create a new column based on this criteria:
The highest of column a vs column b.
If no value in column a and column b , then look column c
if no value in column c, then look column d.
df
a b c d e
1 2 1 2 2
Nan Nan 3 1 3
Nan Nan Nan 5 5
4 5 0 2 5
my idea just until step number 2.
def e(x):
if x['a'] >= x['b']:
return x['a']
elif x['a'] <= x['b']:
return x['b']
else:
x['c']
df['e'] = df.apply(e, axis=1)
IIUC, use pandas.DataFrame.bfill:
df["e"] = df.bfill(1)[["a", "b"]].max(1)
print(df)
Output:
a b c d e
0 1 2 1 2 2.0
1 NaN NaN 3 1 3.0
2 NaN NaN NaN 5 5.0
3 4 5 0 2 5.0
You can always use np.where()
df['e'] = df['d']
df['e'] = np.where((df['a'].isna()) & (df['b'].isna()) & (df['c'].notnull()), df['c'], df['e'])
df['e'] = np.where((df['a'].notnull()) & (df['b'].notnull()) & (df['a'] > df['b']), df['a'], df['e'])
df['e'] = np.where((df['a'].notnull()) & (df['b'].notnull()) & (df['b'] > df['a']), df['b'], df['e'])
df
First get maximum a, b values and assign to a column, then back filling missing values and select first column for prioritize c and then d columns:
df['e'] = df.assign(a = df[['a','b']].max(axis=1)).bfill(axis=1).iloc[:, 0]
print (df)
a b c d e
0 1.0 2.0 1.0 2 2.0
1 NaN NaN 3.0 1 3.0
2 NaN NaN NaN 5 5.0
3 4.0 5.0 0.0 2 5.0
If want test only a,b,c,d columns and possible some another columns:
df['e'] = df[['a','b']].max(axis=1).fillna(df.c).fillna(df.d)
print (df)
a b c d e
0 1.0 2.0 1.0 2 2.0
1 NaN NaN 3.0 5 3.0
2 NaN NaN NaN 5 5.0
3 4.0 5.0 0.0 2 5.0
If changed second row to 3,5 output is:
df['e'] = df.assign(a = df[['a','b']].max(axis=1)).bfill(axis=1).iloc[:, 0]
print (df)
a b c d e
0 1.0 2.0 1.0 2 2.0
1 NaN NaN 3.0 5 3.0 <- changed d=5
2 NaN NaN NaN 5 5.0
3 4.0 5.0 0.0 2 5.0

Move Null rows to the bottom of the dataframe

I have a dataframe:
df1 = pd.DataFrame({'a': [1, 2, 10, np.nan, 5, 6, np.nan, 8],
'b': list('abcdefgh')})
df1
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 NaN d
4 5.0 e
5 6.0 f
6 NaN g
7 8.0 h
I would like to move all the rows where a is np.nan to the bottom of the dataframe
df2 = pd.DataFrame({'a': [1, 2, 10, 5, 6, 8, np.nan, np.nan],
'b': list('abcefhdg')})
df2
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
I have tried this:
na = df1[df1.a.isnull()]
df1.dropna(subset = ['a'], inplace=True)
df1 = df1.append(na)
df1
Is there a cleaner way to do this? Or is there a function that I can use for this?
New answer after edit OP
You were close but you can clean up your code a bit by using the following:
df1 = pd.concat([df1[df1['a'].notnull()], df1[df1['a'].isnull()]], ignore_index=True)
print(df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
Old answer
Use sort_values with the na_position=last argument:
df1 = df1.sort_values('a', na_position='last')
print(df1)
a b
0 1.0 a
1 2.0 b
2 3.0 c
4 5.0 e
5 6.0 f
7 8.0 h
3 NaN d
6 NaN g
Not exist in pandas yet, use Series.isna with Series.argsort for positions and change ordering by DataFrame.iloc:
df1 = df1.iloc[df1['a'].isna().argsort()].reset_index(drop=True)
print (df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g
Or pure pandas solution with helper column and DataFrame.sort_values:
df1 = (df1.assign(tmp=df1['a'].isna())
.sort_values('tmp')
.drop('tmp', axis=1)
.reset_index(drop=True))
print (df1)
a b
0 1.0 a
1 2.0 b
2 10.0 c
3 5.0 e
4 6.0 f
5 8.0 h
6 NaN d
7 NaN g

drops a column if it exceeds a specific number of NA values

i want to write a program that drops a column if it exceeds a specific number of NA values .This is what i did.
def check(x):
for column in df:
if df.column.isnull().sum() > 2:
df.drop(column,axis=1)
there is no error in executing the above code , but while doing df.apply(check), there are a ton of errors.
P.S:I know about the thresh arguement in df.dropna(thresh,axis)
Any tips?Why isnt my code working?
Thanks
Although jezrael's answer works that is not the approach you should do. Instead, create a mask: ~df.isnull().sum().gt(2) and apply it with .loc[:,m] to access columns.
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')
})
m = ~df.isnull().sum().gt(2)
df = df.loc[:,m]
print(df)
Returns:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Explanation
Assume we print the columns and the mask before applying it.
print(df.columns.tolist())
print(m.tolist())
It would return this:
['A', 'B', 'C', 'D', 'E', 'F']
[True, False, False, True, True, True]
Columns B and C are unwanted (False). They are removed when the mask is applied.
I think best here is use dropna with parameter thresh:
thresh : int, optional
Require that many non-NA values.
So for vectorize solution subtract it from length of DataFrame:
N = 2
df = df.dropna(thresh=len(df)-N, axis=1)
print (df)
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
I suggest use DataFrame.pipe for apply function for input DataFrame with change df.column to df[column], because dot notation with dynamic column names from variable failed (it try select column name column):
df = pd.DataFrame({'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a NaN NaN 1 5.0 a
1 b NaN 8.0 3 3.0 a
2 c NaN NaN 5 6.0 a
3 d 5.0 NaN 7 9.0 b
4 e 5.0 2.0 1 2.0 b
5 f NaN 3.0 0 NaN b
def check(df):
for column in df:
if df[column].isnull().sum() > 2:
df.drop(column,axis=1, inplace=True)
return df
print (df.pipe(check))
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Alternatively, you can use count which counts non-null values
In [23]: df.loc[:, df.count().gt(len(df.index) - 2)]
Out[23]:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b

Merging/Combining Dataframes in Pandas

I have a df1, example:
B A C
B 1
A 1
C 2
,and a df2, example:
C E D
C 2 3
E 1
D 2
The column and row 'C' is common in both dataframes.
I would like to combine these dataframes such that I get,
B A C D E
B 1
A 1
C 2 2 3
D 1
E 2
Is there an easy way to do this? pd.concat and pd.append do not seem to work. Thanks!
Edit: df1.combine_first(df2) works (thanks #jezarel), but can we keep the original ordering?
There is problem combine_first always sorted columns namd index, so need reindex with combine columns names:
idx = df1.columns.append(df2.columns).unique()
print (idx)
Index(['B', 'A', 'C', 'E', 'D'], dtype='object')
df = df1.combine_first(df2).reindex(index=idx, columns=idx)
print (df)
B A C E D
B NaN 1.0 NaN NaN NaN
A NaN NaN 1.0 NaN NaN
C 2.0 NaN NaN 2.0 3.0
E NaN NaN NaN NaN 1.0
D NaN NaN 2.0 NaN NaN
More general solution:
c = df1.columns.append(df2.columns).unique()
i = df1.index.append(df2.index).unique()
df = df1.combine_first(df2).reindex(index=i, columns=c)

Pandas: concatenate and reindex dataframes

I would like to combine two pandas dataframes into a new third dataframe using a new index. Suppose I start with the following:
df = pd.DataFrame(np.ones(25).reshape((5,5)),index = ['A','B','C','D','E'])
df1 = pd.DataFrame(np.ones(25).reshape((5,5))*2,index = ['A','B','C','D','E'])
df[2] = np.nan
df1[3] = np.nan
df[4] = np.nan
df1[4] = np.nan
I would like the least convoluted way to achieve the following result:
NewIndex OldIndex df df1
1 A 1 2
2 B 1 2
3 C 1 2
4 D 1 2
5 E 1 2
6 A 1 2
7 B 1 2
8 C 1 2
9 D 1 2
10 E 1 2
11 A NaN 2
12 B NaN 2
13 C NaN 2
14 D NaN 2
15 E NaN 2
16 A 1 NaN
17 B 1 NaN
18 C 1 NaN
19 D 1 NaN
20 E 1 NaN
What's the best way to do this?
You have to unstack your dataframes and then reindex concatenated dataframe.
import numpy as np
import pandas as pd
# test data
df = pd.DataFrame(np.ones(25).reshape((5,5)),index = ['A','B','C','D','E'])
df1 = pd.DataFrame(np.ones(25).reshape((5,5))*2,index = ['A','B','C','D','E'])
df[2] = np.nan
df1[3] = np.nan
df[4] = np.nan
df1[4] = np.nan
# unstack tables and concat
newdf = pd.concat([df.unstack(),df1.unstack()], axis=1)
# reset multiindex for level 1
newdf.reset_index(1, inplace=True)
# rename columns
newdf.columns = ['OldIndex','df','df1']
# drop old index
newdf = newdf.reset_index().drop('index',1)
# set index from 1
newdf.index = np.arange(1, len(newdf) + 1)
# rename new index
newdf.index.name='NewIndex'
print(newdf)
Output:
OldIndex df df1
NewIndex
1 A 1.0 2.0
2 B 1.0 2.0
3 C 1.0 2.0
4 D 1.0 2.0
5 E 1.0 2.0
6 A 1.0 2.0
7 B 1.0 2.0
8 C 1.0 2.0
9 D 1.0 2.0
10 E 1.0 2.0
11 A NaN 2.0
12 B NaN 2.0
13 C NaN 2.0
14 D NaN 2.0
15 E NaN 2.0
16 A 1.0 NaN
17 B 1.0 NaN
18 C 1.0 NaN
19 D 1.0 NaN
20 E 1.0 NaN
21 A NaN NaN
22 B NaN NaN
23 C NaN NaN
24 D NaN NaN
25 E NaN NaN

Categories

Resources