Pandas force NaN to bottom of each column at each index

Pandas force NaN to bottom of each column at each index - python

I have a DataFrame where multiple rows span each index. Taking the first index, for example, has such a structure:
df = pd.DataFrame([["A", "first", 1.0, 1.0, np.NaN],
[np.NaN, np.NaN, 2.0, np.NaN, 2.0],
[np.NaN, np.NaN, np.NaN, 3.0, 3.0]],
columns=["ID", "Name", "val1", "val2", "val3"],
index=[0, 0, 0])
Out[4]:
ID Name val1 val2 val3
0 A first 1 1 NaN
0 NaN NaN 2 NaN 2
0 NaN NaN NaN 3 3
I would like to sort/order each column such that the NaNs are at the bottom of each column at that given index - a result which looks like this:
ID Name val1 val2 val3
0 A first 1 1 2
0 NaN NaN 2 3 3
0 NaN NaN NaN NaN NaN
A more explicit example might look like this:
df = pd.DataFrame([["A", "first", 1.0, 1.0, np.NaN],
[np.NaN, np.NaN, 2.0, np.NaN, 2.0],
[np.NaN, np.NaN, np.NaN, 3.0, 3.0],
["B", "second", 4.0, 4.0, np.NaN],
[np.NaN, np.NaN, 5.0, np.NaN, 5.0],
[np.NaN, np.NaN, np.NaN, 6.0, 6.0]],
columns=[ "ID", "Name", "val1", "val2", "val3"],
index=[0, 0, 0, 1, 1, 1])
Out[5]:
ID Name val1 val2 val3
0 A first 1 1 NaN
0 NaN NaN 2 NaN 2
0 NaN NaN NaN 3 3
1 B second 4 4 NaN
1 NaN NaN 5 NaN 5
1 NaN NaN NaN 6 6
with the desired result to look like this:
ID Name val1 val2 val3
0 A first 1 1 2
0 NaN NaN 2 3 3
0 NaN NaN NaN NaN NaN
1 B second 4 4 5
1 NaN NaN 5 6 6
1 NaN NaN NaN NaN NaN
I have many thousands of rows in this dataframe, with each index containing up to a few hundred rows. My desired result will be very helpful when I to_csv the dataframe.
I have attempted to use sort_values(['val1','val2','val3']) on the whole data frame, but this results in the indices becoming disordered. I have tried to iterate through each index and sort in place, but this too does not restrict the NaN to the bottom of each indices' column. I have also tried to fillna to another value, such as 0, but I have not been successful here, either.
While I am certainly using it wrong, the na_position parameter in sort_values does not produce the desired outcome, though it seems this is likely what want.
Edit:
The final df's index is not required to be in numerical order as in my second example.
By changing ignore_index to False in the single line of #Leb's third code block,
pd.concat([df[col].sort_values().reset_index(drop=True) for col in df], axis=1, ignore_index=True)
to
pd.concat([df[col].sort_values().reset_index(drop=True) for col in df], axis=1, ignore_index=False)
and by creating a temp df for all rows in a given index, I was able to make this work - not pretty, but it orders things how I need them. If someone (certainly) has a better way, please let me know.
new_df = df.ix[0]
new_df = pd.concat([new_df[col].sort_values().reset_index(drop=True) for col in new_df], axis=1, ignore_index=False)
max_index = df.index[-1]
for i in range(1, max_index + 1):
tmp = df.ix[i]
tmp = pd.concat([tmp[col].sort_values().reset_index(drop=True) for col in tmp], axis=1, ignore_index=False)
new_df = pd.concat([new_df,tmp])
In [10]: new_df
Out[10]:
ID Name val1 val2 val3
0 A first 1 1 2
1 NaN NaN 2 3 3
2 NaN NaN NaN NaN NaN
0 B second 4 4 5
1 NaN NaN 5 6 6
2 NaN NaN NaN NaN NaN

I know the issue of pushing nans to an edge has been discussed on github. For your particular frame, I'd probably do it manually at the Python level, and not worry about performance much. Something like
>>> df.groupby(level=0, sort=False).transform(lambda x: sorted(x,key=pd.isnull))
ID Name val1 val2 val3
0 A first 1 1 2
0 NaN NaN 2 3 3
0 NaN NaN NaN NaN NaN
1 B second 4 4 5
1 NaN NaN 5 6 6
1 NaN NaN NaN NaN NaN
should work. Note that since sorted is a stable sort, and we're using pd.isnull as the key (where False < True), we push the NaNs to the end while preserving the order of the remaining objects. Also note that here I'm grouping just on the index; we could alternatively have grouped on whatever we wanted.

Given df:
pd.DataFrame([["A","first",1.0,1.0,np.NaN],
[np.NaN,np.NaN,2.0,np.NaN,2.0],
[np.NaN,np.NaN,np.NaN,3.0,3.0]],
columns=[ "ID", "Name", "val1", "val2", "val3"],index=[0,1,2])
I changed index to make sure order stays.
df
Out[127]:
ID Name val1 val2 val3
0 A first 1 1 NaN
1 NaN NaN 2 NaN 2
2 NaN NaN NaN 3 3
Using:
pd.concat([df[col].sort_values().reset_index(drop=True) for col in df], axis=1, ignore_index=True)
Will give:
Out[130]:
0 1 2 3 4
0 A first 1 1 2
1 NaN NaN 2 3 3
2 NaN NaN NaN NaN NaN
Same for:
df = pd.DataFrame([["A","first",1.0,1.0,np.NaN],
[np.NaN,np.NaN,2.0,np.NaN,2.0],
[np.NaN,np.NaN,np.NaN,3.0,3.0],
["B","second",4.0,4.0,np.NaN],
[np.NaN,np.NaN,5.0,np.NaN,5.0],
[np.NaN,np.NaN,np.NaN,6.0,6.0]],
columns=[ "ID", "Name", "val1", "val2", "val3"],index=[0,0,0,1,1,1])
df
Out[132]:
ID Name val1 val2 val3
0 A first 1 1 NaN
0 NaN NaN 2 NaN 2
0 NaN NaN NaN 3 3
1 B second 4 4 NaN
1 NaN NaN 5 NaN 5
1 NaN NaN NaN 6 6
pd.concat([df[col].sort_values().reset_index(drop=True) for col in df], axis=1, ignore_index=True)
Out[133]:
0 1 2 3 4
0 A first 1 1 2
1 B second 2 3 3
2 NaN NaN 4 4 5
3 NaN NaN 5 6 6
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
After additional comments
new = pd.concat([df[col].sort_values().reset_index(drop=True) for col in df.iloc[:,2:]], axis=1, ignore_index=True)
new.index = df.index
cols = df.iloc[:,2:].columns
new.columns = cols
df.drop(cols,inplace=True,axis=1)
df = pd.concat([df,new],axis=1)
df
Out[37]:
ID Name val1 val2 val3
0 A first 1 1 2
0 NaN NaN 2 3 3
0 NaN NaN 4 4 5
1 B second 5 6 6
1 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN

In [219]:
df.groupby(level=0).transform(lambda x : x.sort(na_position = 'last' , inplace = False))
Out[219]:
ID Name val1 val2 val3
0 A first 1 1 2
0 NaN NaN 2 3 3
0 NaN NaN NaN NaN NaN
1 B second 4 4 5
1 NaN NaN 5 6 6
1 NaN NaN NaN NaN NaN

Related

Clean way to rearrange columns that are repeated and have nans in them

I have the following dataframe:
Subject Val1 Val1 Int Val1 Val1 Int2 Val1
A 1 2 3 NaN NaN Sp NaN
B NaN NaN NaN 2 3 NaN NaN
C NaN NaN 4 NaN NaN 0 3
D NaN NaN 3 NaN NaN 8 NaN
I want to ended up with only 2 column that are val1 because it has at most 2 non-nans for a given subject. Namely, the output would look like this:
Subject Val1 Val1 Int Int2
A 1 2 3 Sp
B 2 3 NaN NaN
C 3 NaN 4 0
D NaN NaN 3 8
is there a function in pandas to do this in a clean way? Clean meaning only a few lines of code. Because one way would be to iterate through row with a for loop and bring all nonnan values to the left, but I'd like something cleaner and more efficient as well.

Idea is per groups by duplicated columns names use lambda function for sort values based by missing values, so possible remove all columns with only missing values in last steps:
df = df.set_index('Subject')
f = lambda x: pd.DataFrame(x.apply(sorted, key=pd.isna, axis=1).tolist(), index=x.index)
df = df.groupby(level=0, axis=1).apply(f).dropna(axis=1, how='all').droplevel(1, axis=1)
print (df)
Int Int2 Val1 Val1
Subject
A 3.0 Sp 1.0 2.0
B NaN NaN 2.0 3.0
C 4.0 0 3.0 NaN
D 3.0 8 NaN NaN

Merge almost identical rows after removing nan values

I have a dataframe like,
pri_col col1 col2 Date
r1 3 4 2020-09-10
r1 4 1 2020-09-11
r1 2 7 2020-09-12
r1 6 4 2020-09-13
Note: There are many more unique values in 'pri_col' column. This is just a sample here. So I'm giving single value. Also, for a single value of 'pri_col' the value of 'Date' will be unique always.
I need the dataframe like,
pri_col col1_2020-09-10 col1_2020-09-11 col1_2020-09-12 col1_2020-09-13 col2_2020-09-10 col2_2020-09-11 col2_2020-09-12 col2_2020-09-13
r1 3 4 2 6 4 1 7 4
According to a previous solution, I have tried this solution:
df = (df.reset_index()
.melt(id_vars=['index','pri_col','Date'],
var_name='cols',
value_name='val')
.pivot(index=['index','pri_col'],
columns=['cols','Date'],
values='val'))
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index(level=1).rename_axis(None)
print (df)
But this is the resulting dataframe:
pri_col col1_2020-09-10 col1_2020-09-11 col1_2020-09-12 col1_2020-09-13 col2_2020-09-10 col2_2020-09-11 col2_2020-09-12 col2_2020-09-13
r1 3 NaN NaN NaN 4 NaN NaN NaN
r1 NaN 4 NaN NaN NaN 1 NaN NaN
r1 NaN NaN 2 NaN NaN NaN 7 NaN
r1 NaN NaN NaN 6 NaN NaN NaN 4
How do I solve the issue?
Also, I had asked a question recently that may sound similar.

IIUC, use pandas.DataFrame.set_index with unstack:
new_df = df.set_index(['pri_col', 'Date']).unstack()
new_df.columns = ["%s_%s" % (i, j) for i, j in new_df.columns]
print(new_df)
Output:
col1_2020-09-10 col1_2020-09-11 col1_2020-09-12 col1_2020-09-13 \
pri_col
r1 3 4 2 6
col2_2020-09-10 col2_2020-09-11 col2_2020-09-12 col2_2020-09-13
pri_col
r1 4 1 7 4

How to insert an empty column after each column in an existing dataframe

I have a dataframe that looks as follows
df = pd.DataFrame({"A":[1,2,3,4],
"B":[3,4,5,6],
"C":[2,3,4,5]})
I would like to insert an empty column (with type string) after each existing column in the dataframe, such that the output looks like
A col1 B col2 C col3
0 1 NaN 3 NaN 2 NaN
1 2 NaN 4 NaN 3 NaN
2 3 NaN 5 NaN 4 NaN
3 4 NaN 6 NaN 5 NaN

Actually there's a much more simple way thanks to reindex:
df.reindex([x for i, c in enumerate(df.columns, 1) for x in (c, f'col{i}')], axis=1)
Result:
A col1 B col2 C col3
0 1 NaN 3 NaN 2 NaN
1 2 NaN 4 NaN 3 NaN
2 3 NaN 5 NaN 4 NaN
3 4 NaN 6 NaN 5 NaN
Here's the other more complicated way:
import numpy as np
df.join(pd.DataFrame(np.empty(df.shape, dtype=object), columns=df.columns + '_sep')).sort_index(axis=1)
A A_sep B B_sep C C_sep
0 1 None 3 None 2 None
1 2 None 4 None 3 None
2 3 None 5 None 4 None
3 4 None 6 None 5 None

This solution worked for me:
merged = pd.concat([myDataFrame, pd.DataFrame(columns= [' '])], axis=1)

This is what you can do:
for count in range(len(df.columns)):
df.insert(count*2+1, str('col'+str(count+1)), 'NaN')
print(df)
Output:
A col1 B col2 C col3
0 1 NaN 3 NaN 2 NaN
1 2 NaN 4 NaN 3 NaN
2 3 NaN 5 NaN 4 NaN
3 4 NaN 6 NaN 5 NaN

How to fill and merge df with 10 empty rows?

how to fill df with empty rows or create a df with empty rows.
have df :
df = pd.DataFrame(columns=["naming","type"])
how to fill this df with empty rows

Specify index values:
df = pd.DataFrame(columns=["naming","type"], index=range(10))
print (df)
naming type
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
If need empty strings:
df = pd.DataFrame('',columns=["naming","type"], index=range(10))
print (df)
naming type
0
1
2
3
4
5
6
7
8
9

Unmelt Pandas DataFrame

I have a pandas dataframe with two id variables:
df = pd.DataFrame({'id': [1,1,1,2,2,3],
'num': [10,10,12,13,14,15],
'q': ['a', 'b', 'd', 'a', 'b', 'z'],
'v': [2,4,6,8,10,12]})
id num q v
0 1 10 a 2
1 1 10 b 4
2 1 12 d 6
3 2 13 a 8
4 2 14 b 10
5 3 15 z 12
I can pivot the table with:
df.pivot('id','q','v')
And end up with something close:
q a b d z
id
1 2 4 6 NaN
2 8 10 NaN NaN
3 NaN NaN NaN 12
However, what I really want is (the original unmelted form):
id num a b d z
1 10 2 4 NaN NaN
1 12 NaN NaN 6 NaN
2 13 8 NaN NaN NaN
2 14 NaN 10 NaN NaN
3 15 NaN NaN NaN 12
In other words:
'id' and 'num' my indices (normally, I've only seen either 'id' or 'num' being the index but I need both since I'm trying to retrieve the original unmelted form)
'q' are my columns
'v' are my values in the table
Update
I found a close solution from Wes McKinney's blog:
df.pivot_table(index=['id','num'], columns='q')
v
q a b d z
id num
1 10 2 4 NaN NaN
12 NaN NaN 6 NaN
2 13 8 NaN NaN NaN
14 NaN 10 NaN NaN
3 15 NaN NaN NaN 12
However, the format is not quite the same as what I want above.

You could use set_index and unstack
In [18]: df.set_index(['id', 'num', 'q'])['v'].unstack().reset_index()
Out[18]:
q id num a b d z
0 1 10 2.0 4.0 NaN NaN
1 1 12 NaN NaN 6.0 NaN
2 2 13 8.0 NaN NaN NaN
3 2 14 NaN 10.0 NaN NaN
4 3 15 NaN NaN NaN 12.0

You're really close slaw. Just rename your column index to None and you've got what you want.
df2 = df.pivot_table(index=['id','num'], columns='q')
df2.columns = df2.columns.droplevel().rename(None)
df2.reset_index().fillna("null").to_csv("test.csv", sep="\t", index=None)
Note that the the 'v' column is expected to be numeric by default so that it can be aggregated. Otherwise, Pandas will error out with:
DataError: No numeric types to aggregate
To resolve this, you can specify your own aggregation function by using a custom lambda function:
df2 = df.pivot_table(index=['id','num'], columns='q', aggfunc= lambda x: x)

you can remove name q.
df1.columns=df1.columns.tolist()
Zero's answer + remove q =
df1 = df.set_index(['id', 'num', 'q'])['v'].unstack().reset_index()
df1.columns=df1.columns.tolist()
id num a b d z
0 1 10 2.0 4.0 NaN NaN
1 1 12 NaN NaN 6.0 NaN
2 2 13 8.0 NaN NaN NaN
3 2 14 NaN 10.0 NaN NaN
4 3 15 NaN NaN NaN 12.0

This might work just fine:
Pivot
df2 = (df.pivot_table(index=['id', 'num'], columns='q', values='v')).reset_index())
Concatinate the 1st level column names with the 2nd
df2.columns =[s1 + str(s2) for (s1,s2) in df2.columns.tolist()]

Came up with a close solution
df2 = df.pivot_table(index=['id','num'], columns='q')
df2.columns = df2.columns.droplevel()
df2.reset_index().fillna("null").to_csv("test.csv", sep="\t", index=None)
Still can't figure out how to drop 'q' from the dataframe

It can be done in three steps:
#1: Prepare auxilary column 'id_num':
df['id_num'] = df[['id', 'num']].apply(tuple, axis=1)
df = df.drop(columns=['id', 'num'])
#2: 'pivot' is almost an inverse of melt:
df, df.columns.name = df.pivot(index='id_num', columns='q', values='v').reset_index(), ''
#3: Bring back 'id' and 'num' columns:
df['id'], df['num'] = zip(*df['id_num'])
df = df.drop(columns=['id_num'])
This is a result, but with different order of columns:
a b d z id num
0 2.0 4.0 NaN NaN 1 10
1 NaN NaN 6.0 NaN 1 12
2 8.0 NaN NaN NaN 2 13
3 NaN 10.0 NaN NaN 2 14
4 NaN NaN NaN 12.0 3 15
Alternatively with proper order:
def multiindex_pivot(df, columns=None, values=None):
#inspired by: https://github.com/pandas-dev/pandas/issues/23955
names = list(df.index.names)
df = df.reset_index()
list_index = df[names].values
tuples_index = [tuple(i) for i in list_index] # hashable
df = df.assign(tuples_index=tuples_index)
df = df.pivot(index="tuples_index", columns=columns, values=values)
tuples_index = df.index # reduced
index = pd.MultiIndex.from_tuples(tuples_index, names=names)
df.index = index
df = df.reset_index() #me
df.columns.name = '' #me
return df
df = df.set_index(['id', 'num'])
df = multiindex_pivot(df, columns='q', values='v')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas force NaN to bottom of each column at each index - python

In [219]: df.groupby(level=0).transform(lambda x : x.sort(na_position = 'last' , inplace = False)) Out[219]: ID Name val1 val2 val3 0 A first 1 1 2 0 NaN NaN 2 3 3 0 NaN NaN NaN NaN NaN 1 B second 4 4 5 1 NaN NaN 5 6 6 1 NaN NaN NaN NaN NaN

Related

Clean way to rearrange columns that are repeated and have nans in them

Merge almost identical rows after removing nan values

How to insert an empty column after each column in an existing dataframe

How to fill and merge df with 10 empty rows?

Unmelt Pandas DataFrame

Categories

Resources