Pandas force NaN to bottom of each column at each index - python
I have a DataFrame where multiple rows span each index. Taking the first index, for example, has such a structure:
df = pd.DataFrame([["A", "first", 1.0, 1.0, np.NaN],
[np.NaN, np.NaN, 2.0, np.NaN, 2.0],
[np.NaN, np.NaN, np.NaN, 3.0, 3.0]],
columns=["ID", "Name", "val1", "val2", "val3"],
index=[0, 0, 0])
Out[4]:
ID Name val1 val2 val3
0 A first 1 1 NaN
0 NaN NaN 2 NaN 2
0 NaN NaN NaN 3 3
I would like to sort/order each column such that the NaNs are at the bottom of each column at that given index - a result which looks like this:
ID Name val1 val2 val3
0 A first 1 1 2
0 NaN NaN 2 3 3
0 NaN NaN NaN NaN NaN
A more explicit example might look like this:
df = pd.DataFrame([["A", "first", 1.0, 1.0, np.NaN],
[np.NaN, np.NaN, 2.0, np.NaN, 2.0],
[np.NaN, np.NaN, np.NaN, 3.0, 3.0],
["B", "second", 4.0, 4.0, np.NaN],
[np.NaN, np.NaN, 5.0, np.NaN, 5.0],
[np.NaN, np.NaN, np.NaN, 6.0, 6.0]],
columns=[ "ID", "Name", "val1", "val2", "val3"],
index=[0, 0, 0, 1, 1, 1])
Out[5]:
ID Name val1 val2 val3
0 A first 1 1 NaN
0 NaN NaN 2 NaN 2
0 NaN NaN NaN 3 3
1 B second 4 4 NaN
1 NaN NaN 5 NaN 5
1 NaN NaN NaN 6 6
with the desired result to look like this:
ID Name val1 val2 val3
0 A first 1 1 2
0 NaN NaN 2 3 3
0 NaN NaN NaN NaN NaN
1 B second 4 4 5
1 NaN NaN 5 6 6
1 NaN NaN NaN NaN NaN
I have many thousands of rows in this dataframe, with each index containing up to a few hundred rows. My desired result will be very helpful when I to_csv the dataframe.
I have attempted to use sort_values(['val1','val2','val3']) on the whole data frame, but this results in the indices becoming disordered. I have tried to iterate through each index and sort in place, but this too does not restrict the NaN to the bottom of each indices' column. I have also tried to fillna to another value, such as 0, but I have not been successful here, either.
While I am certainly using it wrong, the na_position parameter in sort_values does not produce the desired outcome, though it seems this is likely what want.
Edit:
The final df's index is not required to be in numerical order as in my second example.
By changing ignore_index to False in the single line of #Leb's third code block,
pd.concat([df[col].sort_values().reset_index(drop=True) for col in df], axis=1, ignore_index=True)
to
pd.concat([df[col].sort_values().reset_index(drop=True) for col in df], axis=1, ignore_index=False)
and by creating a temp df for all rows in a given index, I was able to make this work - not pretty, but it orders things how I need them. If someone (certainly) has a better way, please let me know.
new_df = df.ix[0]
new_df = pd.concat([new_df[col].sort_values().reset_index(drop=True) for col in new_df], axis=1, ignore_index=False)
max_index = df.index[-1]
for i in range(1, max_index + 1):
tmp = df.ix[i]
tmp = pd.concat([tmp[col].sort_values().reset_index(drop=True) for col in tmp], axis=1, ignore_index=False)
new_df = pd.concat([new_df,tmp])
In [10]: new_df
Out[10]:
ID Name val1 val2 val3
0 A first 1 1 2
1 NaN NaN 2 3 3
2 NaN NaN NaN NaN NaN
0 B second 4 4 5
1 NaN NaN 5 6 6
2 NaN NaN NaN NaN NaN
I know the issue of pushing nans to an edge has been discussed on github. For your particular frame, I'd probably do it manually at the Python level, and not worry about performance much. Something like
>>> df.groupby(level=0, sort=False).transform(lambda x: sorted(x,key=pd.isnull))
ID Name val1 val2 val3
0 A first 1 1 2
0 NaN NaN 2 3 3
0 NaN NaN NaN NaN NaN
1 B second 4 4 5
1 NaN NaN 5 6 6
1 NaN NaN NaN NaN NaN
should work. Note that since sorted is a stable sort, and we're using pd.isnull as the key (where False < True), we push the NaNs to the end while preserving the order of the remaining objects. Also note that here I'm grouping just on the index; we could alternatively have grouped on whatever we wanted.
Given df:
pd.DataFrame([["A","first",1.0,1.0,np.NaN],
[np.NaN,np.NaN,2.0,np.NaN,2.0],
[np.NaN,np.NaN,np.NaN,3.0,3.0]],
columns=[ "ID", "Name", "val1", "val2", "val3"],index=[0,1,2])
I changed index to make sure order stays.
df
Out[127]:
ID Name val1 val2 val3
0 A first 1 1 NaN
1 NaN NaN 2 NaN 2
2 NaN NaN NaN 3 3
Using:
pd.concat([df[col].sort_values().reset_index(drop=True) for col in df], axis=1, ignore_index=True)
Will give:
Out[130]:
0 1 2 3 4
0 A first 1 1 2
1 NaN NaN 2 3 3
2 NaN NaN NaN NaN NaN
Same for:
df = pd.DataFrame([["A","first",1.0,1.0,np.NaN],
[np.NaN,np.NaN,2.0,np.NaN,2.0],
[np.NaN,np.NaN,np.NaN,3.0,3.0],
["B","second",4.0,4.0,np.NaN],
[np.NaN,np.NaN,5.0,np.NaN,5.0],
[np.NaN,np.NaN,np.NaN,6.0,6.0]],
columns=[ "ID", "Name", "val1", "val2", "val3"],index=[0,0,0,1,1,1])
df
Out[132]:
ID Name val1 val2 val3
0 A first 1 1 NaN
0 NaN NaN 2 NaN 2
0 NaN NaN NaN 3 3
1 B second 4 4 NaN
1 NaN NaN 5 NaN 5
1 NaN NaN NaN 6 6
pd.concat([df[col].sort_values().reset_index(drop=True) for col in df], axis=1, ignore_index=True)
Out[133]:
0 1 2 3 4
0 A first 1 1 2
1 B second 2 3 3
2 NaN NaN 4 4 5
3 NaN NaN 5 6 6
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
After additional comments
new = pd.concat([df[col].sort_values().reset_index(drop=True) for col in df.iloc[:,2:]], axis=1, ignore_index=True)
new.index = df.index
cols = df.iloc[:,2:].columns
new.columns = cols
df.drop(cols,inplace=True,axis=1)
df = pd.concat([df,new],axis=1)
df
Out[37]:
ID Name val1 val2 val3
0 A first 1 1 2
0 NaN NaN 2 3 3
0 NaN NaN 4 4 5
1 B second 5 6 6
1 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
In [219]:
df.groupby(level=0).transform(lambda x : x.sort(na_position = 'last' , inplace = False))
Out[219]:
ID Name val1 val2 val3
0 A first 1 1 2
0 NaN NaN 2 3 3
0 NaN NaN NaN NaN NaN
1 B second 4 4 5
1 NaN NaN 5 6 6
1 NaN NaN NaN NaN NaN
Related
Clean way to rearrange columns that are repeated and have nans in them
I have the following dataframe: Subject Val1 Val1 Int Val1 Val1 Int2 Val1 A 1 2 3 NaN NaN Sp NaN B NaN NaN NaN 2 3 NaN NaN C NaN NaN 4 NaN NaN 0 3 D NaN NaN 3 NaN NaN 8 NaN I want to ended up with only 2 column that are val1 because it has at most 2 non-nans for a given subject. Namely, the output would look like this: Subject Val1 Val1 Int Int2 A 1 2 3 Sp B 2 3 NaN NaN C 3 NaN 4 0 D NaN NaN 3 8 is there a function in pandas to do this in a clean way? Clean meaning only a few lines of code. Because one way would be to iterate through row with a for loop and bring all nonnan values to the left, but I'd like something cleaner and more efficient as well.
Idea is per groups by duplicated columns names use lambda function for sort values based by missing values, so possible remove all columns with only missing values in last steps: df = df.set_index('Subject') f = lambda x: pd.DataFrame(x.apply(sorted, key=pd.isna, axis=1).tolist(), index=x.index) df = df.groupby(level=0, axis=1).apply(f).dropna(axis=1, how='all').droplevel(1, axis=1) print (df) Int Int2 Val1 Val1 Subject A 3.0 Sp 1.0 2.0 B NaN NaN 2.0 3.0 C 4.0 0 3.0 NaN D 3.0 8 NaN NaN
Merge almost identical rows after removing nan values
I have a dataframe like, pri_col col1 col2 Date r1 3 4 2020-09-10 r1 4 1 2020-09-11 r1 2 7 2020-09-12 r1 6 4 2020-09-13 Note: There are many more unique values in 'pri_col' column. This is just a sample here. So I'm giving single value. Also, for a single value of 'pri_col' the value of 'Date' will be unique always. I need the dataframe like, pri_col col1_2020-09-10 col1_2020-09-11 col1_2020-09-12 col1_2020-09-13 col2_2020-09-10 col2_2020-09-11 col2_2020-09-12 col2_2020-09-13 r1 3 4 2 6 4 1 7 4 According to a previous solution, I have tried this solution: df = (df.reset_index() .melt(id_vars=['index','pri_col','Date'], var_name='cols', value_name='val') .pivot(index=['index','pri_col'], columns=['cols','Date'], values='val')) df.columns = [f'{a}_{b}' for a, b in df.columns] df = df.reset_index(level=1).rename_axis(None) print (df) But this is the resulting dataframe: pri_col col1_2020-09-10 col1_2020-09-11 col1_2020-09-12 col1_2020-09-13 col2_2020-09-10 col2_2020-09-11 col2_2020-09-12 col2_2020-09-13 r1 3 NaN NaN NaN 4 NaN NaN NaN r1 NaN 4 NaN NaN NaN 1 NaN NaN r1 NaN NaN 2 NaN NaN NaN 7 NaN r1 NaN NaN NaN 6 NaN NaN NaN 4 How do I solve the issue? Also, I had asked a question recently that may sound similar.
IIUC, use pandas.DataFrame.set_index with unstack: new_df = df.set_index(['pri_col', 'Date']).unstack() new_df.columns = ["%s_%s" % (i, j) for i, j in new_df.columns] print(new_df) Output: col1_2020-09-10 col1_2020-09-11 col1_2020-09-12 col1_2020-09-13 \ pri_col r1 3 4 2 6 col2_2020-09-10 col2_2020-09-11 col2_2020-09-12 col2_2020-09-13 pri_col r1 4 1 7 4
How to insert an empty column after each column in an existing dataframe
I have a dataframe that looks as follows df = pd.DataFrame({"A":[1,2,3,4], "B":[3,4,5,6], "C":[2,3,4,5]}) I would like to insert an empty column (with type string) after each existing column in the dataframe, such that the output looks like A col1 B col2 C col3 0 1 NaN 3 NaN 2 NaN 1 2 NaN 4 NaN 3 NaN 2 3 NaN 5 NaN 4 NaN 3 4 NaN 6 NaN 5 NaN
Actually there's a much more simple way thanks to reindex: df.reindex([x for i, c in enumerate(df.columns, 1) for x in (c, f'col{i}')], axis=1) Result: A col1 B col2 C col3 0 1 NaN 3 NaN 2 NaN 1 2 NaN 4 NaN 3 NaN 2 3 NaN 5 NaN 4 NaN 3 4 NaN 6 NaN 5 NaN Here's the other more complicated way: import numpy as np df.join(pd.DataFrame(np.empty(df.shape, dtype=object), columns=df.columns + '_sep')).sort_index(axis=1) A A_sep B B_sep C C_sep 0 1 None 3 None 2 None 1 2 None 4 None 3 None 2 3 None 5 None 4 None 3 4 None 6 None 5 None
This solution worked for me: merged = pd.concat([myDataFrame, pd.DataFrame(columns= [' '])], axis=1)
This is what you can do: for count in range(len(df.columns)): df.insert(count*2+1, str('col'+str(count+1)), 'NaN') print(df) Output: A col1 B col2 C col3 0 1 NaN 3 NaN 2 NaN 1 2 NaN 4 NaN 3 NaN 2 3 NaN 5 NaN 4 NaN 3 4 NaN 6 NaN 5 NaN
How to fill and merge df with 10 empty rows?
how to fill df with empty rows or create a df with empty rows. have df : df = pd.DataFrame(columns=["naming","type"]) how to fill this df with empty rows
Specify index values: df = pd.DataFrame(columns=["naming","type"], index=range(10)) print (df) naming type 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN 5 NaN NaN 6 NaN NaN 7 NaN NaN 8 NaN NaN 9 NaN NaN If need empty strings: df = pd.DataFrame('',columns=["naming","type"], index=range(10)) print (df) naming type 0 1 2 3 4 5 6 7 8 9
Unmelt Pandas DataFrame
I have a pandas dataframe with two id variables: df = pd.DataFrame({'id': [1,1,1,2,2,3], 'num': [10,10,12,13,14,15], 'q': ['a', 'b', 'd', 'a', 'b', 'z'], 'v': [2,4,6,8,10,12]}) id num q v 0 1 10 a 2 1 1 10 b 4 2 1 12 d 6 3 2 13 a 8 4 2 14 b 10 5 3 15 z 12 I can pivot the table with: df.pivot('id','q','v') And end up with something close: q a b d z id 1 2 4 6 NaN 2 8 10 NaN NaN 3 NaN NaN NaN 12 However, what I really want is (the original unmelted form): id num a b d z 1 10 2 4 NaN NaN 1 12 NaN NaN 6 NaN 2 13 8 NaN NaN NaN 2 14 NaN 10 NaN NaN 3 15 NaN NaN NaN 12 In other words: 'id' and 'num' my indices (normally, I've only seen either 'id' or 'num' being the index but I need both since I'm trying to retrieve the original unmelted form) 'q' are my columns 'v' are my values in the table Update I found a close solution from Wes McKinney's blog: df.pivot_table(index=['id','num'], columns='q') v q a b d z id num 1 10 2 4 NaN NaN 12 NaN NaN 6 NaN 2 13 8 NaN NaN NaN 14 NaN 10 NaN NaN 3 15 NaN NaN NaN 12 However, the format is not quite the same as what I want above.
You could use set_index and unstack In [18]: df.set_index(['id', 'num', 'q'])['v'].unstack().reset_index() Out[18]: q id num a b d z 0 1 10 2.0 4.0 NaN NaN 1 1 12 NaN NaN 6.0 NaN 2 2 13 8.0 NaN NaN NaN 3 2 14 NaN 10.0 NaN NaN 4 3 15 NaN NaN NaN 12.0
You're really close slaw. Just rename your column index to None and you've got what you want. df2 = df.pivot_table(index=['id','num'], columns='q') df2.columns = df2.columns.droplevel().rename(None) df2.reset_index().fillna("null").to_csv("test.csv", sep="\t", index=None) Note that the the 'v' column is expected to be numeric by default so that it can be aggregated. Otherwise, Pandas will error out with: DataError: No numeric types to aggregate To resolve this, you can specify your own aggregation function by using a custom lambda function: df2 = df.pivot_table(index=['id','num'], columns='q', aggfunc= lambda x: x)
you can remove name q. df1.columns=df1.columns.tolist() Zero's answer + remove q = df1 = df.set_index(['id', 'num', 'q'])['v'].unstack().reset_index() df1.columns=df1.columns.tolist() id num a b d z 0 1 10 2.0 4.0 NaN NaN 1 1 12 NaN NaN 6.0 NaN 2 2 13 8.0 NaN NaN NaN 3 2 14 NaN 10.0 NaN NaN 4 3 15 NaN NaN NaN 12.0
This might work just fine: Pivot df2 = (df.pivot_table(index=['id', 'num'], columns='q', values='v')).reset_index()) Concatinate the 1st level column names with the 2nd df2.columns =[s1 + str(s2) for (s1,s2) in df2.columns.tolist()]
Came up with a close solution df2 = df.pivot_table(index=['id','num'], columns='q') df2.columns = df2.columns.droplevel() df2.reset_index().fillna("null").to_csv("test.csv", sep="\t", index=None) Still can't figure out how to drop 'q' from the dataframe
It can be done in three steps: #1: Prepare auxilary column 'id_num': df['id_num'] = df[['id', 'num']].apply(tuple, axis=1) df = df.drop(columns=['id', 'num']) #2: 'pivot' is almost an inverse of melt: df, df.columns.name = df.pivot(index='id_num', columns='q', values='v').reset_index(), '' #3: Bring back 'id' and 'num' columns: df['id'], df['num'] = zip(*df['id_num']) df = df.drop(columns=['id_num']) This is a result, but with different order of columns: a b d z id num 0 2.0 4.0 NaN NaN 1 10 1 NaN NaN 6.0 NaN 1 12 2 8.0 NaN NaN NaN 2 13 3 NaN 10.0 NaN NaN 2 14 4 NaN NaN NaN 12.0 3 15 Alternatively with proper order: def multiindex_pivot(df, columns=None, values=None): #inspired by: https://github.com/pandas-dev/pandas/issues/23955 names = list(df.index.names) df = df.reset_index() list_index = df[names].values tuples_index = [tuple(i) for i in list_index] # hashable df = df.assign(tuples_index=tuples_index) df = df.pivot(index="tuples_index", columns=columns, values=values) tuples_index = df.index # reduced index = pd.MultiIndex.from_tuples(tuples_index, names=names) df.index = index df = df.reset_index() #me df.columns.name = '' #me return df df = df.set_index(['id', 'num']) df = multiindex_pivot(df, columns='q', values='v')