The idea is to sort value by two columns.
Such that, given two column, I am expecting the output something like
Expected output
x y
0 2.0 NaN
1 3.0 NaN
2 4.0 4.1
3 NaN 5.0
4 10.0 NaN
5 24.0 24.7
6 31.0 31.4
However, using the code below
import pandas as pd
import numpy as np
df1 = pd.DataFrame ( {'x': [2, 3, 4, 24, 31, '',10],
'y':['','',4.1,24.7,31.4,5,'']} )
df1.replace(r'^\s*$', np.nan, regex=True,inplace=True)
rslt_df = df1.sort_values ( by=['x', 'y'], ascending=(True, True) )
print(rslt_df)
Produce the following
x y
0 2.0 NaN
1 3.0 NaN
2 4.0 4.1
6 10.0 NaN
3 24.0 24.7
4 31.0 31.4
5 NaN 5.0
Notice that at the last row, the 5.0 of column y is placed at the bottom.
May I know what modification to the code in order to obtained the intended output?
Try sorting by x fillna y, then reindex from those sorted values:
df1.reindex(df1['x'].fillna(df1['y']).sort_values().index).reset_index(drop=True)
To update the df1 variable:
df1 = (
df1.reindex(df1['x'].fillna(df1['y']).sort_values().index)
.reset_index(drop=True)
)
df1:
x y
0 2.0 NaN
1 3.0 NaN
2 4.0 4.1
3 NaN 5.0
4 10.0 NaN
5 24.0 24.7
6 31.0 31.4
with np.sort and argsort:
df1.iloc[np.sort(df1[['x','y']],axis=1)[:,0].argsort()]
x y
0 2.0 NaN
1 3.0 NaN
2 4.0 4.1
5 NaN 5.0
6 10.0 NaN
3 24.0 24.7
4 31.0 31.4
Related
Let's say I have data like this:
df = pd.DataFrame({'col1': [5, np.nan, 2, 2, 5, np.nan, 4], 'col2':[1,3,np.nan,np.nan,5,np.nan,4]})
print(df)
col1 col2
0 5.0 1.0
1 NaN 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 NaN NaN
6 4.0 4.0
How can I use fillna() to replace NaN values with the average of the prior and the succeeding value if both of them are not NaN ?
The result would look like this:
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0
Also, is there a way of calculating the average from the previous n and succeeding n values (if they are all not NaN) ?
We can shift the dataframe forward and backwards. Then add these together and divide them by two and use that to fillna:
s1, s2 = df.shift(), df.shift(-1)
df = df.fillna((s1 + s2) / 2)
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0
So I want to add/append data to a specific pandas dataFrame column but without it causing NaN values in the remaining columns
I.e.
DataFrame = pd.DataFrame(columns=["column1", "column2", "column3"])
for i in range():
DataFrame = DataFrame.append({"column1":int(i)}, ignore_index=True)
DataFrame = DataFrame.append({"column2":float(i*2)}, ignore_index=True)
DataFrame = DataFrame.append({"column3":int(i*5)}, ignore_index=True)
print(DataFrame)
This will return:
column1 column2 column3
0 0.0 NaN NaN
1 NaN 0.0 NaN
2 NaN NaN 0.0
3 1.0 NaN NaN
4 NaN 2.0 NaN
5 NaN NaN 5.0
6 2.0 NaN NaN
7 NaN 4.0 NaN
8 NaN NaN 10.0
What we want returned:
column1 column2 column3
0 0.0 0.0 0.0
1 1.0 2.0 5.0
2 2.0 4.0 10.0
I know I can in this case use one .append for all the different columns. But I have some cases where the data to be appended will vary based on multiple conditions. Hence I'd like to know if it's possible to append to single columns in a dataframe without producing NaN values in the remaining columns. So that I can avoid writing hundreds of if else statements.
Or if someone has any good idea regarding how to 'collapse' the NaN values (removing the NaN values without removing the entire row so that if there is a NaN value at index 0 in column 3 and there is a integer 5 at index 1 in the same column the integer 5 gets moved up to index 0)
Happy to hear any ideas.
IIUC for your current example you can try this:
DataFrame[['column2','column3']]=DataFrame[['column2','column3']].bfill()
Output:
column1 column2 column3
0 0.0 0.0 0.0
1 NaN 0.0 0.0
2 NaN 2.0 0.0
3 1.0 2.0 5.0
4 NaN 2.0 5.0
5 NaN 4.0 5.0
6 2.0 4.0 10.0
7 NaN 4.0 10.0
8 NaN 6.0 10.0
9 3.0 6.0 15.0
10 NaN 6.0 15.0
11 NaN 8.0 15.0
12 4.0 8.0 20.0
13 NaN 8.0 20.0
14 NaN NaN 20.0
then remove the NaN :
DataFrame.dropna(inplace=True)
Outpt:
column1 column2 column3
0 0.0 0.0 0.0
3 1.0 2.0 5.0
6 2.0 4.0 10.0
9 3.0 6.0 15.0
12 4.0 8.0 20.0
For example, I have 2 dfs:
df1
ID,col1,col2
1,5,9
2,6,3
3,7,2
4,8,5
and another df is
df2
ID,col1,col2
1,11,9
2,12,7
3,13,2
I want to calculate first pairwise subtraction from df2 to df1. I am using scipy.spatial.distance using a function subtract_
def subtract_(a, b):
return abs(a - b)
d1_s = df1[['col1']]
d2_s = df2[['col1']]
dist = cdist(d1_s, d2_s, metric=subtract_)
dist_df = pd.DataFrame(dist, columns= d2_s.values.ravel())
print(dist_df)
11 12 13
6.0 7.0 8.0
5.0 6.0 7.0
4.0 5.0 6.0
3.0 4.0 5.0
Now, I want to check, these new columns name like 11,12 and 13. I am checking if there is any values in this new dataframe less than 5. If there is, then I want to do further calculations. Like this.
For example, here for columns name '11', less than 5 value is 4 which is at rows 3. Now in this case, I want to subtract columns name ('col2') of df1 but at row 3, in this case it would be value 2. I want to subtract this value 2 with df2(col2) but at row 1 (because column name '11') was from value at row 1 in df2.
My for loop is so complex for this. It would be great, if there would be some easier way in pandas.
Any help, suggestions would be great.
The expected new dataframe is this
0,1,2
Nan,Nan,Nan
Nan,Nan,Nan
(2-9)=-7,Nan,Nan
(5-9)=-4,(5-7)=-2,Nan
Similar to Ben's answer, but with np.where:
pd.DataFrame(np.where(dist_df<5, df1.col2.values[:,None] - df2.col2.values, np.nan),
index=dist_df.index,
columns=dist_df.columns)
Output:
11 12 13
0 NaN NaN NaN
1 NaN NaN NaN
2 -7.0 NaN NaN
3 -4.0 -2.0 NaN
In your case using numpy with mask
df.mask(df<5,df-(df1.col2.values[:,None]+df2.col2.values))
Out[115]:
11 12 13
0 6.0 7.0 8.0
1 5.0 6.0 7.0
2 -7.0 5.0 6.0
3 -11.0 -8.0 5.0
Update
Newdf=(df-(-df1.col2.values[:,None]+df2.col2.values)-df).where(df<5)
Out[148]:
11 12 13
0 NaN NaN NaN
1 NaN NaN NaN
2 -7.0 NaN NaN
3 -4.0 -2.0 NaN
I would like to fill df's nan with an average of adjacent elements.
Consider a dataframe:
df = pd.DataFrame({'val': [1,np.nan, 4, 5, np.nan, 10, 1,2,5, np.nan, np.nan, 9]})
val
0 1.0
1 NaN
2 4.0
3 5.0
4 NaN
5 10.0
6 1.0
7 2.0
8 5.0
9 NaN
10 NaN
11 9.0
My desired output is:
val
0 1.0
1 2.5
2 4.0
3 5.0
4 7.5
5 10.0
6 1.0
7 2.0
8 5.0
9 7.0 <<< deadend
10 7.0 <<< deadend
11 9.0
I've looked into other solutions such as Fill cell containing NaN with average of value before and after, but this won't work in case of two or more consecutive np.nans.
Any help is greatly appreciated!
Use ffill + bfill and divide by 2:
df = (df.ffill()+df.bfill())/2
print(df)
val
0 1.0
1 2.5
2 4.0
3 5.0
4 7.5
5 10.0
6 1.0
7 2.0
8 5.0
9 7.0
10 7.0
11 9.0
EDIT : If 1st and last element contains NaN then use (Dark
suggestion):
df = pd.DataFrame({'val':[np.nan,1,np.nan, 4, 5, np.nan,
10, 1,2,5, np.nan, np.nan, 9,np.nan,]})
df = (df.ffill()+df.bfill())/2
df = df.bfill().ffill()
print(df)
val
0 1.0
1 1.0
2 2.5
3 4.0
4 5.0
5 7.5
6 10.0
7 1.0
8 2.0
9 5.0
10 7.0
11 7.0
12 9.0
13 9.0
Althogh in case of multiple nan's in a row it doesn't produce the exact output you specified, other users reaching this page may actually prefer the effect of the method interpolate():
df = df.interpolate()
print(df)
val
0 1.0
1 2.5
2 4.0
3 5.0
4 7.5
5 10.0
6 1.0
7 2.0
8 5.0
9 6.3
10 7.7
11 9.0
here is my DataFrame:
0 1 2
0 0 0.0 20.0 NaN
1 1.0 21.0 NaN
2 2.0 22.0 NaN
ID NaN NaN 11111.0
Year NaN NaN 2011.0
1 0 3.0 23.0 NaN
1 4.0 24.0 NaN
2 5.0 25.0 NaN
3 6.0 26.0 NaN
ID NaN NaN 11111.0
Year NaN NaN 2012.0
i want to convert the 'ID' and 'Year' rows to dataframe Index with 'ID' being level=0 and 'Year' being level=1. I tried using stack() but still cannot figure it .
Edited: my desired output should look like below:
0 1
11111 2011 0 0.0 20.0
1 1.0 21.0
2 2.0 22.0
2012 0 3.0 23.0
1 4.0 24.0
2 5.0 25.0
3 6.0 26.0
This should work:
df1 = df.loc[pd.IndexSlice[:, ['ID', 'Year']], '2']
dfs = df1.unstack()
dfi = df1.index
dfn = df.drop(dfi).drop('2', axis=1).unstack()
dfn.set_index([dfs.ID, dfs.Year]).stack()