I have a question about how to create a new column based on several column value.
Input:
col1 col2 col3
1 1 1
NULL NULL NULL
2 NULL 2
NULL NULL 3
4 NULL NULL
5 5 NULL
Output
col1 col2 col3 new
1 1 1 1
NULL NULL NULL NULL
2 NULL 2 2
NULL NULL 3 3
4 NULL NULL 4
5 5 NULL 5
I am trying to use combine_first, but it seems not be a good choice since I have multiple columns need to combine.
One option is to rename the columns to have the same name; then use groupby + first:
df['new'] = (df.rename(columns={col: 'col' for col in df.columns})
.groupby(level=0, axis=1).first())
You could also iteratively use combine_first:
df['new'] = float('nan')
for c in df.columns:
df['new'] = df['new'].combine_first(df[c])
Or you could apply a lambda that selects non-NaN values row-wise (works for Python>=3.8 since it uses the walrus operator; could write the same function differently if you have Python<3.8):
df['new'] = df.apply(lambda x: res[0] if (res:=x[x.notna()].tolist()) else float('nan'), axis=1)
Output:
col1 col2 col3 new
0 1.0 1.0 1.0 1.0
1 NaN NaN NaN NaN
2 2.0 NaN 2.0 2.0
3 NaN NaN 3.0 3.0
4 4.0 NaN NaN 4.0
5 5.0 5.0 NaN 5.0
Related
I have the following dataframe:
Subject Val1 Val1 Int Val1 Val1 Int2 Val1
A 1 2 3 NaN NaN Sp NaN
B NaN NaN NaN 2 3 NaN NaN
C NaN NaN 4 NaN NaN 0 3
D NaN NaN 3 NaN NaN 8 NaN
I want to ended up with only 2 column that are val1 because it has at most 2 non-nans for a given subject. Namely, the output would look like this:
Subject Val1 Val1 Int Int2
A 1 2 3 Sp
B 2 3 NaN NaN
C 3 NaN 4 0
D NaN NaN 3 8
is there a function in pandas to do this in a clean way? Clean meaning only a few lines of code. Because one way would be to iterate through row with a for loop and bring all nonnan values to the left, but I'd like something cleaner and more efficient as well.
Idea is per groups by duplicated columns names use lambda function for sort values based by missing values, so possible remove all columns with only missing values in last steps:
df = df.set_index('Subject')
f = lambda x: pd.DataFrame(x.apply(sorted, key=pd.isna, axis=1).tolist(), index=x.index)
df = df.groupby(level=0, axis=1).apply(f).dropna(axis=1, how='all').droplevel(1, axis=1)
print (df)
Int Int2 Val1 Val1
Subject
A 3.0 Sp 1.0 2.0
B NaN NaN 2.0 3.0
C 4.0 0 3.0 NaN
D 3.0 8 NaN NaN
I have the following DataFrame structure with example data:
Col1 Col2 Col3
1 1 8
5 4 7
3 9 9
1 NaN NaN
Columns have a sequential ordering, meaning Col1 comes before Col2 and so on...
I want to compare if two (or more) subsequent columns have the same value. If so I want to drop the entire row. NaN values can appear but should not be treated as having the same value
So with the rows above, I'd like to have row 1 and 3 dropped (row 1: Col1->Col2 same value, row 3: Col2 -> Col3 same value) and row 2 and 4 to be kept in the dataframe.
How can I accomplish this? Thanks!
Use DataFrame.diff and filter rows if exist no 0 value per rows by DataFrame.ne for not equal and DataFrame.all for test if all True and filter in boolean indexing:
df = df[df.diff(axis=1).ne(0).all(axis=1)]
print (df)
Col1 Col2 Col3
1 5 4.0 7.0
3 1 NaN NaN
Detail:
print (df.diff(axis=1))
Col1 Col2 Col3
0 NaN 0.0 7.0
1 NaN -1.0 3.0
2 NaN 6.0 0.0
3 NaN NaN NaN
print (df.diff(axis=1).ne(0))
Col1 Col2 Col3
0 True False True
1 True True True
2 True True False
3 True True True
print (df.diff(axis=1).ne(0).all(axis=1))
0 False
1 True
2 False
3 True
dtype: bool
Let's take this dataframe :
df = pd.DataFrame(dict(Col1 = [1,2,np.nan,4,5,6], Col2=[4,np.nan,5,np.nan,1,5]))
Col1 Col2
0 1.0 4.0
1 2.0 NaN
2 NaN 5.0
3 4.0 NaN
4 5.0 1.0
5 6.0 5.0
I would like to extract the n last rows of df with no NaN.
Could you please help me to get this expected result ?
Col1 Col2
0 5 1
1 6 5
EDIT : Let's say I don't know where is the last NaN
Use DataFrame.dropna with DataFrame.tail and converting to integers:
N = 2
df1 = df.dropna().tail(N).astype(int)
#alternative
#df1 = df.dropna().iloc[-N:].astype(int)
print (df1)
Col1 Col2
4 5 1
5 6 5
EDIT: For last group with no missing values compare misisng values with DataFrame.isna and DataFrame.any, then swap order with cumulative sum, so last group has 0 values in mask:
m = df.isna().any(axis=1).iloc[::-1].cumsum().eq(0).sort_index()
df1 = df[m].astype(int)
print (df1)
Col1 Col2
4 5 1
5 6 5
If no row match it return correct empty DataFrame:
df = pd.DataFrame(dict(Col1 = [1,2,np.nan,4], Col2=[np.nan,np.nan,5,np.nan]))
print (df)
Col1 Col2
0 1.0 NaN
1 2.0 NaN
2 NaN 5.0
3 4.0 NaN
m = df.isna().any(axis=1).iloc[::-1].cumsum().eq(0).sort_index()
df1 = df[m].astype(int)
print (df1)
Empty DataFrame
Columns: [Col1, Col2]
Index: []
another way is to use isna with drop_duplicates and cumsum to get the max index and then just use index filtering
last_na = df.isna().cumsum(axis=0).drop_duplicates(keep='first').index.max() + 1
new_df = df.iloc[last_na:]
print(new_df)
Col1 Col2
4 5.0 1.0
5 6.0 5.0
This question already has answers here:
How to replace NaN values by Zeroes in a column of a Pandas Dataframe?
(17 answers)
Closed 4 years ago.
I have a list of NaN values in my dataframe and I want to replace NaN values with an empty string.
What I've tried so far, which isn't working:
df_conbid_N_1 = pd.read_csv("test-2019.csv",dtype=str, sep=';', encoding='utf-8')
df_conbid_N_1['Excep_Test'] = df_conbid_N_1['Excep_Test'].replace("NaN","")
Use fillna (docs):
An example -
df = pd.DataFrame({'no': [1, 2, 3],
'Col1':['State','City','Town'],
'Col2':['abc', np.NaN, 'defg'],
'Col3':['Madhya Pradesh', 'VBI', 'KJI']})
df
no Col1 Col2 Col3
0 1 State abc Madhya Pradesh
1 2 City NaN VBI
2 3 Town defg KJI
df.Col2.fillna('', inplace=True)
df
no Col1 Col2 Col3
0 1 State abc Madhya Pradesh
1 2 City VBI
2 3 Town defg KJI
Simple! you can do this way
df_conbid_N_1 = pd.read_csv("test-2019.csv",dtype=str, sep=';',encoding='utf-8').fillna("")
We have pandas' fillna to fill missing values.
Let's go through some uses cases with a sample dataframe:
df = pd.DataFrame({'col1':['John', np.nan, 'Anne'], 'col2':[np.nan, 3, 4]})
col1 col2
0 John NaN
1 NaN 3.0
2 Anne 4.0
As mentioned in the docs, fillna accepts the following as fill values:
values: scalar, dict, Series, or DataFrame
So we can replace with a constant value, such as an empty string with:
df.fillna('')
col1 col2
0 John
1 3
2 Anne 4
1
You can also replace with a dictionary mapping column_name:replace_value:
df.fillna({'col1':'Alex', 'col2':2})
col1 col2
0 John 2.0
1 Alex 3.0
2 Anne 4.0
Or you can also replace with another pd.Series or pd.DataFrame:
df_other = pd.DataFrame({'col1':['John', 'Franc', 'Anne'], 'col2':[5, 3, 4]})
df.fillna(df_other)
col1 col2
0 John 5.0
1 Franc 3.0
2 Anne 4.0
This is very useful since it allows you to fill missing values on the dataframes' columns using some extracted statistic from the columns, such as the mean or mode. Say we have:
df = pd.DataFrame(np.random.choice(np.r_[np.nan, np.arange(3)], (3,5)))
print(df)
0 1 2 3 4
0 NaN NaN 0.0 1.0 2.0
1 NaN 2.0 NaN 2.0 1.0
2 1.0 1.0 2.0 NaN NaN
Then we can easilty do:
df.fillna(df.mean())
0 1 2 3 4
0 1.0 1.5 0.0 1.0 2.0
1 1.0 2.0 1.0 2.0 1.0
2 1.0 1.0 2.0 1.5 1.5
I currently have a dataframe which looks like this:
col1 col2 col3
1 2 3
2 3 NaN
3 4 NaN
2 NaN NaN
0 2 NaN
What I want to do is apply some condition to the column values and return the final result in a new column.
The condition is to assign values based on this order of priority where 2 being the first priority: [2,1,3,0,4]
I tried to define a function to append the final results but wasnt really getting anywhere...any thoughts?
The desired outcome would look something like:
col1 col2 col3 col4
1 2 3 2
2 3 NaN 2
3 4 NaN 3
2 NaN NaN 2
0 2 NaN 2
where col4 is the new column created.
Thanks
first you may want to get ride of the NaNs:
df.fillna(5)
and then apply a function to every row to find your value:
def func(x,l=[2,1,3,0,4,5]):
for j in l:
if(j in x):
return j
df['new'] = df.apply(lambda x: func(list(x)),axis =1)
Output:
col1 col2 col3 new
0 1 2 3 2
1 2 3 5 2
2 3 4 5 3
3 2 5 5 2
4 0 2 5 2
maybe a little later.
import numpy as np
def f(x):
for i in [2,1,3,0,4]:
if i in x.tolist():
return i
return np.nan
df["col4"] = df.apply(f, axis=1)
and the Output:
col1 col2 col3 col4
0 1 2.0 3.0 2
1 2 3.0 NaN 2
2 3 4.0 NaN 3
3 2 NaN NaN 2
4 0 2.0 NaN 2