I have three columns in pandas dataframes with Nan:
>>> d=pd.DataFrame({'col1': [1, 2], 'col2': [3, 4], 'col3':[5,6]})
>>> d
col1 col2 col3
0 1 3 5
1 2 4 6
>>> d['col2'].iloc[0]=np.nan
>>> d
col1 col2 col3
0 1 NaN 5
1 2 4.0 6
>>> d['col1'].iloc[1]=np.nan
>>> d
col1 col2 col3
0 1.0 NaN 5
1 NaN 4.0 6
>>> d['col3'].iloc[1]=np.nan
>>> d
col1 col2 col3
0 1.0 NaN 5.0
1 NaN 4.0 NaN
Now, I would like the column addition to have the following output:
>>> d['col1']+d['col3']
0 6.0
1 NaN
>>> d['col1']+d['col2']
0 1.0
1 4.0
However, in reality, the output is instead:
>>> d['col1']+d['col3']
0 6.0
1 NaN
>>> d['col1']+d['col2']
0 NaN
1 NaN
Anyone knows how to achieve this?
You can use add to get your sums, with fill_value=0:
>>> d.col1.add(d.col2, fill_value=0)
0 1.0
1 4.0
dtype: float64
>>> d.col1.add(d.col3, fill_value=0)
0 6.0
1 NaN
dtype: float64
When adding columns one and two, use Series.add with fill_value=0.
>>> d
col1 col2 col3
0 1.0 NaN 5.0
1 NaN 4.0 NaN
>>>
>>> d['col1'].add(d['col2'], fill_value=0)
0 1.0
1 4.0
dtype: float64
Dataframes and series have methods like add, sub, ... in order to perform more sophisticated operations than the associated operators +, -, ... can provide.
The methods may take additional arguments that finetune the operation.
Related
I have a data frame like this for example:
col1 col2
0 A 3
1 B 4
2 A NaN
3 B 5
4 A 5
5 A NaN
6 B NaN
.
.
.
47 B 8
48 A 9
49 B NaN
50 A NaN
when i try df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index() it gives me this output
col1 col2
0 A NaN
1 B NaN
I want to get the last non NaN value after groupby and agg. The desirable output is like below
col1 col2
0 A 9
1 B 8
For me your solution working well, if NaN are missing values.
Here is alternative:
df = df.dropna(subset=['col2']).drop_duplicates('col1', keep='last')
If NaNs are strings first convert them to missing values:
df['col2'] = df['col2'].replace('NaN', np.nan)
df.groupby(['col1'], sort=False).agg({'col2':'last'}).reset_index()
Let's take this dataframe :
df = pd.DataFrame(dict(Col1 = [1,2,np.nan,4,5,6], Col2=[4,np.nan,5,np.nan,1,5]))
Col1 Col2
0 1.0 4.0
1 2.0 NaN
2 NaN 5.0
3 4.0 NaN
4 5.0 1.0
5 6.0 5.0
I would like to extract the n last rows of df with no NaN.
Could you please help me to get this expected result ?
Col1 Col2
0 5 1
1 6 5
EDIT : Let's say I don't know where is the last NaN
Use DataFrame.dropna with DataFrame.tail and converting to integers:
N = 2
df1 = df.dropna().tail(N).astype(int)
#alternative
#df1 = df.dropna().iloc[-N:].astype(int)
print (df1)
Col1 Col2
4 5 1
5 6 5
EDIT: For last group with no missing values compare misisng values with DataFrame.isna and DataFrame.any, then swap order with cumulative sum, so last group has 0 values in mask:
m = df.isna().any(axis=1).iloc[::-1].cumsum().eq(0).sort_index()
df1 = df[m].astype(int)
print (df1)
Col1 Col2
4 5 1
5 6 5
If no row match it return correct empty DataFrame:
df = pd.DataFrame(dict(Col1 = [1,2,np.nan,4], Col2=[np.nan,np.nan,5,np.nan]))
print (df)
Col1 Col2
0 1.0 NaN
1 2.0 NaN
2 NaN 5.0
3 4.0 NaN
m = df.isna().any(axis=1).iloc[::-1].cumsum().eq(0).sort_index()
df1 = df[m].astype(int)
print (df1)
Empty DataFrame
Columns: [Col1, Col2]
Index: []
another way is to use isna with drop_duplicates and cumsum to get the max index and then just use index filtering
last_na = df.isna().cumsum(axis=0).drop_duplicates(keep='first').index.max() + 1
new_df = df.iloc[last_na:]
print(new_df)
Col1 Col2
4 5.0 1.0
5 6.0 5.0
This question already has answers here:
How to replace NaN values by Zeroes in a column of a Pandas Dataframe?
(17 answers)
Closed 4 years ago.
I have a list of NaN values in my dataframe and I want to replace NaN values with an empty string.
What I've tried so far, which isn't working:
df_conbid_N_1 = pd.read_csv("test-2019.csv",dtype=str, sep=';', encoding='utf-8')
df_conbid_N_1['Excep_Test'] = df_conbid_N_1['Excep_Test'].replace("NaN","")
Use fillna (docs):
An example -
df = pd.DataFrame({'no': [1, 2, 3],
'Col1':['State','City','Town'],
'Col2':['abc', np.NaN, 'defg'],
'Col3':['Madhya Pradesh', 'VBI', 'KJI']})
df
no Col1 Col2 Col3
0 1 State abc Madhya Pradesh
1 2 City NaN VBI
2 3 Town defg KJI
df.Col2.fillna('', inplace=True)
df
no Col1 Col2 Col3
0 1 State abc Madhya Pradesh
1 2 City VBI
2 3 Town defg KJI
Simple! you can do this way
df_conbid_N_1 = pd.read_csv("test-2019.csv",dtype=str, sep=';',encoding='utf-8').fillna("")
We have pandas' fillna to fill missing values.
Let's go through some uses cases with a sample dataframe:
df = pd.DataFrame({'col1':['John', np.nan, 'Anne'], 'col2':[np.nan, 3, 4]})
col1 col2
0 John NaN
1 NaN 3.0
2 Anne 4.0
As mentioned in the docs, fillna accepts the following as fill values:
values: scalar, dict, Series, or DataFrame
So we can replace with a constant value, such as an empty string with:
df.fillna('')
col1 col2
0 John
1 3
2 Anne 4
1
You can also replace with a dictionary mapping column_name:replace_value:
df.fillna({'col1':'Alex', 'col2':2})
col1 col2
0 John 2.0
1 Alex 3.0
2 Anne 4.0
Or you can also replace with another pd.Series or pd.DataFrame:
df_other = pd.DataFrame({'col1':['John', 'Franc', 'Anne'], 'col2':[5, 3, 4]})
df.fillna(df_other)
col1 col2
0 John 5.0
1 Franc 3.0
2 Anne 4.0
This is very useful since it allows you to fill missing values on the dataframes' columns using some extracted statistic from the columns, such as the mean or mode. Say we have:
df = pd.DataFrame(np.random.choice(np.r_[np.nan, np.arange(3)], (3,5)))
print(df)
0 1 2 3 4
0 NaN NaN 0.0 1.0 2.0
1 NaN 2.0 NaN 2.0 1.0
2 1.0 1.0 2.0 NaN NaN
Then we can easilty do:
df.fillna(df.mean())
0 1 2 3 4
0 1.0 1.5 0.0 1.0 2.0
1 1.0 2.0 1.0 2.0 1.0
2 1.0 1.0 2.0 1.5 1.5
I have a pandas dataframe like:
index col1 col2 col3 col4 col5
0 a c 1 2 f
1 a c 1 2 f
2 a d 1 2 f
3 b d 1 2 g
4 b e 1 2 g
5 b e 1 2 g
if i group by two columns, like the following:
df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'})
I get:
col3 col4
col1 col2
a c 2 4
d 1 2
b d 1 2
e 2 4
Is it possible to convert this to:
col1 c_col3 d_col3 c_col4 d_col4 e_col3 e_col4
a 2 1 4 2 Nan Nan
b Nan 1 Nan 2 2 4
in an efficient manner where col1 is the index?
Add unstack for MultiIndex in columns, so necessary flattening:
df1 = df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'}).unstack()
#python 3.6+
df1.columns = [f'{j}_{i}' for i, j in df1.columns]
#python bellow
#df1.columns = ['{}_{}'.format(j, i) for i, j in df1.columns]
print (df1)
c_col3 d_col3 e_col3 c_col4 d_col4 e_col4
col1
a 2.0 1.0 NaN 4.0 2.0 NaN
b NaN 1.0 2.0 NaN 2.0 4.0
I have a dataframe like the follow:
Col1
0 C
1 A
3 D
4 A
5 A
I would like to count the step/index that a certain value will re-occur so I would get the following:
Col1 Col2
0 C NaN
1 A 2
3 D NaN
4 A 1
5 A NaN
Any ideas on how to do it ? Thanks for help !
Use GroupBy.cumcount and then replace 0 to NaNs:
df['Col2'] = df.groupby('Col1').cumcount(ascending=False).replace(0,np.nan)
print (df)
Col1 Col2
0 C NaN
1 A 2.0
3 D NaN
4 A 1.0
5 A NaN
Alternative solution with mask:
df['Col2'] = df.groupby('Col1').cumcount(ascending=False).mask(lambda x: x == 0)