how to replace NaN value in python [duplicate] - python

This question already has answers here:
How to replace NaN values by Zeroes in a column of a Pandas Dataframe?
(17 answers)
Closed 4 years ago.
I have a list of NaN values in my dataframe and I want to replace NaN values with an empty string.
What I've tried so far, which isn't working:
df_conbid_N_1 = pd.read_csv("test-2019.csv",dtype=str, sep=';', encoding='utf-8')
df_conbid_N_1['Excep_Test'] = df_conbid_N_1['Excep_Test'].replace("NaN","")

Use fillna (docs):
An example -
df = pd.DataFrame({'no': [1, 2, 3],
'Col1':['State','City','Town'],
'Col2':['abc', np.NaN, 'defg'],
'Col3':['Madhya Pradesh', 'VBI', 'KJI']})
df
no Col1 Col2 Col3
0 1 State abc Madhya Pradesh
1 2 City NaN VBI
2 3 Town defg KJI
df.Col2.fillna('', inplace=True)
df
no Col1 Col2 Col3
0 1 State abc Madhya Pradesh
1 2 City VBI
2 3 Town defg KJI

Simple! you can do this way
df_conbid_N_1 = pd.read_csv("test-2019.csv",dtype=str, sep=';',encoding='utf-8').fillna("")

We have pandas' fillna to fill missing values.
Let's go through some uses cases with a sample dataframe:
df = pd.DataFrame({'col1':['John', np.nan, 'Anne'], 'col2':[np.nan, 3, 4]})
col1 col2
0 John NaN
1 NaN 3.0
2 Anne 4.0
As mentioned in the docs, fillna accepts the following as fill values:
values: scalar, dict, Series, or DataFrame
So we can replace with a constant value, such as an empty string with:
df.fillna('')
col1 col2
0 John
1 3
2 Anne 4
1
You can also replace with a dictionary mapping column_name:replace_value:
df.fillna({'col1':'Alex', 'col2':2})
col1 col2
0 John 2.0
1 Alex 3.0
2 Anne 4.0
Or you can also replace with another pd.Series or pd.DataFrame:
df_other = pd.DataFrame({'col1':['John', 'Franc', 'Anne'], 'col2':[5, 3, 4]})
df.fillna(df_other)
col1 col2
0 John 5.0
1 Franc 3.0
2 Anne 4.0
This is very useful since it allows you to fill missing values on the dataframes' columns using some extracted statistic from the columns, such as the mean or mode. Say we have:
df = pd.DataFrame(np.random.choice(np.r_[np.nan, np.arange(3)], (3,5)))
print(df)
0 1 2 3 4
0 NaN NaN 0.0 1.0 2.0
1 NaN 2.0 NaN 2.0 1.0
2 1.0 1.0 2.0 NaN NaN
Then we can easilty do:
df.fillna(df.mean())
0 1 2 3 4
0 1.0 1.5 0.0 1.0 2.0
1 1.0 2.0 1.0 2.0 1.0
2 1.0 1.0 2.0 1.5 1.5

Related

Python Dataframe Duplicated Columns while Merging multple times

I have a main dataframe and a sub dataframe. I want to merge each column in sub dataframe into main dataframe with main dataframe column as a reference. I have successfully arrived at my desired answer, except that I see duplicated columns of the main dataframe. Below are the my expected and present answers.
Present solution:
df = pd.DataFrame({'Ref':[1,2,3,4]})
df1 = pd.DataFrame({'A':[2,3],'Z':[1,2]})
df = [df.merge(df1[col_name],left_on='Ref',right_on=col_name,how='left') for col_name in df1.columns]
df = pd.concat(df,axis=1)
df =
Ref A Ref Z
0 1 NaN 1 1.0
1 2 2.0 2 2.0
2 3 3.0 3 NaN
3 4 NaN 4 NaN
Expected Answer:
df =
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
Update
Use duplicated:
>>> df.loc[:, ~df.columns.duplicated()]
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
Old answer
You can use:
# Your code
...
df = pd.concat(df, axis=1)
# Use pop and insert to cleanup your dataframe
df.insert(0, 'Ref', df.pop('Ref').iloc[:, 0])
Output:
>>> df
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
What about setting 'Ref' col as index while getting dataframe list. (And resetting index such that you get back Ref as a column)
df = pd.DataFrame({'Ref':[1,2,3,4]})
df1 = pd.DataFrame({'A':[2,3],'Z':[1,2]})
df = [df.merge(df1[col_name],left_on='Ref',right_on=col_name,how='left').set_index('Ref') for col_name in df1.columns]
df = pd.concat(df,axis=1)
df = df.reset_index()
Ref A Z
1 NaN 1.0
2 2.0 2.0
3 3.0 NaN
4 NaN NaN
This is a reduction process. Instead of the list comprehension use for - loop, or even reduce:
from functools import reduce
reduce(lambda x, y : x.merge(df1[y],left_on='Ref',right_on=y,how='left'), df1.columns, df)
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN
The above is similar to:
for y in df1.columns:
df = df.merge(df1[y],left_on='Ref',right_on=y,how='left')
df
Ref A Z
0 1 NaN 1.0
1 2 2.0 2.0
2 3 3.0 NaN
3 4 NaN NaN

pandas pivot table where the column contains a string with multiple catogeries

I have a data in the form:
'cat' 'value'
a 1
a,b 2
a,b,c 3
b,c 2
b 1
which I would like to convert using a pivot table:
'a' 'b' 'c'
1
2 2
3 3 3
2 2
1
How do I perform this. If I use the pivot command:
df.pivot(columns= 'cat', values = 'value')
which yields this result
'a' 'a,b' 'a,b,c' 'b,c' 'b'
1
2
3
2
1
You can use .explode() after transforming the string into a list, and then pivot it normally:
df['cat'] = df['cat'].str.split(',')
df = df.explode('cat').pivot_table(index=df.explode('cat').index,columns='cat',values='value')
This outputs:
cat a b c
0 1.0 NaN NaN
1 2.0 2.0 NaN
2 3.0 3.0 3.0
3 NaN 2.0 2.0
4 NaN 1.0 NaN
You can then reset, or rename the index if you wish for it to not be named cat.
Try with str.get_dummies and multiply the value column (then replace 0 with nan if necessary)
df['cat'].str.get_dummies(",").mul(df['value'],axis=0).replace(0,np.nan)
a b c
0 1.0 NaN NaN
1 2.0 2.0 NaN
2 3.0 3.0 3.0
3 NaN 2.0 2.0
4 NaN 1.0 NaN

How to replace part of the data-frame with another data-frame

i have two data frames and i like to filter the data and replace list of columns from df1 with the same columns in df2
i want to filter this df by df1.loc[df1["name"]=="A"]
first_data={"col1":[2,3,4,5,7],
"col2":[4,2,4,6,4],
"col3":[7,6,9,11,2],
"col4":[14,11,22,8,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df1=pd.DataFrame.from_dict(first_data)
and put the columns ["col1","col2","n_roll"] when name="A"
on the same places in df2 (on the same indexs)
sec_df={"col1":[55,0,57,1,3],
"col2":[55,0,4,4,53],
"col3":[55,33,9,0,2],
"col4":[55,0,22,4,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df2=pd.DataFrame.from_dict(sec_df)
if i put that list of cols=[col1,col2,col3,col4]
so i like to get this
data={"col1":[55,0,4,1,7],
"col2":[55,0,4,4,4],
"col3":[55,33,9,0,2],
"col4":[55,0,22,4,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df=pd.DataFrame.from_dict(data)
df
You can achieve this with a double combine_first
Combine a filtered version of df1 with df2
However, the columns that were excluded in the filtered version of df1 were left behind and you have NaN values. But, that is okay -- Just do another combine_first on df2 to get those values!
(df1.loc[df1['name'] != 'A', ["col1","col2","n_roll"]]
.combine_first(df2)
.combine_first(df2))
Out[1]:
col1 col2 col3 col4 n_roll name
0 55.0 55.0 55.0 55.0 8.0 A
1 0.0 0.0 33.0 0.0 2.0 A
2 4.0 4.0 9.0 22.0 1.0 V
3 1.0 4.0 0.0 4.0 3.0 A
4 7.0 4.0 2.0 5.0 9.0 B
Use one line to achieve this man!
df1=df1[df1.name!='A'].append(df2[df2.name=='A'].rename(columns={'hight':'n_roll'})).sort_index()
col1 col2 col3 col4 name n_roll
0 55 55 55 55 A 8
1 0 0 33 0 A 2
2 4 4 9 22 V 1
3 1 4 0 4 A 3
4 7 4 2 5 B 9
How it works
d=df1[df1.name!='A']#selects df1 where name is not A
df2[df2.name=='A']#selects df where name is A
e=df2[df2.name=='A'].rename(columns={'hight':'n_roll'})#renames column height to allow appending
d.append(e)# combines the dataframes
d.append(e).sort_index()#sorts the index

How to Conditionally Set Column Values [duplicate]

This question already has answers here:
How to pass another entire column as argument to pandas fillna()
(7 answers)
Closed 3 years ago.
I would like my code to look at column one first. If it has a valid number for that row, take that value as the COL 3 value. If not, the second option would be to take the value of COL 2 as the value in COL 3. Is there a Function that can do this?
COL1 COL2 COL3
0 1 2
1 nan 4
2 3 nan
3 4 8
4 nan 10
5 6 nan
COL3
0 1
1 4
2 3
3 4
4 10
5 6
try this
df['col 3'] = np.where(df['col 1'].isnull(),df['col 2'],df['col 1'])
IIUC:
df['COL3'] = df.bfill(axis=1)['COL1']
gives:
COL1 COL2 COL3
0 1.0 2.0 1.0
1 NaN 4.0 4.0
2 3.0 NaN 3.0
3 4.0 8.0 4.0
4 NaN 10.0 10.0
5 6.0 NaN 6.0

Pandas Summing Two Columns with Nan

I have three columns in pandas dataframes with Nan:
>>> d=pd.DataFrame({'col1': [1, 2], 'col2': [3, 4], 'col3':[5,6]})
>>> d
col1 col2 col3
0 1 3 5
1 2 4 6
>>> d['col2'].iloc[0]=np.nan
>>> d
col1 col2 col3
0 1 NaN 5
1 2 4.0 6
>>> d['col1'].iloc[1]=np.nan
>>> d
col1 col2 col3
0 1.0 NaN 5
1 NaN 4.0 6
>>> d['col3'].iloc[1]=np.nan
>>> d
col1 col2 col3
0 1.0 NaN 5.0
1 NaN 4.0 NaN
Now, I would like the column addition to have the following output:
>>> d['col1']+d['col3']
0 6.0
1 NaN
>>> d['col1']+d['col2']
0 1.0
1 4.0
However, in reality, the output is instead:
>>> d['col1']+d['col3']
0 6.0
1 NaN
>>> d['col1']+d['col2']
0 NaN
1 NaN
Anyone knows how to achieve this?
You can use add to get your sums, with fill_value=0:
>>> d.col1.add(d.col2, fill_value=0)
0 1.0
1 4.0
dtype: float64
>>> d.col1.add(d.col3, fill_value=0)
0 6.0
1 NaN
dtype: float64
When adding columns one and two, use Series.add with fill_value=0.
>>> d
col1 col2 col3
0 1.0 NaN 5.0
1 NaN 4.0 NaN
>>>
>>> d['col1'].add(d['col2'], fill_value=0)
0 1.0
1 4.0
dtype: float64
Dataframes and series have methods like add, sub, ... in order to perform more sophisticated operations than the associated operators +, -, ... can provide.
The methods may take additional arguments that finetune the operation.

Categories

Resources