I am trying to fill none values in a Pandas dataframe with 0's for only some subset of columns.
When I do:
import pandas as pd
df = pd.DataFrame(data={'a':[1,2,3,None],'b':[4,5,None,6],'c':[None,None,7,8]})
print df
df.fillna(value=0, inplace=True)
print df
The output:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 NaN 7.0
3 NaN 6.0 8.0
a b c
0 1.0 4.0 0.0
1 2.0 5.0 0.0
2 3.0 0.0 7.0
3 0.0 6.0 8.0
It replaces every None with 0's. What I want to do is, only replace Nones in columns a and b, but not c.
What is the best way of doing this?
You can select your desired columns and do it by assignment:
df[['a', 'b']] = df[['a','b']].fillna(value=0)
The resulting output is as expected:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
You can using dict , fillna with different value for different column
df.fillna({'a':0,'b':0})
Out[829]:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
After assign it back
df=df.fillna({'a':0,'b':0})
df
Out[831]:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
You can avoid making a copy of the object using Wen's solution and inplace=True:
df.fillna({'a':0, 'b':0}, inplace=True)
print(df)
Which yields:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
using the top answer produces a warning about making changes to a copy of a df slice. Assuming that you have other columns, a better way to do this is to pass a dictionary:
df.fillna({'A': 'NA', 'B': 'NA'}, inplace=True)
This should work and without copywarning
df[['a', 'b']] = df.loc[:,['a', 'b']].fillna(value=0)
Here's how you can do it all in one line:
df[['a', 'b']].fillna(value=0, inplace=True)
Breakdown: df[['a', 'b']] selects the columns you want to fill NaN values for, value=0 tells it to fill NaNs with zero, and inplace=True will make the changes permanent, without having to make a copy of the object.
Or something like:
df.loc[df['a'].isnull(),'a']=0
df.loc[df['b'].isnull(),'b']=0
and if there is more:
for i in your_list:
df.loc[df[i].isnull(),i]=0
For some odd reason this DID NOT work (using Pandas: '0.25.1')
df[['col1', 'col2']].fillna(value=0, inplace=True)
Another solution:
subset_cols = ['col1','col2']
[df[col].fillna(0, inplace=True) for col in subset_cols]
Example:
df = pd.DataFrame(data={'col1':[1,2,np.nan,], 'col2':[1,np.nan,3], 'col3':[np.nan,2,3]})
output:
col1 col2 col3
0 1.00 1.00 nan
1 2.00 nan 2.00
2 nan 3.00 3.00
Apply list comp. to fillna values:
subset_cols = ['col1','col2']
[df[col].fillna(0, inplace=True) for col in subset_cols]
Output:
col1 col2 col3
0 1.00 1.00 nan
1 2.00 0.00 2.00
2 0.00 3.00 3.00
Sometimes this syntax wont work:
df[['col1','col2']] = df[['col1','col2']].fillna()
Use the following instead:
df['col1','col2']
Related
How could I use the pandas pad function (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pad.html)
to pad only selected columns in a dataframe ?
For example, in the below dataframe (df):
import pandas as pd
import numpy as np
df = pd.DataFrame([[2, np.nan, 0, 1],[np.nan, np.nan, 5, np.nan ],[np.nan,3,np.nan, 3]],columns=list('ABCD'))
print(df)
A B C D
0 2.0 NaN 0.0 1.0
1 NaN NaN 5.0 NaN
2 NaN 3.0 NaN 3.0
I would like to only pad column A and column D keep the rest of the columns as it is. The final dataframe should look like :
A B C D
0 2.0 NaN 0.0 1.0
1 2.0 NaN 5.0 1.0
2 2.0 3.0 NaN 3.0
Try column assignment and pad:
df[['A', 'D']] = df[['A', 'D']].pad()
>>> df
A B C D
0 2.0 NaN 0.0 1.0
1 2.0 NaN 5.0 1.0
2 2.0 3.0 NaN 3.0
>>>
I have the following df:
df = pd.DataFrame(np.array([[.1, 2, 3], [.4, 5, 6], [7, 8, 9]]),
columns=['col1', 'b', 'c'])
out:
col1 b c
0 0.1 2.0 3.0
1 0.4 5.0 6.0
2 7.0 8.0 9.0
When a value begins with a '.'/point, I want to remove it. But only if it starts with a point / '.'.
I've tried the following:
s = df['col1']
df['col1'] = s.mask(df['col1'].str.startswith('.',na=False),s.str.replace(".",""))
desired output:
col1 b c
0 1 2.0 3.0
1 4 5.0 6.0
2 7.0 8.0 9.0
However this does not work. Please help!
since you have numerical values, You can multiply 10 and replace with a condition:
df.mul(10).mask(df.ge(1),df)
#df['col1'] = df['col1'].mul(10).mask(df['col1'].ge(1),df['col1']) for 1 column
col1 b c
0 1.0 2.0 3.0
1 4.0 5.0 6.0
2 7.0 8.0 9.0
use boolean masking and create a mask:
mask=df['col1'].astype(str).str.startswith('0.')
Finally make use of that mask:
df.loc[mask,'col1']=df.loc[mask,'col1'].astype(str).str.lstrip('0.').astype(float)
Now if you print df you will get your desired output:
col1 b c
0 1.0 2.0 3.0
1 4.0 5.0 6.0
2 7.0 8.0 9.0
Via NumpPy np.where():
df['col1'] = np.where(df<1, df*10, df)
df contents:
col1 b c
0 1.0 2.0 3.0
1 4.0 5.0 6.0
2 7.0 8.0 9.0
Let's say I have data like this:
df = pd.DataFrame({'col1': [5, np.nan, 2, 2, 5, np.nan, 4], 'col2':[1,3,np.nan,np.nan,5,np.nan,4]})
print(df)
col1 col2
0 5.0 1.0
1 NaN 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 NaN NaN
6 4.0 4.0
How can I use fillna() to replace NaN values with the average of the prior and the succeeding value if both of them are not NaN ?
The result would look like this:
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0
Also, is there a way of calculating the average from the previous n and succeeding n values (if they are all not NaN) ?
We can shift the dataframe forward and backwards. Then add these together and divide them by two and use that to fillna:
s1, s2 = df.shift(), df.shift(-1)
df = df.fillna((s1 + s2) / 2)
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0
Taking a Pandas dataframe df I would like to be able to both take away the value in the particular column for all rows/entries and also add another value. This value to be added is a fixed additive for each of the columns.
I believe I could reproduce df, say dfcopy=df, set all cell values in dfcopy to the particular numbers and then subtract df from dfcopy but am hoping for a simpler way.
I am thinking that I need to somehow modify
df.iloc[:, [0,3,4]]
So for example of how this should look:
A B C D E
0 1.0 3.0 1.0 2.0 7.0
1 2.0 1.0 8.0 5.0 3.0
2 1.0 1.0 1.0 1.0 6.0
Then negating only those values in columns (0,3,4) and then adding 10 (for example) we would have:
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
Thanks.
You can first multiply by -1 with mul and then add 10 with add for those columns we select with iloc:
df.iloc[:, [0,3,4]] = df.iloc[:, [0,3,4]].mul(-1).add(10)
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
Or as anky_91 suggested in the comments:
df.iloc[:, [0,3,4]] = 10-df.iloc[:,[0,3,4]]
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
pandas is very intuitive in letting you perform these operations,
negate:
df.iloc[:, [0,2,7,10,11] = -df.iloc[:, [0,2,7,10,11]
add a constant c:
df.iloc[:, [0,2,7,10,11] = df.iloc[:, [0,2,7,10,11]+c
or change to constant value c:
df.iloc[:, [0,2,7,10,11] = c
and any other arithmetics you can think of
I need to rid myself of all rows with a null value in column C. Here is the code:
infile="C:\****"
df=pd.read_csv(infile)
A B C D
1 1 NaN 3
2 3 7 NaN
4 5 NaN 8
5 NaN 4 9
NaN 1 2 NaN
There are two basic methods I have attempted.
method 1:
source: How to drop rows of Pandas DataFrame whose value in certain columns is NaN
df.dropna()
The result is an empty dataframe, which makes sense because there is an NaN value in every row.
df.dropna(subset=[3])
For this method I tried to play around with the subset value using both column index number and column name. The dataframe is still empty.
method 2:
source: Deleting DataFrame row in Pandas based on column value
df = df[df.C.notnull()]
Still results in an empty dataframe!
What am I doing wrong?
df = pd.DataFrame([[1,1,np.nan,3],[2,3,7,np.nan],[4,5,np.nan,8],[5,np.nan,4,9],[np.nan,1,2,np.nan]], columns = ['A','B','C','D'])
df = df[df['C'].notnull()]
df
It's just a prove that your method 2 works properly (at least with pandas 0.18.0):
In [100]: df
Out[100]:
A B C D
0 1.0 1.0 NaN 3.0
1 2.0 3.0 7.0 NaN
2 4.0 5.0 NaN 8.0
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [101]: df.dropna(subset=['C'])
Out[101]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [102]: df[df.C.notnull()]
Out[102]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [103]: df = df[df.C.notnull()]
In [104]: df
Out[104]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN