If I add two columns to create a third, any columns containing NaN (representing missing data in my world) cause the resulting output column to be NaN as well. Is there a way to skip NaNs without explicitly setting the values to 0 (which would lose the notion that those values are "missing")?
In [42]: frame = pd.DataFrame({'a': [1, 2, np.nan], 'b': [3, np.nan, 4]})
In [44]: frame['c'] = frame['a'] + frame['b']
In [45]: frame
Out[45]:
a b c
0 1 3 4
1 2 NaN NaN
2 NaN 4 NaN
In the above, I would like column c to be [4, 2, 4].
Thanks...
with fillna()
frame['c'] = frame.fillna(0)['a'] + frame.fillna(0)['b']
or as suggested :
frame['c'] = frame.a.fillna(0) + frame.b.fillna(0)
giving :
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
Another approach:
>>> frame["c"] = frame[["a", "b"]].sum(axis=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
As an expansion to the answer above, doing frame[["a", "b"]].sum(axis=1) will fill sum of all NaNs as 0
>>> frame["c"] = frame[["a", "b"]].sum(axis=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
3 NaN NaN 0
If you want the sum of all NaNs to be NaN, you can add the min_count flag as referenced in the docs
>>> frame["c"] = frame[["a", "b"]].sum(axis=1, min_count=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
3 NaN NaN NaN
Related
Consider a simple dataframe:
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(10).reshape(5,2))
print(x)
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
I would like to create a hierarchically indexed dataframe of the form:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
where the 'a' columns correspond to the original dataframe columns and the 'b' columns are blank (or nan).
I can certainly create a hierarchically indexed dataframe with all NaNs and loop over the columns of the original dataframe, writing them into
the new dataframe. Is there something more compact than that?
you can do with MultiIndex.from_product
extra_level = ['a', 'b']
new_cols = pd.MultiIndex.from_product([x.columns, extra_level])
x.columns = new_cols[::len(x.columns)] # select all the first element of extra_level
x = x.reindex(columns=new_cols)
print(x)
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
Very much like #Ben.T I am using MultiIndex.from_product:
x.assign(l='a')
.set_index('l', append=True)
.unstack()
.reindex(pd.MultiIndex.from_product([x.columns.tolist(), ['a','b']]), axis=1)
Output:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
Here is a dataframe
a b c d
nan nan 3 5
nan 1 2 3
1 nan 4 5
2 3 7 9
nan nan 2 3
I want to replace the observations in both columns 'a' and 'b' where both of them are NaNs with 0s. Rows 2 and 5 in columns 'a' and 'b' have both both NaN, so I want to replace only those rows with 0's in those matching NaN columns.
so my output must be
a b c d
0 0 3 5
nan 1 2 3
1 nan 4 5
2 3 7 9
0 0 2 3
There might be a easier builtin function in Pandas, but this one should work.
df[['a', 'b']] = df.ix[ (np.isnan(df.a)) & (np.isnan(df.b)), ['a', 'b'] ].fillna(0)
Actually the solution from #Psidom much easier to read.
You can create a boolean series based on the conditions on columns a/b, and then use loc to modify corresponding columns and rows:
df.loc[df[['a','b']].isnull().all(1), ['a','b']] = 0
df
# a b c d
#0 0.0 0.0 3 5
#1 NaN 1.0 2 3
#2 1.0 NaN 4 5
#3 2.0 3.0 7 9
#4 0.0 0.0 2 3
Or:
df.loc[df.a.isnull() & df.b.isnull(), ['a','b']] = 0
Suppose I have a dataframe:
a b c
0 1 2 NaN
1 2 NaN 4
3 Nan 4 NaN
I want to check for NaN in only some particular column's and want the resulting dataframe as:
a b c
0 1 2 NaN
3 Nan 4 NaN
Here I want to check for NaN in only Column 'a' and Column 'c'.
How this can be done?
You could do that with isnull and any methods:
In [264]: df
Out[264]:
a b c
0 1 2 NaN
1 2 NaN 4
2 NaN 4 NaN
In [265]: df[df.isnull().any(axis=1)]
Out[265]:
a b c
0 1 2 NaN
2 NaN 4 NaN
Note: if you just want clear rows without any NaN you could use dropna method
EDIT
If you want to subset your dataframe you could use mask with your columns and apply it to the whole dataframe:
df_subset = df[['a', 'c']]
In [282]: df[df_subset.isnull().any(axis=1)]
Out[282]:
a b c
0 1 2 NaN
2 NaN 4 NaN
Given a 3-column DataFrame, df:
a b c
0 NaN a True
1 1 b True
2 2 c False
3 3 NaN False
4 4 e True
[5 rows x 3 columns]
I would like to place aNaN in column c for each row where a NaN exists in any other colunn. My current approach is as follows:
for col in df:
df['c'][pd.np.isnan(df[col])] = pd.np.nan
I strongly suspect that there is a way to do this via logical indexing instead of iterating through columns as I am currently doing.
How could this be done?
Thank you!
If you don't care about the bool/float issue, I propose:
>>> df.loc[df.isnull().any(axis=1), "c"] = np.nan
>>> df
a b c
0 NaN a NaN
1 1 b 1
2 2 c 0
3 3 NaN NaN
4 4 e 1
[5 rows x 3 columns]
If you really do, then starting again from your frame df you could:
>>> df["c"] = df["c"].astype(object)
>>> df.loc[df.isnull().any(axis=1), "c"] = np.nan
>>> df
a b c
0 NaN a NaN
1 1 b True
2 2 c False
3 3 NaN NaN
4 4 e True
[5 rows x 3 columns]
df.c[df.ix[:, :'c'].apply(lambda r: any(r.isnull()), axis=1)] = np.nan
Note that you may need to change the type of column c to float or you'll get an error about being unable to assign nan to integer column.
filter and select the rows where you have NaN for either 'a' or 'b' and assign 'c' to NaN:
In [18]:
df.ix[pd.isnull(df.a) | pd.isnull(df.b),'c'] = NaN
In [19]:
df
Out[19]:
a b c
0 NaN a NaN
1 1 b 1
2 2 c 0
3 3 d 0
4 4 NaN NaN
[5 rows x 3 columns]
If I make a dataframe like the following:
In [128]: test = pd.DataFrame({'a':[1,4,2,7,3,6], 'b':[2,2,2,1,1,1], 'c':[2,6,np.NaN, np.NaN, 1, np.NaN]})
In [129]: test
Out[129]:
a b c
0 1 2 2
1 4 2 6
2 2 2 NaN
3 7 1 NaN
4 3 1 1
5 6 1 NaN
basic sorts perform as expected. Sorting on column c appropriately segregates the nan values. Doing a multi-level sort on columns a and b orders them as expected:
In [133]: test.sort(columns='c', ascending=False)
Out[133]:
a b c
5 6 1 NaN
3 7 1 NaN
2 2 2 NaN
1 4 2 6
0 1 2 2
4 3 1 1
In [134]: test.sort(columns=['b', 'a'], ascending=False)
Out[134]:
a b c
1 4 2 6
2 2 2 NaN
0 1 2 2
3 7 1 NaN
5 6 1 NaN
4 3 1 1
But doing a multi-level sort with columns b and c does not give the expected result:
In [135]: test.sort(columns=['b', 'c'], ascending=False)
Out[135]:
a b c
1 4 2 6
0 1 2 2
2 2 2 NaN
3 7 1 NaN
4 3 1 1
5 6 1 NaN
And, in fact, even sorting just on column c but using the multi-level sort nomenclature fails:
In [136]: test.sort(columns=['c'], ascending=False)
Out[136]:
a b c
1 4 2 6
0 1 2 2
2 2 2 NaN
3 7 1 NaN
4 3 1 1
5 6 1 NaN
I would think that this should have given the exact same result as line 133 above. Is this a pandas bug or is there something I'm not getting? (FYI, pandas v0.11.0, numpy v1.7.1, python 2.7.2.5 32bit on windows 7)
This is an interesting corner case. Note that even vanilla python doesn't get this "correct":
>>> nan = float('nan')
>>> a = [ 6, 2, nan, nan, 1, nan]
>>> sorted(a)
[2, 6, nan, nan, 1, nan]
The reason here is because NaN is neither greater nor less than the other elements -- So there is no strict ordering defined. Because of this, python leaves them alone.
>>> nan > 6
False
>>> nan < 6
False
Pandas must make an explicit check in the single column case -- probably using np.argsort or np.sort as starting at numpy 1.4, np.sort puts NaN values at the end.
Thanks for the heads up above. I guess this is already a known issue. One stopgap solution I came up with is:
test['c2'] = test.c.fillna(value=test.c.min() - 1)
test.sort(['b', 'c2'])
test = test.drop('c2', axis = 1)
This method wouldn't work in regular numpy since .min() would return nan, but in pandas it works fine.