possible bug in pandas sort with NaN values - python

If I make a dataframe like the following:
In [128]: test = pd.DataFrame({'a':[1,4,2,7,3,6], 'b':[2,2,2,1,1,1], 'c':[2,6,np.NaN, np.NaN, 1, np.NaN]})
In [129]: test
Out[129]:
a b c
0 1 2 2
1 4 2 6
2 2 2 NaN
3 7 1 NaN
4 3 1 1
5 6 1 NaN
basic sorts perform as expected. Sorting on column c appropriately segregates the nan values. Doing a multi-level sort on columns a and b orders them as expected:
In [133]: test.sort(columns='c', ascending=False)
Out[133]:
a b c
5 6 1 NaN
3 7 1 NaN
2 2 2 NaN
1 4 2 6
0 1 2 2
4 3 1 1
In [134]: test.sort(columns=['b', 'a'], ascending=False)
Out[134]:
a b c
1 4 2 6
2 2 2 NaN
0 1 2 2
3 7 1 NaN
5 6 1 NaN
4 3 1 1
But doing a multi-level sort with columns b and c does not give the expected result:
In [135]: test.sort(columns=['b', 'c'], ascending=False)
Out[135]:
a b c
1 4 2 6
0 1 2 2
2 2 2 NaN
3 7 1 NaN
4 3 1 1
5 6 1 NaN
And, in fact, even sorting just on column c but using the multi-level sort nomenclature fails:
In [136]: test.sort(columns=['c'], ascending=False)
Out[136]:
a b c
1 4 2 6
0 1 2 2
2 2 2 NaN
3 7 1 NaN
4 3 1 1
5 6 1 NaN
I would think that this should have given the exact same result as line 133 above. Is this a pandas bug or is there something I'm not getting? (FYI, pandas v0.11.0, numpy v1.7.1, python 2.7.2.5 32bit on windows 7)

This is an interesting corner case. Note that even vanilla python doesn't get this "correct":
>>> nan = float('nan')
>>> a = [ 6, 2, nan, nan, 1, nan]
>>> sorted(a)
[2, 6, nan, nan, 1, nan]
The reason here is because NaN is neither greater nor less than the other elements -- So there is no strict ordering defined. Because of this, python leaves them alone.
>>> nan > 6
False
>>> nan < 6
False
Pandas must make an explicit check in the single column case -- probably using np.argsort or np.sort as starting at numpy 1.4, np.sort puts NaN values at the end.

Thanks for the heads up above. I guess this is already a known issue. One stopgap solution I came up with is:
test['c2'] = test.c.fillna(value=test.c.min() - 1)
test.sort(['b', 'c2'])
test = test.drop('c2', axis = 1)
This method wouldn't work in regular numpy since .min() would return nan, but in pandas it works fine.

Related

How to set a selection of values in a pandas DataFrame with a selection from another pd.DataFrame using labels

Assume 2 DataFrames:
df = pd.DataFrame(columns=['A','B','C'], data=[[0,1,2],[3,4,5],[6,7,8]]
df
A B C
0 0 1 2
1 3 4 5
2 6 7 8
df1 = pd.DataFrame(columns=['A','B','C'], data=[[0,1,2],[3,4,5],[6,7,8]], index=[3,4,5])
df1
A B C
3 0 1 2
4 3 4 5
5 6 7 8
I want to set a part of df using values from df1 according to column-lables, without changing anithing in df but the selected fields. The desired result would be:
df
A B C
0 4 5 2
1 7 8 5
2 6 7 8
I tried:
df.loc[df['A'].isin([0,3]), ['A', 'B']] = df1.loc[df1['A'].isin([7,6]), ['B', 'C']]
But the result is:
A B C
0 NaN NaN 2.0
1 NaN NaN 5.0
2 6.0 7.0 8.0
Because I guess it still requires the indices to match. I feel like this is a pretty basic task to do so I'm wondering if there is a simple way of ding this?
I also looked into 'merge' and 'join' but these functions seem to have different purpose.
One possible solution if matching number of rows and columns of both selections is convert output to mumpy array:
df.loc[df['A'].isin([0,3]), ['A', 'B']] =
df1.loc[df1['A'].isin([3,6]), ['B', 'C']].to_numpy()
print (df)
A B C
0 4 5 2
1 7 8 5
2 6 7 8

Get only two values from 4 specified columns and merge valid values into 2 columns

df:
index a b c d
-
0 1 2 NaN NaN
1 2 NaN 3 NaN
2 5 NaN 6 NaN
3 1 NaN NaN 5
df expect:
index one two
-
0 1 2
1 2 3
2 5 6
3 1 5
Above output example is self-explanatory. Basically, I just need to shift the two values from columns [a, b, c, d] except NaN into another set of two columns ["one", "two"]
Use back filling missing values and select first 2 columns:
df = df.bfill(axis=1).iloc[:, :2].astype(int)
df.columns = ["one", "two"]
print (df)
one two
index
0 1 2
1 2 3
2 5 6
3 1 5
Or combine_first + drop:
df['two']=df.pop('b').combine_first(df.pop('c')).combine_first(df.pop('d'))
df=df.drop(['b','c','d'],1)
df.columns=['index','one','two']
Or fillna:
df['two']=df.pop('b').fillna(df.pop('c')).fillna(df.pop('d'))
df=df.drop(['b','c','d'],1)
df.columns=['index','one','two']
Both cases:
print(df)
Is:
index one two
0 0 1 2.0
1 1 2 3.0
2 2 5 6.0
3 3 1 5.0
If want output like #jezrael's, add a: (both cases all okay)
df=df.set_index('index')
And then:
print(df)
Is:
one two
index
0 1 2.0
1 2 3.0
2 5 6.0
3 1 5.0

Find observations in which both columns are NaN and replace them with 0 in pandas DataFrame

Here is a dataframe
a b c d
nan nan 3 5
nan 1 2 3
1 nan 4 5
2 3 7 9
nan nan 2 3
I want to replace the observations in both columns 'a' and 'b' where both of them are NaNs with 0s. Rows 2 and 5 in columns 'a' and 'b' have both both NaN, so I want to replace only those rows with 0's in those matching NaN columns.
so my output must be
a b c d
0 0 3 5
nan 1 2 3
1 nan 4 5
2 3 7 9
0 0 2 3
There might be a easier builtin function in Pandas, but this one should work.
df[['a', 'b']] = df.ix[ (np.isnan(df.a)) & (np.isnan(df.b)), ['a', 'b'] ].fillna(0)
Actually the solution from #Psidom much easier to read.
You can create a boolean series based on the conditions on columns a/b, and then use loc to modify corresponding columns and rows:
df.loc[df[['a','b']].isnull().all(1), ['a','b']] = 0
df
# a b c d
#0 0.0 0.0 3 5
#1 NaN 1.0 2 3
#2 1.0 NaN 4 5
#3 2.0 3.0 7 9
#4 0.0 0.0 2 3
Or:
df.loc[df.a.isnull() & df.b.isnull(), ['a','b']] = 0

Understanding how pandas join works

Can somebody please explain this result to me? In particular, I don't know where the NaNs come from in the result. Also, I don't know how the join will decide what row to match with what row in this case.
left_df = pd.DataFrame.from_dict({'unique_l':[0, 1, 2, 3, 4], 'join':['a', 'a', 'b','b', 'c'] })
right_df = pd.DataFrame.from_dict({'unique_r':[10, 11, 12, 13, 14], 'join':['a', 'b', 'b','c', 'c'] })
join unique_l
0 a 0
1 a 1
2 b 2
3 b 3
4 c 4
join unique_r
0 a 10
1 b 11
2 b 12
3 c 13
4 c 14
print left_df.join(right_df, on='join', rsuffix='_r')
join unique_l join_r unique_r
0 a 0 NaN NaN
1 a 1 NaN NaN
2 b 2 NaN NaN
3 b 3 NaN NaN
4 c 4 NaN NaN
The join method makes use of indices. What you want is merge:
In [6]: left_df.merge(right_df, on="join", suffixes=("_l", "_r"))
Out[6]:
join unique_l unique_r
0 a 0 10
1 a 1 10
2 b 2 11
3 b 2 12
4 b 3 11
5 b 3 12
6 c 4 13
7 c 4 14
Here is a related (but, IMO, not quite a duplicate) question that explains the difference between join and merge in more detail.

Pandas sum two columns, skipping NaN

If I add two columns to create a third, any columns containing NaN (representing missing data in my world) cause the resulting output column to be NaN as well. Is there a way to skip NaNs without explicitly setting the values to 0 (which would lose the notion that those values are "missing")?
In [42]: frame = pd.DataFrame({'a': [1, 2, np.nan], 'b': [3, np.nan, 4]})
In [44]: frame['c'] = frame['a'] + frame['b']
In [45]: frame
Out[45]:
a b c
0 1 3 4
1 2 NaN NaN
2 NaN 4 NaN
In the above, I would like column c to be [4, 2, 4].
Thanks...
with fillna()
frame['c'] = frame.fillna(0)['a'] + frame.fillna(0)['b']
or as suggested :
frame['c'] = frame.a.fillna(0) + frame.b.fillna(0)
giving :
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
Another approach:
>>> frame["c"] = frame[["a", "b"]].sum(axis=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
As an expansion to the answer above, doing frame[["a", "b"]].sum(axis=1) will fill sum of all NaNs as 0
>>> frame["c"] = frame[["a", "b"]].sum(axis=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
3 NaN NaN 0
If you want the sum of all NaNs to be NaN, you can add the min_count flag as referenced in the docs
>>> frame["c"] = frame[["a", "b"]].sum(axis=1, min_count=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
3 NaN NaN NaN

Categories

Resources