Boolean indexing in Pandas DataFrame with MultiIndex columns - python

I have a DataFrame with MultiIndex columns:
import numpy as np
import pandas as pd
columns = pd.MultiIndex.from_arrays([['n1', 'n1', 'n2', 'n2'], ['p', 'm', 'p', 'm']])
values = [
[1, 2, 3, 4],
[np.nan, 6, 7, 8],
[np.nan, 10, np.nan, 12],
]
df = pd.DataFrame(values, columns=columns)
n1 n2
p m p m
0 1.0 2 3.0 4
1 NaN 6 7.0 8
2 NaN 10 NaN 12
Now I want to set m to NaN whenever p is NaN. Here's the result I'm looking for:
n1 n2
p m p m
0 1.0 2.0 3.0 4.0
1 NaN NaN 7.0 8.0
2 NaN NaN NaN NaN
I know how to find out where p is NaN, for example using
mask = df.xs('p', level=1, axis=1).isnull()
n1 n2
0 False False
1 True False
2 True True
However, I don't know how to use this mask to set the corresponding m values in df to NaN.

You can use pd.IndexSlice to obtain a boolean ndarray indicating whether values are NaN or not in the p column on level 1 and then replacing False to NaN, and also to replace the values in m by multiplying the result:
x = df.loc[:, pd.IndexSlice[:,'p']].notna().replace({False:float('nan')}).values
df.loc[:, pd.IndexSlice[:,'m']] *= x
n1 n2
p m p m
0 1.0 2 3.0 4
1 NaN NaN 7.0 8
2 NaN NaN NaN NaN

You can stack and unstack the transposed dataframe to be able to easily select and change values, and then again stack, unstack and transpose to get it back:
df = df.T.stack(dropna=False).unstack(level=1)
df.loc[df['p'].isna(), 'm'] = np.nan
df = df.stack(dropna=False).unstack(1).T
After first line, df is:
m p
n1 0 2.0 1.0
1 6.0 NaN
2 10.0 NaN
n2 0 4.0 3.0
1 8.0 7.0
2 12.0 NaN
And after last:
n1 n2
m p m p
0 2.0 1.0 4.0 3.0
1 NaN NaN 8.0 7.0
2 NaN NaN NaN NaN

Related

Randomly sample non-empty column values for each row of a pandas dataframe

For each row, I would like to randomly sample k columnar indices that correspond to non-null values.
If I start with this dataframe,
A = pd.DataFrame([
[1, np.nan, 3, 5],
[np.nan, 2, np.nan, 7],
[4, 8, 9]
])
>>> A
0 1 2 3
0 1.0 NaN 3.0 5.0
1 NaN 2.0 NaN 7.0
2 4.0 8.0 9.0 NaN
If I wanted to randomly sample 2 non-null values for each row and change them to the value -1, one way that can be done is as follows:
B = A.copy()
for i in A.index:
s = A.loc[i]
s = s[s.notnull()]
col_idx = random.sample(s.index.tolist(), 2)
B.iloc[i, col_idx] = -1
>>> B
0 1 2 3
0 -1.0 NaN -1.0 5.0
1 NaN -1.0 NaN -1.0
2 -1.0 -1.0 9.0 NaN
Is there a better way to do this natively in Pandas that avoids having to use a for loop? The pandas.DataFrame.sample method seems to keep the number of columns that are sampled in each row constant. But if the dataframe has empty holes, the number of non-null values for each row wouldn't be constant.
In your case do stack the groupby with sample ,change the value update back
s = A.stack().groupby(level=0).sample(n=2)
s[:] = -1
A.update(s.unstack())
A
Out[122]:
0 1 2 3
0 1.0 NaN -1.0 -1.0
1 NaN -1.0 NaN -1.0
2 -1.0 8.0 -1.0 NaN

Extending the Value of non-Missing Cells to Subsequent Rows in Pandas

This is what I have:
df=pd.DataFrame({'A':[1,2,3,4,5],'B':[6,np.nan,np.nan,3,np.nan]})
A B
0 1 6.0
1 2 NaN
2 3 NaN
3 4 3.0
4 5 NaN
I would like to extend non-missing values of B to missing values of B underneath, so I have:
A B C
0 1 6.0 6.0
1 2 NaN NaN
2 3 NaN NaN
3 4 3.0 3.0
4 5 NaN NaN
I tried something like this, and it worked last night:
for i in df.index:
df['C'][i]=np.where(pd.isnull(df['B'].iloc[i]),df['C'][i-1],df.B.iloc[i])
But when I woke up this morning it said it didn't recognize 'C.' I couldn't identify the conditions in which it worked and didn't work.
Thanks!
You could use pandas fillna() method to forward fill the missing values with the last non-null value. See the pandas documentation for more details.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [6, np.nan, np.nan, 3, np.nan]
})
df['C'] = df['B'].fillna(method='ffill')
df
# A B C
# 0 1 6.0 6.0
# 1 2 NaN 6.0
# 2 3 NaN 6.0
# 3 4 3.0 3.0
# 4 5 NaN 3.0

dtypes muck things up when shifting on axis one (columns)

Consider the dataframe df
df = pd.DataFrame(dict(A=[1, 2], B=['X', 'Y']))
df
A B
0 1 X
1 2 Y
If I shift along axis=0 (the default)
df.shift()
A B
0 NaN NaN
1 1.0 X
It pushes all rows downwards one row as expected.
But when I shift along axis=1
df.shift(axis=1)
A B
0 NaN NaN
1 NaN NaN
Everything is null when I expected
A B
0 NaN 1
1 NaN 2
I understand why this happened. For axis=0, Pandas is operating column by column where each column is a single dtype and when shifting, there is clear protocol on how to deal with the introduced NaN value at the beginning or end. But when shifting along axis=1 we introduce potential ambiguity of dtype from one column to the next. In this case, I'm trying for force int64 into an object column and Pandas decides to just null the values.
This becomes more problematic when the dtypes are int64 and float64
df = pd.DataFrame(dict(A=[1, 2], B=[1., 2.]))
df
A B
0 1 1.0
1 2 2.0
And the same thing happens
df.shift(axis=1)
A B
0 NaN NaN
1 NaN NaN
My Question
What are good options for creating a dataframe that is shifted along axis=1 in which the result has shifted values and dtypes?
For the int64/float64 case the result would look like:
df_shifted
A B
0 NaN 1
1 NaN 2
and
df_shifted.dtypes
A object
B int64
dtype: object
A more comprehensive example
df = pd.DataFrame(dict(A=[1, 2], B=[1., 2.], C=['X', 'Y'], D=[4., 5.], E=[4, 5]))
df
A B C D E
0 1 1.0 X 4.0 4
1 2 2.0 Y 5.0 5
Should look like this
df_shifted
A B C D E
0 NaN 1 1.0 X 4.0
1 NaN 2 2.0 Y 5.0
df_shifted.dtypes
A object
B int64
C float64
D object
E float64
dtype: object
It turns out that Pandas is shifting over blocks of similar dtypes
Define df as
df = pd.DataFrame(dict(
A=[1, 2], B=[3., 4.], C=['X', 'Y'],
D=[5., 6.], E=[7, 8], F=['W', 'Z']
))
df
# i f o f i o
# n l b l n b
# t t j t t j
#
A B C D E F
0 1 3.0 X 5.0 7 W
1 2 4.0 Y 6.0 8 Z
It will shift the integers to the next integer column, the floats to the next float column and the objects to the next object column
df.shift(axis=1)
A B C D E F
0 NaN NaN NaN 3.0 1.0 X
1 NaN NaN NaN 4.0 2.0 Y
I don't know if that's a good idea, but that is what is happening.
Approaches
astype(object) first
dtypes = df.dtypes.shift(fill_value=object)
df_shifted = df.astype(object).shift(1, axis=1).astype(dtypes)
df_shifted
A B C D E F
0 NaN 1 3.0 X 5.0 7
1 NaN 2 4.0 Y 6.0 8
transpose
Will make it object
dtypes = df.dtypes.shift(fill_value=object)
df_shifted = df.T.shift().T.astype(dtypes)
df_shifted
A B C D E F
0 NaN 1 3.0 X 5.0 7
1 NaN 2 4.0 Y 6.0 8
itertuples
pd.DataFrame([(np.nan, *t[1:-1]) for t in df.itertuples()], columns=[*df])
A B C D E F
0 NaN 1 3.0 X 5.0 7
1 NaN 2 4.0 Y 6.0 8
Though I'd probably do this
pd.DataFrame([
(np.nan, *t[:-1]) for t in
df.itertuples(index=False, name=None)
], columns=[*df])
I tried using a numpy method. The method works as long as you keep your data in a numpy array:
def shift_df(data, n):
shifted = np.roll(data, n)
shifted[:, :n] = np.NaN
return shifted
shifted(df, 1)
array([[nan, 1, 1.0, 'X', 4.0],
[nan, 2, 2.0, 'Y', 5.0]], dtype=object)
But when you call the DataFrame constructer, all columns are converted to object although the values in the array are float, int, object:
def shift_df(data, n):
shifted = np.roll(data, n)
shifted[:, :n] = np.NaN
shifted = pd.DataFrame(shifted)
return shifted
print(shift_df(df, 1),'\n')
print(shift_df(df, 1).dtypes)
0 1 2 3 4
0 NaN 1 1 X 4
1 NaN 2 2 Y 5
0 object
1 object
2 object
3 object
4 object
dtype: object

How Can I replace NaN in a row with values in another row Pandas

I tried several methods to replace NaN in a row with values in another row, but none of them worked as expected. Here is my Dataframe:
test = pd.DataFrame(
{
"a": [1, 2, 3, 4, 5],
"b": [4, 5, 6, np.nan, np.nan],
"c": [7, 8, 9, np.nan, np.nan],
"d": [7, 8, 9, np.nan, np.nan]
}
)
a b c d
0 1 4.0 7.0 7.0
1 2 5.0 8.0 8.0
2 3 6.0 9.0 9.0
3 4 NaN NaN NaN
4 5 NaN NaN NaN
I need to replace NaN in 4th row with values first row, i.e.,
a b c d
0 1 **4.0 7.0 7.0**
1 2 5.0 8.0 8.0
2 3 6.0 9.0 9.0
3 4 **4.0 7.0 7.0**
4 5 NaN NaN NaN
And the second question is how can I multiply some/part values in a row by a number, for example, I need to double the values in second two when the columns are ['b', 'c', 'd'], then the result is:
a b c d
0 1 4.0 7.0 7.0
1 2 **10.0 16.0 16.0**
2 3 6.0 9.0 9.0
3 4 NaN NaN NaN
4 5 NaN NaN NaN
First of all, I suggest you do some reading on Indexing and selecting data in pandas.
Regaring the first question you can use .loc() with isnull() to perform boolean indexing on the column vaulues:
mask_nans = test.loc[3,:].isnull()
test.loc[3, mask_nans] = test.loc[0, mask_nans]
And to double the values you can directly multiply by 2 the sliced dataframe also using .loc():
test.loc[1,'b':] *= 2
a b c d
0 1 4.0 7.0 7.0
1 2 10.0 16.0 16.0
2 3 6.0 9.0 9.0
3 4 4.0 7.0 7.0
4 5 NaN NaN NaN
Indexing with labels
If you wish to filter by a, and a values are unique, consider making it your index to simplify your logic and make it more efficient:
test = test.set_index('a')
test.loc[4] = test.loc[4].fillna(test.loc[1])
test.loc[2] *= 2
Boolean masks
If a is not unique and Boolean masks are required, you can still use fillna with an additional step::
mask = test['a'].eq(4)
test.loc[mask] = test.loc[mask].fillna(test.loc[test['a'].eq(1).idxmax()])
test.loc[test['a'].eq(2)] *= 2

Pandas: IndexingError: Unalignable boolean Series provided as indexer

I'm trying to run what I think is simple code to eliminate any columns with all NaNs, but can't get this to work (axis = 1 works just fine when eliminating rows):
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,np.nan,np.nan], 'b':[4,np.nan,6,np.nan], 'c':[np.nan, 8,9,np.nan], 'd':[np.nan,np.nan,np.nan,np.nan]})
df = df[df.notnull().any(axis = 0)]
print df
Full error:
raise IndexingError('Unalignable boolean Series provided as 'pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
Expected output:
a b c
0 1.0 4.0 NaN
1 2.0 NaN 8.0
2 NaN 6.0 9.0
3 NaN NaN NaN
You need loc, because filter by columns:
print (df.notnull().any(axis = 0))
a True
b True
c True
d False
dtype: bool
df = df.loc[:, df.notnull().any(axis = 0)]
print (df)
a b c
0 1.0 4.0 NaN
1 2.0 NaN 8.0
2 NaN 6.0 9.0
3 NaN NaN NaN
Or filter columns and then select by []:
print (df.columns[df.notnull().any(axis = 0)])
Index(['a', 'b', 'c'], dtype='object')
df = df[df.columns[df.notnull().any(axis = 0)]]
print (df)
a b c
0 1.0 4.0 NaN
1 2.0 NaN 8.0
2 NaN 6.0 9.0
3 NaN NaN NaN
Or dropna with parameter how='all' for remove all columns filled by NaNs only:
print (df.dropna(axis=1, how='all'))
a b c
0 1.0 4.0 NaN
1 2.0 NaN 8.0
2 NaN 6.0 9.0
3 NaN NaN NaN
You can use dropna with axis=1 and thresh=1:
In[19]:
df.dropna(axis=1, thresh=1)
Out[19]:
a b c
0 1.0 4.0 NaN
1 2.0 NaN 8.0
2 NaN 6.0 9.0
3 NaN NaN NaN
This will drop any column which doesn't have at least 1 non-NaN value which will mean any column with all NaN will get dropped
The reason what you tried failed is because the boolean mask:
In[20]:
df.notnull().any(axis = 0)
Out[20]:
a True
b True
c True
d False
dtype: bool
cannot be aligned on the index which is what is used by default, as this produces a boolean mask on the columns
I was facing the same issue while using a function in fairlearn package. Resetting the index inplace worked for me.
I came here because I tried to filter the 1st 2 letters like this:
filtered = df[(df.Name[0:2] != 'xx')]
The fix was:
filtered = df[(df.Name.str[0:2] != 'xx')]

Categories

Resources