Conditional selecting rows in pandas DataFrame with MultiIndex - python

I have a DataFrame like this:
df = pd.DataFrame(np.random.randn(6, 6),
columns=pd.MultiIndex.from_arrays((['A','A','A','B','B','B'],
['a', 'b', 'c', 'a', 'b', 'c'])))
df
A B
a b c a b c
0 -0.089902 -2.235642 0.282761 0.725579 1.266029 -0.354892
1 -1.753303 1.092057 0.484323 1.789094 -0.316307 0.416002
2 -0.409028 -0.920366 -0.396802 -0.569926 -0.538649 -0.844967
3 1.789569 -0.935632 0.004476 -1.873532 -1.136138 -0.867943
4 0.244112 0.298361 -1.607257 -0.181820 0.577446 0.556841
5 0.903908 -1.379358 0.361620 1.290646 -0.523404 -0.518992
I would like to select only the rows that have a value larger than 0 in column c. I figured that I will have to use pd.IndexSlice to select only the second level index c.
idx = pd.IndexSlice
df.loc[:,idx[:,['c']]] > 0
A B
c c
0 True False
1 True True
2 False False
3 True False
4 False True
5 True False
So, now I would expect that I could simply do df[df.loc[:,idx[:,['c']]] > 0], however that gives me an unexpected result:
df[df.loc[:,idx[:,['c']]] > 0]
A B
a b c a b c
0 NaN NaN 0.282761 NaN NaN NaN
1 NaN NaN 0.484323 NaN NaN 0.416002
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN 0.004476 NaN NaN NaN
4 NaN NaN NaN NaN NaN 0.556841
5 NaN NaN 0.361620 NaN NaN NaN
What I would have liked to have is all values (not NaNs) and only the rows where any of the c-columns is greater 0.
A B
a b c a b c
0 -0.089902 -2.235642 0.282761 0.725579 1.266029 -0.354892
1 -1.753303 1.092057 0.484323 1.789094 -0.316307 0.416002
3 1.789569 -0.935632 0.004476 -1.873532 -1.136138 -0.867943
4 0.244112 0.298361 -1.607257 -0.181820 0.577446 0.556841
5 0.903908 -1.379358 0.361620 1.290646 -0.523404 -0.518992
So, I would probably need to sneak an any() somewhere in there, however, I am not sure how to do that. Any hints?

Another version using get_level_values
df[(df.iloc[:, df.columns.get_level_values(1) == 'c'] > 0).any(axis=1)]

You are looking for any
df[(df.loc[:,idx[:,['c']]]>0).any(axis = 1)]
Out[133]:
A B
a b c a b c
1 -0.423313 0.459464 -1.457655 -0.559667 -0.056230 1.338850
3 -0.072396 1.305868 -1.239441 -0.708834 0.348704 0.260532
4 -1.415575 1.229508 0.148254 -0.812806 1.379552 -1.195062
5 -0.336973 -0.469335 1.345719 0.847943 1.465100 -0.285792

Related

align two pandas dataframes on values in one column, otherwise insert NA to match row number

I have two pandas DataFrames (df1, df2) with a different number of rows and columns and some matching values in a specific column in each df, with caveats (1) there are some unique values in each df, and (2) there are different numbers of matching values across the DataFrames.
Baby example:
df1 = pd.DataFrame({'id1': [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 6, 6]})
df2 = pd.DataFrame({'id2': [1, 1, 2, 2, 2, 2, 3, 4, 5],
'var1': ['B', 'B', 'W', 'W', 'W', 'W', 'H', 'B', 'A']})
What I am seeking to do is create df3 where df2['id2'] is aligned/indexed to df1['id1'], such that:
NaN is added to df3[id2] when df2[id2] has fewer (or missing) matches to df1[id1]
NaN is added to df3[id2] & df3[var1] if df1[id1] exists but has no match to df2[id2]
'var1' is filled in for all cases of df3[var1] where df1[id1] and df2[id2] match
rows are dropped when df2[id2] has more matching values than df1[id1] (or no matches at all)
The resulting DataFrame (df3) should look as follows (Notice id2 = 5 and var1 = A are gone):
id1
id2
var1
1
1
B
1
1
B
1
NaN
B
2
2
W
2
2
W
3
3
H
3
NaN
H
3
NaN
H
3
NaN
H
4
4
B
6
NaN
NaN
6
NaN
NaN
I cannot find a combination of merge/join/concatenate/align that correctly solves this problem. Currently, everything I have tried stacks the rows in sequence without adding NaN in the proper cells/rows and instead adds all the NaN values at the bottom of df3 (so id1 and id2 never align). Any help is greatly appreciated!
You can first assign a helper column for id1 and id2 based on groupby.cumcount, then merge. Finally ffill values of var1 based on the group id1
def helper(data,col): return data.groupby(col).cumcount()
out = df1.assign(k = helper(df1,['id1'])).merge(df2.assign(k = helper(df2,['id2'])),
left_on=['id1','k'],right_on=['id2','k'] ,how='left').drop('k',1)
out['var1'] = out['id1'].map(dict(df2[['id2','var1']].drop_duplicates().to_numpy()))
Or similar but without assign as HenryEcker suggests :
out = df1.merge(df2, left_on=['id1', helper(df1, ['id1'])],
right_on=['id2', helper(df2, ['id2'])], how='left').drop(columns='key_1')
out['var1'] = out['id1'].map(dict(df2[['id2','var1']].drop_duplicates().to_numpy()))
print(out)
id1 id2 var1
0 1 1.0 B
1 1 1.0 B
2 1 NaN B
3 2 2.0 W
4 2 2.0 W
5 3 3.0 H
6 3 NaN H
7 3 NaN H
8 3 NaN H
9 4 4.0 B
10 6 NaN NaN
11 6 NaN NaN

how to drop NAs only if all elements are NAs in a groupby in pandas

I have a dataframe that looks like this
import pandas as pd
import numpy as np
fff = pd.DataFrame({'group': ['a','a','a','b','b','b','b','c','c'], 'value': [1,2, np.nan, 1,2,3,4, np.nan, np.nan]})
I would like to drop the NAs by group only if all values are Nas inside the group. How could i do that ?
Expected output:
fff = pd.DataFrame({'group': ['a','a','a','b','b','b','b'], 'value': [1,2, np.nan, 1,2,3,4]})
You can check value for nan and use groupby().any():
fff = fff[(~fff['value'].isna()).groupby(fff['group']).transform('any')]
Output:
group value
0 a 1.0
1 a 2.0
2 a NaN
3 b 1.0
4 b 2.0
5 b 3.0
6 b 4.0
create a boolean series with isna() and then group on fff['group'], and transform with all , then filter out(exclude) values which return True
c = fff['value'].isna()
fff[~c.groupby(fff['group']).transform('all')]
group value
0 a 1.0
1 a 2.0
2 a NaN
3 b 1.0
4 b 2.0
5 b 3.0
6 b 4.0
Another option:
fff["cases"] = fff.groupby("group").cumcount()
fff["null"] = fff["value"].isnull()
fff["cases 2"] = fff.groupby(["group","null"]).cumcount()
fff[~((fff["value"].isnull()) & (fff["cases"] == fff["cases 2"]))][["group","value"]]
Output:
group value
0 a 1.0
1 a 2.0
2 a NaN
3 b 1.0
4 b 2.0
5 b 3.0
6 b 4.0
An addition to the answers already provided : Keep only groups where all the values are True, and filter the fff dataframe with the result variable.
result = fff.groupby("group").value.all().index.tolist()
fff.query("group == #result")

Find the column name which has the 2nd maximum value for each row (pandas)

Based on this post: Find the column name which has the maximum value for each row it is clear how to get the column name with the max value of each row using df.idxmax(axis=1).
The question is, how can I get the 2nd, 3rd and so on maximum value per row?
You need numpy.argsort for position and then reorder columns names by indexing:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
print (df)
A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
arr = np.argsort(-df.values, axis=1)
df1 = pd.DataFrame(df.columns[arr], index=df.index)
print (df1)
0 1 2 3 4
0 A B D E C
1 D B C E A
2 E A B C D
3 C D A E B
4 C A E D B
Verify:
#first column
print (df.idxmax(axis=1))
0 A
1 D
2 E
3 C
4 C
dtype: object
#last column
print (df.idxmin(axis=1))
0 C
1 A
2 D
3 B
4 B
dtype: object
While there is no method to find specific ranks within a row, you can rank elements in a pandas dataframe using the rank method.
For example, for a dataframe like this:
df = pd.DataFrame([[1, 2, 4],[3, 1, 7], [10, 4, 2]], columns=['A','B','C'])
>>> print(df)
A B C
0 1 2 4
1 3 1 7
2 10 4 2
You can get the ranks of each row by doing:
>>> df.rank(axis=1,method='dense', ascending=False)
A B C
0 3.0 2.0 1.0
1 2.0 3.0 1.0
2 1.0 2.0 3.0
By default, applying rank to dataframes and using method='dense' will result in float ranks. This can be easily fixed just by doing:
>>> ranks = df.rank(axis=1,method='dense', ascending=False).astype(int)
>>> ranks
A B C
0 3 2 1
1 2 3 1
2 1 2 3
Finding the indices is a little trickier in pandas, but it can be resumed to apply a filter on a condition (i.e. ranks==2):
>>> ranks.where(ranks==2)
A B C
0 NaN 2.0 NaN
1 2.0 NaN NaN
2 NaN 2.0 NaN
Applying where will return only the elements matching the condition and the rest set to NaN. We can retrieve the columns and row indices by doing:
>>> ranks.where(ranks==2).notnull().values.nonzero()
(array([0, 1, 2]), array([1, 0, 1]))
And for retrieving the column index or position within a row, which is the answer to your question:
>>> ranks.where(ranks==2).notnull().values.nonzero()[0]
array([1, 0, 1])
For the third element you just need to change the condition in where to ranks.where(ranks==3) and so on for other ranks.

Panda: add a sublevel to an index that depend from the upper one

Possible duplicates but the solution provided there cannot fit my problem due to the information I get.
The idea is quite simple. I have a matrix with a multilevel (and in my case I didn't build the index, I get only the DataFrame):
#test = (('2','C'),('2','B'),('1','A'))
#test = pd.MultiIndex.from_tuples(test)
#pandas.dataFrame(index=test, columns=test)
2 1
C B A
2 C NaN NaN NaN
B NaN NaN NaN
1 A NaN NaN NaN
I would like to add a sublevel on the two axis in function of A, B, C. E.g.:
2 1
C B A
kg kg m3
2 C kg NaN NaN NaN
B kg NaN NaN NaN
1 A m3 NaN NaN NaN
In reality the index is available through the DataFrame (I didn't build it), and I only know this: {'C':'kg', 'B':'kg', 'A':'m3'}. I can get the index series and use an approach similar to the link above, but it is very slow and I cannot imagine there is something simpler and more effective.
Source DF:
In [303]: df
Out[303]:
2 1
C B A
2 C NaN NaN NaN
B NaN NaN NaN
1 A NaN NaN NaN
Solution:
In [304]: cols = df.columns
In [305]: new_lvl = [d[c] for c in df.columns.get_level_values(1)]
In [306]: df.columns = pd.MultiIndex.from_arrays([cols.get_level_values(0),
cols.get_level_values(1),
new_lvl])
In [307]: df
Out[307]:
2 1
C B A
kg kg m3
2 C NaN NaN NaN
B NaN NaN NaN
1 A NaN NaN NaN
where d is:
In [308]: d = {'C':'kg', 'B':'kg', 'A':'m3'}
In [309]: d
Out[309]: {'A': 'm3', 'B': 'kg', 'C': 'kg'}
yout can use set_index(..., append=True) to add new index
test = (('2','C'),('2','B'),('1','A'))
test = pd.MultiIndex.from_tuples(test)
x = pd.DataFrame(index=test, columns=test)
# add new index
x['new'] = pd.Series(x.index.get_level_values(-1), index=x.index).replace({'C':'kg', 'B':'kg', 'A':'m3'})
x.set_index('new', append=True, inplace=True)
x.index.names = [None] * 3
# transpose dataframe and do the same thing
x = x.T
x['new'] = pd.Series(x.index.get_level_values(-1), index=x.index).replace({'C':'kg', 'B':'kg', 'A':'m3'})
x.set_index('new', append=True, inplace=True)
x.index.names = [None] * 3
x = x.T

Make sure Column B = a certain value when Column A is Null - Python

I want to make sure that when Column A is NULL (in csv), or NaN (in dataframe), Column B is "Cash".
I've tried this:
check = df[df['A'].isnull()]['B']
check = check.to_string(index=False)
if "Cash" not in check:
print "Column A Fail"
else:
print "Column A Pass!"
But it is not working.
any suggestions?
I also need to make sure that it doesn't treat '0' as NaN
UPDATE:
my goal is not to assign 'Cash', but rather to make sure that it's
already there as a quality check
In [40]: df
Out[40]:
A B
0 NaN a
1 1.0 b
2 2.0 c
3 NaN Cash
In [41]: df.query("A != A and B != 'Cash'")
Out[41]:
A B
0 NaN a
or using boolean indexing:
In [42]: df.loc[df.A.isnull() & (df.B != 'Cash')]
Out[42]:
A B
0 NaN a
OLD answer:
Alternative solution:
In [23]: df.B = np.where(df.A.isnull(), 'Cash', df.B)
In [24]: df
Out[24]:
A B
0 NaN Cash
1 1.0 b
2 2.0 c
3 NaN Cash
another solution:
In [31]: df = df.mask(df.A.isnull(), df.assign(B='Cash'))
In [32]: df
Out[32]:
A B
0 NaN Cash
1 1.0 b
2 2.0 c
3 NaN Cash
Use loc to assign where A is null.
df.loc[df['A'].isnull(), 'B'] = 'Cash'
example
df = pd.DataFrame(dict(
A=[np.nan, 1, 2, np.nan],
B=['a', 'b', 'c', 'd']
))
print(df)
A B
0 NaN a
1 1.0 b
2 2.0 c
3 NaN d
Then do
df.loc[df['A'].isnull(), 'B'] = 'Cash'
print(df)
A B
0 NaN Cash
1 1.0 b
2 2.0 c
3 NaN Cash
check if all B are 'Cash' where A is null*
(df.loc[df.A.isnull(), 'B'] == 'Cash').all()
According to logic rules, P=>Q is (not P) or Q. So
(~df.A.isnull()|(df.B=="Cash")).all()
check all the lines.

Categories

Resources