Consolidating dataframes - python

I have 3 pandas data frames with matching indices. Some operations have trimmed data frames in different ways (removed rows), so that some indices in one data frame may not exist in the other.
I'd like to consolidate all 3 data frames, so they all contain rows with indices that are present in all 3 of them. How is this achievable?
import pandas as pd
data = pd.DataFrame.from_dict({'a': [1,2,3,4], 'b': [3,4,5,6], 'c': [6,7,8,9]})
a = pd.DataFrame(data['a'])
b = pd.DataFrame(data['b'])
c = pd.DataFrame(data['c'])
a = a[a['a'] <= 3]
b = b[b['b'] >= 4]
# some operation here that removes rows that aren't present in all (intersection of all dataframe's indices)
print a
a
1 2
2 3
print b
b
1 4
2 5
print c
c
1 7
2 8
Update
Sorry, I got carried away and forgot what I wanted to achieve when I wrote the examples. The actual intent was to keep the 3 dataframes separate. Apologies for the misleading example (I corrected it now).

Use merge and pass param left_index=True and right_index=True, the default type of merge is inner, so only values that exist on both left and right will be merged.
In [6]:
a.merge(b, left_index=True, right_index=True).merge(c, left_index=True, right_index=True)
Out[6]:
a b c
1 2 4 7
2 3 5 8
[2 rows x 3 columns]
To modify the original dataframes so that now only contain the rows that exist in all you can do this:
In [12]:
merged = a.merge(b, left_index=True, right_index=True).merge(c, left_index=True, right_index=True)
merged
Out[12]:
a b c
1 2 4 7
2 3 5 8
In [14]:
a = a.loc[merged.index]
b = b.loc[merged.index]
c = c.loc[merged.index]
In [15]:
print(a)
print(b)
print(c)
a
1 2
2 3
b
1 4
2 5
c
1 7
2 8
So we merge all of them on index values that are present in all of them and then use the index to filter the original dataframes.

Take a look at concat, which can be used for a variety of combination operations. Here you want to have the join type set to inner (because the want the intersection), and axis set to 1 (combining columns).
In [123]: pd.concat([a,b,c], join='inner', axis=1)
Out[123]:
a b c
1 2 4 7
2 3 5 8

Related

Python how to merge two dataframes with multiple columns while preserving row order in each column?

My data is contained within two dataframes. Within each dataframe, the entries are sorted in each column. I want to now merge the two dataframes while preserving row order. For example, suppose I have this:
The first dataframe "A1" looks like this:
index a b c
0 1 4 1
3 2 7 3
5 5 8 4
6 6 10 8
...
and the second dataframe "A2" looks like this (A1 and A2 are the same size):
index a b c
1 3 1 2
2 4 2 5
4 7 3 6
7 8 5 7
...
I want to merge both of these dataframes to get the final dataframe "data":
index a b c
0 1 4 1
1 3 1 2
2 4 2 5
3 2 7 3
...
Here is what I have tried:
data = A1.merge(A2, how='outer', left_index=True, right_index=True)
But I keep getting strange results. I don't even know if this works if you have multiple columns whose row order you need to preserve. I find that some of the entries become NaNs for some reason. I don't know how to fix it. I also tried data.join(A1, A2) but the compiler printed out that it couldn't join these two dataframes.
import pandas as pd
#Create Data Frame df and df1
df = pd.DataFrame({'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,0,11,12]},index=[0,3,5,6])
df1 = pd.DataFrame({'a':[13,14,15,16],'b':[17,18,19,20],'c':[21,22,23,24]},index=[1,2,4,7])
#Append df and df1 and sort by index.
df2 = df.append(df1)
print(df2.sort_index())

Append only matching columns to dataframe

I have a sort of 'master' dataframe that I'd like to append only matching columns from another dataframe to
df:
A B C
1 2 3
df_to_append:
A B C D E
6 7 8 9 0
The problem is that when I use df.append(), It also appends the unmatched columns to df.
df = df.append(df_to_append, ignore_index=True)
Out:
A B C D E
1 2 3 NaN NaN
6 7 8 9 0
But my desired output is to drop columns D and E since they are not a part of the original dataframe? Perhaps I need to use pd.concat? I don't think I can use pd.merge since I don't have anything unique to merge on.
Using concat join='inner
pd.concat([df,df_to_append],join='inner')
Out[162]:
A B C
0 1 2 3
0 6 7 8
Just select the columns common to both dfs:
df.append(df_to_append[df.columns], ignore_index=True)
The simplest way would be to get the list of columns common to both dataframes using df.columns, but if you don't know that all of the original columns are included in df_to_append, then you need to find the intersection of the two sets:
cols = list(set(df.columns) & set(df_to_append.columns))
df.append(df_to_append[cols], ignore_index=True)

Matching on basis of a pair of columns in pandas

I have a data frame df1 with multiple columns. I have df2 with same set of columns. I want to get the records of df1 which aren't present in df2. I am able to perform this task as below:
df1[~df1['ID'].isin(df2['ID'])]
Now I want to the same operation, but on the combination of NAME and ID. This means if the NAME and ID together as a pair from df1 also exists as the same pair in df2, then that whole record should not be part of my result.
How do I accomplish this task using pandas?
I don't think that the currently accepted answer is actually correct. It was my impression that you would like to drop a value pair in df1 if that pair also exists in the other dataframe, independent of the row position that they take in the respective dataframes.
Consider the following dataframes
df1 = pd.DataFrame({'a': list('ABC'), 'b': list('CDF')})
df2 = pd.DataFrame({'a': list('ABAC'), 'b': list('CFFF')})
df1
a b
0 A C
1 B D
2 C F
df2
a b
0 A C
1 B F
2 A F
3 C F
So you would like to drop row 0 and 2 in df1. However, with the above suggestion you get
df1.isin(df2)
a b
0 True True
1 True False
2 False True
What you can do instead is
compare_cols = ['a','b']
mask = pd.Series(list(zip(*[df1[c] for c in compare_cols]]))).isin(list(zip(*[df2[c] for c in compare_cols])))
mask
0 True
1 False
2 True
dtype: bool
That is, you construct a Series of tuples from the columns you would like to compare coming from the first dataframe, and then check whether these tuples exist in the list of tuples obtained in the same way from the respective columns in the second dataframe.
Final step: df1 = df1.loc[~mask.values]
As pointed out by #rvrvrv in the comments, it is best to use mask.values instead of just mask in case df1 and mask do not have the same index (or one uses the df1 index in the construction of mask.)
It's actually pretty easy.
df1[(~df1[['ID', 'Name']].isin(df2[['ID', 'Name']])).any(axis=1)]
You pass the column names that you want to compare as a list. The interesting part is what it outputs.
Let's say df1 equals:
ID Name
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 1 1
And df2 equals:
ID Name
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 1 9
Every (ID, Name) pair between df1 and df2 matches except for row 9. The result of my answer will return:
ID Name
9 1 1
Which is exactly what you want.
In more detail, when you do the mask:
~df[['ID', 'Name']].isin(df2[['ID', 'Name']]
You get this:
ID Name
0 False False
1 False False
2 False False
3 False False
4 False False
5 False False
6 False False
7 False False
8 False False
9 False True
And we want to select the row where one of those columns is true. For this, we can add the any(axis=1) onto the end which creates:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 True
And then when you index using this series, it will only select row 9.
Isin() would not work here as it is also comparing the index.
Let's have a look at a super powerful tool of pandas : merge()
If we consider the nice example given by user3820991, we have :
df1 = pd.DataFrame({'a': list('ABC'), 'b': list('CDF')})
df2 = pd.DataFrame({'a': list('ABAC'), 'b': list('CFFF')})
df1
a b
0 A C
1 B D
2 C F
df2
a b
0 A C
1 B F
2 A F
3 C F
The basic merge method of pandas is the 'inner' join. This will give you the equivalent of isin() method for two columns:
df1.merge(df2[['a','b']], how='inner')
a b
0 A C
1 C F
If you would like te equivalent of the not(isin()), then just change the merge method by 'outer' join (left join would work, but for the beauty of the example, you have more possibilities with the outer join).
This will give you all the rows in both dataframe, we only have to add the indicator=True to be able to select the one we want:
df1.merge(df2[['a','b']], how='outer', indicator=True)
a b _merge
0 A C both
1 B D left_only
2 C F both
3 B F right_only
4 A F right_only
We want the rows that are in df1 but not in df2, so 'left_only'. In a one liner code, you have :
pd.merge(df1, df2, on=['a','b'], how="outer", indicator=True
).query('_merge=="left_only"').drop(columns='_merge')
a b
1 B D
You can create a new column by concatenating NAME and ID and use this new column the same way you used ID in your question:
df1['temp'] = df1['NAME'].astype(str)+df1['ID'].astype(str)
df2['temp'] = df2['NAME'].astype(str)+df2['ID'].astype(str)
df1[~df1['temp'].isin(df2['temp'])].drop('temp',1)

Compare Python Pandas DataFrames for matching rows

I have this DataFrame (df1) in Pandas:
df1 = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD'))
print df1
A B C D
0.860379 0.726956 0.394529 0.833217
0.014180 0.813828 0.559891 0.339647
0.782838 0.698993 0.551252 0.361034
0.833370 0.982056 0.741821 0.006864
0.855955 0.546562 0.270425 0.136006
0.491538 0.445024 0.971603 0.690001
0.911696 0.065338 0.796946 0.853456
0.744923 0.545661 0.492739 0.337628
0.576235 0.219831 0.946772 0.752403
0.164873 0.454862 0.745890 0.437729
I would like to check if any row (all columns) from another dataframe (df2) are present in df1. Here is df2:
df2 = df1.ix[4:8]
df2.reset_index(drop=True,inplace=True)
df2.loc[-1] = [2, 3, 4, 5]
df2.loc[-2] = [14, 15, 16, 17]
df2.reset_index(drop=True,inplace=True)
print df2
A B C D
0.855955 0.546562 0.270425 0.136006
0.491538 0.445024 0.971603 0.690001
0.911696 0.065338 0.796946 0.853456
0.744923 0.545661 0.492739 0.337628
0.576235 0.219831 0.946772 0.752403
2.000000 3.000000 4.000000 5.000000
14.000000 15.000000 16.000000 17.000000
I tried using df.lookup to search for one row at a time. I did it this way:
list1 = df2.ix[0].tolist()
cols = df1.columns.tolist()
print df1.lookup(list1, cols)
but I got this error message:
File "C:\Users\test.py", line 19, in <module>
print df1.lookup(list1, cols)
File "C:\python27\lib\site-packages\pandas\core\frame.py", line 2217, in lookup
raise KeyError('One or more row labels was not found')
KeyError: 'One or more row labels was not found'
I also tried .all() using:
print (df2 == df1).all(1).any()
but I got this error message:
File "C:\Users\test.py", line 12, in <module>
print (df2 == df1).all(1).any()
File "C:\python27\lib\site-packages\pandas\core\ops.py", line 884, in f
return self._compare_frame(other, func, str_rep)
File "C:\python27\lib\site-packages\pandas\core\frame.py", line 3010, in _compare_frame
raise ValueError('Can only compare identically-labeled '
ValueError: Can only compare identically-labeled DataFrame objects
I also tried isin() like this:
print df2.isin(df1)
but I got False everywhere, which is not correct:
A B C D
False False False False
False False False False
False False False False
False False False False
False False False False
False False False False
False False False False
False False False False
False False False False
False False False False
Is it possible to search for a set of rows in a DataFrame, by comparing it to another dataframe's rows?
EDIT:
Is is possible to drop df2 rows if those rows are also present in df1?
One possible solution to your problem would be to use merge. Checking if any row (all columns) from another dataframe (df2) are present in df1 is equivalent to determining the intersection of the the two dataframes. This can be accomplished using the following function:
pd.merge(df1, df2, on=['A', 'B', 'C', 'D'], how='inner')
For example, if df1 was
A B C D
0 0.403846 0.312230 0.209882 0.397923
1 0.934957 0.731730 0.484712 0.734747
2 0.588245 0.961589 0.910292 0.382072
3 0.534226 0.276908 0.323282 0.629398
4 0.259533 0.277465 0.043652 0.925743
5 0.667415 0.051182 0.928655 0.737673
6 0.217923 0.665446 0.224268 0.772592
7 0.023578 0.561884 0.615515 0.362084
8 0.346373 0.375366 0.083003 0.663622
9 0.352584 0.103263 0.661686 0.246862
and df2 was defined as:
A B C D
0 0.259533 0.277465 0.043652 0.925743
1 0.667415 0.051182 0.928655 0.737673
2 0.217923 0.665446 0.224268 0.772592
3 0.023578 0.561884 0.615515 0.362084
4 0.346373 0.375366 0.083003 0.663622
5 2.000000 3.000000 4.000000 5.000000
6 14.000000 15.000000 16.000000 17.000000
The function pd.merge(df1, df2, on=['A', 'B', 'C', 'D'], how='inner') produces:
A B C D
0 0.259533 0.277465 0.043652 0.925743
1 0.667415 0.051182 0.928655 0.737673
2 0.217923 0.665446 0.224268 0.772592
3 0.023578 0.561884 0.615515 0.362084
4 0.346373 0.375366 0.083003 0.663622
The results are all of the rows (all columns) that are both in df1 and df2.
We can also modify this example if the columns are not the same in df1 and df2 and just compare the row values that are the same for a subset of the columns. If we modify the original example:
df1 = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD'))
df2 = df1.ix[4:8]
df2.reset_index(drop=True,inplace=True)
df2.loc[-1] = [2, 3, 4, 5]
df2.loc[-2] = [14, 15, 16, 17]
df2.reset_index(drop=True,inplace=True)
df2 = df2[['A', 'B', 'C']] # df2 has only columns A B C
Then we can look at the common columns using common_cols = list(set(df1.columns) & set(df2.columns)) between the two dataframes then merge:
pd.merge(df1, df2, on=common_cols, how='inner')
EDIT: New question (comments), having identified the rows from df2 that were also present in the first dataframe (df1), is it possible to take the result of the pd.merge() and to then drop the rows from df2 that are also present in df1
I do not know of a straightforward way to accomplish the task of dropping the rows from df2 that are also present in df1. That said, you could use the following:
ds1 = set(tuple(line) for line in df1.values)
ds2 = set(tuple(line) for line in df2.values)
df = pd.DataFrame(list(ds2.difference(ds1)), columns=df2.columns)
There probably exists a better way to accomplish that task but i am unaware of such a method / function.
EDIT 2: How to drop the rows from df2 that are also present in df1 as shown in #WR answer.
The method provided df2[~df2['A'].isin(df12['A'])] does not account for all types of situations. Consider the following DataFrames:
df1:
A B C D
0 6 4 1 6
1 7 6 6 8
2 1 6 2 7
3 8 0 4 1
4 1 0 2 3
5 8 4 7 5
6 4 7 1 1
7 3 7 3 4
8 5 2 8 8
9 3 2 8 4
df2:
A B C D
0 1 0 2 3
1 8 4 7 5
2 4 7 1 1
3 3 7 3 4
4 5 2 8 8
5 1 1 1 1
6 2 2 2 2
df12:
A B C D
0 1 0 2 3
1 8 4 7 5
2 4 7 1 1
3 3 7 3 4
4 5 2 8 8
Using the above DataFrames with the goal of dropping rows from df2 that are also present in df1 would result in the following:
A B C D
0 1 1 1 1
1 2 2 2 2
Rows (1, 1, 1, 1) and (2, 2, 2, 2) are in df2 and not in df1. Unfortunately, using the provided method (df2[~df2['A'].isin(df12['A'])]) results in:
A B C D
6 2 2 2 2
This occurs because the value of 1 in column A is found in both the intersection DataFrame (i.e. (1, 0, 2, 3)) and df2 and thus removes both (1, 0, 2, 3) and (1, 1, 1, 1). This is unintended since the row (1, 1, 1, 1) is not in df1 and should not be removed.
I think the following will provide a solution. It creates a dummy column that is later used to subset the DataFrame to the desired results:
df12['key'] = 'x'
temp_df = pd.merge(df2, df12, on=df2.columns.tolist(), how='left')
temp_df[temp_df['key'].isnull()].drop('key', axis=1)
#Andrew: I believe I found a way to drop the rows of one dataframe that are already present in another (i.e. to answer my EDIT) without using loops - let me know if you disagree and/or if my OP + EDIT did not clearly state this:
THIS WORKS
The columns for both dataframes are always the same - A, B, C and D. With this in mind, based heavily on Andrew's approach, here is how to drop the rows from df2 that are also present in df1:
common_cols = df1.columns.tolist() #generate list of column names
df12 = pd.merge(df1, df2, on=common_cols, how='inner') #extract common rows with merge
df2 = df2[~df2['A'].isin(df12['A'])]
Line 3 does the following:
Extract only rows from df2 that do not match rows in df1:
In order for 2 rows to be different, ANY one column of one row must
necessarily be different that the corresponding column in another
row.
Here, I picked column A to make this comparison - it is
possible to use any of the column names, but not ALL of the
column names.
NOTE: this method is essentially the equivalent of the SQL NOT IN().

pandas: Boolean indexing with multi index

There are many questions here with similar titles, but I couldn't find one that's addressing this issue.
I have dataframes from many different origins, and I want to filter one by the other. Using boolean indexing works great when the boolean series is the same size as the filtered dataframe, but not when the size of the series is the same as a higher level index of the filtered dataframe.
In short, let's say I have this dataframe:
In [4]: df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3],
'b':[1,2,3,1,2,3,1,2,3],
'c':range(9)}).set_index(['a', 'b'])
Out[4]:
c
a b
1 1 0
2 1
3 2
2 1 3
2 4
3 5
3 1 6
2 7
3 8
And this series:
In [5]: filt = pd.Series({1:True, 2:False, 3:True})
Out[6]:
1 True
2 False
3 True
dtype: bool
And the output I want is this:
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8
I am not looking for solutions that are not using the filt series, such as:
df[df.index.get_level_values('a') != 2]
df[df.index.get_level_values('a').isin([1,3])]
I want to know if I can use my input filt series as is, as I would use a filter on c:
filt = df.c < 7
df[filt]
If you transform your index 'a' back to a column, you can do it as follows:
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3],
'b':[1,2,3,1,2,3,1,2,3],
'c':range(9)})
>>> filt = pd.Series({1:True, 2:False, 3:True})
>>> df[filt[df['a']].values]
a b c
0 1 1 0
1 1 2 1
2 1 3 2
6 3 1 6
7 3 2 7
8 3 3 8
edit.
As suggested by #joris, this works also with indices. Here is the code for your sample data:
>>> df[filt[df.index.get_level_values('a')].values]
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8
If the boolean series is not aligned with the dataframe you want to index it with, you can first explicitely align it with align:
In [25]: df_aligned, filt_aligned = df.align(filt.to_frame(), level=0, axis=0)
In [26]: filt_aligned
Out[26]:
0
a b
1 1 True
2 True
3 True
2 1 False
2 False
3 False
3 1 True
2 True
3 True
And then you can index with it:
In [27]: df[filt_aligned[0]]
Out[27]:
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8
Note: the align didn't work with a Series, therefore the to_frame in the align call, and therefore the [0] above to get back the series.
You can use pd.IndexSlicer.
>>> df.loc[pd.IndexSlice[filt[filt].index.values, :], :]
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8
where filt[filt].index.values is just [1, 3]. In other words
>>> df.loc[pd.IndexSlice[[1, 3], :]]
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8
so if you design your filter construction a bit differently, the expression gets shorter. The advantave over Emanuele Paolini's solution df[filt[df.index.get_level_values('a')].values] is, that you have more control over the indexing.
The topic of multiindex slicing is covered in more depth here.
Here the full code
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c':range(9)}).set_index(['a', 'b'])
filt = pd.Series({1:True, 2:False, 3:True})
print(df.loc[pd.IndexSlice[[1, 3], :]])
print(df.loc[(df.index.levels[0].values[filt], slice(None)), :])
print(df.loc[pd.IndexSlice[filt[filt].index.values, :], :])
The more readable (to my liking) solution is to reindex the boolean series (dataframe) to match index of the multi-index df:
df.loc[filt.reindex(df.index, level='a')]
I was facing exactly the same problem. I found this question and tried the solutions here, but none of them was efficient enough. My dataframes are: A = 700k rows x 14 cols, B = 100M rows x 3 cols. B has an MultiIndex, where the first (high) level is equal to the index of A. Let C be a slice from A of size 10k rows. My task was to get rows from B whose high-level index matches indexes of C as fast as possible. C is selected at runtime. A and B are static.
I tried the solutions from here: get_level_values takes many seconds, df.align didn't even finish giving MemoryError (and also took seconds).
The solution which worked for me (in ~300msec during runtime) is the following:
For each index value i from A, find the first and the last (non-inclusive) positional indexes in B which contain i as the first level of MultiIndex. Store these pairs in A. This is done once and in advance.
Example code:
def construct_position_indexes(A, B):
indexes = defaultdict(list)
prev_index = 0
for i, cur_index in enumerate(B.index.get_level_values(0)):
if cur_index != prev_index:
indexes[cur_index].append(i)
if prev_index:
indexes[prev_index].append(i)
prev_index = cur_index
indexes[cur_index].append(i+1)
index_df = pd.DataFrame(indexes.values(),
index=indexes.keys(),
columns=['start_index', 'end_index'], dtype=int)
A = A.join(index_df)
# they become floats, so we fix that
A['start_index'] = A.start_index.fillna(0).astype(int)
A['end_index'] = A.end_index.fillna(0).astype(int)
return A
At runtime, get positional boundaries from C and construct a list of all positional indexes to search for in B, and pass them to B.take():
def get_slice(B, C):
all_indexes = []
for start_index, end_index in zip(
C.start_index.values, C.end_index.values):
all_indexes.extend(range(start_index, end_index))
return B.take(all_indexes)
I hope it's not too complicated. Essentially, the idea is for each row in A store the range of corresponding (positional) indexes of rows in B, so that at runtime we can quickly construct the list of all positional indexes to query B by.
This is a toy example:
A = pd.DataFrame(range(3), columns=['dataA'], index=['A0', 'A1', 'A2'])
print A
dataA
A0 0
A1 1
A2 2
mindex = pd.MultiIndex.from_tuples([
('A0', 'B0'), ('A0', 'B1'), ('A1', 'B0'),
('A2', 'B0'), ('A2', 'B1'), ('A2', 'B3')])
B = pd.DataFrame(range(6), columns=['dataB'], index=mindex)
print B
dataB
A0 B0 0
B1 1
A1 B0 2
A2 B0 3
B1 4
B3 5
A = construct_position_indexes(A, B)
print A
dataA start_index end_index
A0 0 0 2
A1 1 2 3
A2 2 3 6
C = A.iloc[[0, 2], :]
print C
dataA start_index end_index
A0 0 0 2
A2 2 3 6
print get_slice(B, C)
dataB
A0 B0 0
B1 1
A2 B0 3
B1 4
B3 5
Simply:
df.where(
filt.rename_axis('a').rename('c').to_frame()
).dropna().astype(int)
Explanation:
.rename_axis('a') renames the index as a (the index we want to filter by)
.rename('c') renames the column as c (the column that stores the values)
.to_frame() converts this Series into a DataFrame, for compatibility with df
df.where(...) filters the rows, leaving missing values (NaN) where filter is False
.drop_na() removes the rows with missing values (in our case where a == 2)
.astype(int) converts from float back to int (not sure why float to begin with)
By the way, it seems that df.where(...) and df[...] behave similarly here, so take your pick.
Not sure how fast/slow it would be on a large-scale dataframe, but what I sometimes do is
df.loc[filt[filt].index]
The problem is that the loc method only works with boolean inputs on a 1D index. If you provide the values of the first level elements you want to retain, you're good to go. So by filtering filt with itself (since it's on a 1D index) and keeping the values from its index, you achieve your goal.
Building on #Markus Dutschke's answer, note that the IndexSlice object can be created just once and then used over and over (even to slice up different objects). I find this creates more readable code, especially when using it twice to slice on both MultiIndex rows and columns in the same .loc.
Applying this to his answer and simplifying slightly (no need for .values):
idx = pd.IndexSlice
df.loc[idx[filt[filt].index, :], :]
or the full code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c':range(9)}).set_index(['a', 'b'])
filt = pd.Series({1:True, 2:False, 3:True})
idx = pd.IndexSlice
print(df.loc[idx[[1, 3], :]])
print(df.loc[(df.index.levels[0].values[filt], slice(None)), :])
print(df.loc[idx[filt[filt].index, :], :])

Categories

Resources