pandas: Boolean indexing with multi index - python

There are many questions here with similar titles, but I couldn't find one that's addressing this issue.
I have dataframes from many different origins, and I want to filter one by the other. Using boolean indexing works great when the boolean series is the same size as the filtered dataframe, but not when the size of the series is the same as a higher level index of the filtered dataframe.
In short, let's say I have this dataframe:
In [4]: df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3],
'b':[1,2,3,1,2,3,1,2,3],
'c':range(9)}).set_index(['a', 'b'])
Out[4]:
c
a b
1 1 0
2 1
3 2
2 1 3
2 4
3 5
3 1 6
2 7
3 8
And this series:
In [5]: filt = pd.Series({1:True, 2:False, 3:True})
Out[6]:
1 True
2 False
3 True
dtype: bool
And the output I want is this:
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8
I am not looking for solutions that are not using the filt series, such as:
df[df.index.get_level_values('a') != 2]
df[df.index.get_level_values('a').isin([1,3])]
I want to know if I can use my input filt series as is, as I would use a filter on c:
filt = df.c < 7
df[filt]

If you transform your index 'a' back to a column, you can do it as follows:
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3],
'b':[1,2,3,1,2,3,1,2,3],
'c':range(9)})
>>> filt = pd.Series({1:True, 2:False, 3:True})
>>> df[filt[df['a']].values]
a b c
0 1 1 0
1 1 2 1
2 1 3 2
6 3 1 6
7 3 2 7
8 3 3 8
edit.
As suggested by #joris, this works also with indices. Here is the code for your sample data:
>>> df[filt[df.index.get_level_values('a')].values]
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8

If the boolean series is not aligned with the dataframe you want to index it with, you can first explicitely align it with align:
In [25]: df_aligned, filt_aligned = df.align(filt.to_frame(), level=0, axis=0)
In [26]: filt_aligned
Out[26]:
0
a b
1 1 True
2 True
3 True
2 1 False
2 False
3 False
3 1 True
2 True
3 True
And then you can index with it:
In [27]: df[filt_aligned[0]]
Out[27]:
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8
Note: the align didn't work with a Series, therefore the to_frame in the align call, and therefore the [0] above to get back the series.

You can use pd.IndexSlicer.
>>> df.loc[pd.IndexSlice[filt[filt].index.values, :], :]
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8
where filt[filt].index.values is just [1, 3]. In other words
>>> df.loc[pd.IndexSlice[[1, 3], :]]
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8
so if you design your filter construction a bit differently, the expression gets shorter. The advantave over Emanuele Paolini's solution df[filt[df.index.get_level_values('a')].values] is, that you have more control over the indexing.
The topic of multiindex slicing is covered in more depth here.
Here the full code
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c':range(9)}).set_index(['a', 'b'])
filt = pd.Series({1:True, 2:False, 3:True})
print(df.loc[pd.IndexSlice[[1, 3], :]])
print(df.loc[(df.index.levels[0].values[filt], slice(None)), :])
print(df.loc[pd.IndexSlice[filt[filt].index.values, :], :])

The more readable (to my liking) solution is to reindex the boolean series (dataframe) to match index of the multi-index df:
df.loc[filt.reindex(df.index, level='a')]

I was facing exactly the same problem. I found this question and tried the solutions here, but none of them was efficient enough. My dataframes are: A = 700k rows x 14 cols, B = 100M rows x 3 cols. B has an MultiIndex, where the first (high) level is equal to the index of A. Let C be a slice from A of size 10k rows. My task was to get rows from B whose high-level index matches indexes of C as fast as possible. C is selected at runtime. A and B are static.
I tried the solutions from here: get_level_values takes many seconds, df.align didn't even finish giving MemoryError (and also took seconds).
The solution which worked for me (in ~300msec during runtime) is the following:
For each index value i from A, find the first and the last (non-inclusive) positional indexes in B which contain i as the first level of MultiIndex. Store these pairs in A. This is done once and in advance.
Example code:
def construct_position_indexes(A, B):
indexes = defaultdict(list)
prev_index = 0
for i, cur_index in enumerate(B.index.get_level_values(0)):
if cur_index != prev_index:
indexes[cur_index].append(i)
if prev_index:
indexes[prev_index].append(i)
prev_index = cur_index
indexes[cur_index].append(i+1)
index_df = pd.DataFrame(indexes.values(),
index=indexes.keys(),
columns=['start_index', 'end_index'], dtype=int)
A = A.join(index_df)
# they become floats, so we fix that
A['start_index'] = A.start_index.fillna(0).astype(int)
A['end_index'] = A.end_index.fillna(0).astype(int)
return A
At runtime, get positional boundaries from C and construct a list of all positional indexes to search for in B, and pass them to B.take():
def get_slice(B, C):
all_indexes = []
for start_index, end_index in zip(
C.start_index.values, C.end_index.values):
all_indexes.extend(range(start_index, end_index))
return B.take(all_indexes)
I hope it's not too complicated. Essentially, the idea is for each row in A store the range of corresponding (positional) indexes of rows in B, so that at runtime we can quickly construct the list of all positional indexes to query B by.
This is a toy example:
A = pd.DataFrame(range(3), columns=['dataA'], index=['A0', 'A1', 'A2'])
print A
dataA
A0 0
A1 1
A2 2
mindex = pd.MultiIndex.from_tuples([
('A0', 'B0'), ('A0', 'B1'), ('A1', 'B0'),
('A2', 'B0'), ('A2', 'B1'), ('A2', 'B3')])
B = pd.DataFrame(range(6), columns=['dataB'], index=mindex)
print B
dataB
A0 B0 0
B1 1
A1 B0 2
A2 B0 3
B1 4
B3 5
A = construct_position_indexes(A, B)
print A
dataA start_index end_index
A0 0 0 2
A1 1 2 3
A2 2 3 6
C = A.iloc[[0, 2], :]
print C
dataA start_index end_index
A0 0 0 2
A2 2 3 6
print get_slice(B, C)
dataB
A0 B0 0
B1 1
A2 B0 3
B1 4
B3 5

Simply:
df.where(
filt.rename_axis('a').rename('c').to_frame()
).dropna().astype(int)
Explanation:
.rename_axis('a') renames the index as a (the index we want to filter by)
.rename('c') renames the column as c (the column that stores the values)
.to_frame() converts this Series into a DataFrame, for compatibility with df
df.where(...) filters the rows, leaving missing values (NaN) where filter is False
.drop_na() removes the rows with missing values (in our case where a == 2)
.astype(int) converts from float back to int (not sure why float to begin with)
By the way, it seems that df.where(...) and df[...] behave similarly here, so take your pick.

Not sure how fast/slow it would be on a large-scale dataframe, but what I sometimes do is
df.loc[filt[filt].index]
The problem is that the loc method only works with boolean inputs on a 1D index. If you provide the values of the first level elements you want to retain, you're good to go. So by filtering filt with itself (since it's on a 1D index) and keeping the values from its index, you achieve your goal.

Building on #Markus Dutschke's answer, note that the IndexSlice object can be created just once and then used over and over (even to slice up different objects). I find this creates more readable code, especially when using it twice to slice on both MultiIndex rows and columns in the same .loc.
Applying this to his answer and simplifying slightly (no need for .values):
idx = pd.IndexSlice
df.loc[idx[filt[filt].index, :], :]
or the full code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c':range(9)}).set_index(['a', 'b'])
filt = pd.Series({1:True, 2:False, 3:True})
idx = pd.IndexSlice
print(df.loc[idx[[1, 3], :]])
print(df.loc[(df.index.levels[0].values[filt], slice(None)), :])
print(df.loc[idx[filt[filt].index, :], :])

Related

How to compare two Dataframe and update particular column in one of the Dataframe? [duplicate]

I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):
Dataframe A:
I II
0 1 2
1 3 1
Dataframe B:
I II
0 5 6
1 3 1
New Dataframe:
I II
0 1 2
1 3 1
2 5 6
How can I do this?
The simplest way is to just do the concatenation, and then drop duplicates.
>>> df1
A B
0 1 2
1 3 1
>>> df2
A B
0 5 6
1 3 1
>>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
A B
0 1 2
1 3 1
2 5 6
The reset_index(drop=True) is to fix up the index after the concat() and drop_duplicates(). Without it you will have an index of [0,1,0] instead of [0,1,2]. This could cause problems for further operations on this dataframe down the road if it isn't reset right away.
In case you have a duplicate row already in DataFrame A, then concatenating and then dropping duplicate rows, will remove rows from DataFrame A that you might want to keep.
In this case, you will need to create a new column with a cumulative count, and then drop duplicates, it all depends on your use case, but this is common in time-series data
Here is an example:
df_1 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':34},])
df_2 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':14},
])
df_1['count'] = df_1.groupby(['date','id','value']).cumcount()
df_2['count'] = df_2.groupby(['date','id','value']).cumcount()
df_tot = pd.concat([df_1,df_2], ignore_index=False)
df_tot = df_tot.drop_duplicates()
df_tot = df_tot.drop(['count'], axis=1)
>>> df_tot
date id value
0 11/20/2015 4 24
1 11/20/2015 4 24
2 11/20/2015 6 34
1 11/20/2015 6 14
I'm surprised that pandas doesn't offer a native solution for this task.
I don't think that it's efficient to just drop the duplicates if you work with large datasets (as Rian G suggested).
It is probably most efficient to use sets to find the non-overlapping indices. Then use list comprehension to translate from index to 'row location' (boolean), which you need to access rows using iloc[,]. Below you find a function that performs the task. If you don't choose a specific column (col) to check for duplicates, then indexes will be used, as you requested. If you chose a specific column, be aware that existing duplicate entries in 'a' will remain in the result.
import pandas as pd
def append_non_duplicates(a, b, col=None):
if ((a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame)):
raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
if (a is None):
return(b)
if (b is None):
return(a)
if(col is not None):
aind = a.iloc[:,col].values
bind = b.iloc[:,col].values
else:
aind = a.index.values
bind = b.index.values
take_rows = list(set(bind)-set(aind))
take_rows = [i in take_rows for i in bind]
return(a.append( b.iloc[take_rows,:] ))
# Usage
a = pd.DataFrame([[1,2,3],[1,5,6],[1,12,13]], index=[1000,2000,5000])
b = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=[1000,2000,3000])
append_non_duplicates(a,b)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 3000 7 8 9 <- from b
append_non_duplicates(a,b,0)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 2000 4 5 6 <- from b
# 3000 7 8 9 <- from b
Another option:
concatenation = pd.concat([
dfA,
dfB[dfB['I'].isin(dfA['I']) == False], # <-- get all the data in dfB that doesn't show up in dfB (based on values in column 'I')
])
The object concatenation will be:
I II
0 1 2
1 3 1
2 5 6

Pandas Insert a row above the Index and the Series data in a Dataframe

I ve been around several trials, nothing seems to work so far.
I have tried df.insert(0, "XYZ", 555) which seemed to work until it did not for some reasons i am not certain.
I understand that the issue is that Index is not considered a Series and so, df.iloc[0] does not allow you to insert data directly above the Index column.
I ve also tried manually adding in the list of indices part of the definition of the dataframe a first index with the value "XYZ"..but nothing has work.
Thanks for your help
A B C D are my columns. range(5) is my index. I am trying to obtain this below, for an arbitrary row starting with type, and then a list of strings..thanks
A B C D
type 'string1' 'string2' 'string3' 'string4'
0
1
2
3
4
If you use Timestamps as Index adding a row and a custom single row with its own custom index will throw an error:
ValueError: Cannot add integral value to Timestamp without offset. I am guessing it's due to the difference in the operands, if i substract an Integer from a Timestamp for example.. ? how could i fix this in a generic manner? thanks! –
if you want to insert a row before the first row, you can do it this way:
data:
In [57]: df
Out[57]:
id var
0 a 1
1 a 2
2 a 3
3 b 5
4 b 9
adding one row:
In [58]: df.loc[df.index.min() - 1] = ['z', -1]
In [59]: df
Out[59]:
id var
0 a 1
1 a 2
2 a 3
3 b 5
4 b 9
-1 z -1
sort index:
In [60]: df = df.sort_index()
In [61]: df
Out[61]:
id var
-1 z -1
0 a 1
1 a 2
2 a 3
3 b 5
4 b 9
optionally reset your index :
In [62]: df = df.reset_index(drop=True)
In [63]: df
Out[63]:
id var
0 z -1
1 a 1
2 a 2
3 a 3
4 b 5
5 b 9

Group by value of sum of columns with Pandas

I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !
Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.
Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4

How to concatenate two dataframes without duplicates?

I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):
Dataframe A:
I II
0 1 2
1 3 1
Dataframe B:
I II
0 5 6
1 3 1
New Dataframe:
I II
0 1 2
1 3 1
2 5 6
How can I do this?
The simplest way is to just do the concatenation, and then drop duplicates.
>>> df1
A B
0 1 2
1 3 1
>>> df2
A B
0 5 6
1 3 1
>>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
A B
0 1 2
1 3 1
2 5 6
The reset_index(drop=True) is to fix up the index after the concat() and drop_duplicates(). Without it you will have an index of [0,1,0] instead of [0,1,2]. This could cause problems for further operations on this dataframe down the road if it isn't reset right away.
In case you have a duplicate row already in DataFrame A, then concatenating and then dropping duplicate rows, will remove rows from DataFrame A that you might want to keep.
In this case, you will need to create a new column with a cumulative count, and then drop duplicates, it all depends on your use case, but this is common in time-series data
Here is an example:
df_1 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':34},])
df_2 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':14},
])
df_1['count'] = df_1.groupby(['date','id','value']).cumcount()
df_2['count'] = df_2.groupby(['date','id','value']).cumcount()
df_tot = pd.concat([df_1,df_2], ignore_index=False)
df_tot = df_tot.drop_duplicates()
df_tot = df_tot.drop(['count'], axis=1)
>>> df_tot
date id value
0 11/20/2015 4 24
1 11/20/2015 4 24
2 11/20/2015 6 34
1 11/20/2015 6 14
I'm surprised that pandas doesn't offer a native solution for this task.
I don't think that it's efficient to just drop the duplicates if you work with large datasets (as Rian G suggested).
It is probably most efficient to use sets to find the non-overlapping indices. Then use list comprehension to translate from index to 'row location' (boolean), which you need to access rows using iloc[,]. Below you find a function that performs the task. If you don't choose a specific column (col) to check for duplicates, then indexes will be used, as you requested. If you chose a specific column, be aware that existing duplicate entries in 'a' will remain in the result.
import pandas as pd
def append_non_duplicates(a, b, col=None):
if ((a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame)):
raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
if (a is None):
return(b)
if (b is None):
return(a)
if(col is not None):
aind = a.iloc[:,col].values
bind = b.iloc[:,col].values
else:
aind = a.index.values
bind = b.index.values
take_rows = list(set(bind)-set(aind))
take_rows = [i in take_rows for i in bind]
return(a.append( b.iloc[take_rows,:] ))
# Usage
a = pd.DataFrame([[1,2,3],[1,5,6],[1,12,13]], index=[1000,2000,5000])
b = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=[1000,2000,3000])
append_non_duplicates(a,b)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 3000 7 8 9 <- from b
append_non_duplicates(a,b,0)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 2000 4 5 6 <- from b
# 3000 7 8 9 <- from b
Another option:
concatenation = pd.concat([
dfA,
dfB[dfB['I'].isin(dfA['I']) == False], # <-- get all the data in dfB that doesn't show up in dfB (based on values in column 'I')
])
The object concatenation will be:
I II
0 1 2
1 3 1
2 5 6

pandas: filter rows of DataFrame with operator chaining

Most operations in pandas can be accomplished with operator chaining (groupby, aggregate, apply, etc), but the only way I've found to filter rows is via normal bracket indexing
df_filtered = df[df['column'] == value]
This is unappealing as it requires I assign df to a variable before being able to filter on its values. Is there something more like the following?
df_filtered = df.mask(lambda x: x['column'] == value)
I'm not entirely sure what you want, and your last line of code does not help either, but anyway:
"Chained" filtering is done by "chaining" the criteria in the boolean index.
In [96]: df
Out[96]:
A B C D
a 1 4 9 1
b 4 5 0 2
c 5 5 1 0
d 1 3 9 6
In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
A B C D
d 1 3 9 6
If you want to chain methods, you can add your own mask method and use that one.
In [90]: def mask(df, key, value):
....: return df[df[key] == value]
....:
In [92]: pandas.DataFrame.mask = mask
In [93]: df = pandas.DataFrame(np.random.randint(0, 10, (4,4)), index=list('abcd'), columns=list('ABCD'))
In [95]: df.ix['d','A'] = df.ix['a', 'A']
In [96]: df
Out[96]:
A B C D
a 1 4 9 1
b 4 5 0 2
c 5 5 1 0
d 1 3 9 6
In [97]: df.mask('A', 1)
Out[97]:
A B C D
a 1 4 9 1
d 1 3 9 6
In [98]: df.mask('A', 1).mask('D', 6)
Out[98]:
A B C D
d 1 3 9 6
Filters can be chained using a Pandas query:
df = pd.DataFrame(np.random.randn(30, 3), columns=['a','b','c'])
df_filtered = df.query('a > 0').query('0 < b < 2')
Filters can also be combined in a single query:
df_filtered = df.query('a > 0 and 0 < b < 2')
The answer from #lodagro is great. I would extend it by generalizing the mask function as:
def mask(df, f):
return df[f(df)]
Then you can do stuff like:
df.mask(lambda x: x[0] < 0).mask(lambda x: x[1] > 0)
Since version 0.18.1 the .loc method accepts a callable for selection. Together with lambda functions you can create very flexible chainable filters:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df.loc[lambda df: df.A == 80] # equivalent to df[df.A == 80] but chainable
df.sort_values('A').loc[lambda df: df.A > 80].loc[lambda df: df.B > df.A]
If all you're doing is filtering, you can also omit the .loc.
pandas provides two alternatives to Wouter Overmeire's answer which do not require any overriding. One is .loc[.] with a callable, as in
df_filtered = df.loc[lambda x: x['column'] == value]
the other is .pipe(), as in
df_filtered = df.pipe(lambda x: x.loc[x['column'] == value])
I offer this for additional examples. This is the same answer as https://stackoverflow.com/a/28159296/
I'll add other edits to make this post more useful.
pandas.DataFrame.query
query was made for exactly this purpose. Consider the dataframe df
import pandas as pd
import numpy as np
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(10, 5)),
columns=list('ABCDE')
)
df
A B C D E
0 0 2 7 3 8
1 7 0 6 8 6
2 0 2 0 4 9
3 7 3 2 4 3
4 3 6 7 7 4
5 5 3 7 5 9
6 8 7 6 4 7
7 6 2 6 6 5
8 2 8 7 5 8
9 4 7 6 1 5
Let's use query to filter all rows where D > B
df.query('D > B')
A B C D E
0 0 2 7 3 8
1 7 0 6 8 6
2 0 2 0 4 9
3 7 3 2 4 3
4 3 6 7 7 4
5 5 3 7 5 9
7 6 2 6 6 5
Which we chain
df.query('D > B').query('C > B')
# equivalent to
# df.query('D > B and C > B')
# but defeats the purpose of demonstrating chaining
A B C D E
0 0 2 7 3 8
1 7 0 6 8 6
4 3 6 7 7 4
5 5 3 7 5 9
7 6 2 6 6 5
My answer is similar to the others. If you do not want to create a new function you can use what pandas has defined for you already. Use the pipe method.
df.pipe(lambda d: d[d['column'] == value])
I had the same question except that I wanted to combine the criteria into an OR condition. The format given by Wouter Overmeire combines the criteria into an AND condition such that both must be satisfied:
In [96]: df
Out[96]:
A B C D
a 1 4 9 1
b 4 5 0 2
c 5 5 1 0
d 1 3 9 6
In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
A B C D
d 1 3 9 6
But I found that, if you wrap each condition in (... == True) and join the criteria with a pipe, the criteria are combined in an OR condition, satisfied whenever either of them is true:
df[((df.A==1) == True) | ((df.D==6) == True)]
Just want to add a demonstration using loc to filter not only by rows but also by columns and some merits to the chained operation.
The code below can filter the rows by value.
df_filtered = df.loc[df['column'] == value]
By modifying it a bit you can filter the columns as well.
df_filtered = df.loc[df['column'] == value, ['year', 'column']]
So why do we want a chained method? The answer is that it is simple to read if you have many operations. For example,
res = df\
.loc[df['station']=='USA', ['TEMP', 'RF']]\
.groupby('year')\
.agg(np.nanmean)
If you would like to apply all of the common boolean masks as well as a general purpose mask you can chuck the following in a file and then simply assign them all as follows:
pd.DataFrame = apply_masks()
Usage:
A = pd.DataFrame(np.random.randn(4, 4), columns=["A", "B", "C", "D"])
A.le_mask("A", 0.7).ge_mask("B", 0.2)... (May be repeated as necessary
It's a little bit hacky but it can make things a little bit cleaner if you're continuously chopping and changing datasets according to filters.
There's also a general purpose filter adapted from Daniel Velkov above in the gen_mask function which you can use with lambda functions or otherwise if desired.
File to be saved (I use masks.py):
import pandas as pd
def eq_mask(df, key, value):
return df[df[key] == value]
def ge_mask(df, key, value):
return df[df[key] >= value]
def gt_mask(df, key, value):
return df[df[key] > value]
def le_mask(df, key, value):
return df[df[key] <= value]
def lt_mask(df, key, value):
return df[df[key] < value]
def ne_mask(df, key, value):
return df[df[key] != value]
def gen_mask(df, f):
return df[f(df)]
def apply_masks():
pd.DataFrame.eq_mask = eq_mask
pd.DataFrame.ge_mask = ge_mask
pd.DataFrame.gt_mask = gt_mask
pd.DataFrame.le_mask = le_mask
pd.DataFrame.lt_mask = lt_mask
pd.DataFrame.ne_mask = ne_mask
pd.DataFrame.gen_mask = gen_mask
return pd.DataFrame
if __name__ == '__main__':
pass
This solution is more hackish in terms of implementation, but I find it much cleaner in terms of usage, and it is certainly more general than the others proposed.
https://github.com/toobaz/generic_utils/blob/master/generic_utils/pandas/where.py
You don't need to download the entire repo: saving the file and doing
from where import where as W
should suffice. Then you use it like this:
df = pd.DataFrame([[1, 2, True],
[3, 4, False],
[5, 7, True]],
index=range(3), columns=['a', 'b', 'c'])
# On specific column:
print(df.loc[W['a'] > 2])
print(df.loc[-W['a'] == W['b']])
print(df.loc[~W['c']])
# On entire - or subset of a - DataFrame:
print(df.loc[W.sum(axis=1) > 3])
print(df.loc[W[['a', 'b']].diff(axis=1)['b'] > 1])
A slightly less stupid usage example:
data = pd.read_csv('ugly_db.csv').loc[~(W == '$null$').any(axis=1)]
By the way: even in the case in which you are just using boolean cols,
df.loc[W['cond1']].loc[W['cond2']]
can be much more efficient than
df.loc[W['cond1'] & W['cond2']]
because it evaluates cond2 only where cond1 is True.
DISCLAIMER: I first gave this answer elsewhere because I hadn't seen this.
This is unappealing as it requires I assign df to a variable before being able to filter on its values.
df[df["column_name"] != 5].groupby("other_column_name")
seems to work: you can nest the [] operator as well. Maybe they added it since you asked the question.
So the way I see it is that you do two things when sub-setting your data ready for analysis.
get rows
get columns
Pandas has a number of ways of doing each of these and some techniques that help get rows and columns. For new Pandas users it can be confusing as there is so much choice.
Do you use iloc, loc, brackets, query, isin, np.where, mask etc...
Method chaining
Now method chaining is a great way to work when data wrangling. In R they have a simple way of doing it, you select() columns and you filter() rows.
So if we want to keep things simple in Pandas why not use the filter() for columns and the query() for rows. These both return dataframes and so no need to mess-around with boolean indexing, no need to add df[ ] round the return value.
So what does that look like:-
df.filter(['col1', 'col2', 'col3']).query("col1 == 'sometext'")
You can then chain on any other methods like groupby, dropna(), sort_values(), reset_index() etc etc.
By being consistent and using filter() to get your columns and query() to get your rows it will be easier to read your code when coming back to it after a time.
But filter can select rows?
Yes this is true but by default query() get rows and filter() get columns. So if you stick with the default there is no need to use the axis= parameter.
query()
query() can be used with both and/or &/| you can also use comparison operators > , < , >= , <=, ==, !=. You can also use Python in, not in.
You can pass a list to query using #my_list
Some examples of using query to get rows
df.query('A > B')
df.query('a not in b')
df.query("series == '2206'")
df.query("col1 == #mylist")
df.query('Salary_in_1000 >= 100 & Age < 60 & FT_Team.str.startswith("S").values')
filter()
So filter is basicly like using bracket df[] or df[[]] in that it uses the labels to select columns. But it does more than the bracket notation.
filter has like= param so as to help select columns with partial names.
df.filter(like='partial_name',)
filter also has regex to help with selection
df.filter(regex='reg_string')
So to sum up this way of working might not work for ever situation e.g. if you want to use indexing/slicing then iloc is the way to go. But this does seem to be a solid way of working and can simplify your workflow and code.
You can also leverage the numpy library for logical operations. Its pretty fast.
df[np.logical_and(df['A'] == 1 ,df['B'] == 6)]
If you set your columns to search as indexes, then you can use DataFrame.xs() to take a cross section. This is not as versatile as the query answers, but it might be useful in some situations.
import pandas as pd
import numpy as np
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(3, size=(10, 5)),
columns=list('ABCDE')
)
df
# Out[55]:
# A B C D E
# 0 0 2 2 2 2
# 1 1 1 2 0 2
# 2 0 2 0 0 2
# 3 0 2 2 0 1
# 4 0 1 1 2 0
# 5 0 0 0 1 2
# 6 1 0 1 1 1
# 7 0 0 2 0 2
# 8 2 2 2 2 2
# 9 1 2 0 2 1
df.set_index(['A', 'D']).xs([0, 2]).reset_index()
# Out[57]:
# A D B C E
# 0 0 2 2 2 2
# 1 0 2 1 1 0

Categories

Resources