Pandas: filter on multiple columns [duplicate] - python

This question already has answers here:
selecting across multiple columns with pandas
(3 answers)
Closed 5 years ago.
I am working in Pandas, and I want to apply multiple filters to a data frame across multiple fields.
I am working with another, more complex data frame, but I am simplifying the contex for this question. Here is the setup for a sample data frame:
dates = pd.date_range('20170101', periods=16)
rand_df = pd.DataFrame(np.random.randn(16,4), index=dates, columns=list('ABCD'))
Applying one filter to this data frame is well documented and simple:
rand_df.loc[lambda df: df['A'] < 0]
Since the lambda looks like a simple boolean expression. It is tempting to do the following. This does not work, since, instead of being a boolean expression, it is a callable. Multiple of these cannot combine as boolean expressions would:
rand_df.loc[lambda df: df['A'] < 0 and df[‘B’] < 0]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-31-dfa05ab293f9> in <module>()
----> 1 rand_df.loc[lambda df: df['A'] < 0 and df['B'] < 0]
I have found two ways to successfully implement this. I will add them to the potential answers, so you can comment directly on them as solutions. However, I would like to solicit other approaches, since I am not really sure that either of these is a very standard approach for filtering a Pandas data frame.

In [3]: rand_df.query("A < 0 and B < 0")
Out[3]:
A B C D
2017-01-02 -0.701682 -1.224531 -0.273323 -1.091705
2017-01-05 -1.262971 -0.531959 -0.997451 -0.070095
2017-01-06 -0.065729 -1.427199 1.202082 0.136657
2017-01-08 -1.445050 -0.367112 -2.617743 0.496396
2017-01-12 -1.273692 -0.456254 -0.668510 -0.125507
or:
In [6]: rand_df[rand_df[['A','B']].lt(0).all(1)]
Out[6]:
A B C D
2017-01-02 -0.701682 -1.224531 -0.273323 -1.091705
2017-01-05 -1.262971 -0.531959 -0.997451 -0.070095
2017-01-06 -0.065729 -1.427199 1.202082 0.136657
2017-01-08 -1.445050 -0.367112 -2.617743 0.496396
2017-01-12 -1.273692 -0.456254 -0.668510 -0.125507
PS You will find a lot of examples in the Pandas docs

rand_df[(rand_df.A < 0) & (rand_df.B <0)]

To use the lambda, don't pass the entire column.
rand_df.loc[lambda x: (x.A < 0) & (x.B < 0)]
# Or
# rand_df[lambda x: (x.A < 0) & (x.B < 0)]
A B C D
2017-01-12 -0.460918 -1.001184 -0.796981 0.328535
2017-01-14 -0.146846 -1.088095 -1.055271 -0.778120
You can speed up the evaluation by using boolean numpy arrays
c1 = rand_df.A.values > 0
c2 = rand_df.B.values > 0
rand_df[c1 & c2]
A B C D
2017-01-12 -0.460918 -1.001184 -0.796981 0.328535
2017-01-14 -0.146846 -1.088095 -1.055271 -0.778120

Here is an approach that “chains” use of the ‘loc’ operation:
rand_df.loc[lambda df: df['A'] < 0].loc[lambda df: df['B'] < 0]

Here is an approach which includes writing a method to do the filtering. I am sure that some filters will be sufficiently complex or complicated that the method is the best way to go (this case is not so complex.) Also, when I am using Pandas and I write a “for” loop, I feel like I am doing it wrong.
def lt_zero_ab(df):
result = []
for index, row in df.iterrows():
if row['A'] <0 and row['B'] <0:
result.append(index)
return result
rand_df.loc[lt_zero_ab]

Related

Pandas : Confused when extending DataFrame vs Series (Column/Index). Why the difference?

First off, let me say that I've already looked over various responses to similar questions, but so far, none of them has really made it clear to me why (or why not) the Series and DataFrame methodologies are different.
Also, some of the Pandas information is not clear, for example looking up Series.reindex,
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reindex.html
all the examples suddenly switch to showing examples for DataFrame not Series, but the functions don't seem to overlap exactly.
So, now to it, first with a DataFrame.
> df = pd.DataFrame(np.random.randn(6,4), index=range(6), columns=list('ABCD'))
> df
Out[544]:
A B C D
0 0.136833 -0.974500 1.708944 0.435174
1 -0.357955 -0.775882 -0.208945 0.120617
2 -0.002479 0.508927 -0.826698 -0.904927
3 1.955611 -0.558453 -0.476321 1.043139
4 -0.399369 -0.361136 -0.096981 0.092468
5 -0.130769 -0.075684 0.788455 1.640398
Now, to add new columns, I can do something simple (2 ways, same result).
> df[['X','Y']] = (99,-99)
> df.loc[:,['X','Y']] = (99,-99)
> df
Out[557]:
A B C D X Y
0 0.858615 -0.552171 1.225210 -1.700594 99 -99
1 1.062435 -1.917314 1.160043 -0.058348 99 -99
2 0.023910 1.262706 -1.924022 -0.625969 99 -99
3 1.794365 0.146491 -0.103081 0.731110 99 -99
4 -1.163691 1.429924 -0.194034 0.407508 99 -99
5 0.444909 -0.905060 0.983487 -4.149244 99 -99
Now, with a Series, I have hit a (mental?) block trying the same.
I'm going to be using a loop to construct a list of Series that will eventually be a data frame, but I want to deal with each 'row' as a Series first, (to make development easier).
> ss = pd.Series(np.random.randn(4), index=list('ABCD'))
> ss
Out[552]:
A 0.078013
B 1.707052
C -0.177543
D -1.072017
dtype: float64
> ss['X','Y'] = (99,-99)
Traceback (most recent call last):
...
KeyError: "None of [Index(['X', 'Y'], dtype='object')] are in the [index]"
Same for,
> ss[['X','Y']] = (99,-99)
> ss.loc[['X','Y']] = (99,-99)
KeyError: "None of [Index(['X', 'Y'], dtype='object')] are in the [index]"
The only way I can get this working is a rather clumsy (IMHO),
> ss['X'],ss['Y'] = (99,-99)
> ss
Out[560]:
A 0.078013
B 1.707052
C -0.177543
D -1.072017
X 99.000000
Y -99.000000
dtype: float64
I did think that, perhaps, reindexing the Series to add the new indices prior to assignment might solve to problem. It would, but then I hit an issue trying to change the index.
> ss = pd.Series(np.random.randn(4), index=list('ABCD'), name='z')
> xs = pd.Series([99,-99], index=['X','Y'], name='z')
Here I can concat my 2 Series to create a new one, and I can also concat the Series indices, eg,
> ss.index.append(xs.index)
Index(['A', 'B', 'C', 'D', 'X', 'Y'], dtype='object')
But I can't extend the current index with,
> ss.index = ss.index.append(xs.index)
ValueError: Length mismatch: Expected axis has 4 elements, new values have 6 elements
So, what intuitive leap must I make to understand why the former Series methods don't work, but (what looks like an equivalent) DataFrame method does work?
It makes passing multiple outputs back from a function into new Series elements a bit clunky. I can't 'on the fly' make up new Series index names to insert values into my exiting Series object.
I don't think you can directly modify the Series in place to add multiple values at once.
If having a new object is not an issue:
ss = pd.Series(np.random.randn(4), index=list('ABCD'), name='z')
xs = pd.Series([99,-99], index=['X','Y'], name='z')
# new object with updated index
ss = ss.reindex(ss.index.union(xs.index))
ss.update(xs)
Output:
A -0.369182
B -0.239379
C 1.099660
D 0.655264
X 99.000000
Y -99.000000
Name: z, dtype: float64
in place alternative using a function:
ss = pd.Series(np.random.randn(4), index=list('ABCD'), name='z')
xs = pd.Series([99,-99], index=['X','Y'], name='z')
def extend(s1, s2):
s1.update(s2) # update common indices
# add others
for idx, val in s2[s2.index.difference(s1.index)].items():
s1[idx] = val
extend(ss, xs)
Updated ss:
A 0.279925
B -0.098150
C 0.910179
D 0.317218
X 99.000000
Y -99.000000
Name: z, dtype: float64
While I have accepted #mozway's answer above since it nicely handles extending the Series even when there are possible index conflicts, I'm adding this 'answer' to demonstrate my point about the inconsistency in the extend operation between Series and DataFrame.
If I create my Series as single row DataFrames, as below, I can now extend the 'series' as I expected.
z=pd.Index(['z'])
ss = pd.DataFrame(np.random.randn(1,4), columns=list('ABCD'),index=z)
xs = pd.DataFrame([[99,-99]], columns=['X','Y'],index=z)
ss
Out[619]:
A B C D
z 1.052589 -0.337622 -0.791994 -0.266888
ss[['x','y']] = xs
ss
Out[620]:
A B C D x y
z 1.052589 -0.337622 -0.791994 -0.266888 99 -99
type(ss)
Out[621]: pandas.core.frame.DataFrame
Note that, as a DataFrame, I don't even need a Series for the extend object.
ss[['X','Y']] = [123,-123]
ss
Out[633]:
A B C D X Y
z 0.600981 -0.473031 0.216941 0.255252 123 -123
So I've simply extended the DataFrame, but it's still a DataFrame of 1 row.
I can now either 'squeeze' the DataFrame,
zz1=ss.squeeze()
type(zz1)
Out[624]: pandas.core.series.Series
zz1
Out[625]:
A 1.052589
B -0.337622
C -0.791994
D -0.266888
x 99.000000
y -99.000000
Name: z, dtype: float64
Alternatively, I can use 'iloc[0]' to get a Series directly. Note that 'loc' will return a DataFrame not a Series and will still require 'squeezing'.
zz2=ss.iloc[0]
type(zz2)
Out[629]: pandas.core.series.Series
zz2
Out[630]:
A 1.052589
B -0.337622
C -0.791994
D -0.266888
x 99.000000
y -99.000000
Name: z, dtype: float64
Please note, I'm not a Pandas 'wizard' so there may be other insights that I lack.

Looping to recode variables in python

I'm fairly new to programming and I have a question on using loops to recode variables in a pandas data frame that I was hoping I could get some help with.
I want to recode multiple columns in a pandas data frame from units of seconds to minutes. I've written a simple function in python and then can copy and repeat it on each column which works, but I wanted to automate this. I appreciate the help.
The ivf.secondsUntilCC.xxx column contains the number of seconds until something happens. I want the new column ivf.minsUntilCC.xxx to be the number of minutes. The data frame name is data.
def f(x,y):
return x[y]/60
data['ivf.minsUntilCC.500'] = f(data,'ivf.secondsUntilCC.500')
data['ivf.minsUntilCC.1000'] = f(data,'ivf.secondsUntilCC.1000')
data['ivf.minsUntilCC.2000'] = f(data,'ivf.secondsUntilCC.2000')
data['ivf.minsUntilCC.3000'] = f(data,'ivf.secondsUntilCC.3000')
data['ivf.minsUntilCC.4000'] = f(data,'ivf.secondsUntilCC.4000')
I would use vectorized approach:
In [27]: df
Out[27]:
X ivf.minsUntilCC.500 ivf.minsUntilCC.1000 ivf.minsUntilCC.2000 ivf.minsUntilCC.3000 ivf.minsUntilCC.4000
0 191365 906395 854268 701859 979647 914942
1 288577 300394 577555 880370 924162 897984
2 66705 493545 232603 682509 794074 204429
3 747828 504930 379035 29230 410390 287327
4 926553 913360 657640 336139 210202 356649
In [28]: df.loc[:, df.columns.str.startswith('ivf.minsUntilCC.')] /= 60
In [29]: df
Out[29]:
X ivf.minsUntilCC.500 ivf.minsUntilCC.1000 ivf.minsUntilCC.2000 ivf.minsUntilCC.3000 ivf.minsUntilCC.4000
0 191365 15106.583333 14237.800000 11697.650000 16327.450000 15249.033333
1 288577 5006.566667 9625.916667 14672.833333 15402.700000 14966.400000
2 66705 8225.750000 3876.716667 11375.150000 13234.566667 3407.150000
3 747828 8415.500000 6317.250000 487.166667 6839.833333 4788.783333
4 926553 15222.666667 10960.666667 5602.316667 3503.366667 5944.150000
Setup:
df = pd.DataFrame(np.random.randint(0,10**6,(5,6)),
columns=['X','ivf.minsUntilCC.500', 'ivf.minsUntilCC.1000',
'ivf.minsUntilCC.2000', 'ivf.minsUntilCC.3000',
'ivf.minsUntilCC.4000'])
Explanation:
In [26]: df.loc[:, df.columns.str.startswith('ivf.minsUntilCC.')]
Out[26]:
ivf.minsUntilCC.500 ivf.minsUntilCC.1000 ivf.minsUntilCC.2000 ivf.minsUntilCC.3000 ivf.minsUntilCC.4000
0 906395 854268 701859 979647 914942
1 300394 577555 880370 924162 897984
2 493545 232603 682509 794074 204429
3 504930 379035 29230 410390 287327
4 913360 657640 336139 210202 356649

Is there a query method or similar for pandas Series (pandas.Series.query())?

The pandas.DataFrame.query() method is of great usage for (pre/post)-filtering data when loading or plotting. It comes particularly handy for method chaining.
I find myself often wanting to apply the same logic to a pandas.Series, e.g. after having done a method such as df.value_counts which returns a pandas.Series.
Example
Lets assume there is a huge table with the columns Player, Game, Points and I want to plot a histogram of the players with more than 14 times 3 points. I first have to sum the points of each player (groupby -> agg) which will return a Series of ~1000 players and their overall points. Applying the .query logic it would look something like this:
df = pd.DataFrame({
'Points': [random.choice([1,3]) for x in range(100)],
'Player': [random.choice(["A","B","C"]) for x in range(100)]})
(df
.query("Points == 3")
.Player.values_count()
.query("> 14")
.hist())
The only solutions I find force me to do an unnecessary assignment and break the method chaining:
(points_series = df
.query("Points == 3")
.groupby("Player").size()
points_series[points_series > 100].hist()
Method chaining as well as the query method help to keep the code legible meanwhile the subsetting-filtering can get messy quite quickly.
# just to make my point :)
series_bestplayers_under_100[series_prefiltered_under_100 > 0].shape
Please help me out of my dilemma! Thanks
If I understand correctly you can add query("Points > 100"):
df = pd.DataFrame({'Points':[50,20,38,90,0, np.Inf],
'Player':['a','a','a','s','s','s']})
print (df)
Player Points
0 a 50.000000
1 a 20.000000
2 a 38.000000
3 s 90.000000
4 s 0.000000
5 s inf
points_series = df.query("Points < inf").groupby("Player").agg({"Points": "sum"})['Points']
print (points_series)
a = points_series[points_series > 100]
print (a)
Player
a 108.0
Name: Points, dtype: float64
points_series = df.query("Points < inf")
.groupby("Player")
.agg({"Points": "sum"})
.query("Points > 100")
print (points_series)
Points
Player
a 108.0
Another solution is Selection By Callable:
points_series = df.query("Points < inf")
.groupby("Player")
.agg({"Points": "sum"})['Points']
.loc[lambda x: x > 100]
print (points_series)
Player
a 108.0
Name: Points, dtype: float64
Edited answer by edited question:
np.random.seed(1234)
df = pd.DataFrame({
'Points': [np.random.choice([1,3]) for x in range(100)],
'Player': [np.random.choice(["A","B","C"]) for x in range(100)]})
print (df.query("Points == 3").Player.value_counts().loc[lambda x: x > 15])
C 19
B 16
Name: Player, dtype: int64
print (df.query("Points == 3").groupby("Player").size().loc[lambda x: x > 15])
Player
B 16
C 19
dtype: int64
Why not convert from Series to DataFrame, do the querying, and then convert back.
df["Points"] = df["Points"].to_frame().query('Points > 100')["Points"]
Here, .to_frame() converts to DataFrame, while the trailing ["Points"] converts to Series.
The method .query() can then be used consistently whether or not the Pandas object has 1 or more columns.
Instead of query you can use pipe:
s.pipe(lambda x: x[x>0]).pipe(lambda x: x[x<10])

Pandas filter rows based on multiple conditions

I have some values in the risk column that are neither, Small, Medium or High. I want to delete the rows with the value not being Small, Medium and High. I tried the following:
df = df[(df.risk == "Small") | (df.risk == "Medium") | (df.risk == "High")]
But this returns an empty DataFrame. How can I filter them correctly?
I think you want:
df = df[(df.risk.isin(["Small","Medium","High"]))]
Example:
In [5]:
import pandas as pd
df = pd.DataFrame({'risk':['Small','High','Medium','Negligible', 'Very High']})
df
Out[5]:
risk
0 Small
1 High
2 Medium
3 Negligible
4 Very High
[5 rows x 1 columns]
In [6]:
df[df.risk.isin(['Small','Medium','High'])]
Out[6]:
risk
0 Small
1 High
2 Medium
[3 rows x 1 columns]
Another nice and readable approach is the following:
small_risk = df["risk"] == "Small"
medium_risk = df["risk"] == "Medium"
high_risk = df["risk"] == "High"
Then you can use it like this:
df[small_risk | medium_risk | high_risk]
or
df[small_risk & medium_risk]
You could also use query:
df.query('risk in ["Small","Medium","High"]')
You can refer to variables in the environment by prefixing them with #. For example:
lst = ["Small","Medium","High"]
df.query("risk in #lst")
If the column name is multiple words, e.g. "risk factor", you can refer to it by surrounding it with backticks ` `:
df.query('`risk factor` in #lst')
query method comes in handy if you need to chain multiple conditions. For example, the outcome of the following filter:
df[df['risk factor'].isin(lst) & (df['value']**2 > 2) & (df['value']**2 < 5)]
can be derived using the following expression:
df.query('`risk factor` in #lst and 2 < value**2 < 5')

subsetting a Python DataFrame

I am transitioning from R to Python. I just began using Pandas. I have an R code that subsets nicely:
k1 <- subset(data, Product = p.id & Month < mn & Year == yr, select = c(Time, Product))
Now, I want to do similar stuff in Python. this is what I have got so far:
import pandas as pd
data = pd.read_csv("../data/monthly_prod_sales.csv")
#first, index the dataset by Product. And, get all that matches a given 'p.id' and time.
data.set_index('Product')
k = data.ix[[p.id, 'Time']]
# then, index this subset with Time and do more subsetting..
I am beginning to feel that I am doing this the wrong way. perhaps, there is an elegant solution. Can anyone help? I need to extract month and year from the timestamp I have and do subsetting. Perhaps there is a one-liner that will accomplish all this:
k1 <- subset(data, Product = p.id & Time >= start_time & Time < end_time, select = c(Time, Product))
thanks.
I'll assume that Time and Product are columns in a DataFrame, df is an instance of DataFrame, and that other variables are scalar values:
For now, you'll have to reference the DataFrame instance:
k1 = df.loc[(df.Product == p_id) & (df.Time >= start_time) & (df.Time < end_time), ['Time', 'Product']]
The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators. The & operator is actually an overloaded bitwise operator which has the same precedence as arithmetic operators which in turn have a higher precedence than comparison operators.
In pandas 0.13 a new experimental DataFrame.query() method will be available. It's extremely similar to subset modulo the select argument:
With query() you'd do it like this:
df[['Time', 'Product']].query('Product == p_id and Month < mn and Year == yr')
Here's a simple example:
In [9]: df = DataFrame({'gender': np.random.choice(['m', 'f'], size=10), 'price': poisson(100, size=10)})
In [10]: df
Out[10]:
gender price
0 m 89
1 f 123
2 f 100
3 m 104
4 m 98
5 m 103
6 f 100
7 f 109
8 f 95
9 m 87
In [11]: df.query('gender == "m" and price < 100')
Out[11]:
gender price
0 m 89
4 m 98
9 m 87
The final query that you're interested will even be able to take advantage of chained comparisons, like this:
k1 = df[['Time', 'Product']].query('Product == p_id and start_time <= Time < end_time')
Just for someone looking for a solution more similar to R:
df[(df.Product == p_id) & (df.Time> start_time) & (df.Time < end_time)][['Time','Product']]
No need for data.loc or query, but I do think it is a bit long.
I've found that you can use any subset condition for a given column by wrapping it in []. For instance, you have a df with columns ['Product','Time', 'Year', 'Color']
And let's say you want to include products made before 2014. You could write,
df[df['Year'] < 2014]
To return all the rows where this is the case. You can add different conditions.
df[df['Year'] < 2014][df['Color' == 'Red']
Then just choose the columns you want as directed above. For instance, the product color and key for the df above,
df[df['Year'] < 2014][df['Color'] == 'Red'][['Product','Color']]
Regarding some points mentioned in previous answers, and to improve readability:
No need for data.loc or query, but I do think it is a bit long.
The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators.
I like to write such expressions as follows - less brackets, faster to type, easier to read. Closer to R, too.
q_product = df.Product == p_id
q_start = df.Time > start_time
q_end = df.Time < end_time
df.loc[q_product & q_start & q_end, c('Time,Product')]
# c is just a convenience
c = lambda v: v.split(',')
Creating an Empty Dataframe with known Column Name:
Names = ['Col1','ActivityID','TransactionID']
df = pd.DataFrame(columns = Names)
Creating a dataframe from csv:
df = pd.DataFrame('...../file_name.csv')
Creating a dynamic filter to subset a dtaframe:
i = 12
df[df['ActivitiID'] <= i]
Creating a dynamic filter to subset required columns of dtaframe
df[df['ActivityID'] == i][['TransactionID','ActivityID']]

Categories

Resources