I have a pandas dataframe with about 50 columns and >100 rows. I want to select columns 'col_x', 'col_y' where 'col_z' < m. Is there a simple way to do this, similar to df[df['col3'] < m] and df[['colx','coly']] but combined?
Let's break down your problem. You want to
Filter rows based on some boolean condition
You want to select a subset of columns from the result.
For the first point, the condition you'd need is -
df["col_z"] < m
For the second requirement, you'd want to specify the list of columns that you need -
["col_x", "col_y"]
How would you combine these two to produce an expected output with pandas? The most straightforward way is using loc -
df.loc[df["col_z"] < m, ["col_x", "col_y"]]
The first argument selects rows, and the second argument selects columns.
More About loc
Think of this in terms of the relational algebra operations - selection and projection. If you're from the SQL world, this would be a relatable equivalent. The above operation, in SQL syntax, would look like this -
SELECT col_x, col_y # projection on columns
FROM df
WHERE col_z < m # selection on rows
pandas loc allows you to specify index labels for selecting rows. For example, if you have a dataframe -
col_x col_y
a 1 4
b 2 5
c 3 6
To select index a, and c, and col_x you'd use -
df.loc[['a', 'c'], ['col_x']]
col_x
a 1
c 3
Alternatively, for selecting by a boolean condition (using a series/array of bool values, as your original question asks), where all values in col_x are odd -
df.loc[(df.col_x % 2).ne(0), ['col_y']]
col_y
a 4
c 6
For details, df.col_x % 2 computes the modulus of each value with respect to 2. The ne(0) will then compare the value to 0, and return True if it isn't (all odd numbers are selected like this). Here's what that expression results in -
(df.col_x % 2).ne(0)
a True
b False
c True
Name: col_x, dtype: bool
Further Reading
10 Minutes to Pandas - Selection by Label
Indexing and selecting data
Boolean indexing
Selection with .loc in python
pandas loc vs. iloc vs. ix vs. at vs. iat?
Related
I have a dataframe (called my_df1) and want to drop several rows based on certain dates. How can I create a new dataframe (my_df2) without the dates '2020-05-01' and '2020-05-04'?
I tried the following which did not work as you can see below:
my_df2 = mydf_1[(mydf_1['Date'] != '2020-05-01') | (mydf_1['Date'] != '2020-05-04')]
my_df2.head()
The problem seems to be with your logical operator.
You should be using and here instead of or since you have to select all the rows which are not 2020-05-01 and 2020-05-04.
The bitwise operators will not be short circuiting and hence the result.
You can use isin with negation ~ sign:
dates=['2020-05-01', '2020-05-04']
my_df2 = mydf_1[~mydf_1['Date'].isin(dates)]
The short explanation about your mistake AND and OR was addressed by kanmaytacker.
Following a few additional recommendations:
Indexing in pandas:
By label .loc
By index .iloc
By label also works without .loc but it's slower as it's composed of chained operations instead of a single internal operation consisting on nested loops (see here). Also, with .loc you can select on more than one axis at a time.
# example with rows. Same logic for columns or additional axis.
df.loc[(df['a']!=4) & (df['a']!=1),:] # ".loc" is the only addition
>>>
a b c
2 0 4 6
Your index is a boolean set. This is true for numpy and as a consecuence, pandas too.
(df['a']!=4) & (df['a']!=1)
>>>
0 False
1 False
2 True
Name: a, dtype: bool
I have a dataframe that contains three columns: two define the start and end of a period of time (a window) and another which contains an array of individual timepoints. I would like to determine if any of the individual points are within the window's start and end (the two other columns). The ideal output would be True/False for each row.
I can iterate through each row of the dataframe, extract the timepoints and start_window and end_window times and determine this one row at a time, but I was looking for a faster (no-loop) option.
Example of dataframe
row start_window end_window times (numpy array)
0 307.110309 307.710309 [307.48857, 307.6031]
1 309.140340 311.900309 [315.23134]
...
The output based on the above dataframe would be:
True
False
One way to do is use pd.DataFrame.apply:
df.apply(lambda x: any(x['start_window']< i< x['end_window'] for i in x['times']), 1)
Output:
0 True
1 False
dtype: bool
Let us do it vertorized
s=pd.DataFrame(df.time.tolist(),index=df.index)
((df.start_window-s<0)&(df.end_window-s>0)).any(1)
Out[277]:
0 True
1 False
dtype: bool
Here is another efficient solution.
t_max = df["times"].apply(max)
t_min = df["times"].apply(min)
out = (t_max > df["start_window"]) & (t_min < df["end_window"])
I have a dataframe like the following, where everything is formatted as a string:
df
property value count
0 propAb True 10
1 propAA False 10
2 propAB blah 10
3 propBb 3 8
4 propBA 4 7
5 propCa 100 4
I am trying to find a way to filter the dataframe by applying a series of regex-style rules to both the property and value columns together.
For example, some sample rules may be like the following:
"if property starts with 'propA' and value is not 'True', drop the row".
Another rule may be something more mathematical, like:
"if property starts with 'propB' and value < 4, drop the row".
Is there a way to accomplish something like this without having to iterate over all rows each time for every rule I want to apply?
You still have to apply each rule (how else?), but let pandas handle the rows. Also, instead of removing the rows that you do not like, keep the rows that you do. Here's an example of how the first two rules can be applied:
rule1 = df.property.str.startswith('propA') & (df.value != 'True')
df = df[~rule1] # Keep everything that does NOT match
rule2 = df.property.str.startswith('propB') & (df.value < 4)
df = df[~rule2] # Keep everything that does NOT match
By the way, the second rule will not work because value is not a numeric column.
For the first one:
df = df.drop(df[(df.property.startswith('propA')) & (df.value is not True)].index)
and the other one:
df = df.drop(df[(df.property.startswith('propB')) & (df.value < 4)].index)
a = np.array([[1.,2.,3.],
[3.,4.,2.],
[8.,1.,3.]])
b = [8.,1.]
c = a[np.isclose(a[:,0:2],b)]
print(c)
I want to select full rows in a based on only a few columns. My attempt is above.
It works if I include the last column too in that condition, but I don't care about the last column. How do I select rows with 3 columns, based on a condition on 2?
Compare with np.isclose using the sliced version of a and then look for all matches along each row, for which we can use np.all or np.logical_and.reduce. Finally, index into input array for the output.
Hence, two solutions -
a[np.isclose(a[:,:2],b).all(axis=1)]
a[np.logical_and.reduce( np.isclose(a[:,:2],b), axis=1)]
Python 2.7, Pandas 0.18.
I have a DataFrame, and I have methods that select a subset of the rows via a criterion parameter. I'd like to know a more idiomatic way to write a criterion that matches all rows.
Here's a very simple example:
import pandas as pd
def apply_to_matching(df,criterion):
df.loc[criterion,'A'] = df[criterion]['A']*df[criterion]['B']
df = pd.DataFrame({'A':[1,2,3,4],'B':[10,100,1000,10000]})
criterion = (df['A']<3)
result = apply_to_matching(df,criterion)
print df
The output would be:
A B
0 10 10
1 200 100
2 3 1000
3 4 10000
because the criterion applies to only the first two rows.
I would like to know the idiomatic way to create a criterion that selects all rows of the DataFrame.
This could be done by adding a column of all true values to the DataFrame:
# Add a column
df['AllTrue']=True
criterion = df['AllTrue']
result = apply_to_matching(df,criterion)
print df.drop('AllTrue',axis=1)
The output is:
A B
0 10 10
1 200 100
2 3000 1000
3 40000 10000
but that approach adds a column to my DataFrame, which I have to filter out later to not get it in my output.
So, is there a more idiomatic way to do this in Pandas? One which does not require me to know anything about the column names, and not change the DataFrame?
When everything should be True, the boolean indexing way would require a series of True. With the code you have above, another way to look at it is that the criterion argument can also receive slices. Getting all the rows would mean slicing the entire rows like this df.loc[:, 'A']. As you need to pass it as an argument to apply_to_matching function, use slice builtin:
apply_to_matching(df, slice(None, None))