Run logical Expressions against pandas dataframe - python

I m trying to select rows from a pandas dataframe by applying condition to a column (in form of logical expression).
Sample data frame looks like:
id userid code
0 645382311 12324234234
1 645382311 -2434234242
2 645382312 32536365654
3 645382312 12324234234
...
For example, I expect next result by applying logical expressions for column 'code':
case 1: (12324234234 OR -2434234242) AND NOT 32536365654
case 2: (12324234234 AND -2434234242) OR NOT 32536365654
must give a result for both cases:
userid: 645382311
The logic above says:
For case 1 - give me only those userid who has at least one of the values (12324234234 OR -2434234242) and doesn't have 32536365654 in the whole data frame.
For case 2 - I need only those userid who has either both codes in data frame (12324234234 AND -2434234242) or any codes but not 32536365654.
The statement like below returns empty DataFrame:
flt = df[(df.code == 12324234234) & (df.code == -2434234242)]
print("flt: ", flt)
Result (and it make sens):
flt: Empty DataFrame
Would be appreciate for any hints to handle such cases.

As a simple approach, I would transform your sample table into a boolean presence matrix, which would then allow you to perform the logic you need:
import pandas
sample = pandas.DataFrame([[645382311, 12324234234], [645382311, -2434234242], [645382312, 32536365654], [645382312, 12324234234]], columns=['userid', 'code'])
# Add a column of True values
sample['value'] = True
# Pivot to boolean presence matrix and remove MultiIndex
presence = sample.pivot(index='userid', columns='code').fillna(False)['value']
# Perform desired boolean tests
case1 = (presence[12324234234] | presence[-2434234242]) & ~(presence[32536365654])
case2 = (presence[12324234234] & presence[-2434234242]) | ~(presence[32536365654])
The case variables will contain the boolean test result for each userid.

Related

How to filter on a pandas dataframe using contains against a list of columns, if I don't know which columns are present?

I want to filter my dataframe to look for columns containing a known string.
I know you can do something like this:
summ_proc = summ_proc[
summ_proc['data.process.name'].str.contains(indicator) |
summ_proc['data.win.eventdata.processName'].str.contains(indicator) |
summ_proc['data.win.eventdata.logonProcessName'].str.contains(indicator) |
summ_proc['syscheck.audit.process.name'].str.contains(indicator)
]
where I'm using the | operator to check against multiple columns. But there are cases where a certain column name isn't present. So 'data.process.name' might not be present every time.
I tried the following implementation:
summ_proc[summ_proc.apply(lambda x: summ_proc['data.process.name'].str.contains(indicator) if 'data.process.name' in summ_proc.columns else summ_proc)]
And that works. But I'm not sure how I can apply the OR operator to this lambda function.
I want all the rows where either data.process.name or data.win.eventdata.processName or data.win.eventdata.logonProcessName or syscheck.audit.process.name contains the indicator.
EDIT:
I tried the following approach, where I created individual frames and concated all the frames.
summ_proc1 = summ_proc[summ_proc.apply(lambda x: summ_proc['data.process.name'].str.contains(indicator) if 'data.process.name' in summ_proc.columns else summ_proc)]
summ_proc2 = summ_proc[summ_proc.apply(lambda x: summ_proc['data.win.eventdata.processName'].str.contains(indicator) if 'data.win.eventdata.processName' in summ_proc.columns else summ_proc)]
summ_proc3 = summ_proc[summ_proc.apply(lambda x: summ_proc['data.win.eventdata.logonProcessName'].str.contains(indicator) if 'data.win.eventdata.logonProcessName' in summ_proc.columns else summ_proc)]
frames = [summ_proc1, summ_proc2, summ_proc3]
result = pd.concat(frames)
This works, but I'm curious if there's a better more pythonic approach? Or if this current method will cause more downstream issues?
should work with something like this:
import numpy as np
columns = ['data.process.name', 'data.win.eventdata.processName']
# filter columns that are in summ_proc
available_columns = [c for c in columns if c in summ_proc.columns]
# array of Boolean values indicating if c contains indicator
ss = [summ_proc[c].str.contains(indicator) for c in available_columns]
# reduce without '|' by using 'np.logical_or'
indexer = np.logical_or.reduce(ss)
result = summ_proc[indexer]

How can I drop several rows from my Dataframe?

I have a dataframe (called my_df1) and want to drop several rows based on certain dates. How can I create a new dataframe (my_df2) without the dates '2020-05-01' and '2020-05-04'?
I tried the following which did not work as you can see below:
my_df2 = mydf_1[(mydf_1['Date'] != '2020-05-01') | (mydf_1['Date'] != '2020-05-04')]
my_df2.head()
The problem seems to be with your logical operator.
You should be using and here instead of or since you have to select all the rows which are not 2020-05-01 and 2020-05-04.
The bitwise operators will not be short circuiting and hence the result.
You can use isin with negation ~ sign:
dates=['2020-05-01', '2020-05-04']
my_df2 = mydf_1[~mydf_1['Date'].isin(dates)]
The short explanation about your mistake AND and OR was addressed by kanmaytacker.
Following a few additional recommendations:
Indexing in pandas:
By label .loc
By index .iloc
By label also works without .loc but it's slower as it's composed of chained operations instead of a single internal operation consisting on nested loops (see here). Also, with .loc you can select on more than one axis at a time.
# example with rows. Same logic for columns or additional axis.
df.loc[(df['a']!=4) & (df['a']!=1),:] # ".loc" is the only addition
>>>
a b c
2 0 4 6
Your index is a boolean set. This is true for numpy and as a consecuence, pandas too.
(df['a']!=4) & (df['a']!=1)
>>>
0 False
1 False
2 True
Name: a, dtype: bool

Why does using "==" return a Series instead of bool in pandas?

I just can't figure out what "==" means at the second line:
- It is not a test, there is no if statement...
- It is not a variable declaration...
I've never seen this before, the thing is data.ctage==cat is a pandas Series and not a test...
for cat in data["categ"].unique():
subset = data[data.categ == cat] # Création du sous-échantillon
print("-"*20)
print('Catégorie : ' + cat)
print("moyenne:\n",subset['montant'].mean())
print("mediane:\n",subset['montant'].median())
print("mode:\n",subset['montant'].mode())
print("VAR:\n",subset['montant'].var())
print("EC:\n",subset['montant'].std())
plt.figure(figsize=(5,5))
subset["montant"].hist(bins=30) # Crée l'histogramme
plt.show() # Affiche l'histogramme
It is testing each element of data.categ for equality with cat. That produces a vector of True/False values. This is passed as in indexer to data[], which returns the rows from data that correspond to the True values in the vector.
To summarize, the whole expression returns the subset of rows from data where the value of data.categ equals cat.
(Seems possible the whole operation could be done more elegantly using data.groupBy('categ').apply(someFunc).)
It creates a boolean series with indexes where data.categ is equal to cat , with this boolean mask, you can filter your dataframe, in other words subset will have all records where the categ is the value stored in cat.
This is an example using numeric data
np.random.seed(0)
a = np.random.choice(np.arange(2), 5)
b = np.random.choice(np.arange(2), 5)
df = pd.DataFrame(dict(a = a, b = b))
df[df.a == 0].head()
# a b
# 0 0 0
# 2 0 0
# 4 0 1
df[df.a == df.b].head()
# a b
# 0 0 0
# 2 0 0
# 3 1 1
Yes, it is a test. Boolean expressions are not restricted to if statements.
It looks as if data is a data frame (PANDAS). The expression used as a data frame index is how PANDAS denotes a selector or filter. This says to select every row in which the fieled categ matches the variable cat (apparently a pre-defined variable). This collection of rows becomes a new data frame, subset.
data.categ == cat will return a boolean list that will be used to filter your dataframe by lefting only values where boolean is equal True.
Booleans are used in many situations, not only in if statements.
Here you are checking data.categ with the element iterating, cat, in the dictionary of data.
And if they are equal you are continuing the loop.

Pandas - how to filter dataframe by regex comparisons on mutliple column values

I have a dataframe like the following, where everything is formatted as a string:
df
property value count
0 propAb True 10
1 propAA False 10
2 propAB blah 10
3 propBb 3 8
4 propBA 4 7
5 propCa 100 4
I am trying to find a way to filter the dataframe by applying a series of regex-style rules to both the property and value columns together.
For example, some sample rules may be like the following:
"if property starts with 'propA' and value is not 'True', drop the row".
Another rule may be something more mathematical, like:
"if property starts with 'propB' and value < 4, drop the row".
Is there a way to accomplish something like this without having to iterate over all rows each time for every rule I want to apply?
You still have to apply each rule (how else?), but let pandas handle the rows. Also, instead of removing the rows that you do not like, keep the rows that you do. Here's an example of how the first two rules can be applied:
rule1 = df.property.str.startswith('propA') & (df.value != 'True')
df = df[~rule1] # Keep everything that does NOT match
rule2 = df.property.str.startswith('propB') & (df.value < 4)
df = df[~rule2] # Keep everything that does NOT match
By the way, the second rule will not work because value is not a numeric column.
For the first one:
df = df.drop(df[(df.property.startswith('propA')) & (df.value is not True)].index)
and the other one:
df = df.drop(df[(df.property.startswith('propB')) & (df.value < 4)].index)

Python - Pandas - Groupby conditional on column values in group

I have a dataframe with the following structure with columns group_, vals_ and dates_.
I would like to perform a groupby operation on group_ and subsequently output for each group a statistic conditional on dates. For instance, the mean of all vals_ within a group whose associated date is below some date.
I tried
df_.groupby(group_).agg(lambda x: x[x['date_']< some_date][vals_].mean())
But this fails. I believe it is because x is not a dataframe but a series. Is this correct? Is it possible to achieve what I am trying to achieve here with groupby?
You can write it differently:
def summary(sub_df):
bool_before = sub_df["date_"] < some_date
bool_after = sub_df["date_"] > some_date
before = sub_df.loc[bool_before, vals_].mean()
after = sub_df.loc[bool_after, vals_].mean()
overall = sub_df.loc[:, vals_].mean()
return pd.Series({"before": before, "after": after, "overall": overall})
result = df_.groupby(group_).apply(summary)
The result is a data frame containing 3 mean values for before/after/overall.
If you require additional summary statistics, you can supply them within the summary function.

Categories

Resources