How to filter by boolean? - python

I have a series of booleans extracted and I would like to filter a dataframe from this in Pandas but it is returning no results.
Dataframe
Account mphone rphone BPHONE
0 14999201 3931812 8014059 9992222
1 12980801 4444444 3932929 4279999
2 9999999 3279999 4419999 3938888
Here are the series:
df['BPHONE'].str.startswith(tuple(combined_list))
0 False
1 True
2 False
Name: BPHONE, dtype: bool
df['rphone'].str.startswith(tuple(combined_list))
0 True
1 False
2 True
Name: rphone, dtype: bool
Now below, when I try to filter this, I am getting empty results:
df[(df['BPHONE'].str.startswith(tuple(combined_list))) & (df['rphone'].str.startswith(tuple(combined_list)))]
Account mphone rphone BPHONE
Lastly, when I just use one column, it seems to match and filter by row and not column. Any advice here on how to correct this?
df[(df['BPHONE'].str.startswith(tuple(combined_list)))]
Account mphone rphone BPHONE
1 12980801 4444444 3932929 4279999
I thought that this should just populate BPHONE along the column axis and not the row axis. How would I be able to filter this way?
The output wanted is the following:
Account mphone rphone BPHONE
14999201 3931812 8014059 np.nan
12980801 4444444 np.nan 4279999
99999999 3279999 4419999 np.nan
To explain this, rphone shows True, False, True, so only the True numbers should show. Under False it should not show, or show as NAN.

The output you are expecting is not a filtered result, but conditionally replaced result.
condition = df['BPHONE'].str.startswith(tuple(combined_list))
Use np.where
df['BPHONE'] = pd.np.where(condition, df['BPHONE'], pd.np.nan)
OR
df.loc[~condition, 'BPHONE'] = pd.np.nan

All filters you are applying are functioning correctly:
df['BPHONE'].str.startswith(tuple(combined_list))
0 False
1 True #Only this row will be retained
2 False
The combined filter will return:
df[(df['BPHONE'].str.startswith(tuple(combined_list))) & (df['rphone'].str.startswith(tuple(combined_list)))]
First filter Second filter Combined filter
0 False True False #Not retained
1 True False False #Not retained
2 False True False #Not retained

Related

How to find error value on pandas dataframe?

While using csv files from excel and read it with pandas data frame, got 1 value that's has symbol such as 2$3.74836730957 while it has to be 243.74836730957 (it seems mistook 4 with $). is there anyways that I could find such as values that I mention before and change it into NaN values on Data Frame?
CSV file:
You can use pd.to_numeric in order to report boolean values that denote whether a particular column has only numerical values. In order to check all columns you can do
df.apply(lambda s: pd.to_numeric(s, errors='coerce').notnull().all())
And the output would look like:
A True
B False
C True
D False
...
dtype: bool
Now if you want to know which specific row(s) are not numerical you can use np.isreal:
df.applymap(np.isreal)
A B C D
item
r1 True True True True
r2 True True True True
r3 True True True False
...

How to compress a pandas dataframe, based on their boolean column value?

Given a pandas frame like this:
In:
pokemon yes no ignore
Vulpix True False False
Nidorino False True False
Growlithe False False True
Krokorok False True False
Darumaka False False True
Klefki False True False
Croagunk True False False
What is the correct way of getting as a row value the column associated to the pokemon column?:
Out:
pokemon Val
Vulpix yes
Nidorino no
Growlithe ignore
Krokorok no
Darumaka ignore
Klefki no
Croagunk yes
So far I tried with cross tab:
pd.crosstab(df, columns=['yes', 'no', 'ignore'])
However, I am getting a value error:
`ValueError: Shape of passed values`
What is the correct way of getting the previous output?
If those are one-hot encoded such that there is only ever a single 1/True on each row, set_index and dot with the columns.
df = df.set_index('pokemon')
df = df.dot(df.columns)
pokemon
Vulpix yes
Nidorino no
Growlithe ignore
Krokorok no
Darumaka ignore
Klefki no
Croagunk yes
dtype: object
The above is a Series, to get the DataFrame to match your output:
df = df.dot(df.columns).to_frame('Val').reset_index()

Split strings in DataFrame and keep only certain parts

I have a DataFrame like this:
x = ['3.13.1.7-2.1', '3.21.1.8-2.2', '4.20.1.6-2.1', '4.8.1.2-2.0', '5.23.1.10-2.2']
df = pd.DataFrame(data = x, columns = ['id'])
id
0 3.13.1.7-2.1
1 3.21.1.8-2.2
2 4.20.1.6-2.1
3 4.8.1.2-2.0
4 5.23.1.10-2.2
I need to split each id string on the periods, and then I need to know when the second part is 13 and the third part is 1. Ideally, I would have one additional column that is a boolean value (in the above example, index 0 would be TRUE and all others would be FALSE). But I could live with multiple additional columns, where one or more contain individual string parts, and one is for said boolean value.
I first tried to just split the string into parts:
df['id_split'] = df['id'].apply(lambda x: str(x).split('.'))
This worked, however if I try to isolate only the second part of the string like this...
df['id_split'] = df['id'].apply(lambda x: str(x).split('.')[1])
...I get an error that the list index is out of range.
Yet, if I check any individual index in the DataFrame like this...
df['id_split'][0][1]
...this works, producing only the second item in the list of strings.
I guess I'm not familiar enough with what the .apply() method is doing to know why it won't accept list indices. But anyway, I'd like to know how I can isolate just the second and third parts of each string, check their values, and output a boolean based on those values, in a scalable manner (the actual dataset is millions of rows). Thanks!
Let's use str.split to get the parts, then you can compare:
parts = df['id'].str.split('\.', expand=True)
(parts[[1,2]] == ['13','1']).all(1)
Output:
0 True
1 False
2 False
3 False
4 False
dtype: bool
You can do something like this
df['flag'] = df['id'].apply(lambda x: True if x.split('.')[1] == '13' and x.split('.')[2]=='1' else False)
Output
id flag
0 3.13.1.7-2.1 True
1 3.21.1.8-2.2 False
2 4.20.1.6-2.1 False
3 4.8.1.2-2.0 False
4 5.23.1.10-2.2 False
You can do it directly, like below:
df['new'] = df['id'].apply(lambda x: str(x).split('.')[1]=='13' and str(x).split('.')[2]=='1')
>>> print(df)
id new
0 3.13.1.7-2.1 True
1 3.21.1.8-2.2 False
2 4.20.1.6-2.1 False
3 4.8.1.2-2.0 False
4 5.23.1.10-2.2 False

Find rows in a DataFrame that match

Suppose make a data frame with pandas
df = pandas.DataFrame({"a":[1,2,3],"b":[3,4,5]})
Now I have a series
s = pandas.Series([1,3], index=['a','b'])
OK, finally, my question is how can I know s is a item in df especially I need to consider performance?
ps: It's better to return True or False when test s in or not in df.
You want
df.eq(s).all(axis=1).any()
# True
We start with a row-wise comparison,
df.eq(s)
a b
0 True True
1 False False
2 False False
Find all rows which match fully,
df.eq(s).all(axis=1)
0 True
1 False
2 False
dtype: bool
Return True if any matches were found,
df.eq(s).all(axis=1).any()
# True

Python PANDAS: Apply Multi-Line Boolean Criteria Within Group?

I have a dataset with the following general format:
id,thing_criteria_field,place_criteria_field
1,thing_1,place_2
1,thing_3,place_2
1,thing_3,place_2
1,thing_7,place_1
2,thing_3,place_3
2,thing_7,place_2
2,thing_9,place_2
2,thing_4,place_5
3,thing_1,place_1
3,thing_2,place_6
3,thing_3,place_6
3,thing_4,place_6
What I am trying to accomplish is to apply two boolean criteria within a group that MAY have the criteria values spread across multiple records/lines within a group. And if these criteria exist, do not filter any records from the group. If not, filter out all records for a group.
This is a simplified example. The criteria sets are huge lists which is why I concatenate them with pipes and use str.contains() with regex=True instead of something simpler.
This is what I have come up with so far but I do not think I am even on the right track for handling the possibility of multi-line criteria within groups or returning all when found.
thing_criteria = (x.df['thing_criteria_field'].str.contains('thing_1|thing2|thing3', regex=True))
place_criteria = (x.df['place_criteria_field'].str.contains('place_1', regex=True))
df_result = df.groupby('id').filter(lambda x: (thing_criteria & place_criteria).all())
This is the result set I am trying to create from the sample dataset:
id,thing_criteria_field,place_criteria_field
1,thing_1,place_2
1,thing_3,place_2
1,thing_3,place_2
1,thing_7,place_1
3,thing_1,place_1
3,thing_2,place_6
3,thing_3,place_6
3,thing_4,place_6
Any advice would be most appreciated!
Try this:
# Build a dataframe indicating whether each row meets
# each of the individual criterion
all_criteria = [thing_criteria, place_criteria]
cond = pd.DataFrame(all_criteria).T \
.assign(id=df['id'])
# Now group them by id and reduce the truth values
# .any(): test if any row in the group matches a single criterion
# .all(): test if all criteria are met in the group
match = cond.groupby('id').apply(lambda x: x.iloc[:, :-1].any().all())
ids = match[match].index
# Finally, get the ids that matches all criteria
df[df['id'].isin(ids)]
How any().all() works: let's say you have the following groups:
thing_criteria_field place_criteria_field id
0 True False 1
1 False False 1
2 False False 1
3 False True 1
-------------------------------------------------
any: True True ==> all: True
thing_criteria_field place_criteria_field id
4 False False 2
5 False False 2
6 False False 2
7 False False 2
-------------------------------------------------
any: False False ==> all: False

Categories

Resources