Find rows in a DataFrame that match - python

Suppose make a data frame with pandas
df = pandas.DataFrame({"a":[1,2,3],"b":[3,4,5]})
Now I have a series
s = pandas.Series([1,3], index=['a','b'])
OK, finally, my question is how can I know s is a item in df especially I need to consider performance?
ps: It's better to return True or False when test s in or not in df.

You want
df.eq(s).all(axis=1).any()
# True
We start with a row-wise comparison,
df.eq(s)
a b
0 True True
1 False False
2 False False
Find all rows which match fully,
df.eq(s).all(axis=1)
0 True
1 False
2 False
dtype: bool
Return True if any matches were found,
df.eq(s).all(axis=1).any()
# True

Related

How to find error value on pandas dataframe?

While using csv files from excel and read it with pandas data frame, got 1 value that's has symbol such as 2$3.74836730957 while it has to be 243.74836730957 (it seems mistook 4 with $). is there anyways that I could find such as values that I mention before and change it into NaN values on Data Frame?
CSV file:
You can use pd.to_numeric in order to report boolean values that denote whether a particular column has only numerical values. In order to check all columns you can do
df.apply(lambda s: pd.to_numeric(s, errors='coerce').notnull().all())
And the output would look like:
A True
B False
C True
D False
...
dtype: bool
Now if you want to know which specific row(s) are not numerical you can use np.isreal:
df.applymap(np.isreal)
A B C D
item
r1 True True True True
r2 True True True True
r3 True True True False
...

How to compress a pandas dataframe, based on their boolean column value?

Given a pandas frame like this:
In:
pokemon yes no ignore
Vulpix True False False
Nidorino False True False
Growlithe False False True
Krokorok False True False
Darumaka False False True
Klefki False True False
Croagunk True False False
What is the correct way of getting as a row value the column associated to the pokemon column?:
Out:
pokemon Val
Vulpix yes
Nidorino no
Growlithe ignore
Krokorok no
Darumaka ignore
Klefki no
Croagunk yes
So far I tried with cross tab:
pd.crosstab(df, columns=['yes', 'no', 'ignore'])
However, I am getting a value error:
`ValueError: Shape of passed values`
What is the correct way of getting the previous output?
If those are one-hot encoded such that there is only ever a single 1/True on each row, set_index and dot with the columns.
df = df.set_index('pokemon')
df = df.dot(df.columns)
pokemon
Vulpix yes
Nidorino no
Growlithe ignore
Krokorok no
Darumaka ignore
Klefki no
Croagunk yes
dtype: object
The above is a Series, to get the DataFrame to match your output:
df = df.dot(df.columns).to_frame('Val').reset_index()

How to filter by boolean?

I have a series of booleans extracted and I would like to filter a dataframe from this in Pandas but it is returning no results.
Dataframe
Account mphone rphone BPHONE
0 14999201 3931812 8014059 9992222
1 12980801 4444444 3932929 4279999
2 9999999 3279999 4419999 3938888
Here are the series:
df['BPHONE'].str.startswith(tuple(combined_list))
0 False
1 True
2 False
Name: BPHONE, dtype: bool
df['rphone'].str.startswith(tuple(combined_list))
0 True
1 False
2 True
Name: rphone, dtype: bool
Now below, when I try to filter this, I am getting empty results:
df[(df['BPHONE'].str.startswith(tuple(combined_list))) & (df['rphone'].str.startswith(tuple(combined_list)))]
Account mphone rphone BPHONE
Lastly, when I just use one column, it seems to match and filter by row and not column. Any advice here on how to correct this?
df[(df['BPHONE'].str.startswith(tuple(combined_list)))]
Account mphone rphone BPHONE
1 12980801 4444444 3932929 4279999
I thought that this should just populate BPHONE along the column axis and not the row axis. How would I be able to filter this way?
The output wanted is the following:
Account mphone rphone BPHONE
14999201 3931812 8014059 np.nan
12980801 4444444 np.nan 4279999
99999999 3279999 4419999 np.nan
To explain this, rphone shows True, False, True, so only the True numbers should show. Under False it should not show, or show as NAN.
The output you are expecting is not a filtered result, but conditionally replaced result.
condition = df['BPHONE'].str.startswith(tuple(combined_list))
Use np.where
df['BPHONE'] = pd.np.where(condition, df['BPHONE'], pd.np.nan)
OR
df.loc[~condition, 'BPHONE'] = pd.np.nan
All filters you are applying are functioning correctly:
df['BPHONE'].str.startswith(tuple(combined_list))
0 False
1 True #Only this row will be retained
2 False
The combined filter will return:
df[(df['BPHONE'].str.startswith(tuple(combined_list))) & (df['rphone'].str.startswith(tuple(combined_list)))]
First filter Second filter Combined filter
0 False True False #Not retained
1 True False False #Not retained
2 False True False #Not retained

Python PANDAS: Apply Multi-Line Boolean Criteria Within Group?

I have a dataset with the following general format:
id,thing_criteria_field,place_criteria_field
1,thing_1,place_2
1,thing_3,place_2
1,thing_3,place_2
1,thing_7,place_1
2,thing_3,place_3
2,thing_7,place_2
2,thing_9,place_2
2,thing_4,place_5
3,thing_1,place_1
3,thing_2,place_6
3,thing_3,place_6
3,thing_4,place_6
What I am trying to accomplish is to apply two boolean criteria within a group that MAY have the criteria values spread across multiple records/lines within a group. And if these criteria exist, do not filter any records from the group. If not, filter out all records for a group.
This is a simplified example. The criteria sets are huge lists which is why I concatenate them with pipes and use str.contains() with regex=True instead of something simpler.
This is what I have come up with so far but I do not think I am even on the right track for handling the possibility of multi-line criteria within groups or returning all when found.
thing_criteria = (x.df['thing_criteria_field'].str.contains('thing_1|thing2|thing3', regex=True))
place_criteria = (x.df['place_criteria_field'].str.contains('place_1', regex=True))
df_result = df.groupby('id').filter(lambda x: (thing_criteria & place_criteria).all())
This is the result set I am trying to create from the sample dataset:
id,thing_criteria_field,place_criteria_field
1,thing_1,place_2
1,thing_3,place_2
1,thing_3,place_2
1,thing_7,place_1
3,thing_1,place_1
3,thing_2,place_6
3,thing_3,place_6
3,thing_4,place_6
Any advice would be most appreciated!
Try this:
# Build a dataframe indicating whether each row meets
# each of the individual criterion
all_criteria = [thing_criteria, place_criteria]
cond = pd.DataFrame(all_criteria).T \
.assign(id=df['id'])
# Now group them by id and reduce the truth values
# .any(): test if any row in the group matches a single criterion
# .all(): test if all criteria are met in the group
match = cond.groupby('id').apply(lambda x: x.iloc[:, :-1].any().all())
ids = match[match].index
# Finally, get the ids that matches all criteria
df[df['id'].isin(ids)]
How any().all() works: let's say you have the following groups:
thing_criteria_field place_criteria_field id
0 True False 1
1 False False 1
2 False False 1
3 False True 1
-------------------------------------------------
any: True True ==> all: True
thing_criteria_field place_criteria_field id
4 False False 2
5 False False 2
6 False False 2
7 False False 2
-------------------------------------------------
any: False False ==> all: False

Find values of data frame in another dataframe in python

I have two data frames df_1 contains:
["TP","MP"]
and df_2 contains:
["This is case 12389TP12098","12378MP899" is now resolved","12356DCT is pending"]
I want to use values in df_1 search it in each entry of df_2
and return those which matches. In this case,those two entries which have TP,MP.
I tried something like this.
df_2.str.contains(df_1)
You need to do it separately for each element of df_1. Pandas will help you:
df_1.apply(df_2.str.contains)
Out:
0 1 2
0 True False False
1 False True False
That's a matrix of all combinations. You can pretty it up:
matches = df_1.apply(df_2.str.contains)
matches.index = df_1
matches.columns = df_2
matches
Out:
This is case 12389TP12098 12378MP899 is now resolved 12356DCT is pending
TP True False False
MP False True False

Categories

Resources