I have two data frames df_1 contains:
["TP","MP"]
and df_2 contains:
["This is case 12389TP12098","12378MP899" is now resolved","12356DCT is pending"]
I want to use values in df_1 search it in each entry of df_2
and return those which matches. In this case,those two entries which have TP,MP.
I tried something like this.
df_2.str.contains(df_1)
You need to do it separately for each element of df_1. Pandas will help you:
df_1.apply(df_2.str.contains)
Out:
0 1 2
0 True False False
1 False True False
That's a matrix of all combinations. You can pretty it up:
matches = df_1.apply(df_2.str.contains)
matches.index = df_1
matches.columns = df_2
matches
Out:
This is case 12389TP12098 12378MP899 is now resolved 12356DCT is pending
TP True False False
MP False True False
Related
While using csv files from excel and read it with pandas data frame, got 1 value that's has symbol such as 2$3.74836730957 while it has to be 243.74836730957 (it seems mistook 4 with $). is there anyways that I could find such as values that I mention before and change it into NaN values on Data Frame?
CSV file:
You can use pd.to_numeric in order to report boolean values that denote whether a particular column has only numerical values. In order to check all columns you can do
df.apply(lambda s: pd.to_numeric(s, errors='coerce').notnull().all())
And the output would look like:
A True
B False
C True
D False
...
dtype: bool
Now if you want to know which specific row(s) are not numerical you can use np.isreal:
df.applymap(np.isreal)
A B C D
item
r1 True True True True
r2 True True True True
r3 True True True False
...
I have a DataFrame like this:
x = ['3.13.1.7-2.1', '3.21.1.8-2.2', '4.20.1.6-2.1', '4.8.1.2-2.0', '5.23.1.10-2.2']
df = pd.DataFrame(data = x, columns = ['id'])
id
0 3.13.1.7-2.1
1 3.21.1.8-2.2
2 4.20.1.6-2.1
3 4.8.1.2-2.0
4 5.23.1.10-2.2
I need to split each id string on the periods, and then I need to know when the second part is 13 and the third part is 1. Ideally, I would have one additional column that is a boolean value (in the above example, index 0 would be TRUE and all others would be FALSE). But I could live with multiple additional columns, where one or more contain individual string parts, and one is for said boolean value.
I first tried to just split the string into parts:
df['id_split'] = df['id'].apply(lambda x: str(x).split('.'))
This worked, however if I try to isolate only the second part of the string like this...
df['id_split'] = df['id'].apply(lambda x: str(x).split('.')[1])
...I get an error that the list index is out of range.
Yet, if I check any individual index in the DataFrame like this...
df['id_split'][0][1]
...this works, producing only the second item in the list of strings.
I guess I'm not familiar enough with what the .apply() method is doing to know why it won't accept list indices. But anyway, I'd like to know how I can isolate just the second and third parts of each string, check their values, and output a boolean based on those values, in a scalable manner (the actual dataset is millions of rows). Thanks!
Let's use str.split to get the parts, then you can compare:
parts = df['id'].str.split('\.', expand=True)
(parts[[1,2]] == ['13','1']).all(1)
Output:
0 True
1 False
2 False
3 False
4 False
dtype: bool
You can do something like this
df['flag'] = df['id'].apply(lambda x: True if x.split('.')[1] == '13' and x.split('.')[2]=='1' else False)
Output
id flag
0 3.13.1.7-2.1 True
1 3.21.1.8-2.2 False
2 4.20.1.6-2.1 False
3 4.8.1.2-2.0 False
4 5.23.1.10-2.2 False
You can do it directly, like below:
df['new'] = df['id'].apply(lambda x: str(x).split('.')[1]=='13' and str(x).split('.')[2]=='1')
>>> print(df)
id new
0 3.13.1.7-2.1 True
1 3.21.1.8-2.2 False
2 4.20.1.6-2.1 False
3 4.8.1.2-2.0 False
4 5.23.1.10-2.2 False
Suppose make a data frame with pandas
df = pandas.DataFrame({"a":[1,2,3],"b":[3,4,5]})
Now I have a series
s = pandas.Series([1,3], index=['a','b'])
OK, finally, my question is how can I know s is a item in df especially I need to consider performance?
ps: It's better to return True or False when test s in or not in df.
You want
df.eq(s).all(axis=1).any()
# True
We start with a row-wise comparison,
df.eq(s)
a b
0 True True
1 False False
2 False False
Find all rows which match fully,
df.eq(s).all(axis=1)
0 True
1 False
2 False
dtype: bool
Return True if any matches were found,
df.eq(s).all(axis=1).any()
# True
I have a dataset with the following general format:
id,thing_criteria_field,place_criteria_field
1,thing_1,place_2
1,thing_3,place_2
1,thing_3,place_2
1,thing_7,place_1
2,thing_3,place_3
2,thing_7,place_2
2,thing_9,place_2
2,thing_4,place_5
3,thing_1,place_1
3,thing_2,place_6
3,thing_3,place_6
3,thing_4,place_6
What I am trying to accomplish is to apply two boolean criteria within a group that MAY have the criteria values spread across multiple records/lines within a group. And if these criteria exist, do not filter any records from the group. If not, filter out all records for a group.
This is a simplified example. The criteria sets are huge lists which is why I concatenate them with pipes and use str.contains() with regex=True instead of something simpler.
This is what I have come up with so far but I do not think I am even on the right track for handling the possibility of multi-line criteria within groups or returning all when found.
thing_criteria = (x.df['thing_criteria_field'].str.contains('thing_1|thing2|thing3', regex=True))
place_criteria = (x.df['place_criteria_field'].str.contains('place_1', regex=True))
df_result = df.groupby('id').filter(lambda x: (thing_criteria & place_criteria).all())
This is the result set I am trying to create from the sample dataset:
id,thing_criteria_field,place_criteria_field
1,thing_1,place_2
1,thing_3,place_2
1,thing_3,place_2
1,thing_7,place_1
3,thing_1,place_1
3,thing_2,place_6
3,thing_3,place_6
3,thing_4,place_6
Any advice would be most appreciated!
Try this:
# Build a dataframe indicating whether each row meets
# each of the individual criterion
all_criteria = [thing_criteria, place_criteria]
cond = pd.DataFrame(all_criteria).T \
.assign(id=df['id'])
# Now group them by id and reduce the truth values
# .any(): test if any row in the group matches a single criterion
# .all(): test if all criteria are met in the group
match = cond.groupby('id').apply(lambda x: x.iloc[:, :-1].any().all())
ids = match[match].index
# Finally, get the ids that matches all criteria
df[df['id'].isin(ids)]
How any().all() works: let's say you have the following groups:
thing_criteria_field place_criteria_field id
0 True False 1
1 False False 1
2 False False 1
3 False True 1
-------------------------------------------------
any: True True ==> all: True
thing_criteria_field place_criteria_field id
4 False False 2
5 False False 2
6 False False 2
7 False False 2
-------------------------------------------------
any: False False ==> all: False
I'm afraid the solution is obvious or the question a duplicate, but I couldn't find an answer yet: I have a pandas data frame that contains long strings and I need two strings to be matched at the same time. I found the "or" version multiple time but I didn't find the "and" solution yet.
Please assume the following data frame where the interesting information "element type" and subpart type" are separated by a random in between element:
import pandas as pd
data = pd.DataFrame({"col1":["element1_random_string_subpartA"
, "element2_ran_str_subpartA"
, "element1_some_text_subpartB"
, "element2_some_other_text_subpartB"]})
I'd now like to filter for all lines that contain element1 and subpartA.
data.col1.str.contains("element1|subpartA")
return a data frame
True
True
True
False
which is the expected result. But I need an "And" combination and
data.col1.str.contains("element1&subpartA")
returns
False
False
False
False
although I'd expect
True
False
False
False
Regex and is not easy:
m = data.col1.str.contains(r'(?=.*subpartA)(?=.*element1)')
Simplier is chain both conditions with & for bitwise AND:
m = data.col1.str.contains("subpartA") & data.col1.str.contains("element1")
print (m)
0 True
1 False
2 False
3 False
Name: col1, dtype: bool