check if syntax in pandas column meets certain criteria

check if syntax in pandas column meets certain criteria - python

I have the dataframe underneath:
df = pd.DataFrame(np.array(['YM.296','MM.305','VO.081.019','VO.081.016','AM.081.002.001','AM081','SR.082','VO.081.012.001','VO.081.012.003']))
I want to know in what row the syntax is similar to 'XX.222.333' (example). So two letters followed by a stop ('.') followed by three numbers followed by a stop ('.') followed by three numbers again.
The desired outcome looks as follows:
tf = pd.DataFrame(np.array([False,False,True,True,False,False,False,False, False]))
Is there a fast and pythonic way to do this?

You can do that using str.contains and regex.
As follows:
df[0].str.contains(r'^[A-Z]{2}\.\d{3}\.\d{3}$', regex=True)
Outputs:
0 False
1 False
2 True
3 True
4 False
5 False
6 False
Name: 0, dtype: bool
Here is a visualization of the regex used:

Related

Split strings in DataFrame and keep only certain parts

I have a DataFrame like this:
x = ['3.13.1.7-2.1', '3.21.1.8-2.2', '4.20.1.6-2.1', '4.8.1.2-2.0', '5.23.1.10-2.2']
df = pd.DataFrame(data = x, columns = ['id'])
id
0 3.13.1.7-2.1
1 3.21.1.8-2.2
2 4.20.1.6-2.1
3 4.8.1.2-2.0
4 5.23.1.10-2.2
I need to split each id string on the periods, and then I need to know when the second part is 13 and the third part is 1. Ideally, I would have one additional column that is a boolean value (in the above example, index 0 would be TRUE and all others would be FALSE). But I could live with multiple additional columns, where one or more contain individual string parts, and one is for said boolean value.
I first tried to just split the string into parts:
df['id_split'] = df['id'].apply(lambda x: str(x).split('.'))
This worked, however if I try to isolate only the second part of the string like this...
df['id_split'] = df['id'].apply(lambda x: str(x).split('.')[1])
...I get an error that the list index is out of range.
Yet, if I check any individual index in the DataFrame like this...
df['id_split'][0][1]
...this works, producing only the second item in the list of strings.
I guess I'm not familiar enough with what the .apply() method is doing to know why it won't accept list indices. But anyway, I'd like to know how I can isolate just the second and third parts of each string, check their values, and output a boolean based on those values, in a scalable manner (the actual dataset is millions of rows). Thanks!

Let's use str.split to get the parts, then you can compare:
parts = df['id'].str.split('\.', expand=True)
(parts[[1,2]] == ['13','1']).all(1)
Output:
0 True
1 False
2 False
3 False
4 False
dtype: bool

You can do something like this
df['flag'] = df['id'].apply(lambda x: True if x.split('.')[1] == '13' and x.split('.')[2]=='1' else False)
Output
id flag
0 3.13.1.7-2.1 True
1 3.21.1.8-2.2 False
2 4.20.1.6-2.1 False
3 4.8.1.2-2.0 False
4 5.23.1.10-2.2 False

You can do it directly, like below:
df['new'] = df['id'].apply(lambda x: str(x).split('.')[1]=='13' and str(x).split('.')[2]=='1')
>>> print(df)
id new
0 3.13.1.7-2.1 True
1 3.21.1.8-2.2 False
2 4.20.1.6-2.1 False
3 4.8.1.2-2.0 False
4 5.23.1.10-2.2 False

Find rows in a DataFrame that match

Suppose make a data frame with pandas
df = pandas.DataFrame({"a":[1,2,3],"b":[3,4,5]})
Now I have a series
s = pandas.Series([1,3], index=['a','b'])
OK, finally, my question is how can I know s is a item in df especially I need to consider performance?
ps: It's better to return True or False when test s in or not in df.

You want
df.eq(s).all(axis=1).any()
# True
We start with a row-wise comparison,
df.eq(s)
a b
0 True True
1 False False
2 False False
Find all rows which match fully,
df.eq(s).all(axis=1)
0 True
1 False
2 False
dtype: bool
Return True if any matches were found,
df.eq(s).all(axis=1).any()
# True

Python PANDAS: Apply Multi-Line Boolean Criteria Within Group?

I have a dataset with the following general format:
id,thing_criteria_field,place_criteria_field
1,thing_1,place_2
1,thing_3,place_2
1,thing_3,place_2
1,thing_7,place_1
2,thing_3,place_3
2,thing_7,place_2
2,thing_9,place_2
2,thing_4,place_5
3,thing_1,place_1
3,thing_2,place_6
3,thing_3,place_6
3,thing_4,place_6
What I am trying to accomplish is to apply two boolean criteria within a group that MAY have the criteria values spread across multiple records/lines within a group. And if these criteria exist, do not filter any records from the group. If not, filter out all records for a group.
This is a simplified example. The criteria sets are huge lists which is why I concatenate them with pipes and use str.contains() with regex=True instead of something simpler.
This is what I have come up with so far but I do not think I am even on the right track for handling the possibility of multi-line criteria within groups or returning all when found.
thing_criteria = (x.df['thing_criteria_field'].str.contains('thing_1|thing2|thing3', regex=True))
place_criteria = (x.df['place_criteria_field'].str.contains('place_1', regex=True))
df_result = df.groupby('id').filter(lambda x: (thing_criteria & place_criteria).all())
This is the result set I am trying to create from the sample dataset:
id,thing_criteria_field,place_criteria_field
1,thing_1,place_2
1,thing_3,place_2
1,thing_3,place_2
1,thing_7,place_1
3,thing_1,place_1
3,thing_2,place_6
3,thing_3,place_6
3,thing_4,place_6
Any advice would be most appreciated!

Try this:
# Build a dataframe indicating whether each row meets
# each of the individual criterion
all_criteria = [thing_criteria, place_criteria]
cond = pd.DataFrame(all_criteria).T \
.assign(id=df['id'])
# Now group them by id and reduce the truth values
# .any(): test if any row in the group matches a single criterion
# .all(): test if all criteria are met in the group
match = cond.groupby('id').apply(lambda x: x.iloc[:, :-1].any().all())
ids = match[match].index
# Finally, get the ids that matches all criteria
df[df['id'].isin(ids)]
How any().all() works: let's say you have the following groups:
thing_criteria_field place_criteria_field id
0 True False 1
1 False False 1
2 False False 1
3 False True 1
-------------------------------------------------
any: True True ==> all: True
thing_criteria_field place_criteria_field id
4 False False 2
5 False False 2
6 False False 2
7 False False 2
-------------------------------------------------
any: False False ==> all: False

Using str.contains to look for two substrings with pandas in python

I'm afraid the solution is obvious or the question a duplicate, but I couldn't find an answer yet: I have a pandas data frame that contains long strings and I need two strings to be matched at the same time. I found the "or" version multiple time but I didn't find the "and" solution yet.
Please assume the following data frame where the interesting information "element type" and subpart type" are separated by a random in between element:
import pandas as pd
data = pd.DataFrame({"col1":["element1_random_string_subpartA"
, "element2_ran_str_subpartA"
, "element1_some_text_subpartB"
, "element2_some_other_text_subpartB"]})
I'd now like to filter for all lines that contain element1 and subpartA.
data.col1.str.contains("element1|subpartA")
return a data frame
True
True
True
False
which is the expected result. But I need an "And" combination and
data.col1.str.contains("element1&subpartA")
returns
False
False
False
False
although I'd expect
True
False
False
False

Regex and is not easy:
m = data.col1.str.contains(r'(?=.*subpartA)(?=.*element1)')
Simplier is chain both conditions with & for bitwise AND:
m = data.col1.str.contains("subpartA") & data.col1.str.contains("element1")
print (m)
0 True
1 False
2 False
3 False
Name: col1, dtype: bool

pandas find strings in common among Series and return keywords

I would like to improve this previous question about searching strings in pandas Series based on a Series of keywords. My question now is how to get the keywords found in the DataFrame rows as a new column. The Keywords Series "w" is:
Skilful
Wilful
Somewhere
Thing
Strange
and the DataFrame "df" is:
User_ID;Tweet
01;hi all
02;see you somewhere
03;So weird
04;hi all :-)
05;next big thing
06;how can i say no?
07;so strange
08;not at all
The following solution worked well for masking the DataFrame:
import re
r = re.compile(r'.*({}).*'.format('|'.join(w.values)), re.IGNORECASE)
masked = map(bool, map(r.match, df['Tweet']))
df['Tweet_masked'] = masked
and return this:
User_ID Tweet Tweet_masked
0 1 hi all False
1 2 see you somewhere True
2 3 So weird False
3 4 hi all :-) False
4 5 next big thing True
5 6 how can i say no? False
6 7 so strange True
7 8 not at all False
Now I'm looking for a result like this:
User_ID;Tweet;Keyword
01;hi all;None
02;see you somewhere;somewhere
03;So weird;None
04;hi all :-);None
05;next big thing;thing
06;how can i say no?;None
07;so strange;strange
08;not at all;None
Thanks in advance for your support.

How about replacing
masked = map(bool, map(r.match, df['Tweet']))
with
masked = [m.group(1) if m else None for m in map(r.match, df['Tweet'])]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

check if syntax in pandas column meets certain criteria - python

You can do that using str.contains and regex. As follows: df[0].str.contains(r'^[A-Z]{2}\.\d{3}\.\d{3}$', regex=True) Outputs: 0 False 1 False 2 True 3 True 4 False 5 False 6 False Name: 0, dtype: bool Here is a visualization of the regex used:

Related

Split strings in DataFrame and keep only certain parts

Find rows in a DataFrame that match

Python PANDAS: Apply Multi-Line Boolean Criteria Within Group?

Using str.contains to look for two substrings with pandas in python

pandas find strings in common among Series and return keywords

Categories

Resources