how to exclude sentences containing specific word - python

I m reading a sentence from excel(containing bio data) file and want to extract the organizations where they are working. The file also contains sentences which specifies where the person is studying.
ex :
i m studying in 'x' instition(university)
i m student in 'y' college
i want to skip these type of sentences.
I am using regular expression to match these sentences, and if its related to student then skip the part, and only other lines i want write in a separate excel file.
my code is as below..
csvdata = pandas.read_csv("filename.csv",",");
for data in csvdata:
regEX=re.compile('|'.join([r'\bstudent\b',r'\bstudy[ing]\b']),re.I)
matched_data=re.match(regEX,data)
if matched_data is not None:
continue
else:
## write the sentence to excel
But, when i check the newly created excel file, it still contains the sentences that contain 'student', 'study'.
How regular expression can be modified to get the result.

There are 2 things here:
1) Use re.search (re.match only searches at the string start)
2) The regex should be regEX=re.compile(r"\b(?:{})\b".format('|'.join([r'student',r'study(?:ing)?'])),re.I)
The [ing] only matches 1 symbol, either i, n or g while you intended to match an optional ing ending. A non-capturing group with a ? quantifier - (?:ing)? - is actually matching 1 or 0 sequences of ings.
Also, \b(x|y)\b is a more efficient pattern than \bx\b|\by\b, as it involves fewer backtracking steps.
Here is just a demo of what this regex looks like:
import re
pat = r"\b(?:{})\b".format('|'.join([r'student',r'study(?:ing)?']))
print(pat)
# => \b(?:student|study(?:ing)?)\b
regEX=re.compile(pat,re.I)
s = "He is studying here."
mObj = regEX.search(s)
if mObj:
print(mObj.group(0))
# => studying

Related

Is there a way to find (potentially) multiple results with re.search?

While parsing file names of TV shows, I would like to extract information about them to use for renaming. I have a working model, but it currently uses 28 if/elif statements for every iteration of filename I've seen over the last few years. I'd love to be able to condense this to something that I'm not ashamed of, so any help would be appreciated.
Phase one of this code repentance is to hopefully grab multiple episode numbers. I've gotten as far as the code below, but in the first entry it only displays the first episode number and not all three.
import re
def main():
pattern = '(.*)\.S(\d+)[E(\d+)]+'
strings = ['blah.s01e01e02e03', 'foo.s09e09', 'bar.s05e05']
#print(strings)
for string in strings:
print(string)
result = re.search("(.*)\.S(\d+)[E(\d+)]+", string, re.IGNORECASE)
print(result.group(2))
if __name__== "__main__":
main()
This outputs:
blah.s01e01e02e03
01
foo.s09e09
09
bar.s05e05
05
It's probably trivial, but regular expressions might as well be Cuneiform most days. Thanks in advance!
No. You can use findall to find all e\d+, but it cannot find overlapping matches, which makes it impossible to use s\d+ together with it (i.e. you can't distinguish e02 in "foo.s01e006e007" from that of "age007.s01e001"), and Python doesn't let you use variable-length lookbehind (to make sure s\d+ is before it without overlapping).
The way to do this is to find \.s\d+((?:e\d+)+)$ then split the resultant group 1 in another step (whether by using findall with e\d+, or by splitting with (?<!^)(?=e)).
text = 'blah.s01e01e02e03'
match = re.search(r'\.(s\d+)((?:e\d+)+)$', text, re.I)
season = match.group(1)
episodes = re.findall(r'e\d+', match.group(2), re.I)
print(season, episodes)
# => s01 ['e01', 'e02', 'e03']
re.findall instead of re.search will return a list of all matches
If you can make use of the PyPi regex module you could make use of repeating capture groups in the pattern, and then use .captures()
For example:
import regex
s = "blah.s01e01e02e03"
pattern = r"\.(s\d+)(e\d+)+"
m = regex.search(pattern, s, regex.IGNORECASE)
if m:
print(m.captures(1)[0], m.captures(2))
Output:
s01 ['e01', 'e02', 'e03']
See a Python demo and a regex101 demo.
Or using .capturesdict () with named capture groups.
For example:
import regex
s = "blah.s01e01e02e03"
pattern = r"\.(?P<season>s\d+)(?P<episodes>e\d+)+"
m = regex.search(pattern, s, regex.IGNORECASE)
if m:
print(m.capturesdict())
Output:
{'season': ['s01'], 'episodes': ['e01', 'e02', 'e03']}
See a Python demo.
Note that the notation [E(\d+)] that you used is a character class, that matches 1 or the listed characters like E ( a digit + )

Extract text between two dots containing a specific word

I'm trying to extract a sentence between two dots. All sentences have inflam or Inflam in them which is my specific word but I don't know how to make that happen.
what I want is ".The bulk of the underlying fibrous connective tissue consists of diffuse aggregates of chronic inflammatory cells."
or
".The fibrous connective tissue reveals scattered vascular structures and possible chronic inflammation."
from a long paragraph
what I have tried so far is this
##title Extract microscopic-inflammation { form-width: "20%" }
def inflammation1(microscopic_description):
PATTERNS=[
"(?=\.)(.*)(?<=inflamm)",
"(?=inflamm)(.*)(?<=.)",
]
for pattern in PATTERNS:
matches = re.findall(pattern, microscopic_description)
if len(matches) > 0:
break
inflammation1 = ''.join([k for k in matches])
return (inflammation1)
for index, microscopic_description in enumerate(texts):
print(inflammation1(microscopic_description))
print("#"*79, index)
which hasn't worked for me and it gives me error. when I separate my patterns and run them in different cells they work. The problem is they don't work together to give me the sentence between "." and "." before inflamm and after inflamm.
import re
string='' # replace with your paragraph
print(re.search(r"\.[\s\w]*\.",string).group()) #will print first matched string
print(re.findall(r"\.[\s\w]*\.",string)) #will print all matched strings
You can try by checking for the word in every sentence of the text.
for sentence in text.split("."):
if word in sentence:
print(sentence[1:])
Here you do exactly that and if you find the word, you print the sentence without the space in the start of it. You can modify it in any way you want.

Python Regex to extract multiple complex groups

I am trying to extract some groups of data from a text and validate if the input text is correct. In the simplified form my input text looks like this:
Sample=A,B;C,D;E,F;G,H;I&other_text
In which A-I are groups I am interested in extracting them.
In the generic form, Sample looks like this:
val11,val12;val21,val22;...;valn1,valn2;final_val
arbitrary number of comma separated pairs which are separated by semicolon, and one single value at the very end.
There must be at least two pairs before the final value.
The regular expression I came up with is something like this:
r'Sample=(\w),(\w);(\w),(\w);((\w),(\w);)*(\w)'
Assuming my desired groups are simply words (in reality they are more complex but this is out of the scope of the question).
It actually captures the whole text but fails to group the values correctly.
I am just assuming that your "values" are any composed of any characters other than , and ;, i.e. [^,;]+. This clearly needs to be modified in the re.match and re.finditer calls to meet your actual requirements.
import re
s = 'Sample=val11,val12;val21,val22;val31,val32;valn1,valn2;final_val'
# verify if there is a match:
m = re.match(r'^Sample=([^,;]+),+([^,;]+)(;([^,;]+),+([^,;]+))+;([^,;]+)$', s)
if m:
final_val = m.group(6)
other_vals = [(m.group(1), m.group(2)) for m in re.finditer(r'([^,;]+),+([^,;]+)', s[7:])]
print(final_val)
print(other_vals)
Prints:
final_val
[('val11', 'val12'), ('val21', 'val22'), ('val31', 'val32'), ('valn1', 'valn2')]
You can do this with a regex that has an OR in it to decide which kind of data you are parsing. I spaced out the regex for commenting and clarity.
data = 'val11,val12;val21,val22;valn1,valn2;final_val'
pat = re.compile(r'''
(?P<pair> # either comma separated ending in semicolon
(?P<entry_1>[^,;]+) , (?P<entry_2>[^,;]+) ;
)
| # OR
(?P<end_part> # the ending token which contains no comma or semicolon
[^;,]+
)''', re.VERBOSE)
results = []
for match in pat.finditer(data):
if match.group('pair'):
results.append(match.group('entry_1', 'entry_2'))
elif match.group('end_part'):
results.append(match.group('end_part'))
print(results)
This results in:
[('val11', 'val12'), ('val21', 'val22'), ('valn1', 'valn2'), 'final_val']
You can do this without using regex, by using string.split.
An example:
words = map(lambda x : x.split(','), 'val11,val12;val21,val22;valn1,valn2;final_val'.split(';'))
This will result in the following list:
[
['val11', 'val12'],
['val21', 'val22'],
['valn1', 'valn2'],
['final_val']
]

Counting phrases EXCEPT when they are preceded by another phrase in Python

Using pandas in Python 2.7 I am attempting to count the number of times a phrase (e.g., "very good") appears in pieces of text stored in a CSV file. I have multiple phrases and multiple pieces of text. I have succeeded in this first part using the following code:
for row in df_book.itertuples():
index, text = row
normed = re.sub(r'[^\sa-zA-Z0-9]', '', text).lower().strip()
for row in df_phrase.itertuples():
index, phrase = row
count = sum(1 for x in re.finditer(r"\b%s\b" % (re.escape(phrase)), normed))
file.write("%s," % (count))
However, I don't want to count the phrase if it's preceded by a different phrase (e.g., "it is not"). Therefore I used a negative lookbehind assertion:
for row in df_phrase.itertuples():
index, phrase = row
for row in df_negations.itertuples():
index, negation = row
count = sum(1 for x in re.finditer(r"(?<!%s )\b%s\b" % (negation, re.escape(phrase)), normed))
The problem with this approach is that it records a value for each and every negation as pulled from the df_negations dataframe. So, if finditer doesn't find "it was not 'very good'", then it will record a 0. And so on for every single possible negation.
What I really want is just an overall count for the number of times a phrase was used without a preceding phrase. In other words, I want to count every time "very good" occurs, but only when it's not preceded by a negation ("it was not") on my list of negations.
Also, I'm more than happy to hear suggestions on making the process run quicker. I have 100+ phrases, 100+ negations, and 1+ million pieces of text.
I don't really do pandas, but this cheesy non-Pandas version gives some results with the data you sent me.
The primary complication is that the Python re module does not allow variable-width negative look-behind assertions. So this example looks for matching phrases, saving the starting location and text of each phrase, and then, if it found any, looks for negations in the same source string, saving the ending locations of the negations. To make sure that negation ending locations are the same as phrase starting locations, we capture the whitespace after each negation along with the negation itself.
Repeatedly calling functions in the re module is fairly costly. If you have a lot of text as you say, you might want to batch it up, e.g. by using 'non-matching-string'.join() on some of your source strings.
import re
from collections import defaultdict
import csv
def read_csv(fname):
with open(fname, 'r') as csvfile:
result = list(csv.reader(csvfile))
return result
df_negations = read_csv('negations.csv')[1:]
df_phrases = read_csv('phrases.csv')[1:]
df_book = read_csv('test.csv')[1:]
negations = (str(row[0]) for row in df_negations)
phrases = (str(re.escape(row[1])) for row in df_phrases)
# Add a word to the negation pattern so it overlaps the
# next group.
negation_pattern = r"\b((?:%s)\W+)" % '|'.join(negations)
phrase_pattern = r"\b(%s)\b" % '|'.join(phrases)
counts = defaultdict(int)
for row in df_book:
normed = re.sub(r'[^\sa-zA-Z0-9]', '', row[0]).lower().strip()
# Find the location and text of any matching good groups
phrases = [(x.start(), x.group()) for x in
re.finditer(phrase_pattern, normed)]
if not phrases:
continue
# If we had matches, find the (start, end) locations of matching bad
# groups
negated = set(x.end() for x in re.finditer(negation_pattern, normed))
for start, text in phrases:
if start not in negated:
counts[text] += 1
else:
print("%r negated and ignored" % text)
for pattern, count in sorted(counts.items()):
print(count, pattern)

Using Python 2.4.3: Want to find the same regex multiple times in a text file

Super NOOB to Python (2.4.3): I am executing a function containing a regular expression which searches through a txt file that I'm importing. I am able to read and run re.search on the text file and the output is correct. I need to fun this for multiple occurrences. The regex occurs 48 times in the text). The code is as follows:
!/usr/bin/python
import re
dataRead = open('pd_usage_14-04-23.txt', 'r')
dataWrite = open('test_write.txt', 'w')
text = (dataRead.read()) #reads and initializes text for conversion to string
s = str(text) #converts text to string for reading
def user(str):
re1='((?:[a-z][a-z]+))' # Word 1
re2='(\\s+)' # White Space 1
re3='((?:[a-z][a-z]+))' # Word 2
re4='(\\s+)' # White Space 2
re5='((?:[a-z][a-z]*[0-9]+[a-z0-9]*))' # Alphanum 1
rg = re.compile(re1+re2+re3+re4+re5,re.IGNORECASE|re.DOTALL)
#alphanum1=rg.group(5)
re.findall(rg, s, flags=0)
#print "("+alphanum1+")"+"\n"
#if m:
#word1=m.group(1)
#ws1=m.group(2)
#word2=m.group(3)
#ws2=m.group(4)
#alphanum1=m.group(5)
#print "("+alphanum1+")"+"\n"
return
user(s)
dataRead.close()
dataWrite.close()
OUTPUT: g706454
THIS OUTPUT IS CORRECT! BUT...!
I need to run it multiple times reading text thats further down.
I have 2 other definitions that need to be ran multiple times also. I need all 3 to run consecutively, and then run again but starting with the next line or something to search and output newer data. All the logic I tried implement returns the same output.
So I have something like this:
for count in range (0,47):
if stop_read:
date(s)
usage(s)
user(s)
stop_read is a definition that finds the next line after the data that I'm looking for (date, usage, user). I figured I could call this to say If you hit stop_read, read the next line and run definitions all over again.
Any help is greatly appreciated!
Here is what I do for a regex in Python 3, should be similar to Python 2. This is for a multiline searc.
regex = re.compile("\\w+-\\d+\\b", re.MULTILINE)
Then later on in code I have something like:
myset.update([m.group(0) for m in regex.finditer(logmsg.text)])
Maybe you might want to update your Python if you can, 2.4 is old, old, and stale.
looks like re.findall would solve your problem:
re.findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.

Categories

Resources