I am working on an python automation script where I want extract specific paragraph based on regex match but I am stuck on how to extract the paragraph. The following is an example showing my case:
Solution : (Consistent Pattern)
The paragraph I want to extract (Inconsistent Pattern)
Remote value: x (Consistent Pattern)
The following is the program that I am currently working on and it will be great if anyone could enlighten me!
import re
test= 'Solution\s:'
test1='Remote'
with open('<filepath>', 'r') as extract:
lines=extract.readlines()
for line in lines:
x = re.search(test, line)
y = re.search(test1, line)
if x is not y:
f4.write(line)
print('good')
else:
print('stop')
This can be easily done using regular expressions, for example:
import re
text = r"""
Solution\s:
The paragraph I
want to extract
Remote
Some useless text here
Solution\s:
Another paragraph
I want to
extract
Remote
"""
m = re.findall(r"Solution\\s:(.*?)Remote", text, re.DOTALL | re.IGNORECASE)
print(m)
Where text represents some text of interest (read in from a file, for example) from which we wish to extract all portions between the sentinel patterns Solution\s: and Remote. Here we use an IGNORECASE search so that the sentinel patterns are recognised even if spelt with different capitalization.
The above code outputs:
['\nThe paragraph I\nwant to extract\n', '\nAnother paragraph\nI want to\nextract\n']
Read the Python re library documentation at https://docs.python.org/3/library/re.html for more details.
Related
I need to grab specific details being parsed in from email bodies, in this case the emails are plain text and formatted like so:
imbad#regex.com
John Doe
+16073948374
2021-04-27T15:38:11+0000
14904
The above is an example output of print(body) parsed in from an email like so:
def parseEmail(popServer, msgNum):
raw_message=popServer.retr(msgNum)[1]
str_message=email.message_from_bytes(b'\n'.join(raw_message))
body=str(str_message.get_payload())
So, if I needed to simply grab the email address and phone number from body object, how might I do that using regex?
I understand regex is most certainly overkill for this, however I'm only repurposing an existing in-house utility that's already written to utilize regex for more complex queries, so it seems the simplest solution here would to modify the regex to grab the desired text. attempts to utilize str.partition() resulted in other unrelated errors.
Thank you in advance.
You could use the following regex patterns:
For the email: \.+#.+\n/g
For the phone number: \^[+]\d+\n/gm
Remove the Initial forward slash if using in python re library.
Note in the email one only the global flag is used, but for the phone number pattern, the multiline flag is also used.
Simply loop over every body, capturing these details and storing them how you like.
In the comments clarifying the question, you indicated that the e-mail address is always on the first line, and the phone number is always on the 3rd line. In that case, I would just split the lines instead of trying to match them with an RE.
lines = body.split("\n")
email = lines[0]
phone = lines[2]
To match those patterns on the 1st and the 3rd line you can use 2 capture groups using a single regex:
^([^\s#]+#[^\s#]+)\r?\n.*\r?\n(\+\d+)$
The pattern matches:
^ Start of string
([^\s#]+#[^\s#]+) Capture an email like pattern in group 1 (Just a single # on the first line)
\r?\n.*\r?\n Match (do not capture) the second line
(\+\d+) Capture a + and 1+ digits in group 2
$ End of string
Regex demo
Example
import re
regex = r"^([^\s#]+#[^\s#]+)\r?\n.*\r?\n(\+\d+)$"
s = ("imbad#regex.com\n"
"John Doe\n"
"+16073948374\n"
"2021-04-27T15:38:11+0000\n"
"14904")
match = re.match(regex, s, re.MULTILINE)
if match:
print(f"{match.group(1)}, {match.group(2)}")
Output
imbad#regex.com, +16073948374
Using Regex.
Ex:
import re
s = """imbad#regex.com
John Doe
+16073948374
2021-04-27T15:38:11+0000
14904"""
ptrn = re.compile(r"(\w+#\w+\.[a-z]+|\+\d{11}\b)")
print(ptrn.findall(s))
Output:
['imbad#regex.c', '+16073948374']
I have a text file with each line look something like this -
GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4
Each line has keyword testcaseid followed by some test case id (in this case blt12_0001 is the id and s3 and n4 are some parameters). I want to extract blt12_0001 from the above line. Each testcaseid will have exactly 1 underscore '_' in-between. What would be a regex for this case and how can I store name of test case id in a variable.
You could make use of capturing groups:
testcaseid_([^_]+_[^_]+)
See a demo on regex101.com.
One of many possible ways in Python could be
import re
line = "GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4"
for id in re.finditer(r'testcaseid_([^_]+_[^_]+)', line):
print(id.group(1))
See a demo on ideone.com.
You can use this regex to capture your testcaseid given in your format,
(?<=testcaseid_)[^_]+_[^_]+
This essentially captures a text having exactly one underscore between them and preceded by testcaseid_ text using positive lookbehind. Here [^_]+ captures one or more any character other than underscore, followed by _ then again uses [^_]+ to capture one or more any character except _
Check out this demo
Check out this Python code,
import re
list = ['GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4', 'GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s6_n9']
for s in list:
grp = re.search(r'(?<=testcaseid_)[^_]+_[^_]+', s)
if grp:
print(grp.group())
Output,
blt12_0001
blt12_0001
Another option that might work would be:
import re
expression = r"[^_\r\n]+_[^_\r\n]+(?=(?:_[a-z0-9]{2}){2}$)"
string = '''
GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4
GeneralBKT_n24_-e_dee_testcaseid_blt81_0023_s4_n5
'''
print(re.findall(expression, string, re.M))
Output
['blt12_0001', 'blt81_0023']
Demo
RegEx Circuit
jex.im visualizes regular expressions:
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
I am learning Regex with Python and I am doing the baby names exercise of the Google Tutorial on Regex. The html file --baby1990.html-- is in a zipped file that can be downloaded here: https://developers.google.com/edu/python/set-up ('Download Google Python Exercises')
The year is placed within Tags. The html code is the following:
<h3 align="center">Popularity in 1990</h3>
I am using the following code to extract the year from the file:
f = open('C:/Users/ALEX/MyFiles/JUPYTER NOTEBOOKS/google-python-exercises/babynames/baby1990.html', 'r')
strings = re.findall(r'<h3 align="center">Popularity in (/d/d/d/d)</h3>', f.read())
I have tested the pattern with RegularExpressions101 website and it works.
However the 'strings' list returned is empty.
len(strings)
out
I think the best way to match a year in a contextual string is to use re.search or re.match.
For instance:
import re
tag = """<h3 align="center">Popularity in 1990</h3>"""
mo = re.search(r"Popularity in (\d{4})", tag)
year = mo.group(1) if mo else ""
print(year)
# -> 1990
Or course, if you want to find all matches, you need to use re.findall
…
You check your Python RegEx, you can also try online with https://regex101.com/
I'm trying to do some text file parsing where this pattern is repeated throughout the file:
VERSION.PROGRAM:program_name
VERSION.SUBPROGRAM:sub_program_name
My intent is to, given a progra_name, retrieve the sub_program_name for each block of text i mentioned above.
I have the following function that finds if the text actually exists, but doesn't print the sub_program_name:
def find_subprogram(program_name):
regex_string = r'VERSION.PROGRAM:%s\nVERSION.SUBPROGRAM:.' % program_name
with open('file.txt', r) as f:
match = re.search(regex_string, f.read(), re.DOTALL|re.MULTILINE)
if match:
print match.group()
I will appreciate some help or tips.
Thanks
Your regex has a typo, it's looking for PRGRAM.
If you want to search for multiple lines, then you don't want to use the MULTILINE modifier. What that does is it considers each line as its own separate entity to be matched against with a beginning and an end.
You also are not using valid regex matching techniques. You should look up how to properly use regex.
For matching any character, using (.*) not %s.
Here is an example
Using VERSION\.PROGRAM:YOURSTRING\nVERSION\.SUBPROGRAM:(.*) will match the groups properly
re.compile('VERSION\.PROGRAM:%s\nVERSION\.SUBPROGRAM:(.*)'%(re.escape(yourstr)))
I would like to extract the exact sentence if a particular word is present in that sentence. Could anyone let me know how to do it with python. I used concordance() but it only prints lines where the word matches.
Just a quick reminder: Sentence breaking is actually a pretty complex thing, there's exceptions to the period rule, such as "Mr." or "Dr." There's also a variety of sentence ending punctuation marks. But there's also exceptions to the exception (if the next word is Capitalized and is not a proper noun, then Dr. can end a sentence, for example).
If you're interested in this (it's a natural language processing topic) you could check out:
the natural language tool kit's (nltk) punkt module.
If you have each sentence in a string you can use find() on your word and if found return the sentence. Otherwise you could use a regex, something like this
pattern = "\.?(?P<sentence>.*?good.*?)\."
match = re.search(pattern, yourwholetext)
if match != None:
sentence = match.group("sentence")
I havent tested this but something along those lines.
My test:
import re
text = "muffins are good, cookies are bad. sauce is awesome, veggies too. fmooo mfasss, fdssaaaa."
pattern = "\.?(?P<sentence>.*?good.*?)\."
match = re.search(pattern, text)
if match != None:
print match.group("sentence")
dutt did a good job answering this. just wanted to add a couple things
import re
text = "go directly to jail. do not cross go. do not collect $200."
pattern = "\.(?P<sentence>.*?(go).*?)\."
match = re.search(pattern, text)
if match != None:
sentence = match.group("sentence")
obviously, you'll need to import the regex library (import re) before you begin. here is a teardown of what the regular expression actually does (more info can be found at the Python re library page)
\. # looks for a period preceding sentence.
(?P<sentence>...) # sets the regex captured to variable "sentence".
.*? # selects all text (non-greedy) until the word "go".
again, the link to the library ref page is key.