Python - regex to grab specific lines from text - python

I need to grab specific details being parsed in from email bodies, in this case the emails are plain text and formatted like so:
imbad#regex.com
John Doe
+16073948374
2021-04-27T15:38:11+0000
14904
The above is an example output of print(body) parsed in from an email like so:
def parseEmail(popServer, msgNum):
raw_message=popServer.retr(msgNum)[1]
str_message=email.message_from_bytes(b'\n'.join(raw_message))
body=str(str_message.get_payload())
So, if I needed to simply grab the email address and phone number from body object, how might I do that using regex?
I understand regex is most certainly overkill for this, however I'm only repurposing an existing in-house utility that's already written to utilize regex for more complex queries, so it seems the simplest solution here would to modify the regex to grab the desired text. attempts to utilize str.partition() resulted in other unrelated errors.
Thank you in advance.

You could use the following regex patterns:
For the email: \.+#.+\n/g
For the phone number: \^[+]\d+\n/gm
Remove the Initial forward slash if using in python re library.
Note in the email one only the global flag is used, but for the phone number pattern, the multiline flag is also used.
Simply loop over every body, capturing these details and storing them how you like.

In the comments clarifying the question, you indicated that the e-mail address is always on the first line, and the phone number is always on the 3rd line. In that case, I would just split the lines instead of trying to match them with an RE.
lines = body.split("\n")
email = lines[0]
phone = lines[2]

To match those patterns on the 1st and the 3rd line you can use 2 capture groups using a single regex:
^([^\s#]+#[^\s#]+)\r?\n.*\r?\n(\+\d+)$
The pattern matches:
^ Start of string
([^\s#]+#[^\s#]+) Capture an email like pattern in group 1 (Just a single # on the first line)
\r?\n.*\r?\n Match (do not capture) the second line
(\+\d+) Capture a + and 1+ digits in group 2
$ End of string
Regex demo
Example
import re
regex = r"^([^\s#]+#[^\s#]+)\r?\n.*\r?\n(\+\d+)$"
s = ("imbad#regex.com\n"
"John Doe\n"
"+16073948374\n"
"2021-04-27T15:38:11+0000\n"
"14904")
match = re.match(regex, s, re.MULTILINE)
if match:
print(f"{match.group(1)}, {match.group(2)}")
Output
imbad#regex.com, +16073948374

Using Regex.
Ex:
import re
s = """imbad#regex.com
John Doe
+16073948374
2021-04-27T15:38:11+0000
14904"""
ptrn = re.compile(r"(\w+#\w+\.[a-z]+|\+\d{11}\b)")
print(ptrn.findall(s))
Output:
['imbad#regex.c', '+16073948374']

Related

Match only words (sometimes with dots seperating) regex

I have a list like so:
example.com=120.0.0.0
ben.example.com=120.0.0.0
+ben.example=120.0.0.0
+ben.example.com.np=120.0.0.0
ben=120.0.0.0
ben-example.com=120.0.0.0
ben43.example.com=120.0.0.0
I need to find only the words (with dots seperated).
No ips, =, + and so on.
Some FQDN have multiple dots, some none at all and so on.
Is this possible?
If the script works well when i run the regex i want to get these only:
ben.example.com.np
ben.example
ben.example.com
example.com
ben
ben43.example.com
I want to parse the file into ips and FQDNS via python regex so i can work with it and check if the ips are available for the domain.
This is very straightforward
import re
fqdns = re.findall(r"[a-zA-Z\.-]{2,}", text, flags=re.M)
gives
['example.com', 'ben.example.com', 'ben.example', 'ben-example.com.np', 'ben']
regex101 example here
The group matches all characters in the ranges a-z and A-Z, along with dot . and -. The {2,} means match at least 2 characters in a row, so it won't match the dots in the IPs.
EDIT: After I wrote this answer the parameters of the question changed slightly, as some of the URLs contained numbers. So, instead of using re.findall() to get all matches in a (potentially multi-line) input, you should use re.match().group() with a slightly altered regex and process the input line by line:
import re
with open("path/to/file", "r") as f:
fqdns = [re.match(r"(?:[a-zA-Z\.\-0-9]{2,})", line).group() for line in f]
re.match(), in the absence of any flags, returns after the first match in the line. .group() is the way you access the matched string.

How to copy subsequent text after matching a pattern?

I have a text file with each line look something like this -
GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4
Each line has keyword testcaseid followed by some test case id (in this case blt12_0001 is the id and s3 and n4 are some parameters). I want to extract blt12_0001 from the above line. Each testcaseid will have exactly 1 underscore '_' in-between. What would be a regex for this case and how can I store name of test case id in a variable.
You could make use of capturing groups:
testcaseid_([^_]+_[^_]+)
See a demo on regex101.com.
One of many possible ways in Python could be
import re
line = "GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4"
for id in re.finditer(r'testcaseid_([^_]+_[^_]+)', line):
print(id.group(1))
See a demo on ideone.com.
You can use this regex to capture your testcaseid given in your format,
(?<=testcaseid_)[^_]+_[^_]+
This essentially captures a text having exactly one underscore between them and preceded by testcaseid_ text using positive lookbehind. Here [^_]+ captures one or more any character other than underscore, followed by _ then again uses [^_]+ to capture one or more any character except _
Check out this demo
Check out this Python code,
import re
list = ['GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4', 'GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s6_n9']
for s in list:
grp = re.search(r'(?<=testcaseid_)[^_]+_[^_]+', s)
if grp:
print(grp.group())
Output,
blt12_0001
blt12_0001
Another option that might work would be:
import re
expression = r"[^_\r\n]+_[^_\r\n]+(?=(?:_[a-z0-9]{2}){2}$)"
string = '''
GeneralBKT_n24_-e_dee_testcaseid_blt12_0001_s3_n4
GeneralBKT_n24_-e_dee_testcaseid_blt81_0023_s4_n5
'''
print(re.findall(expression, string, re.M))
Output
['blt12_0001', 'blt81_0023']
Demo
RegEx Circuit
jex.im visualizes regular expressions:
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

text containing emails but no space between words.How to extract emails

I have a text containing only emails but there is no space between
each email
Example : email1#file1.comemail2#file1.comemail3#dom1.net
I have applied re.findall(r'[\w\.-]+#[\w\.-]+', str(line)) and this is what I got
email1#file1.comemail
2#file1.comemail
3#dom1.net
Popular tlds are .com,.net,.info,.org. So if I find one of them after #[\w\.-]+ then I will induce a space after the tlds in the line and then extract email.
But how to check it I have .com or.net or .info ....
One option (which can become quite cumbersome if you take a lot of variations into account like .com .net etc..) could be to use a non greedy +? match and list all the options that you would allow using an alternation.
[\w.-]+?#[\w.-]+?\.(?:com|net)
Regex demo | Python demo
Note that repeating the character class [\w.-]+ would also allow for example .-.-.#.-.-..com
For example
import re
s = "email1#file1.comemail2#file1.comemail3#dom1.net"
regex = r"[\w.-]+?#[\w.-]+?\.(?:com|net)"
res = re.findall(regex, s)
print(res)
Result
['email1#file1.com', 'email2#file1.com', 'email3#dom1.net']
You can use re.sub() to add space after each tlds. As an example I specified .net, .org and .com, but you can add as many as you want.
Then you can apply your regex
import re
text = 'email1#file1.comemail2#file1.comemail3#dom1.net'
new_text = re.sub(r'(.com|.net|.org)', r'\1 ', text)
emails = re.findall(r'[\w\.-]+#[\w\.-]+', new_text)
OUTPUT
['email1#file1.com', 'email2#file1.com', 'email3#dom1.net']

How to extract a fragment that can appear multiple times in a string

I have emails which have following in email body (from email_body variable) :
body for 1st email:
2.email:
3 email:
same as for 2 just different machine name
These emails have attachments which also have job names, i want to get job name for every email only once
for emailid in items:
resp, data = conn.uid("fetch",emailid, "(RFC822)")
if resp == 'OK':
email_body = data[0][1].decode('utf-8')
mail = email.message_from_string(email_body)
#get all emails with words "PA1" or "PA2" in subject
if mail["Subject"].find("PA1") > 0 or mail["Subject"].find("PA2") > 0:
#search email body for machine name (string after word "MACHINE")
regex1 = r'(?<!^)MACHINE:\s*(\S+)'
a=re.findall(regex1 ,email_body)
print (c)
example of message body from 1st email for MACHINE section retrived from python code, it's email_body variable which needs to be searched by regex:
MACHINE: =^M
example_machine_1
Email body for 2nd email
MACHINE: example_machine_2^M
MACHINE: example_machine_2<br>^M
The difference is in line break in 1st email body
Current output
['example_machine_1', 'example_machine_1<br>']
['example_machine_2', 'example_machine_2<br>']
['=', '=']
as you can see, i'm getting duplicate jobs and missing job name from 1st email
Desired output
['example_machine_3']
['example_machine_2']
['example_machine_1']
UPDATE
Thanks to #Predicate i eliminated duplicates for 2nd and 3rd email
regex2 = r'(?<=MACHINE: )\b\w+\b|$'
still have no idea how to get job from first email (line break)
Try to use this one. With defined word boundaries. \w matches letter, digits and underscores. \b marks a word boundary. \b does not match < so it will end before the <br> tag.
Try to be as specific as you can. If you know what characters will be used in your match than use them in your regex. It will reduce the amount of false positive and also boost the speed of the search.
Variant 1:
regex1 = r'(?<=MACHINE: )\b\w+\b'
Variant 2:
Also possible (if the codes are in the format <some letters and digits>< two digits>). To be more specific:
regex1 = r'(?<=MACHINE: )\b\w+\d{2}\b'
Variant 3:
If there are multiple appearances of the same code - one way to handle it is matching only the last appearance of the job name. We will create a capturing group (\w+\d{2}) and will check that it will not appear after it was matched (?!.*\1):
regex1 = r'(?<=MACHINE: )\b(\w+\d{2})\b(?!.*\1)'
Variant 4 (after getting more info about the environment):
're' module does not support varying length lookbehinds. It is better to use regex from pypi, but you can use this trick. Try it.
regex1 = r'(?<=MACHINE:\s=\s|..MACHINE:\s)\b(\w+)\b(?!.*\1)'
matches both emails and only once. one two
Of course you can still be more specific if you know the structure of your codes and replace \w+ with \w+\d{2}. Its always a good practise. But my regex should be enough for you. Also you would probably need to compile the regex with "single line flag". regex1 = re.compile(r'<your regex>', re.DOTALL) and do then regex1.findall(...

EMAIL id matcher-python regular expression cant figure out

i am trying to match specific type of email addreses of the form username#siteaddress
where username is non-empty string of minimum length 5 build from characters {a-z A-Z 0-9 . _}.The username cannot start from '.' or ' _ ' The site-address is build of a prefix which is non-empty string build from characters {a-z A-Z 0-9} (excluding the brackets) followed by one of the following suffixes {".com", ".org", "edu", ".co.in"}.
The following code doesnt work
list=re.findall("[a-zA-Z0-9][a-zA-Z0-9._][a-zA-Z0-9._][a-zA-Z0-9._][a-zA-Z0-9._][a-zA-Z0-9._]*#[a-zA-Z0-9][a-zA-Z0-9]*\.(com|edu|org|co\.in)",raw_input())
However the following works fine when i add a '?:' in the last parenthesis, cant figure out the reason
list=re.findall("[a-zA-Z0-9][a-zA-Z0-9._][a-zA-Z0-9._][a-zA-Z0-9._][a-zA-Z0-9._][a-zA-Z0-9._]*#[a-zA-Z0-9][a-zA-Z0-9]*\.(?:com|edu|org|co\.in)",raw_input())
You shouldn't roll your own email address regex - it's a notoriously difficult thing to do correctly. See http://www.regular-expressions.info/email.html for a discussion on the topic.
To summarise that article, this is usually good enough: \b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
This one is even more precise (the author claims 99.99% of email addresses):
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
And this is the version that literally matches all possible RFC 5322 email addresses:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
| "(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]
| \\[\x01-\x09\x0b\x0c\x0e-\x7f])*")
# (?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
| \[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]
| \\[\x01-\x09\x0b\x0c\x0e-\x7f])+)
\])
The last one is clearly overkill, but it gives you an idea of the complexity involved.
Your question is less about email-matching than about the behavior of findall, which varies depending on whether the regular expression contains capturing groups. Here's a simple example:
import re
text = '123.com 456.edu 999.com'
a = r'\d+\.(com|edu)' # A capturing group.
b = r'\d+\.(?:com|edu)' # A non-capturing group.
print re.findall(a, text) # Only the captures: ['com', 'edu', 'com']
print re.findall(b, text) # The full matches: ['123.com', '456.edu', '999.com']
A quick scan through the regular expression documentation might be worthwhile for you. A few items that seem relevant here:
(?:...) # Non-capturing group.
...{4,} # Match something 4 or more times.
\w # Word character.
\d # Digit
\b[^\W_][\w.]{3,}[^\W_]#[^\W_]+\.(?:com|org|edu|co\.in)\b

Categories

Resources