Regex in python for validating mail - python

I am learning regex for validating an email id, for which the regex I made is the following:
regex = r'^([a-zA-Z0-9_\\-\\.]+)#([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,})$'
Here someone#...com is showing as valid but it should not, how can fix this?

I would recommend the regular expression suggested on this site which properly shows that the email someone#...com is invalid, I quickly wrote up an example using their suggestion below, happy coding!
>>>import re
>>>email = "someone#...com"
>>>regex = re.compile(r"(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)")
>>>print(re.match(regex, email))
None

The reason it matches someone#...com is that the dot is in the character class here #([a-zA-Z0-9_\\-\\.]+) and is repeated 1 or more times and it can therefore also match ...
What you can do is place the dot after the character class, and use that whole part in a repeating group.
If you put the - at the end you don't have to escape it.
Note that that character class at the start also has a dot.
^[a-zA-Z0-9_.-]+#(?:[a-zA-Z0-9_-]+\.)+([a-zA-Z]{2,})$
Regex demo

Related

regular expression, negative look-around for wired email matching

I'm trying to make a email matcher, since there are so many things like this:
https://site_1.com#site_2.com/xxxxx
I decided to use a negative to get rid of these. My attempt is as follow:
regex = r"([a-zA-Z0-9\._-]+(?!https?://.*)#[a-zA-Z0-9\._-]\.[a-zA-Z0-9])"
My idea is, the negative look-around will fail to match anything with a https://xxxxx#, but clearly I'm wrong. I did the following:
email_search = re.compile(regex)
email_search.search("https://siteA.com#siteB.com")
And the result is a match, the matched string is //siteA.com#siteB.com
I sort of have to use re.search because I'm working with obfuscated text, but the negative look ahead should do the trick in my understanding, please show me what I did wrong and how to do it correctly, any help is appreciated!
Use negative look-aheads to prevent certain inputs from matching (i.e. "preconditions"):
regex = r"(?!https?://)<actual email regex here>"
You can chain them:
regex = r"(?!<exclude this>)(?!<exclude that>)(?!<and that>)<actual regex here>"
Apart from that - so, so, so many email matching regexes have been made by now that I would discourage you from inventing yet another one. Pick one from the pile.
The better ones would not allow things like https://site_1.com#site_2.com/xxxxx from the start, so you would not have to work around defects in your own creation.

Python forcing hex escape into regex statement [duplicate]

I am not sure why this is not working. Perhaps I am missing something with Python regex.
Here is my regex and an example string of what I want it to match too:
PHONE_REGEX = "<(.*)>phone</\1>"
EXAMPLE = "<bar>phone</bar>"
I tested this match in isolation and it failed. I used an online regex tester and it matched. Am I simply missing something that is particular to Python regex?
Thanks!
You have to mark the string as a raw string, due to the \ in there, by putting an r in front of the regex:
m = re.match(r"<(.*)>phone</\1>", "<bar>phone</bar>")

regex for email parsing in python

i'm asked to write regular expression which can catch multi-domain email addresses and implement it in python. so i came up with the following regular expression (and code;the emphasis is on the regex though), which i think is correct:
import re
regex = r'\b[\w|\.|-]+#([\w]+\.)+\w{2,4}\b'
input_string = "hey my mail is abc#def.ghi"
match=re.findall(regex,input_string)
print match
now when i run this (using a very simple mail) it doesn't catch it!!
instead it shows an empty list as the output. can somebody tell me where did i go wrong in the regular expression literal?
Here's a simple one to start you off with
regex = r'\b[\w.-]+?#\w+?\.\w+?\b'
re.findall(regex,input_string) # ['abc#def.ghi']
The problem with your original one is that you don't need the | operator inside a character class ([..]). Just write [\w|\.|-] as [\w.-] (If the - is at the end, you don't need to escape it).
Next there are way too many variations on legitimate domain names. Just look for at least one period surrounded by word characters after the # symbol:
#\w+?\.\w+?\b

REGEX: Parsing n digits with non numeric word boundaries

I hope this message finds you in good spirits. I am trying to find a quick tutorial on the \b expression (apologies if there is a better term). I am writing a script at the moment to parse some xml files, but have ran into a bit of a speed bump. I will show an example of my xml:
<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>
<...> is unimportant and non relevant xml code. Focus primarily on the CustomerID and OrderId.
My issue lies in parsing a string, similar to the above statement. I have a regexParse definition that works perfectly. However it is not intuitive. I need to match only the part of the string that contains 44444444.
My Current setup is:
searchPattern = '>\d{8}</CustomerId'
Great! It works, but I want to do it the right way. My thinking is 1) find 8 digits 2) if the some word boundary is non numeric after that matches CustomerId return it.
Idea:
searchPattern = '\bd{16}\b'
My issue in my tests is incorporating the search for CustomerId somewhere before and after the digits. I was wondering if any of you can either help me out with my issue, or point me in the right path (in words of a guide or something along the lines). Any help is appreciated.
Mods if this is in the wrong area apologies, I wanted to post this in the Python discussion because I am not sure if Python regex supports this functionality.
Thanks again all,
darcmasta
txt = """
<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>
"""
import re
pattern = "<(\w+)>(\d+)<"
print re.findall(pattern,txt)
#output [('OrderId', '123456'), ('CustomerId', '44444444')]
You might consider using a look-back operator in your regex to make it easy for a human to read:
import re
a = re.compile("(?<=OrderId>)\\d{6}")
a.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['123456']
b = re.compile("(?<=CustomerId>)\\d{8}")
b.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['44444444']
You should be using raw string literals:
searchPattern = r'\b\d{16}\b'
The escape sequence \b in a plain (non-raw) string literal represents the backspace character, so that's what the re module would be receiving (unrecognised escape sequences such as \d get passed on as-is, i.e. backslash followed by 'd').

Issue with regex backreference in Python

I am not sure why this is not working. Perhaps I am missing something with Python regex.
Here is my regex and an example string of what I want it to match too:
PHONE_REGEX = "<(.*)>phone</\1>"
EXAMPLE = "<bar>phone</bar>"
I tested this match in isolation and it failed. I used an online regex tester and it matched. Am I simply missing something that is particular to Python regex?
Thanks!
You have to mark the string as a raw string, due to the \ in there, by putting an r in front of the regex:
m = re.match(r"<(.*)>phone</\1>", "<bar>phone</bar>")

Categories

Resources