regex for email parsing in python - python

i'm asked to write regular expression which can catch multi-domain email addresses and implement it in python. so i came up with the following regular expression (and code;the emphasis is on the regex though), which i think is correct:
import re
regex = r'\b[\w|\.|-]+#([\w]+\.)+\w{2,4}\b'
input_string = "hey my mail is abc#def.ghi"
match=re.findall(regex,input_string)
print match
now when i run this (using a very simple mail) it doesn't catch it!!
instead it shows an empty list as the output. can somebody tell me where did i go wrong in the regular expression literal?

Here's a simple one to start you off with
regex = r'\b[\w.-]+?#\w+?\.\w+?\b'
re.findall(regex,input_string) # ['abc#def.ghi']
The problem with your original one is that you don't need the | operator inside a character class ([..]). Just write [\w|\.|-] as [\w.-] (If the - is at the end, you don't need to escape it).
Next there are way too many variations on legitimate domain names. Just look for at least one period surrounded by word characters after the # symbol:
#\w+?\.\w+?\b

Related

Regex in python for validating mail

I am learning regex for validating an email id, for which the regex I made is the following:
regex = r'^([a-zA-Z0-9_\\-\\.]+)#([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,})$'
Here someone#...com is showing as valid but it should not, how can fix this?
I would recommend the regular expression suggested on this site which properly shows that the email someone#...com is invalid, I quickly wrote up an example using their suggestion below, happy coding!
>>>import re
>>>email = "someone#...com"
>>>regex = re.compile(r"(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)")
>>>print(re.match(regex, email))
None
The reason it matches someone#...com is that the dot is in the character class here #([a-zA-Z0-9_\\-\\.]+) and is repeated 1 or more times and it can therefore also match ...
What you can do is place the dot after the character class, and use that whole part in a repeating group.
If you put the - at the end you don't have to escape it.
Note that that character class at the start also has a dot.
^[a-zA-Z0-9_.-]+#(?:[a-zA-Z0-9_-]+\.)+([a-zA-Z]{2,})$
Regex demo

How to have 2 or more regular expressions with exclusive OR in Python?

I have a regex to find an email mailto:(.*)\"|(\S+#+\S*)|(.{1,40}#.{1,40}) on several HTML sources.
Having the string Email: <u><a href="mailto:test#test.com">email me the piece mailto:(.*)\" works great,
I would want it to stop there and return the value instead of continuing with the other expressions. Is there something like an XOR operator or a way to do this since I would have to add more pieces eventually?
I tried here: http://pythex.org/
Regex should do this naturally. To illustrate this point, one of the easy ways of matching a word with one exception is to precede your match with the exception and use an alternation.
For instance in my email program I need to sort all emails with the subject line /labels?/ to another folder. However some of my contacts never learned how to spell (apparently) and I also sort /lables?/ However I noticed that emails containing the subject line Available were also being picked up by this filter.
I could have done /\blables?/ but preferred instead to catch available and handle it separately, so I did:
`/available|(lables?)|(labels?)/`
This alternation will match available, but only match and capture lable, lables, label, or labels.
As Adam Smith said, the problem you're having is that your last alternation .{1,40}#.{1,40} begins matching well before the rest of your alternations, so it consumes the text and that's the match that returns.
To overcome this, you could modify the other partial patterns so that they match just as early, by prepending .*? e. g. .*?mailto:(.*)\"|.*?(\S+#+\S*)|(.{1,40}#.{1,40}).
Or, perhaps somewhat less convoluted, you could just search for one after the other:
string = 'Email: <u><a href="mailto:test#test.com">email me'
m = re.search('mailto:(.*)\"', string) \
or re.search('(\S+#+\S*)', string) \
or re.search('(.{1,40}#.{1,40})', string)
print(m.group(1))

in python find index in list if combination of strings exist

I'm writing my first script and trying to learn python.
But I'm stuck and can't get out of this one.
I'm writing a script to change file names.
Lets say I have a string = "this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv"
I want the result to be string = "This Is Test3 E00"
this is what I have so far:
l = list(string)
//Transform the string into list
for i in l:
if "E" in l:
p = l.index("E")
if isinstance((p+1), int () is True:
if isinstance((p+2), int () is True:
delp = p+3
a = p-3
del l[delp:]
new = "".join(l)
new = new.replace("."," ")
print (new)
get in index where "E" and check if after "E" there are 2 integers.
Then delete everything after the second integer.
However this will not work if there is an "E" anyplace else.
at the moment the result I get is:
this is tEst
because it is finding index for the first "E" on the list and deleting everything after index+3
I guess my question is how do I get the index in the list if a combination of strings exists.
but I can't seem to find how.
thanks for everyone answers.
I was going in other direction but it is also not working.
if someone could see why it would be awesome. It is much better to learn by doing then just coping what others write :)
this is what I came up with:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
anyone can tell me why this isn't working. I get an error.
Thank you so much
Have you ever heard of a Regular Expression?
Check out python's re module. Link to the Docs.
Basically, you can define a "regex" that would match "E and then two integers" and give you the index of it.
After that, I'd just use python's "Slice Notation" to choose the piece of the string that you want to keep.
Then, check out the string methods for str.replace to swap the periods for spaces, and str.title to put them in Title Case
An easy way is to use a regex to find up until the E followed by 2 digits criteria, with s as your string:
import re
up_until = re.match('(.*?E\d{2})', s).group(1)
# this.is.tEst3.E00
Then, we replace the . with a space and then title case it:
output = up_until.replace('.', ' ').title()
# This Is Test3 E00
The technique to consider using is Regular Expressions. They allow you to search for a pattern of text in a string, rather than a specific character or substring. Regular Expressions have a bit of a tough learning curve, but are invaluable to learn and you can use them in many languages, not just in Python. Here is the Python resource for how Regular Expressions are implemented:
http://docs.python.org/2/library/re.html
The pattern you are looking to match in your case is an "E" followed by two digits. In Regular Expressions (usually shortened to "regex" or "regexp"), that pattern looks like this:
E\d\d # ('\d' is the specifier for any digit 0-9)
In Python, you create a string of the regex pattern you want to match, and pass that and your file name string into the search() method of the the re module. Regex patterns tend to use a lot of special characters, so it's common in Python to prepend the regex pattern string with 'r', which tells the Python interpreter not to interpret the special characters as escape characters. All of this together looks like this:
import re
filename = 'this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv'
match_object = re.search(r'E\d\d', filename)
if match_object:
# The '0' means we want the first match found
index_of_Exx = match_object.end(0)
truncated_filename = filename[:index_of_Exx]
# Now take care of any more processing
Regular expressions can get very detailed (and complex). In fact, you can probably accomplish your entire task of fully changing the file name using a single regex that's correctly put together. But since I don't know the full details about what sorts of weird file names might come into your program, I can't go any further than this. I will add one more piece of information: if the 'E' could possibly be lower-case, then you want to add a flag as a third argument to your pattern search which indicates case-insensitive matching. That flag is 're.I' and your search() method would look like this:
match_object = re.search(r'E\d\d', filename, re.I)
Read the documentation on Python's 're' module for more information, and you can find many great tutorials online, such as this one:
http://www.zytrax.com/tech/web/regex.htm
And before you know it you'll be a superhero. :-)
The reason why this isn't working:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
...is because 'i' contains a character from the string 'l', not an integer. You compare it with 'E' (which works), but then try to add 1 to it, which errors out.

Python Regex working different depending on the implementation?

I'm working on a file parser that needs to cut out comments from JavaScript code. The thing is it has to be smart so it won't take '//' sequence inside string as the beggining of the comment. I have following idea to do it:
Iterate through lines.
Find '//' sequence first, then find all strings surrounded with quotes ( ' or ") in line and then iterate through all string matches to check if the '//' sequence is inside or outside one of those strings. If it is outside of them it's obvious that it'll be a proper comment begining.
When testing code on following line (part of bigger js file of course):
document.getElementById("URL_LABEL").innerHTML="<a name=\"link\" href=\"http://"+url+"\" target=\"blank\">"+url+"</a>";
I've encountered problem. My regular expression code:
re_strings=re.compile(""" "
(?:
\\.|
[^\\"]
)*
"
|
'
(?:
[^\\']|
\\.
)*
'
""",re.VERBOSE);
for s in re.finditer(re_strings,line):
print(s.group(0))
In python 3.2.3 (and 3.1.4) returns the following strings:
"URL_LABEL"
"<a name=\"
" href=\"
"+url+"
" target=\"
">"
"</a>"
Which is obviously wrong because \" should not exit the string. I've been debugging my regex for quite a long time and it SHOULDN'T exit here. So i used RegexBuddy (with Python compatibility) and Python regex tester at http://re-try.appspot.com/ for reference.
The most peculiar thing is they both return same, correct results other than my code, that is:
"URL_LABEL"
"<a name=\"link\" href=\"http://"
"\" target=\"blank\">"
"</a>"
My question is what is the cause of those differences? What have I overlooked? I'm rather a beginer in both Python and regular expressions so maybe the answer is simple...
P.S. I know that finding if the '//' sequence is inside string quotes can be accomplished with one, bigger regex. I've already tried it and met the same problem.
P.P.S I would like to know what I'm doing wrong, why there are differences in behaviour of my code and regex test applications, not find other ideas how to parse JavaScript code.
You just need to use a raw string to create the regex:
re_strings=re.compile(r""" "
etc.
"
""",re.VERBOSE);
The way you've got it, \\.|[^\\"] becomes the regex \.|[^\"], which matches a literal dot (.) or anything that's not a quotation mark ("). Add the r prefix to the string literal and it works as you intended.
See the demo here. (I also used a raw string to make sure the backslashes appeared in the target string. I don't know how you arranged that in your tests, but the backslashes obviously are present; the problem is that they're missing from your regex.)
you cannot deal with matching quotes with regex ... in fact you cannot guarantee any matching pairs of anything(and nested pairs especially) ... you need a more sophisticated statemachine for that(LLVM, etc...)
source: lots of CS classes...
and also see : Matching pair tag with regex for a more detailed explanation
I know its not what you wanted to hear but its basically just the way it is ... and yes different implementations of regex can return different results for stuff that regex cant really do

REGEX: Parsing n digits with non numeric word boundaries

I hope this message finds you in good spirits. I am trying to find a quick tutorial on the \b expression (apologies if there is a better term). I am writing a script at the moment to parse some xml files, but have ran into a bit of a speed bump. I will show an example of my xml:
<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>
<...> is unimportant and non relevant xml code. Focus primarily on the CustomerID and OrderId.
My issue lies in parsing a string, similar to the above statement. I have a regexParse definition that works perfectly. However it is not intuitive. I need to match only the part of the string that contains 44444444.
My Current setup is:
searchPattern = '>\d{8}</CustomerId'
Great! It works, but I want to do it the right way. My thinking is 1) find 8 digits 2) if the some word boundary is non numeric after that matches CustomerId return it.
Idea:
searchPattern = '\bd{16}\b'
My issue in my tests is incorporating the search for CustomerId somewhere before and after the digits. I was wondering if any of you can either help me out with my issue, or point me in the right path (in words of a guide or something along the lines). Any help is appreciated.
Mods if this is in the wrong area apologies, I wanted to post this in the Python discussion because I am not sure if Python regex supports this functionality.
Thanks again all,
darcmasta
txt = """
<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>
"""
import re
pattern = "<(\w+)>(\d+)<"
print re.findall(pattern,txt)
#output [('OrderId', '123456'), ('CustomerId', '44444444')]
You might consider using a look-back operator in your regex to make it easy for a human to read:
import re
a = re.compile("(?<=OrderId>)\\d{6}")
a.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['123456']
b = re.compile("(?<=CustomerId>)\\d{8}")
b.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['44444444']
You should be using raw string literals:
searchPattern = r'\b\d{16}\b'
The escape sequence \b in a plain (non-raw) string literal represents the backspace character, so that's what the re module would be receiving (unrecognised escape sequences such as \d get passed on as-is, i.e. backslash followed by 'd').

Categories

Resources