EMAIL id matcher-python regular expression cant figure out - python

i am trying to match specific type of email addreses of the form username#siteaddress
where username is non-empty string of minimum length 5 build from characters {a-z A-Z 0-9 . _}.The username cannot start from '.' or ' _ ' The site-address is build of a prefix which is non-empty string build from characters {a-z A-Z 0-9} (excluding the brackets) followed by one of the following suffixes {".com", ".org", "edu", ".co.in"}.
The following code doesnt work
list=re.findall("[a-zA-Z0-9][a-zA-Z0-9._][a-zA-Z0-9._][a-zA-Z0-9._][a-zA-Z0-9._][a-zA-Z0-9._]*#[a-zA-Z0-9][a-zA-Z0-9]*\.(com|edu|org|co\.in)",raw_input())
However the following works fine when i add a '?:' in the last parenthesis, cant figure out the reason
list=re.findall("[a-zA-Z0-9][a-zA-Z0-9._][a-zA-Z0-9._][a-zA-Z0-9._][a-zA-Z0-9._][a-zA-Z0-9._]*#[a-zA-Z0-9][a-zA-Z0-9]*\.(?:com|edu|org|co\.in)",raw_input())

You shouldn't roll your own email address regex - it's a notoriously difficult thing to do correctly. See http://www.regular-expressions.info/email.html for a discussion on the topic.
To summarise that article, this is usually good enough: \b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
This one is even more precise (the author claims 99.99% of email addresses):
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
And this is the version that literally matches all possible RFC 5322 email addresses:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
| "(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]
| \\[\x01-\x09\x0b\x0c\x0e-\x7f])*")
# (?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
| \[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]
| \\[\x01-\x09\x0b\x0c\x0e-\x7f])+)
\])
The last one is clearly overkill, but it gives you an idea of the complexity involved.

Your question is less about email-matching than about the behavior of findall, which varies depending on whether the regular expression contains capturing groups. Here's a simple example:
import re
text = '123.com 456.edu 999.com'
a = r'\d+\.(com|edu)' # A capturing group.
b = r'\d+\.(?:com|edu)' # A non-capturing group.
print re.findall(a, text) # Only the captures: ['com', 'edu', 'com']
print re.findall(b, text) # The full matches: ['123.com', '456.edu', '999.com']
A quick scan through the regular expression documentation might be worthwhile for you. A few items that seem relevant here:
(?:...) # Non-capturing group.
...{4,} # Match something 4 or more times.
\w # Word character.
\d # Digit

\b[^\W_][\w.]{3,}[^\W_]#[^\W_]+\.(?:com|org|edu|co\.in)\b

Related

Colliding regex for emails (Python)

I'm trying to grab both usernames (such as abc123#) and emails (such as (abc123#company.com) in the same Pythonic regex.
Here's an example statement:
abc123# is a researcher at abc123#company.com doing cool work.
Regex used:
For username:
re.match("^([A-Za-z])+([#]){1}$")
For email:
re.match("^([A-Za-z0-9-_])+(#company.com){1}$")
Most cases, what happens is username gets grabbed but not email address (trying to grab them as two separate entities) - any ideas what's going on?
Actually you have a lot of groups and repetition counts and start/end boundaries in your regexes that are not really necessary. These 2 are just enough to find each in the input string.
For user: [A-Za-z0-9]+#
For email: [A-Za-z0-9-_]+#company.com
If, however, you want your groupings, these versions that will work:
For user: ([A-Za-z0-9])+(#)
For email: ([A-Za-z0-9-_]+(#company.com)
Disclaimer: I have tested this only on Java, as I am not so familiar with Python.
In your patterns you use anchors ^ and $ to assert the start and end of the string.
Removing the anchors, will leave this for the username pattern ([A-Za-z])+([#]){1}
Here, you can omit the {1} and the capture groups. Note that in the example, abc123# has digits that you are not matching.
Still, using [A-Za-z0-9]+# will get a partial match in the email abc123#company.com To prevent that, you can use a right hand whitespace boundary.
The username pattern might look like
\b[A-Za-z0-9]+#(?!\S)
\b A word boundary
[A-Za-z0-9]+ Match 1+ occurrences of the listed (including the digits)
# Match literally
(?!\S) Negative lookahead, assert not a non whitspace char to the right
Regex demo
For the email address, using a character class like [A-Za-z0-9-_] is quite strict.
If you want a broad match, you might use:
[^\s#]+#[^\s#]+\.[a-z]{2,}
Regex demo

python - regex putting repeating patterns in a single group

I'm trying to parse a string in regex and am 99% there.
my test string is
1
1234 1111 5555 88945
172.255.255.255 from 172.255.255.255 (1.1.1.1)
Origin IGP, localpref 300, valid, external, best
rx pathid: 0, tx pathid: 0x0
my current regex pattern is:
(?P<as_path>(\d{4,10}\s){1,20})\s+(?P<peer_addr>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3}).*\((?P<peer_rid>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3})\)\s+.*localpref\s(?P<local_pref>\d+),\s(?P<attribs>\S+,\s{0,4})
im using regex101 to test and have a link to the test here https://regex101.com/r/iGM8ye/1
So currently i have a group2 I don't want this group, could someone tell me why im getting this group and how to remove it?
and the second is, in the attributes I want to match the words, "valid, external, best" currently my pattern only matches "valid," I thought adding the repeat of within the group would of matched all three of those but it hasn't.
How would I achieve matching the repeat of "string, string, string," (string comma space) into one group?
Thanks
EDIT
Desired output
as_path : 1234 1111 5555 88945
peer_addr : 172.255.255.255
peer_rid : 1.1.1.1
local_pref : 300
attribs : valid, external, best
attiribs may also just be valid, external, or just external, or another entry in the format (stringcommaspace)
Try Regex: (?P<as_path>(?:\d{4,10}\s){1,20})\s+(?P<peer_addr>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3}).*\((?P<peer_rid>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3})\)\s+.*localpref\s(?P<local_pref>\d+),\s(?P<attribs>[\S]+,(?: [\S]+,?)*){0,4}
Demo
Regex in the question had a capturing group (Group 2) for (\d{4,10}\s). it is changed to a non capturing group now (?:\d{4,10}\s)
See regex in use here.
(?P<as_path>(?:\d{4,10}\s){1,20})\s+(?P<peer_addr>\d{0,3}(?:\.\d{0,3}){3}).*\((?P<peer_rid>\d{0,3}(?:\.\d{0,3}){3})\)\s+.*localpref\s(?P<local_pref>\d+),\s+(?P<attribs>\S+(?:,\s+\S+){2})
You were getting group 2 because your as_path group contained a group. I changed that to a non-capturing group.
I changed attribs to \S+(?:,\s+\S+){2}
This will match any non-space character one or more times \S+, followed by the following exactly twice:
,\s+\S+ the comma character, followed by the space character one or more times, followed by any non-space character one or more times
I changed peer_addr and peer_rid to \d{0,3}(?:\.\d{0,3}){3} instead of \d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3}. This is a preference, but shortens the expression.
Without that last modification, you can use the following regex (it performs slightly better anyway (as seen here):
(?P<as_path>(?:\d{4,10}\s){1,20})\s+(?P<peer_addr>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3}).*\((?P<peer_rid>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3})\)\s+.*localpref\s(?P<local_pref>\d+),\s+(?P<attribs>\S+(?:,\s+\S+){2})
You can also improve the performance by using more specific tokens as the following suggests (notice I also added the x modifier to make it more legible) and as seen here:
(?P<as_path>\d{4,10}(?:\s\d{4,10}){0,19})\s+
(?P<peer_addr>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3})[^)]*
\((?P<peer_rid>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3})\)\s+
.*localpref\s(?P<local_pref>\d+),\s+
(?P<attribs>\w+(?:,\s+\w+){2})
You get that separate group because your are repeating a capturing group were the last iteration will be the capturing group, in this case 88945 You could make it non capturing instead (?:
For the second part you could use an alternation to exactly match one of the options (?:valid|external|best)
Your pattern might look like:
(?P<as_path>(?:\d{4,10}\s){1,20})\s+(?P<peer_addr>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3}).*\((?P<peer_rid>\d{0,3}\.\d{0,3}\.\d{0,3}\.\d{0,3})\)\s+.*localpref\s(?P<local_pref>\d+),\s(?P<attribs>(?:valid|external|best)(?:,\s{0,4}(?:valid|external|best))+)
regex101 demo

Python regex if match for more than one criteria in the same regex string

I am currently learning python and do some excercices and have the following problem. I take user input for Password which should be at least 8 chars long, have capital letter, small letter and a special char.
What I would like to understand is, can I combine all of the above in one regex as below, or I need to list each and every case separately (see below).
Using only one:
whole_check = re.compile(r'''(
[A-Z] #Check for capital letter
\d #Check for number
\W #check for special character)''', re.VERBOSE)
So how can I do a multiple if match here. As example:
if not [A-Z]:
do something
if not \d:
do something
The only other option is if i define each category in a separate variable:
cap_letter = re.compile(r'[A-Z]')
small_letter = re.compile(r'[a-z]')
Thanks for clearing this for me.
See Regex for password policy. Generally the answer is: yes, you could put it into one regex, but you should consider not doing that, as it will be much easier to maintain and read/understand in a week if you don't do that :)

[FORKING]Python Regex - Re.Sub and Re.Findall Interesting Challenges

Not sure if this is something that should be a bounty. II just want to understand regex better.
I checked the responses in the Regex to match pattern.one skip newlines and characters until pattern.two and Regex to match if given text is not found and match as little as possible threads and read about Tempered Greedy Token Solutions and Explicit Greedy Alternation Solutions on RexEgg, but admittedly the explanations baffled me.
I spent the last day fiddling mainly with re.sub (and with findall) because re.sub's behaviour is odd to me.
.
Problem 1:
Given Strings below with characters followed by / how would I produce a SINGLE regex (using only either re.sub or re.findall) that uses alternating capture groups which must use [\S]+/ to get the desired output
>>> string_1 = 'variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/'
>>> string_2 = 'variety.com/2017/biz/the/life/of/madam/green/news/tax-march-donald-trump-protest-1202031487/'
>>> string_3 = 'variety.com/2017/biz/the/life/of/news/tax-march-donald-trump-protest-1202031487/the/days/of/our/lives'
Desired Output Given the Conditions(!!)
tax-march-donald-trump-protest-
CONDITIONS: Must use alternating capture groups which must capture ([\S]+) or ([\S]+?)/ to capture the other groups but ignore them if they don't contain -
I'M WELL AWARE that it would be better to use re.findall('([\-]*(?:[^/]+?\-)+)[\d]+', string) or something similar but I want to know if I can use [\S]+ or ([\S]+) or ([\S]+?)/ and tell regex that if those are captured, ignore the result if it contains / or doesn't contain - While also having used an alternating capture group
I KNOW I don't need to use [\S]+ or ([\S]+) but I want to see if there is an extra directive I can use to make the regex reject some characters those two would normally capture.
Posted per request:
(?:(?!/)[\S])*-(?:(?!/)[\S])*
https://regex101.com/r/azrwjO/1
Explained
(?: # Optional group
(?! / ) # Not a forward slash ahead
[\S] # Not whitespace class
)* # End group, do 0 to many times
- # A dash must exist
(?: # Optional group, same as above
(?! / )
[\S]
)*
You could use
/([-a-z]+)-\d+
and take the first capturing group, see a demo on regex101.com.

Does this regex fail, or do I need to modify the regex to support "optional followed by"?

I am trying the following regex: https://regex101.com/r/5dlRZV/1/, I am aware, that I am trying with \author and not \maketitle
In python, I try the following:
import re
text = str(r'
\author{
\small
}
\maketitle
')
regex = [re.compile(r'[\\]author*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S),
re.compile(r'[\\]maketitle*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S)]
for p in regex:
for m in p.finditer(text):
print(m.group())
Python freezes, I am suspecting that this has something to do with my pattern, and the SRE fails.
EDIT: Is there something wrong with my regex? Can it be improved to actually work? Still I get the same results on my machine.
EDIT 2: Can this be fixed somehow so the pattern supports optional followed by ?: or ?= look-heads? So that one can capture both?
After reading the heading, "Parentheses Create Numbered Capturing Groups", on this site: https://www.regular-expressions.info/brackets.html, I managed to find the answer which is:
Besides grouping part of a regular expression together, parentheses also create a
numbered capturing group. It stores the part of the string matched by the part of
the regular expression inside the parentheses.
The regex Set(Value)? matches Set or SetValue.
In the first case, the first (and only) capturing group remains empty.
In the second case, the first capturing group matches Value.

Categories

Resources