Colliding regex for emails (Python) - python

I'm trying to grab both usernames (such as abc123#) and emails (such as (abc123#company.com) in the same Pythonic regex.
Here's an example statement:
abc123# is a researcher at abc123#company.com doing cool work.
Regex used:
For username:
re.match("^([A-Za-z])+([#]){1}$")
For email:
re.match("^([A-Za-z0-9-_])+(#company.com){1}$")
Most cases, what happens is username gets grabbed but not email address (trying to grab them as two separate entities) - any ideas what's going on?

Actually you have a lot of groups and repetition counts and start/end boundaries in your regexes that are not really necessary. These 2 are just enough to find each in the input string.
For user: [A-Za-z0-9]+#
For email: [A-Za-z0-9-_]+#company.com
If, however, you want your groupings, these versions that will work:
For user: ([A-Za-z0-9])+(#)
For email: ([A-Za-z0-9-_]+(#company.com)
Disclaimer: I have tested this only on Java, as I am not so familiar with Python.

In your patterns you use anchors ^ and $ to assert the start and end of the string.
Removing the anchors, will leave this for the username pattern ([A-Za-z])+([#]){1}
Here, you can omit the {1} and the capture groups. Note that in the example, abc123# has digits that you are not matching.
Still, using [A-Za-z0-9]+# will get a partial match in the email abc123#company.com To prevent that, you can use a right hand whitespace boundary.
The username pattern might look like
\b[A-Za-z0-9]+#(?!\S)
\b A word boundary
[A-Za-z0-9]+ Match 1+ occurrences of the listed (including the digits)
# Match literally
(?!\S) Negative lookahead, assert not a non whitspace char to the right
Regex demo
For the email address, using a character class like [A-Za-z0-9-_] is quite strict.
If you want a broad match, you might use:
[^\s#]+#[^\s#]+\.[a-z]{2,}
Regex demo

Related

Python regex if match for more than one criteria in the same regex string

I am currently learning python and do some excercices and have the following problem. I take user input for Password which should be at least 8 chars long, have capital letter, small letter and a special char.
What I would like to understand is, can I combine all of the above in one regex as below, or I need to list each and every case separately (see below).
Using only one:
whole_check = re.compile(r'''(
[A-Z] #Check for capital letter
\d #Check for number
\W #check for special character)''', re.VERBOSE)
So how can I do a multiple if match here. As example:
if not [A-Z]:
do something
if not \d:
do something
The only other option is if i define each category in a separate variable:
cap_letter = re.compile(r'[A-Z]')
small_letter = re.compile(r'[a-z]')
Thanks for clearing this for me.
See Regex for password policy. Generally the answer is: yes, you could put it into one regex, but you should consider not doing that, as it will be much easier to maintain and read/understand in a week if you don't do that :)

How can I avoid selecting email Ids with a particular domain name with regex

I have a list of e-mail ids among which I have to select only those which do not have ruba.com as domain name with regex. For examples, if I have ads#gmail.com, dgh#rubd.com and ert#ruba.com, then my regular expression should select first two Ids. What should be the regular expression for this problem?
I have tried with two expressions:
[a-zA-Z0-9_.+-]+#[^(ruba)]+.[a-zA-Z0-9-.]+
and
[a-zA-Z0-9_.+-]+#[^r][^u][^b][^a]+.[a-zA-Z0-9-.]+
None of the above two was able to fulfill my requirement.
I assume that by email ID you mean the part before the # symbol, otherwise that would be a full email address.
.+(?=#)(?!#ruba\.com)
. the dot character is a special symbol for regex engines
and it is used to capture everything
* also known as Kleene plus says you want to capture one or more instances of the preceding symbol, in our case .; basically you are saying "give me every char"
(?=#) is a positive lookahead, i.e. a special search feature that makes sure that what follows is #; I'm using it to take the cursor to the position of # and "stop" capturing, otherwise + would go on indefinitely
(?!#ruba\.com) is a negative lookahead, i.e. a special search feature that makes sure that what follows is not (!) #ruba\.com; I'm escaping the dot not to confuse it with the capture-all symbol I was talking before
Live demo here.
You could use a negative lookahead to ensure that you do not match the domain ruba.com.
The negative lookahead: (?!rubd) will match against anything that you want to exclude. Also, because emails typically have more than word characters (such as hyphens and periods), you would be better off using [\w\.\-] rather than just \w.
^[\w\.\-]+#(?!rubd)[\w\.\-]+\.(?:com|net|org|edu)$
DEMO

Regular expression pattern questions?

I am having a hard time understanding regular expression pattern. Could someone help me regular expression pattern to match all words ending in s. And start with a and end with a (like ana).
How do I write ending?
Word boundaries are given by \b so the following regex matches words ending with ing or s: "\b(\w+?(?:ing|s))\b" where as \b is a word boundary, \w+ is one or more "word character" and (?:ing|s) is an uncaptured group of either ing or s.
As you asked "how to develop a regex":
First: Don't use regex for complex tasks. They are hard to read, write and maintain. For example there is a regex that validates email addresses - but its computer generated and nothing you should use in practice.
Start simple and add edge cases. At the beginning plan what characters you need to use: You said you need words ending with s or ing. So you probably need something to represent a word, endings of words and the literal characters s and ing. What is a word? This might change from case to case, but at least every alphabetical character. Looking up in the python documentation on regexes you can find \w which is [a-zA-Z0-9_], which fits my impression of a word character. There you can also find \b which is a word boundary.
So the "first pseudo code try" is something like \b\w...\w\b which matches a word. We still need to "formalize" ... which we want to have the meaning of "one ore more characters", which directly translates to \b\w+\b. We can now match a word! We still need the s or ing. | translates to or, so how is the following: \b\w+ing|s\b? If you test this, you'll see that it will match confusing things like ingest which should not match our regex. What is happening? As you probably already saw the | can't know "which part it should or", so we need to introduce parenthesis: \b\w+(ing|s)\b. Congratulations, you have now arrived at a working regex!
Why (and how) does this differ from the example I gave first? First I wrote \w+? instead of \w+, the ? turns the + into a non-greedy version. If you know what the difference between greedy and non greedy is, skip this paragraph. Consider the following: AaAAbA and we want to match the things enclosed with big letter A. A naive try: A\w+A, so one or more word characters enclosed with A. This matches AaA, but also AaAAbA, A is still something that can be matched by \w. Without further config the *+? quantifier all try to match as much as possible. Sometimes, like in the A example, you don't want that, you can then use a ? after the quantifier to signal you want a non-greedy version, a version that matches as little as possible.
But in our case this isn't needed, the words are well seperated by whitespaces, which are not part of \w. So in fact you can just let + be greedy and everything will be alright. If you use . (any character) you often need to be careful not to match to much.
The other difference is using (?:s|ing) instead of (s|ing). What does the ?: do here? It changes a capturing group to a non capturing group. Generally you don't want to get "everything" from the regex. Consider the following regex: I want to go to \w+. You are not interested in the whole sentence, but only in the \w+, so you can capture it in a group: I want to go to (\w+). This means that you are interested in this specific piece of information and want to retrieve it later. Sometimes (like when using |) you need to group expressions together, but are not interested in their content, you can then declare it as non capturing. Otherwise you will get the group (s or ing) but not the actual word!
So to summarize:
* start small
* add one case after another
* always test with examples
In fact I just tried re.findall(\b\w+(?:ing|s)\b, "fishing words") and it didn't work. \w+(?:ing|s) works. I've no idea why, maybe someone else can explain that. Regex are an arcane thing, only use them for easy and easy to test tasks.
Generally speaking I'd use \b to match "word boundaries" with \w which matches word components (short cut for [A-Za-z0-9_]). Then you can do an or grouping to match "s" or "ing". Result is:
/\b\w+(s|ing)\b/

EMAIL id matcher-python regular expression cant figure out

i am trying to match specific type of email addreses of the form username#siteaddress
where username is non-empty string of minimum length 5 build from characters {a-z A-Z 0-9 . _}.The username cannot start from '.' or ' _ ' The site-address is build of a prefix which is non-empty string build from characters {a-z A-Z 0-9} (excluding the brackets) followed by one of the following suffixes {".com", ".org", "edu", ".co.in"}.
The following code doesnt work
list=re.findall("[a-zA-Z0-9][a-zA-Z0-9._][a-zA-Z0-9._][a-zA-Z0-9._][a-zA-Z0-9._][a-zA-Z0-9._]*#[a-zA-Z0-9][a-zA-Z0-9]*\.(com|edu|org|co\.in)",raw_input())
However the following works fine when i add a '?:' in the last parenthesis, cant figure out the reason
list=re.findall("[a-zA-Z0-9][a-zA-Z0-9._][a-zA-Z0-9._][a-zA-Z0-9._][a-zA-Z0-9._][a-zA-Z0-9._]*#[a-zA-Z0-9][a-zA-Z0-9]*\.(?:com|edu|org|co\.in)",raw_input())
You shouldn't roll your own email address regex - it's a notoriously difficult thing to do correctly. See http://www.regular-expressions.info/email.html for a discussion on the topic.
To summarise that article, this is usually good enough: \b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
This one is even more precise (the author claims 99.99% of email addresses):
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
And this is the version that literally matches all possible RFC 5322 email addresses:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
| "(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]
| \\[\x01-\x09\x0b\x0c\x0e-\x7f])*")
# (?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
| \[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]
| \\[\x01-\x09\x0b\x0c\x0e-\x7f])+)
\])
The last one is clearly overkill, but it gives you an idea of the complexity involved.
Your question is less about email-matching than about the behavior of findall, which varies depending on whether the regular expression contains capturing groups. Here's a simple example:
import re
text = '123.com 456.edu 999.com'
a = r'\d+\.(com|edu)' # A capturing group.
b = r'\d+\.(?:com|edu)' # A non-capturing group.
print re.findall(a, text) # Only the captures: ['com', 'edu', 'com']
print re.findall(b, text) # The full matches: ['123.com', '456.edu', '999.com']
A quick scan through the regular expression documentation might be worthwhile for you. A few items that seem relevant here:
(?:...) # Non-capturing group.
...{4,} # Match something 4 or more times.
\w # Word character.
\d # Digit
\b[^\W_][\w.]{3,}[^\W_]#[^\W_]+\.(?:com|org|edu|co\.in)\b

multiple negative lookahead assertions

I can't figure out how to do multiple lookaround for the life of me. Say I want to match a variable number of numbers following a hash but not if preceded by something or followed by something else. For example I want to match #123 or #12345 in the following. The lookbehinds seem to be fine but the lookaheads do not. I'm out of ideas.
matches = ["#123", "This is #12345",
# But not
"bad #123", "No match #12345", "This is #123-ubuntu",
"This is #123 0x08"]
pat = '(?<!bad )(?<!No match )(#[0-9]+)(?! 0x0)(?!-ubuntu)'
for i in matches:
print i, re.search(pat, i)
You should have a look at the captures as well. I bet for the last two strings you will get:
#12
This is what happens:
The engine checks the two lookbehinds - they don't match, so it continues with the capturing group #[0-9]+ and matches #123. Now it checks the lookaheads. They fail as desired. But now there's backtracking! There is one variable in the pattern and that is the +. So the engine discards the last matched character (3) and tries again. Now the lookaheads are no problem any more and you get a match. The simplest way to solve this is to add another lookahead that makes sure that you go to the last digit:
pat = r'(?<!bad )(?<!No match )(#[0-9]+)(?![0-9])(?! 0x0)(?!-ubuntu)'
Note the use of a raw string (the leading r) - it doesn't matter in this pattern, but it's generally a good practice, because things get ugly once you start escaping characters.
EDIT: If you are using or willing to use the regex package instead of re, you get possessive quantifiers which suppress backtracking:
pat = r'(?<!bad )(?<!No match )(#[0-9]++)(?! 0x0)(?!-ubuntu)'
It's up to you which you find more readable or maintainable. The latter will be marginally more efficient, though. (Credits go to nhahtdh for pointing me to the regex package.)

Categories

Resources