Python Regex giving false negative - python

I am trying to regex match phone numbers and I have come up with the following code:
pattern = re.compile("^(1?[2-9]\d{2}([.\-\s])?\d{3}\2\d{4}){1}$")
if pattern.match(phoneNumber):
return True
This should match numbers such as:
12142142141
1214-444-4444
214.333.3333
However, this will not match to ANY of the above examples. I have tested this on a few different regex validators and they are all successful on their. I'm assuming the python regex engine is different, but after searching around I cannot find the difference. Any suggestions?

Follow your code, made couple of small changes:
import re as re
def test(s):
pattern = re.compile("^1?[2-9]\d{2}([.\-\s])?\d{3}\\1?\d{4}$")
return pattern.match(s) is not None
print(test("12142142141")) #True
print(test("1214-444-4444")) #True
print(test("214.333.3333")) #True
print(test("214-333-3333")) #True
print(test("214.333-3333")) #False
All three test cases passed.

Try this regex:
^1?(?:(?:\d{10})|(?:\d{3}-\d{3}-\d{4})|(?:\d{3}\.\d{3}\.\d{4}))$
Most likely you do not want to allow mixed separator types, e.g. if a number uses no separator then it must use no separator everywhere (and the same for dot and hyphen). In this case, we can use an alternation to cover the three types of patterns.
Demo here:
Regex101

Related

regex ":[^]" not working in python re module

I have a problem in regular expression using re module
pattern = ":[^]", string = ":r", and bool(re.findall(strting, pattern)) should return True However, it returns False like the pic1
I verified this using https://regexr.com/ and it shows like the pic2. So I believe the problem is on the re module
How can i show the same result of pic2 in python
That's the expected behavior cause re.findall(':r', ':[^]') means find the strings that match the pattern :r in the string :[^] i.e. the first argument is the pattern and second argument is the string/text where you need to find a match.
And [^] in python regex means none of the characters that's inside the square brackets after caret ^ symbol should match.
If you are looking to find the strings starting with : followed by any number of alphabets, following should work:
>>> re.findall(':\w+',':r')
[':r']
```

How to check if a string is word with regex?

I want to check if a string (like 'hello') input by user only contains one word and nothing else. Like only true for those contains only [a-zA-Z] and no whitespace or dot or underscore or another word.
For example:
'hello' true
'hello_' false
'hello world' false
'h.e.l.l.o' false
I don't know how to write the regex. Need help.
There is no need to write a regex here. This is already builtin in Python with str.isalpha:
Return True if all characters in the string are alphabetic and there is at least one character, False otherwise.
So we can check it with:
if your_string.isalpha():
pass
Note however that:
Note: str.isalpha also includes diacritics, etc. For example:
>>> 'รค'.isalpha()
True
this is not per se a problem. But it can be something to take into account.
In case you do not want diacricics, you can for instance check that all characters have an ord(..) less than 128 as well:
if your_string.isalpha() and all(ord(c) < 128 for c in your_string):
pass
The advantage of using builtins is that these are more self-explaining (isalpha() clearly suggests what it is doing), and furthermore it is very unlikely to contain any bugs (I am not saying that other approaches do contain bugs, but writing something yourself, typically means it is not tested very effectively, hence it can still not fully cover edge and corner cases).
You can use the anchors ^ and $:
import re
s = "hello"
if re.findall('^[a-zA-Z]+$', s):
pass #string condition met
Performance comparisons between re.findall and re.search:
import timeit
s1 = """
import re
re.findall('^[a-zA-Z]+$', 'hello')
"""
print(timeit.timeit(stmt=s1,number=10000))
>>> 0.0147941112518
s2 = """
import re
re.match('^[a-zA-Z]+$', 'hello')
"""
print(timeit.timeit(stmt=s2,number=10000))
>>> 0.0134868621826
While re.match performs slightly better than re.findall, I prefer re.findall as 1) it is easier to view the results initially and 2) immediately store the results in a list.

A more powerful method than Python's find? A regex issue?

I'm looking for a list of strings and their variations within a very large string.
What I want to do is find even the implicit matches between two strings.
For example, if my start string is foo-bar, I want the matching to find Foo-bAr foo Bar, or even foo(bar.... Of course, foo-bar should also return a match.
EDIT: More specifically, I need the following matches.
The string itself, case insenstive.
The string with spaces separating any of the characters
The string with parentheses separating any of the characters.
How do I write an expression to meet these conditions?
I realize this might require some tricky regex. The thing is, I have a large list of strings I need to search for, and I feel regex is just the tool for making this as robust as I need.
Perhaps regex isn't the best solution?
Thanks for your help guys. I'm still learning to think in regex.
>>> def findString(inputStr, targetStr):
... if convertToStringSoup(targetStr).find(convertToStringSoup(inputStr)) != -1:
... return True
... return False
...
>>> def convertToStringSoup(testStr):
... testStr = testStr.lower()
... testStr = testStr.replace(" ", "")
... testStr = testStr.replace("(", "")
... testStr = testStr.replace(")", "")
... return testStr
...
>>>
>>> findString("hello", "hello")
True
>>> findString("hello", "hello1")
True
>>> findString("hello", "hell!o1")
False
>>> findString("hello", "hell( o)1")
True
should work according to your specification. Obviously, could be optimized. You're asking about regex, which I'm thinking hard about, and will hopefully edit this question soon with something good. If this isn't too slow, though, regexps can be miserable, and readable is often better!
I noticed that you're repeatedly looking in the same big haystack. Obviously, you only have to convert that to "string soup" once!
Edit: I've been thinking about regex, and any regex you do would either need to have many clauses or the text would have to be modified pre-regex like I did in this answer. I haven't benchmarked string.find() vs re.find(), but I imagine the former would be faster in this case.
I'm going to assume that your rules are right, and your examples are wrong, mainly since you added the rules later, as a clarification, after a bunch of questions. So:
EDIT: More specifically, I need the following matches.
The string itself, case insenstive.
The string with spaces separating any of the characters
The string with parentheses separating any of the characters.
The simplest way to do this is to just remove spaces and parens, then do a case-insensitive search on the result. You don't even need regex for that. For example:
haystack.replace(' ', '').replace('(', '').upper().find(needle.upper())
Try this regex:
[fF][oO]{2}[- ()][bB][aA][rR]
Test:
>>> import re
>>> pattern = re.compile("[fF][oO]{2}[- ()][bB][aA][rR]")
>>> m = pattern.match("foo-bar")
>>> m.group(0)
'foo-bar'
Using a regex, a case-insensitive search matches upper/lower case invariants, '[]' matches any contained characters and '|' lets you do multiple compares at once. Putting it all together, you can try:
import re
pairs = ['foo-bar', 'jane-doe']
regex = '|'.join(r'%s[ -\)]%s' % tuple(p.split('-')) for p in pairs)
print regex
results = re.findall(regex, your_text_here, re.IGNORECASE)

Python regex for int with at least 4 digits

I am just learning regex and I'm a bit confused here. I've got a string from which I want to extract an int with at least 4 digits and at most 7 digits. I tried it as follows:
>>> import re
>>> teststring = 'abcd123efg123456'
>>> re.match(r"[0-9]{4,7}$", teststring)
Where I was expecting 123456, unfortunately this results in nothing at all. Could anybody help me out a little bit here?
#ExplosionPills is correct, but there would still be two problems with your regex.
First, $ matches the end of the string. I'm guessing you'd like to be able to extract an int in the middle of the string as well, e.g. abcd123456efg789 to return 123456. To fix that, you want this:
r"[0-9]{4,7}(?![0-9])"
^^^^^^^^^
The added portion is a negative lookahead assertion, meaning, "...not followed by any more numbers." Let me simplify that by the use of \d though:
r"\d{4,7}(?!\d)"
That's better. Now, the second problem. You have no constraint on the left side of your regex, so given a string like abcd123efg123456789, you'd actually match 3456789. So, you need a negative lookbehind assertion as well:
r"(?<!\d)\d{4,7}(?!\d)"
.match will only match if the string starts with the pattern. Use .search.
You can also use:
re.findall(r"[0-9]{4,7}", teststring)
Which will return a list of all substrings that match your regex, in your case ['123456']
If you're interested in just the first matched substring, then you can write this as:
next(iter(re.findall(r"[0-9]{4,7}", teststring)), None)

Difference in regex behavior between Perl and Python?

I have a couple email addresses, 'support#company.com' and '1234567#tickets.company.com'.
In perl, I could take the To: line of a raw email and find either of the above addresses with
/\w+#(tickets\.)?company\.com/i
In python, I simply wrote the above regex as'\w+#(tickets\.)?company\.com' expecting the same result. However, support#company.com isn't found at all and a findall on the second returns a list containing only 'tickets.'. So clearly the '(tickets\.)?' is the problem area, but what exactly is the difference in regular expression rules between Perl and Python that I'm missing?
The documentation for re.findall:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
Since (tickets\.) is a group, findall returns that instead of the whole match. If you want the whole match, put a group around the whole pattern and/or use non-grouping matches, i.e.
r'(\w+#(tickets\.)?company\.com)'
r'\w+#(?:tickets\.)?company\.com'
Note that you'll have to pick out the first element of each tuple returned by findall in the first case.
I think the problem is in your expectations of extracted values. Try using this in your current Python code:
'(\w+#(?:tickets\.)?company\.com)'
Two problems jump out at me:
You need to use a raw string to avoid having to escape "\"
You need to escape "."
So try:
r'\w+#(tickets\.)?company\.com'
EDIT
Sample output:
>>> import re
>>> exp = re.compile(r'\w+#(tickets\.)?company\.com')
>>> bool(exp.match("s#company.com"))
True
>>> bool(exp.match("1234567#tickets.company.com"))
True
There isn't a difference in the regexes, but there is a difference in what you are looking for. Your regex is capturing only "tickets." if it exists in both regexes. You probably want something like this
#!/usr/bin/python
import re
regex = re.compile("(\w+#(?:tickets\.)?company\.com)");
a = [
"foo#company.com",
"foo#tickets.company.com",
"foo#ticketsacompany.com",
"foo#compant.org"
];
for string in a:
print regex.findall(string)

Categories

Resources