How to check if a string is word with regex?

How to check if a string is word with regex? - python

I want to check if a string (like 'hello') input by user only contains one word and nothing else. Like only true for those contains only [a-zA-Z] and no whitespace or dot or underscore or another word.
For example:
'hello' true
'hello_' false
'hello world' false
'h.e.l.l.o' false
I don't know how to write the regex. Need help.

There is no need to write a regex here. This is already builtin in Python with str.isalpha:
Return True if all characters in the string are alphabetic and there is at least one character, False otherwise.
So we can check it with:
if your_string.isalpha():
pass
Note however that:
Note: str.isalpha also includes diacritics, etc. For example:
>>> 'ä'.isalpha()
True
this is not per se a problem. But it can be something to take into account.
In case you do not want diacricics, you can for instance check that all characters have an ord(..) less than 128 as well:
if your_string.isalpha() and all(ord(c) < 128 for c in your_string):
pass
The advantage of using builtins is that these are more self-explaining (isalpha() clearly suggests what it is doing), and furthermore it is very unlikely to contain any bugs (I am not saying that other approaches do contain bugs, but writing something yourself, typically means it is not tested very effectively, hence it can still not fully cover edge and corner cases).

You can use the anchors ^ and $:
import re
s = "hello"
if re.findall('^[a-zA-Z]+$', s):
pass #string condition met
Performance comparisons between re.findall and re.search:
import timeit
s1 = """
import re
re.findall('^[a-zA-Z]+$', 'hello')
"""
print(timeit.timeit(stmt=s1,number=10000))
>>> 0.0147941112518
s2 = """
import re
re.match('^[a-zA-Z]+$', 'hello')
"""
print(timeit.timeit(stmt=s2,number=10000))
>>> 0.0134868621826
While re.match performs slightly better than re.findall, I prefer re.findall as 1) it is easier to view the results initially and 2) immediately store the results in a list.

Related

Python Regex to Remove Special Characters from Middle of String and Disregard Anything Else

Using the python re.sub, is there a way I can extract the first alpha numeric characters and disregard the rest form a string that starts with a special character and might have special characters in the middle of the string? For example:
re.sub('[^A-Za-z0-9]','', '#my,name')
How do I just get "my"?
re.sub('[^A-Za-z0-9]','', '#my')
Here I would also want it to just return 'my'.

re.sub(".*?([A-Za-z0-9]+).*", r"\1", str)
The \1 in the replacement is equivalent to matchobj.group(1). In other words it replaces the whole string with just what was matched by the part of the regexp inside the brackets. $ could be added at the end of the regexp for clarity, but it is not necessary because the final .* will be greedy (match as many characters as possible).
This solution does suffer from the problem that if the string doesn't match (which would happen if it contains no alphanumeric characters), then it will simply return the original string. It might be better to attempt a match, then test whether it actually matches, and handle separately the case that it doesn't. Such a solution might look like:
matchobj = re.match(".*?([A-Za-z0-9]+).*", str)
if matchobj:
print(matchobj.group(1))
else:
print("did not match")
But the question called for the use of re.sub.

Instead of re.sub it is easier to do matching using re.search or re.findall.
Using re.search:
>>> s = '#my,name'
>>> res = re.search(r'[a-zA-Z\d]+', s)
>>> if res:
... print (res.group())
...
my
Code Demo

This is not a complete answer. [A-Za-z]+ will give give you ['my','name']
Use this to further explore: https://regex101.com/

Python Regex giving false negative

I am trying to regex match phone numbers and I have come up with the following code:
pattern = re.compile("^(1?[2-9]\d{2}([.\-\s])?\d{3}\2\d{4}){1}$")
if pattern.match(phoneNumber):
return True
This should match numbers such as:
12142142141
1214-444-4444
214.333.3333
However, this will not match to ANY of the above examples. I have tested this on a few different regex validators and they are all successful on their. I'm assuming the python regex engine is different, but after searching around I cannot find the difference. Any suggestions?

Follow your code, made couple of small changes:
import re as re
def test(s):
pattern = re.compile("^1?[2-9]\d{2}([.\-\s])?\d{3}\\1?\d{4}$")
return pattern.match(s) is not None
print(test("12142142141")) #True
print(test("1214-444-4444")) #True
print(test("214.333.3333")) #True
print(test("214-333-3333")) #True
print(test("214.333-3333")) #False
All three test cases passed.

Try this regex:
^1?(?:(?:\d{10})|(?:\d{3}-\d{3}-\d{4})|(?:\d{3}\.\d{3}\.\d{4}))$
Most likely you do not want to allow mixed separator types, e.g. if a number uses no separator then it must use no separator everywhere (and the same for dot and hyphen). In this case, we can use an alternation to cover the three types of patterns.
Demo here:
Regex101

A more powerful method than Python's find? A regex issue?

I'm looking for a list of strings and their variations within a very large string.
What I want to do is find even the implicit matches between two strings.
For example, if my start string is foo-bar, I want the matching to find Foo-bAr foo Bar, or even foo(bar.... Of course, foo-bar should also return a match.
EDIT: More specifically, I need the following matches.
The string itself, case insenstive.
The string with spaces separating any of the characters
The string with parentheses separating any of the characters.
How do I write an expression to meet these conditions?
I realize this might require some tricky regex. The thing is, I have a large list of strings I need to search for, and I feel regex is just the tool for making this as robust as I need.
Perhaps regex isn't the best solution?
Thanks for your help guys. I'm still learning to think in regex.

>>> def findString(inputStr, targetStr):
... if convertToStringSoup(targetStr).find(convertToStringSoup(inputStr)) != -1:
... return True
... return False
...
>>> def convertToStringSoup(testStr):
... testStr = testStr.lower()
... testStr = testStr.replace(" ", "")
... testStr = testStr.replace("(", "")
... testStr = testStr.replace(")", "")
... return testStr
...
>>>
>>> findString("hello", "hello")
True
>>> findString("hello", "hello1")
True
>>> findString("hello", "hell!o1")
False
>>> findString("hello", "hell( o)1")
True
should work according to your specification. Obviously, could be optimized. You're asking about regex, which I'm thinking hard about, and will hopefully edit this question soon with something good. If this isn't too slow, though, regexps can be miserable, and readable is often better!
I noticed that you're repeatedly looking in the same big haystack. Obviously, you only have to convert that to "string soup" once!
Edit: I've been thinking about regex, and any regex you do would either need to have many clauses or the text would have to be modified pre-regex like I did in this answer. I haven't benchmarked string.find() vs re.find(), but I imagine the former would be faster in this case.

I'm going to assume that your rules are right, and your examples are wrong, mainly since you added the rules later, as a clarification, after a bunch of questions. So:
EDIT: More specifically, I need the following matches.
The string itself, case insenstive.
The string with spaces separating any of the characters
The string with parentheses separating any of the characters.
The simplest way to do this is to just remove spaces and parens, then do a case-insensitive search on the result. You don't even need regex for that. For example:
haystack.replace(' ', '').replace('(', '').upper().find(needle.upper())

Try this regex:
[fF][oO]{2}[- ()][bB][aA][rR]
Test:
>>> import re
>>> pattern = re.compile("[fF][oO]{2}[- ()][bB][aA][rR]")
>>> m = pattern.match("foo-bar")
>>> m.group(0)
'foo-bar'

Using a regex, a case-insensitive search matches upper/lower case invariants, '[]' matches any contained characters and '|' lets you do multiple compares at once. Putting it all together, you can try:
import re
pairs = ['foo-bar', 'jane-doe']
regex = '|'.join(r'%s[ -\)]%s' % tuple(p.split('-')) for p in pairs)
print regex
results = re.findall(regex, your_text_here, re.IGNORECASE)

Why is the minimal (non-greedy) match affected by the end of string character '$'?

EDIT: remove original example because it provoked ancillary answers. also fixed the title.
The question is why the presence of the "$" in the regular expression effects the greedyness of the expression:
Here is a simpler example:
>>> import re
>>> str = "baaaaaaaa"
>>> m = re.search(r"a+$", str)
>>> m.group()
'aaaaaaaa'
>>> m = re.search(r"a+?$", str)
>>> m.group()
'aaaaaaaa'
The "?" seems to be doing nothing. Note the when the "$" is removed, however, then the "?" is respected:
>>> m = re.search(r"a+?", str)
>>> m.group()
'a'
EDIT:
In other words, "a+?$" is matching ALL of the a's instead of just the last one, this is not what I expected. Here is the description of the regex "+?" from the python docs:
"Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched."
This does not seem to be the case in this example: the string "a" matches the regex "a+?$", so why isn't the match for the same regex on the string "baaaaaaa" just a single a (the rightmost one)?

Matches are "ordered" by "left-most, then longest"; however "longest" is the term used before non-greedy was allowed, and instead means something like "preferred number of repetitions for each atom". Being left-most is more important than the number of repetitions. Thus, "a+?$" will not match the last A in "baaaaa" because matching at the first A starts earlier in the string.
(Answer changed after OP clarification in comments. See history for previous text.)

The non-greedy modifier only affects where the match stops, never where it starts. If you want to start the match as late as possible, you will have to add .+? to the beginning of your pattern.
Without the $, your pattern is allowed to be less greedy and stop sooner, because it doesn't have to match to the end of the string.
EDIT:
More details... In this case:
re.search(r"a+?$", "baaaaaaaa")
the regex engine will ignore everything up until the first 'a', because that's how re.search works. It will match the first a, and would "want" to return a match, except it doesn't match the pattern yet because it must reach a match for the $. So it just keeps eating the a's one at a time and checking for $. If it were greedy, it wouldn't check for the $ after each a, but only after it couldn't match any more a's.
But in this case:
re.search(r"a+?", "baaaaaaaa")
the regex engine will check if it has a complete match after eating the first match (because it's non-greedy) and succeed because there is no $ in this case.

The presence of the $ in the regular expression does not affect the greediness of the expression. It merely adds another condition which must be met for the overall match to succeed.
Both a+ and a+? are required to consume the first a they find. If that a is followed by more a's, a+ goes ahead and consumes them too, while a+? is content with just the one. If there were anything more to the regex, a+ would be willing to settle for fewer a's, and a+? would consume more, if that's what it took to achieve a match.
With a+$ and a+?$, you've added another condition: match at least one a followed by the end of the string. a+ still consumes all of the a's initially, then it hands off to the anchor ($). That succeeds on the first try, so a+ is not required to give back any of its a's.
On the other hand, a+? initially consumes just the one a before handing off to $. That fails, so control is returned to a+?, which consumes another a and hands off again. And so it goes, until a+? consumes the last a and $ finally succeeds. So yes, a+?$ does match the same number of a's as a+$, but it does so reluctantly, not greedily.
As for the leftmost-longest rule that was mentioned elsewhere, that never did apply to Perl-derived regex flavors like Python's. Even without reluctant quantifiers, they could always return a less-then-maximal match thanks to ordered alternation. I think Jan's got the right idea: Perl-derived (or regex-directed) flavors should be called eager, not greedy.
I believe the leftmost-longest rule only applies to POSIX NFA regexes, which use NFA engines under under the hood, but are required to return the same results a DFA (text-directed) regex would.

Answer to original question:
Why does the first search() span
multiple "/"s rather than taking the
shortest match?
A non-greedy subpattern will take the shortest match consistent with the whole pattern succeeding. In your example, the last subpattern is $, so the previous ones need to stretch out to the end of the string.
Answer to revised question:
A non-greedy subpattern will take the shortest match consistent with the whole pattern succeeding.
Another way of looking at it: A non-greedy subpattern will initially match the shortest possible match. However if this causes the whole pattern to fail, it will be retried with an extra character. This process continues until the subpattern fails (causing the whole pattern to fail) or the whole pattern matches.

There are two issues going on, here. You used group() without specifying a group, and I can tell you are getting confused between the behavior of regular expressions with an explicitly parenthesized group and without a parenthesized group. This behavior without parentheses that you are observing is just a shortcut that Python provides, and you need to read the documentation on group() to understand it fully.
>>> import re
>>> string = "baaa"
>>>
>>> # Here you're searching for one or more `a`s until the end of the line.
>>> pattern = re.search(r"a+$", string)
>>> pattern.group()
'aaa'
>>>
>>> # This means the same thing as above, since the presence of the `$`
>>> # cancels out any meaning that the `?` might have.
>>> pattern = re.search(r"a+?$", string)
>>> pattern.group()
'aaa'
>>>
>>> # Here you remove the `$`, so it matches the least amount of `a` it can.
>>> pattern = re.search(r"a+?", string)
>>> pattern.group()
'a'
Bottom line is that the string a+? matches one a, period. However, a+?$ matches a's until the end of the line. Note that without explicit grouping, you'll have a hard time getting the ? to mean anything at all, ever. In general, it's better to be explicit about what you're grouping with parentheses, anyway. Let me give you an example with explicit groups.
>>> # This is close to the example pattern with `a+?$` and therefore `a+$`.
>>> # It matches `a`s until the end of the line. Again the `?` can't do anything.
>>> pattern = re.search(r"(a+?)$", string)
>>> pattern.group(1)
'aaa'
>>>
>>> # In order to get the `?` to work, you need something else in your pattern
>>> # and outside your group that can be matched that will allow the selection
>>> # of `a`s to be lazy. # In this case, the `.*` is greedy and will gobble up
>>> # everything that the lazy `a+?` doesn't want to.
>>> pattern = re.search(r"(a+?).*$", string)
>>> pattern.group(1)
'a'
Edit: Removed text related to old versions of the question.

Unless your question isn't including some important information, you don't need, and shouldn't use, regex for this task.
>>> import os
>>> p = "/we/shant/see/this/butshouldseethis"
>>> os.path.basename(p)
butshouldseethis

grep variable in python

I need something like grep in python
I have done research and found the re module to be suitable
I need to search variables for a specific string

To search for a specific string within a variable, you can just use in:
>>> 'foo' in 'foobar'
True
>>> s = 'foobar'
>>> 'foo' in s
True
>>> 'baz' in s
False

Using re.findall will be the easiest way. You can search for just a literal string if that's what you're looking for (although your purpose would be better served by the string in operator and you'll need to escape regex characters), or else any string you would pass to grep (although I don't know the syntax differences between the two off the top of my head, but I'm sure there are differences).
>>> re.findall("x", "xyz")
['x']
>>> re.findall("b.d", "abcde")
['bcd']
>>> re.findall("a?ba?c", "abacbc")
['abac', 'bc']

It sounds like what you really want is the ability to print a large substring in a way that lets you easily see where a particular substring is. There are a couple of ways to approach this.
def grep(large_string, substring):
for line, i in enumerate(large_string.split('\n')):
if substring in line:
print("{}: {}".format(i, line))
This would print only the lines that have your substring. However, you would lose a bunch of context. If you want true grep, replace if substring in line with something that uses the re module to do regular expression matching.
def highlight(large_string, substring):
from colorama import Fore
text_in_between = large_string.split(substring)
highlighted_substring = "{}{}{}".format(Fore.RED, substring, Fore.RESET)
print(highlighted_substring.join(text_in_between))
This will print the whole large string, but with the substring you are looking for in red. Note that you'll need to pip install colorama for it to work. You can of course combine the two approaches.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to check if a string is word with regex? - python

Related

Python Regex to Remove Special Characters from Middle of String and Disregard Anything Else

Python Regex giving false negative

A more powerful method than Python's find? A regex issue?

Why is the minimal (non-greedy) match affected by the end of string character '$'?

grep variable in python

Categories

Resources