How to efficiently match regex in python

How to efficiently match regex in python - python

I am writing a code to match the US phone number format
So it should match:
123-333-1111
(123)111-2222
123-2221111
But should not match
1232221111
matchThreeDigits = r"(?:\s*\(?[\d]{3}\)?\s*)"
matchFourDigits = r"(?:\s*[\d]{4}\s*)"
phoneRegex = '('+ '('+ matchThreeDigits + ')' + '-?' + '('+ matchThreeDigits + ')' + '-?' + '(' + matchFourDigits + ')' +')';
matches = re.findall(re.compile(phoneRegex),line)
The problem is I need to make sure at least one of () or '-' is present in present in the pattern (or else it can be a nine digit number rather than a phone number). I don't want to do another pattern search for efficiency reasons. Is there any way to accommodate this information in the regex pattern itself.

Something like this?
pattern = r'(\(?(\d{3})\)?(?P<A>-)?(\d{3})(?(A)-?|-)(\d{4}))'
Using it:
import re
regex = re.compile(pattern)
check = ['123-333-1111', '(123)111-2222', '123-2221111', '1232221111']
for number in check:
match = regex.match(number)
print number, bool(match)
if match:
# show the numbers
print 'nums:', filter(lambda x: x and x.isalnum(), match.groups())
>>>
123-333-1111 True
nums: ('123', '333', '1111')
(123)111-2222 True
nums: ('123', '111', '2222')
123-2221111 True
nums: ('123', '222', '1111')
1232221111 False
Note:
You requested an explanation of: (?P<A>-) and (?(A)-?|-)
(?P<A>-) : Is a named capture group with the name A, (?P<NAME> ... )
(?(A)-?|-) : Is a group that checks if the named group A captured something or not, if so it does the YES, else it does the NO capture. (?(NAME)YES|NO)
All this can be easily learned if you do a simple help(re) in the Python interpreter, or a Google search for Python Regular Expressions....

You can use the following regex:
regex = r'(?:\d{3}-|\(\d{3}\))\d{3}-?\d{4}'
assuming that (123)1112222 is acceptable.
The | acts as an or, and \( and \) escape ( and ), respectively.

import re
phoneRegex = re.compile("(\({0,1}[\d]{3}\)(?=[\d]{3})|[\d]{3}-)([\d]{3}[-]{0,1}[\d]{4})")
numbers = ["123-333-1111", "(123)111-2222", "123-2221111", "1232221111", "(123)-111-2222"]
for number in numbers:
print bool(re.match(phoneRegex, number))
Output
True
True
True
False
False
You can see an explanation to this regular expression here : http://regex101.com/r/bA4fH8

Related

In python, how to 'if finditer(...) has no matches'?

I would like to do something when finditer() does not find anything.
import re
pattern = "1"
string = "abc"
matched_iter = re.finditer(pattern, string)
# <if matched_iter is empty (no matched found>.
# do something.
# else
for m in matched_iter:
print m.group()
The best thing I could come up with is to keep track of found manually:
mi_no_find = re.finditer(r'\w+',"$$%%%%") # not matching.
found = False
for m in mi_no_find:
print m.group()
found = True
if not found:
print "Nothing found"
Related posts that don't answer:
Counting finditer matches: Number of regex matches (I don't need to count, I just need to know if there are no matches).
finditer vs match: different behavior when using re.finditer and re.match (says always have to loop over an iterator returned by finditer)
[edit]
- I have no interest in enumerating or counting total output. Only if found else not found actions.
- I understand I can put finditer into a list, but this would be inefficient for large strings. One objective is to have low memory utilization.

Updated 04/10/2020
Use re.search(pattern, string) to check if a pattern exists.
pattern = "1"
string = "abc"
if re.search(pattern, string) is None:
print('do this because nothing was found')
Returns:
do this because nothing was found
If you want to iterate over the return, then place the re.finditer() within the re.search().
pattern = '[A-Za-z]'
string = "abc"
if re.search(pattern, string) is not None:
for thing in re.finditer(pattern, string):
print('Found this thing: ' + thing[0])
Returns:
Found this thing: a
Found this thing: b
Found this thing: c
Therefore, if you wanted both options, use the else: clause with the if re.search() conditional.
pattern = "1"
string = "abc"
if re.search(pattern, string) is not None:
for thing in re.finditer(pattern, string):
print('Found this thing: ' + thing[0])
else:
print('do this because nothing was found')
Returns:
do this because nothing was found
previous reply below (not sufficient, just read above)
If the .finditer() does not match a pattern, then it will not perform any commands within the related loop.
So:
Set the variable before the loop you are using to iterate over the regex returns
Call the variable after (And outside of) the loop you are using to iterate over the regex returns
This way, if nothing is returned from the regex call, the loop won't execute and your variable call after the loop will return the exact same variable it was set to.
Below, example 1 demonstrates the regex finding the pattern. Example 2 shows the regex not finding the pattern, so the variable within the loop is never set.
Example 3 shows my suggestion - where the variable is set before the regex loop, so if the regex does not find a match (and subsequently, does not trigger the loop), the variable call after the loop returns the initial variable set (Confirming the regex pattern was not found).
Remember to import the import re module.
EXAMPLE 1 (Searching for the characters 'he' in the string 'hello world' will return 'he')
my_string = 'hello world'
pat = '(he)'
regex = re.finditer(pat,my_string)
for a in regex:
b = str(a.groups()[0])
print(b)
# returns 'he'
EXAMPLE 2 (Searching for the characters 'ab' in the string 'hello world' do not match anything, so the 'for a in regex:' loop does not execute and does not assign the b variable any value.)
my_string = 'hello world'
pat = '(ab)'
regex = re.finditer(pat,my_string)
for a in regex:
b = str(a.groups()[0])
print(b)
# no return
EXAMPLE 3 (Searching for the characters 'ab' again, but this time setting the variable b to 'CAKE' before the loop, and calling the variable b after, outside of the loop returns the initial variable - i.e. 'CAKE' - since the loop did not execute).
my_string = 'hello world'
pat = '(ab)'
regex = re.finditer(pat,my_string)
b = 'CAKE' # sets the variable prior to the for loop
for a in regex:
b = str(a.groups()[0])
print(b) # calls the variable after (and outside) the loop
# returns 'CAKE'
It's also worth noting that when designing your pattern to feed into the regex, make sure to use the parenthesis to indicate the start and end of a group.
pattern = '(ab)' # use this
pattern = 'ab' # avoid using this
To tie back to the initial question:
Since nothing found won’t execute the for loop (for a in regex), the user can preload the variable, then check it after the for loop for the original loaded value. This will allow for the user to know if nothing was found.
my_string = 'hello world'
pat = '(ab)'
regex = re.finditer(pat,my_string)
b = 'CAKE' # sets the variable prior to the for loop
for a in regex:
b = str(a.groups()[0])
if b == ‘CAKE’:
# action taken if nothing is returned

If performance isn't an issue, simply use findall or list(finditer(...)), which returns a list.
Otherwise, you can "peek" into the generator with next, then loop as normal if it raises StopIteration. Though there are other ways to do it, this is the simplest to me:
import itertools
import re
pattern = "1"
string = "abc"
matched_iter = re.finditer(pattern, string)
try:
first_match = next(matched_iter)
except StopIteration:
print("No match!") # action for no match
else:
for m in itertools.chain([first_match], matched_iter):
print(m.group())

You can probe the iterator with next and then chain the results back together while excepting StopIteration which means the iterator was empty:
import itertools as it
matches = iter([])
try:
probe = next(matches)
except StopIteration:
print('empty')
else:
for m in it.chain([probe], matches):
print(m)
Regarding your solution you could check m directly, setting it to None beforehand:
matches = iter([])
m = None
for m in matches:
print(m)
if m is None:
print('empty')

It prints the original string if there are no matches in the string.
It will replace the position n of the string.
For more reference: https://docs.python.org/2/howto/regex.html
Input_Str = "FOOTBALL"
def replacing(Input_String, char_2_replace, replaced_char, n):
pattern = re.compile(char_2_replace)
if len(re.findall(pattern, Input_String)) >= n:
where = [m for m in pattern.finditer(Input_String)][n-1]
before = Input_String[:where.start()]
after = Input_String[where.end():]
newString = before + replaced_char + after
else:
newString = Input_String
return newString
print(replacing(Input_Str, 'L', 'X', 4))```

I know this answer is late, but very suitable for Python 3.8+
You can use the new warlus operator := operator along with next(iterator[, default]) to solve for 'no matches' in re.finditer(pattern, string, flags=0) somewhat like this:
import re
pattern_ = "1"
string_ = "abc"
def is_match():
was_found = False
while next((match := re.finditer(pattern_, string_)), None) is not None:
was_found = True
yield match.group() # or just print it
return was_found

How can you group a very specfic pattern with regex?

Problem:
https://coderbyte.com/editor/Simple%20Symbols
The str parameter will be composed of + and = symbols with
several letters between them (ie. ++d+===+c++==a) and for the string
to be true each letter must be surrounded by a + symbol. So the string
to the left would be false. The string will not be empty and will have
at least one letter.
Input:"+d+=3=+s+"
Output:"true"
Input:"f++d+"
Output:"false"
I'm trying to create a regular expression for the following problem, but I keep running into various problems. How can I produce something that returns the specified rules('+\D+')?
import re
plusReg = re.compile(r'[(+A-Za-z+)]')
plusReg.findall()
>>> []
Here I thought I could create my own class that searches for the pattern.
import re
plusReg = re.compile(r'([\\+,\D,\\+])')
plusReg.findall('adf+a+=4=+S+')
>>> ['a', 'd', 'f', '+', 'a', '+', '=', '=', '+', 'S', '+']
Here I thought I the '\\+' would single out the plus symbol and read it as a char.
mo = plusReg.search('adf+a+=4=+S+')
mo.group()
>>>'a'
Here using the same shell, I tried using the search instead of findall, but I just ended up with the first letter which isn't even surrounded by a plus.
My end result is to group the string 'adf+a+=4=+S+' into ['+a+','+S+'] and so on.
edit:
Solution:
import re
def SimpleSymbols(str):
#added padding, because if str = 'y+4==+r+'
#then program would return true when it should return false.
string = '=' + str + '='
#regex that returns false if a letter *doesn't* have a + in front or back
plusReg = re.compile(r'[^\+][A-Za-z].|.[A-Za-z][^\+]')
#if statement that returns "true" if regex doesn't find any letters
#without a + behind or in front
if plusReg.search(string) is None:
return "true"
return "false"
print SimpleSymbols(raw_input())
I borrowed some code from ekhumoro and Sanjay. Thanks

One approach is to search the string for any letters that are either: (1) not preceeded by a +, or (2) not followed by a +. This can be done using look ahead and look behind assertions:
>>> rgx = re.compile(r'(?<!\+)[a-zA-Z]|[a-zA-Z](?!\+)')
So if rgx.search(string) returns None, the string is valid:
>>> rgx.search('+a+') is None
True
>>> rgx.search('+a+b+') is None
True
but if it returns a match, the string is invalid:
>>> rgx.search('+ab+') is None
False
>>> rgx.search('+a=b+') is None
False
>>> rgx.search('a') is None
False
>>> rgx.search('+a') is None
False
>>> rgx.search('a+') is None
False
The important thing about look ahead/behind assertions is that they don't consume characters, so they can handle overlapping matches.

Something like this should do the trick:
import re
def is_valid_str(s):
return re.findall('[a-zA-Z]', s) == re.findall('\+([a-zA-Z])\+', s)
Usage:
In [10]: is_valid_str("f++d+")
Out[10]: False
In [11]: is_valid_str("+d+=3=+s+")
Out[11]: True

I think you are on the right track. The regular expression you have is correct, but it can simplify down to just letters:
search_pattern = re.compile(r'\+[a-zA-z]\+')
for upper and lower case strings. Now we can use this regex with the findall function:
results = re.findall(search_pattern, 'adf+a+=4=+S+') # returns ['+a+', '+S+']
Now the question needs you to return a boolean depending on if the string is valid to the specified pattern so we can wrap this all up into a function:
def is_valid_pattern(pattern_string):
search_pattern = re.compile(r'\+[a-zA-z]?\+')
letter_pattern = re.compile(r'[a-zA-z]') # to search for all letters
results = re.findall(search_pattern, pattern_string)
letters = re.findall(letter_pattern, pattern_string)
# if the lenght of the list of all the letters equals the length of all
# the values found with the pattern, we can say that it is a valid string
return len(results) == len(letter_pattern)

You should be looking for what isn't there, as opposed to what is. You should search for something like, ([^\+][A-Za-z]|[A-Za-z][^\+]). The | in the middle is a logical or operator. Then on either side, it checks if it can find any scenario where there is a letter without a "+" on the left/right respectively. If if finds something, that means the string fails. If it can't find anything, that means that there are no instances of a letter not being surrounded by "+"'s.

Replacing consecutive symbol with number digit in python using regex

Here's the case :
str='myfile.#.#####-########.ext'
i want to replace the # with number : 456
so it should be :
str = 'myfile.456.00456-00000456.ext'
the second case :
str= 'myfile.%012d.tga'
replace the pattern with number 456 so it will become :
str= 'myfile.000000000456.tga'
i can solve this using string replacement method by grab the pattern then count the digits then fill with zero pad. Right now , i want to know how to do it using regex in python ? Can anyone help ? Thanks a lot.

The second case is straight forward and does not require regex and a regex would be an overkill. I would suggest you to use a string format replacement
'myfile.%012d.tga' % 456
Out[21]: 'myfile.000000000456.tga'
The first case is tricky but possible
>>> def repl(m):
return "{{0:0{}}}".format(len(m.group(1)))
>>> re.sub(r"(#+)", repl, st).format(456)
'myfile.456.00456-00000456.ext'

Through re.sub without format function.
>>> s = 'myfile.#.#####-########.ext'
>>> re.sub(r'#+', lambda m: '456' if len(m.group()) == 1 else m.group()[:-1].replace('#', '0') + '456', s)
'myfile.456.0000456-0000000456.ext'
For the second case,
>>> s = 'myfile.%012d.tga'
>>> re.sub(r'%(\d+)d', lambda m: str('0' * int(m.group(1)))[:-1] + '456', s)
'myfile.00000000000456.tga'

Finally, thanks for all who has answered my question. That 'lambda' thing reallt give me the idea , here's the answer for my question :
my first case using '#' :
s = 'myfile.##.####.########.ext'
print re.sub('#+', lambda x : '456'.zfill(len(x.group())) ,s)
---> myfile.456.0456.00000456.ext
my second case using %0xd format :
s = 'myfile.%06d--%012d--%02d.ext'
print re.sub('%0[0-9]+d', lambda x : x.group() % 456 ,s)
r---> myfile.000456--000000000456--456.ext
Here's just simple combination of both case above :
s = 'myfile.##.####.########.%06d---%02d.ext'
x=re.sub('#+', lambda x : '456'.zfill(len(x.group())) ,s)
print re.sub('%0[0-9]+d', lambda x : x.group() % 456 ,x)
---> myfile.456.0456.00000456.000456---456.ext
Don't forget to 'import re' to use the regex.

how to remove whitespace inside bracket?

I have the following string:
res = '(321, 3)-(m-5, 5) -(31,1)'
I wanna remove the whitespace withing the bracket but i haven't any knowledge about regular expression
I ve try this but that doesn't work:
import re
res = re.sub(r'\(.*\s+\)', '', res)

You can substitute a non-greedy wildcard match for characters in parentheses with a function that splits the match on whitespace and rejoins it.
>>> import re
>>> res = '(321, 3)-(m-5, 5) -(31,1)'
>>> re.sub(r'\(.*?\)', lambda x: ''.join(x.group(0).split()), res)
'(321,3)-(m-5,5) -(31,1)'

You could convert the string into a list, go through each letter and count if you are within brackets or not. In toRemove, you collect the positions of whitespaces, which you then remove from the list. Then you convert the list back to a string ...
res = '(321, 3)-(m-5, 5) -(31,1)'
r = list(res)
insideBracket = 0
toRemove = []
for pos,letter in enumerate(r):
if letter == '(':
insideBracket += 1
elif letter == ')':
insideBracket -= 1
if insideBracket > 0:
if letter == ' ':
toRemove.append(pos)
for t in toRemove[::-1]:
r.pop(t)
result = ''.join(r)
print(result)

I think regular expressions aren't quite powerful enough to do what you want here; you want to remove all whitespace that's found in between parenthesis characters. The trouble is, solving this for the general case means you're doing a context-sensitive match on the string, and regular expressions are mostly context-insensitive, and so can't do your job. There are lookaheads and lookbehinds that can restrict matches to particular contexts, but they won't solve your problem in the general case either:
The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not. Group references are not supported even if they match strings of some fixed length.
Because of this, I would match the parenthesis groups first:
>>> re.split(r'(\([^)]*\))', res)
['', '(321, 3)', '-', '(m-5, 5)', ' -', '(31,1)', '']
and then remove whitespace from them in a second step before joining everything back up into a single string:
>>> g = re.split(r'(\([^)]*\))', res)
>>> g[1::2] = [re.sub(r'\s*', '', x) for x in g[1::2]]
>>> ''.join(g)
'(321,3)-(m-5,5) -(31,1)'

How to match beginning of string or character in Python

I have a string consisting of parameter number _ parameter number:
dir = 'a1.8000_b1.0000_cc1.3000_al0.209_be0.209_c1.344_e0.999'
I need to get the number behind a parameter chosen, i.e.
par='be' -->need 0.209
par='e' -->need 0.999
I tried:
num1 = float(re.findall(par + '(\d+\.\d*)', dir)[0])
but for par='e' this will match 0.209 and 0.999, so I tried to match the parameter together with the beginning of the string or an underscore:
num1 = float(re.findall('[^_]'+par+'(\d+\.\d*)', dir)[0])
which didn't work for some reason.
Any suggestions? Thank you!

Your [^_] pattern matches any character that is not the underscore.
Use a (..|..) or grouping instead:
float(re.findall('(?:^|_)' + par + r'(\d+\.\d*)', dir)[0])
I used a (?:..) non-capturing group there so that it doesn't interfere with your original group indices.
Demo:
>>> import re
>>> dir = 'a1.8000_b1.0000_cc1.3000_al0.209_be0.209_c1.344_e0.999'
>>> par = 'e'
>>> re.findall('(?:^|_)' + par + r'(\d+\.\d*)', dir)
['0.999']
>>> par = 'a'
>>> re.findall('(?:^|_)' + par + r'(\d+\.\d*)', dir)
['1.8000']
To elaborate, when using a character group ([..]) and you start that group with the caret (^) you invert the character group, turning it from matching the listed characters to matching everything else instead:
>>> re.findall('[a]', 'abcd')
['a']
>>> re.findall('[^a]', 'abcd')
['b', 'c', 'd']

without regex solution:
def func(par,strs):
ind=strs.index('_'+par)+1+len(par)
ind1=strs.find('_',ind) if strs.find('_',ind)!=-1 else len(strs)
return strs[ind:ind1]
output:
>>> func('be',dir)
'0.209'
>>> func('e',dir)
'0.999'
>>> func('cc',dir)
'1.3000'

A solution without regex:
>>> def get_value(dir, parm):
... return map(float, [t[len(parm):] for t in dir.split('_') if t.startswith(parm)])
...
>>> get_value('a1.8000_b1.0000_cc1.3000_al0.209_be0.209_c1.344_e0.999', "be")
[0.20899999999999999]
If there are multiple occurrences of the parameter in the string, all of them are evaluated.
And a version without casting to a float:
return [t[len(parm):] for t in dir.split('_') if t.startswith(parm)]

(?P<param>[a-zA-Z]*)(?P<version>[^_]*)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to efficiently match regex in python - python

You can use the following regex: regex = r'(?:\d{3}-|\(\d{3}\))\d{3}-?\d{4}' assuming that (123)1112222 is acceptable. The | acts as an or, and \( and \) escape ( and ), respectively.

Related

In python, how to 'if finditer(...) has no matches'?

How can you group a very specfic pattern with regex?

Replacing consecutive symbol with number digit in python using regex

how to remove whitespace inside bracket?

How to match beginning of string or character in Python

Categories

Resources