Find out number of capture groups in Python regular expressions

Find out number of capture groups in Python regular expressions - python

Is there a way to determine how many capture groups there are in a given regular expression?
I would like to be able to do the follwing:
def groups(regexp, s):
""" Returns the first result of re.findall, or an empty default
>>> groups(r'(\d)(\d)(\d)', '123')
('1', '2', '3')
>>> groups(r'(\d)(\d)(\d)', 'abc')
('', '', '')
"""
import re
m = re.search(regexp, s)
if m:
return m.groups()
return ('',) * num_of_groups(regexp)
This allows me to do stuff like:
first, last, phone = groups(r'(\w+) (\w+) ([\d\-]+)', 'John Doe 555-3456')
However, I don't know how to implement num_of_groups. (Currently I just work around it.)
EDIT: Following the advice from rslite, I replaced re.findall with re.search.
sre_parse seems like the most robust and comprehensive solution, but requires tree traversal and appears to be a bit heavy.
MizardX's regular expression seems to cover all bases, so I'm going to go with that.

def num_groups(regex):
return re.compile(regex).groups

f_x = re.search(...)
len_groups = len(f_x.groups())

Something from inside sre_parse might help.
At first glance, maybe something along the lines of:
>>> import sre_parse
>>> sre_parse.parse('(\d)\d(\d)')
[('subpattern', (1, [('in', [('category', 'category_digit')])])),
('in', [('category', 'category_digit')]),
('subpattern', (2, [('in', [('category', 'category_digit')])]))]
I.e. count the items of type 'subpattern':
import sre_parse
def count_patterns(regex):
"""
>>> count_patterns('foo: \d')
0
>>> count_patterns('foo: (\d)')
1
>>> count_patterns('foo: (\d(\s))')
1
"""
parsed = sre_parse.parse(regex)
return len([token for token in parsed if token[0] == 'subpattern'])
Note that we're only counting root level patterns here, so the last example only returns 1. To change this, tokens would need to searched recursively.

First of all if you only need the first result of re.findall it's better to just use re.search that returns a match or None.
For the groups number you could count the number of open parenthesis '(' except those that are escaped by '\'. You could use another regex for that:
def num_of_groups(regexp):
rg = re.compile(r'(?<!\\)\(')
return len(rg.findall(regexp))
Note that this doesn't work if the regex contains non-capturing groups and also if '(' is escaped by using it as '[(]'. So this is not very reliable. But depending on the regexes that you use it might help.

Using your code as a basis:
def groups(regexp, s):
""" Returns the first result of re.findall, or an empty default
>>> groups(r'(\d)(\d)(\d)', '123')
('1', '2', '3')
>>> groups(r'(\d)(\d)(\d)', 'abc')
('', '', '')
"""
import re
m = re.search(regexp, s)
if m:
return m.groups()
return ('',) * len(m.groups())

Might be wrong, but I don't think there is a way to find the number of groups that would have been returned had the regex matched. The only way I can think of to make this work the way you want it to is to pass the number of matches your particular regex expects as an argument.
To clarify though: When findall succeeds, you only want the first match to be returned, but when it fails you want a list of empty strings? Because the comment seems to show all matches being returned as a list.

Related

Regular Expression (find matching characters in order)

Let us say that I have the following string variables:
welcome = "StackExchange 2016"
string_to_find = "Sx2016"
Here, I want to find the string string_to_find inside welcome using regular expressions. I want to see if each character in string_to_find comes in the same order as in welcome.
For instance, this expression would evaluate to True since the 'S' comes before the 'x' in both strings, the 'x' before the '2', the '2' before the 0, and so forth.
Is there a simple way to do this using regex?

Your answer is rather trivial. The .* character combination matches 0 or more characters. For your purpose, you would put it between all characters in there. As in S.*x.*2.*0.*1.*6. If this pattern is matched, then the string obeys your condition.
For a general string you would insert the .* pattern between characters, also taking care of escaping special characters like literal dots, stars etc. that may otherwise be interpreted by regex.

This function might fit your need
import re
def check_string(text, pattern):
return re.match('.*'.join(pattern), text)
'.*'.join(pattern) create a pattern with all you characters separated by '.*'. For instance
>> ".*".join("Sx2016")
'S.*x.*2.*0.*1.*6'

Use wildcard matches with ., repeating with *:
expression = 'S.*x.*2.*0.*1.*6'
You can also assemble this expression with join():
expression = '.*'.join('Sx2016')
Or just find it without a regular expression, checking whether the location of each of string_to_find's characters within welcome proceeds in ascending order, handling the case where a character in string_to_find is not present in welcome by catching the ValueError:
>>> welcome = "StackExchange 2016"
>>> string_to_find = "Sx2016"
>>> try:
... result = [welcome.index(c) for c in string_to_find]
... except ValueError:
... result = None
...
>>> print(result and result == sorted(result))
True

Actually having a sequence of chars like Sx2016 the pattern that best serve your purpose is a more specific:
S[^x]*x[^2]*2[^0]*0[^1]*1[^6]*6
You can obtain this kind of check defining a function like this:
import re
def contains_sequence(text, seq):
pattern = seq[0] + ''.join(map(lambda c: '[^' + c + ']*' + c, list(seq[1:])))
return re.search(pattern, text)
This approach add a layer of complexity but brings a couple of advantages as well:
It's the fastest one because the regex engine walk down the string only once while the dot-star approach go till the end of the sequence and back each time a .* is used. Compare on the same string (~1k chars):
Negated class -> 12 steps
Dot star -> 4426 step
It works on multiline strings in input as well.
Example code
>>> sequence = 'Sx2016'
>>> inputs = ['StackExchange2015','StackExchange2016','Stack\nExchange\n2015','Stach\nExchange\n2016']
>>> map(lambda x: x + ': yes' if contains_sequence(x,sequence) else x + ': no', inputs)
['StackExchange2015: no', 'StackExchange2016: yes', 'Stack\nExchange\n2015: no', 'Stach\nExchange\n2016: yes']

Python extract username from URL

I'm scraping reddit usernames using Python and I'm trying to extract the username from an URL. The URL looks like this:
https://www.reddit.com/user/ExampleUser
This is my code:
def extract_username(url):
start = url.find('https://www.reddit.com/user/') + 28
end = url.find('?', start)
end2 = url.find("/", start)
return url[start:end] and url[start:end2] and url[start:]
The first part works but removing the question mark and forward slash doesen't. Maybe I'm using the "and" keyword wrong? Which means I sometimes get something like this:
ExampleUser/
ExampleUser/comments/
ExampleUser/submitted/
ExampleUser/gilded/
ExampleUser?sort=hot
ExampleUser?sort=new
ExampleUser?sort=top
ExampleUser?sort=controversial
I know I can use the api but i'd like to learn how to do it without. I've also heard about regular expressions but aren't they pretty slow?

You could use re module.
>>> s = "https://www.reddit.com/user/ExampleUser/comments/"
>>> import re
>>> re.search(r'https://www.reddit.com/user/([^/?]+)', s).group(1)
'ExampleUser'
[^/?]+ negated character class which matches any character but not of / or ? one or more times. () capturing group around the negated character class captures those matched characters. Later we could refer the captured characters through back-referencing (like \1 which refers the group index 1).
By defining a separate function.
>>> def extract_username(url):
... return re.search(r'https://www.reddit.com/user/([^/?]+)', url).group(1)
...
>>> extract_username('https://www.reddit.com/user/ExampleUser')
'ExampleUser'
>>> extract_username('https://www.reddit.com/user/ExampleUser/submitted/')
'ExampleUser'
>>> extract_username('https://www.reddit.com/user/ExampleUser?sort=controversial')
'ExampleUser'

This removes anything which follows a '?' and then splits on '/', retrieving the fifth element which is the user name:
>>> s = 'https://www.reddit.com/user/ExampleUser?sort=new'
>>> s.split('?')[0].split('/')[4]
'ExampleUser'
This also works on the other cases that you showed. For example:
>>> s = 'https://www.reddit.com/user/ExampleUser/comments/'
>>> s.split('?')[0].split('/')[4]
'ExampleUser'
>>> s = 'https://www.reddit.com/user/ExampleUser'
>>> s.split('?')[0].split('/')[4]
'ExampleUser'

Just for kicks, here's an example using find. Basically, you just want to take the minimum where you find your delimiter or the end if it's not found at all:
def extract_username(url):
username = url[len('https://www.reddit.com/user/'):]
end = min([i for i in (len(username),
username.find('/'),
username.find('?') ) if i >=0])
return username[:end]
for url in ('https://www.reddit.com/user/ExampleUser',
'https://www.reddit.com/user/ExampleUser/submitted/',
'https://www.reddit.com/user/ExampleUser?sort=controversial'):
print extract_username(url)

alternative to encased capturing groups in python

At the moment I have a string and I want to extract the contents of the parenthesis.
This is the string:
>>>string = "djdjfksjlfsdk (600m 36.57) fhksjhfhsdhfkjhks"
This is the regular expression I am using and it yields the following:
>>>regex_output = re.findall(r'\((\d{3,4})m|([\d.:]+\d)\)',string)
>>>regex_output
[('600', ''), ('', '36.57')]
As I understand, the empty strings are caused by nesting capturing groups in my regex.
All I want is a list of two variables as:
['600','36.57']
I could make my new list from my current output but that would defeat the purpose of using the regular expression. So is there a way of achieving my desired output by modifying my regex. Thanks

>>> import re
>>> s = "djdjfksjlfsdk (600m 36.57) fhksjhfhsdhfkjhks"
You can search for the enclosing ( and )
>>> re.search('\((.*?)\)',s).group(1)
'600m 36.57'
Then split on the 'm ' characters
>>> re.search('\((.*?)\)',s).group(1).split('m ')
['600', '36.57']

You could try the below code also which uses positive look-behind to match the number which was just after to ( and also it uses lookahead to match the decimal number which was just before to ),
>>> import re
>>> s = "djdjfksjlfsdk (600m 36.57) fhksjhfhsdhfkjhks"
>>> m = re.findall(r'(?<=\()\d+|\d+[.:]\d+(?=\))', s, re.M)
>>> m
['600', '36.57']

How to capture multiple repeating patterns with regular expression?

I get some string like this: \input{{whatever}{1}}\mypath{{path1}{path2}{path3}...{pathn}}\shape{{0.2}{0.3}}
I would like to capture all the paths: path1, path2, ... pathn. I tried the re module in python. However, it does not support multiple capture.
For example: r"\\mypath\{(\{[^\{\}\[\]]*\})*\}" will only return the last matched group. Applying the pattern to search(r"\mypath{{path1}{path2}})" will only return groups() as ("{path2}",)
Then I found an alternative way to do this:
gpathRegexPat=r"(?:\\mypath\{)((\{[^\{\}\[\]]*\})*)(?:\})"
gpathRegexCp=re.compile(gpathRegexPat)
strpath=gpathRegexCp.search(r'\mypath{{sadf}{ad}}').groups()[0]
>>> strpath
'{sadf}{ad}'
p=re.compile('\{([^\{\}\[\]]*)\}')
>>> p.findall(strpath)
['sadf', 'ad']
or:
>>> gpathRegexPat=r"\\mypath\{(\{[^{}[\]]*\})*\}"
>>> gpathRegexCp=re.compile(gpathRegexPat, flags=re.I|re.U)
>>> strpath=gpathRegexCp.search(r'\input{{whatever]{1}}\mypath{{sadf}{ad}}\shape{{0.2}{0.1}}').group()
>>> strpath
'\\mypath{{sadf}{ad}}'
>>> p.findall(strpath)
['sadf', 'ad']
At this point, I thought, why not just use the findall on the original string? I may use:
gpathRegexPat=r"(?:\\mypath\{)(?:\{[^\{\}\[\]]*\})*?\{([^\{\}\[\]]*)\}(?:\{[^\{\}\[\]]*\})*?(?:\})": if the first (?:\{[^\{\}\[\]]*\})*? matches 0 time and the 2nd (?:\{[^\{\}\[\]]*\})*? matches 1 time, it will capture sadf; if the first (?:\{[^\{\}\[\]]*\})*? matches 1 time, the 2nd one matches 0 time, it will capture ad. However, it will only return ['sadf'] with this regex.
With out all those extra patterns ((?:\\mypath\{) and (?:\})), it actually works:
>>> p2=re.compile(r'(?:\{[^\{\}\[\]]*\})*?\{([^\{\}\[\]]*)\}(?:\{[^\{\}\[\]]*\})*?')
>>> p2.findall(strpath)
['sadf', 'ad']
>>> p2.findall('{adadd}{dfada}{adafadf}')
['adadd', 'dfada', 'adafadf']
Can anyone explain this behavior to me? Is there any smarter way to achieve the result I want?

re.findall("{([^{}]+)}",text)
should work
returns
['path1', 'path2', 'path3', 'pathn']
finally
my_path = r"\input{{whatever}{1}}\mypath{{path1}{path2}{path3}...{pathn}}\shape{{0.2}{0.3}}"
#get the \mypath part
my_path2 = [p for p in my_path.split("\\") if p.startswith("mypath")][0]
print re.findall("{([^{}]+)}",my_path2)
or even better
re.findall("{(path\d+)}",text) #will only return things like path<num> inside {}

You are right. It is not possible to return repeated subgroups inside a group. To do what you want, you can use a regular expression to capture the group and then use a second regular expression to capture the repeated subgroups.
In this case that would be something like: \\mypath{(?:\{.*?\})}. This will return {path1}{path2}{path3}
Then to find the repeating patterns of {pathn} inside that string, you can simply use \{(.*?)\}. This will match anything withing the braces. The .*? is a non-greedy version of .*, meaning it will return the shortest possible match instead of the longest possible match.

Splitting a string into an iterator

Does python have a build-in (meaning in the standard libraries) to do a split on strings that produces an iterator rather than a list? I have in mind working on very long strings and not needing to consume most of the string.

Not directly splitting strings as such, but the re module has re.finditer() (and corresponding finditer() method on any compiled regular expression).
#Zero asked for an example:
>>> import re
>>> s = "The quick brown\nfox"
>>> for m in re.finditer('\S+', s):
... print(m.span(), m.group(0))
...
(0, 3) The
(4, 9) quick
(13, 18) brown
(19, 22) fox

Like s.Lott, I don't quite know what you want. Here is code that may help:
s = "This is a string."
for character in s:
print character
for word in s.split(' '):
print word
There are also s.index() and s.find() for finding the next character.
Later: Okay, something like this.
>>> def tokenizer(s, c):
... i = 0
... while True:
... try:
... j = s.index(c, i)
... except ValueError:
... yield s[i:]
... return
... yield s[i:j]
... i = j + 1
...
>>> for w in tokenizer(s, ' '):
... print w
...
This
is
a
string.

If you don't need to consume the whole string, that's because you are looking for something specific, right? Then just look for that, with re or .find() instead of splitting. That way you can find the part of the string you are interested in, and split that.

There is no built-in iterator-based analog of str.split. Depending on your needs you could make a list iterator:
iterator = iter("abcdcba".split("b"))
iterator
# <list_iterator at 0x49159b0>
next(iterator)
# 'a'
However, a tool from this third-party library likely offers what you want, more_itertools.split_at. See also this post for an example.

Here's an isplit function, which behaves much like split - you can turn off the regex syntax with the regex argument. It uses the re.finditer function, and returns the strings "inbetween" the matches.
import re
def isplit(s, splitter=r'\s+', regex=True):
if not regex:
splitter = re.escape(splitter)
start = 0
for m in re.finditer(splitter, s):
begin, end = m.span()
if begin != start:
yield s[start:begin]
start = end
if s[start:]:
yield s[start:]
_examples = ['', 'a', 'a b', ' a b c ', '\na\tb ']
def test_isplit():
for example in _examples:
assert list(isplit(example)) == example.split(), 'Wrong for {!r}: {} != {}'.format(
example, list(isplit(example)), example.split()
)

Look at itertools. It contains things like takewhile, islice and groupby that allows you to slice an iterable -- a string is iterable -- into another iterable based on either indexes or a boolean condition of sorts.

You could use something like SPARK (which has been absorbed into the Python distribution itself, though not importable from the standard library), but ultimately it uses regular expressions as well so Duncan's answer would possibly serve you just as well if it was as easy as just "splitting on whitespace".
The other, far more arduous option would be to write your own Python module in C to do it if you really wanted speed, but that's a far larger time investment of course.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find out number of capture groups in Python regular expressions - python

def num_groups(regex): return re.compile(regex).groups

f_x = re.search(...) len_groups = len(f_x.groups())

Related

Regular Expression (find matching characters in order)

Python extract username from URL

alternative to encased capturing groups in python

How to capture multiple repeating patterns with regular expression?

Splitting a string into an iterator

Categories

Resources