Grabbing multiple patterns in a string using regex - python

In python I'm trying to grab multiple inputs from string using regular expression; however, I'm having trouble. For the string:
inputs = 12 1 345 543 2
I tried using:
match = re.match(r'\s*inputs\s*=(\s*\d+)+',string)
However, this only returns the value '2'. I'm trying to capture all the values '12','1','345','543','2' but not sure how to do this.
Any help is greatly appreciated!
EDIT: Thank you all for explaining why this is does not work and providing alternative suggestions. Sorry if this is a repeat question.

You could try something like:
re.findall("\d+", your_string).

You cannot do this with a single regex (unless you were using .NET), because each capturing group will only ever return one result even if it is repeated (the last one in the case of Python).
Since variable length lookbehinds are also not possible (in which case you could do (?<=inputs.*=.*)\d+), you will have to separate this into two steps:
match = re.match(r'\s*inputs\s*=\s*(\d+(?:\s*\d+)+)', string)
integers = re.split(r'\s+',match.group(1))
So now you capture the entire list of integers (and the spaces between them), and then you split that capture at the spaces.
The second step could also be done using findall:
integers = re.findall(r'\d+',match.group(1))
The results are identical.

You can embed your regular expression:
import re
s = 'inputs = 12 1 345 543 2'
print re.findall(r'(\d+)', re.match(r'inputs\s*=\s*([\s\d]+)', s).group(1))
>>>
['12', '1', '345', '543', '2']
Or do it in layers:
import re
def get_inputs(s, regex=r'inputs\s*=\s*([\s\d]+)'):
match = re.match(regex, s)
if not match:
return False # or raise an exception - whatever you want
else:
return re.findall(r'(\d+)', match.group(1))
s = 'inputs = 12 1 345 543 2'
print get_inputs(s)
>>>
['12', '1', '345', '543', '2']

You should look at this answer: https://stackoverflow.com/a/4651893/1129561
In short:
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).

Related

Is there a way to find (potentially) multiple results with re.search?

While parsing file names of TV shows, I would like to extract information about them to use for renaming. I have a working model, but it currently uses 28 if/elif statements for every iteration of filename I've seen over the last few years. I'd love to be able to condense this to something that I'm not ashamed of, so any help would be appreciated.
Phase one of this code repentance is to hopefully grab multiple episode numbers. I've gotten as far as the code below, but in the first entry it only displays the first episode number and not all three.
import re
def main():
pattern = '(.*)\.S(\d+)[E(\d+)]+'
strings = ['blah.s01e01e02e03', 'foo.s09e09', 'bar.s05e05']
#print(strings)
for string in strings:
print(string)
result = re.search("(.*)\.S(\d+)[E(\d+)]+", string, re.IGNORECASE)
print(result.group(2))
if __name__== "__main__":
main()
This outputs:
blah.s01e01e02e03
01
foo.s09e09
09
bar.s05e05
05
It's probably trivial, but regular expressions might as well be Cuneiform most days. Thanks in advance!
No. You can use findall to find all e\d+, but it cannot find overlapping matches, which makes it impossible to use s\d+ together with it (i.e. you can't distinguish e02 in "foo.s01e006e007" from that of "age007.s01e001"), and Python doesn't let you use variable-length lookbehind (to make sure s\d+ is before it without overlapping).
The way to do this is to find \.s\d+((?:e\d+)+)$ then split the resultant group 1 in another step (whether by using findall with e\d+, or by splitting with (?<!^)(?=e)).
text = 'blah.s01e01e02e03'
match = re.search(r'\.(s\d+)((?:e\d+)+)$', text, re.I)
season = match.group(1)
episodes = re.findall(r'e\d+', match.group(2), re.I)
print(season, episodes)
# => s01 ['e01', 'e02', 'e03']
re.findall instead of re.search will return a list of all matches
If you can make use of the PyPi regex module you could make use of repeating capture groups in the pattern, and then use .captures()
For example:
import regex
s = "blah.s01e01e02e03"
pattern = r"\.(s\d+)(e\d+)+"
m = regex.search(pattern, s, regex.IGNORECASE)
if m:
print(m.captures(1)[0], m.captures(2))
Output:
s01 ['e01', 'e02', 'e03']
See a Python demo and a regex101 demo.
Or using .capturesdict () with named capture groups.
For example:
import regex
s = "blah.s01e01e02e03"
pattern = r"\.(?P<season>s\d+)(?P<episodes>e\d+)+"
m = regex.search(pattern, s, regex.IGNORECASE)
if m:
print(m.capturesdict())
Output:
{'season': ['s01'], 'episodes': ['e01', 'e02', 'e03']}
See a Python demo.
Note that the notation [E(\d+)] that you used is a character class, that matches 1 or the listed characters like E ( a digit + )

Regex for Backward Look up

Hi I have a string that I want to parse with Python.
I am new to regex, so really appreciate help.
ABC_XYZ::A_BCD_XYZ_C9_KDFJ_7011_1_11_14
Output C9 : Always starts with letter and a digit
Output 7011: Always 4 or more digits
Output 1, 11, 14: Always at the end of the string. One or two digits. May have more than 3.
Update.
I was using [^_]+ and it parses everything '_'. I wanted just those matches.
you can use this regex
((?<=_)\d{4})|((?<=_)\w?\d{1})
https://regex101.com/r/0fhJFY/1
You might get along with
import re
def get_values(string):
rx = re.compile(r'_([A-Z]\d)_.*?_(\d{4,}(?=_)).*?((?:_\d{1,2})+)')
m = rx.search(string)
if m:
return (m.group(1), m.group(2), [item for item in m.group(3).split("_") if item])
print(get_values("ABC_XYZ::A_BCD_XYZ_C9_KDFJ_7011_1_11_14"))
# ('C9', '7011', ['1', '11', '14'])
See a demo for the expression on regex101.com.
i don't understand what you need exactly but the regex:
[^_]+_[^_]+::[^_]_[^_]+_[^_]+_([A-Z]\d)_[^_]+_(\d{4,})_(\d)_(\d+)_(\d+)
give the output you want for the string you provided.
To test and learn regex I advise you to visit site like this.

Python regular expression match number in string

I used regular expression in python2.7 to match the number in a string but I can't match a single number in my expression, here are my code
import re
import cv2
s = '858 1790 -156.25 2'
re_matchData = re.compile(r'\-?\d{1,10}\.?\d{1,10}')
data = re.findall(re_matchData, s)
print data
and then print:
['858', '1790', '-156.25']
but when I change expression from
re_matchData = re.compile(r'\-?\d{1,10}\.?\d{1,10}')
to
re_matchData = re.compile(r'\-?\d{0,10}\.?\d{1,10}')
then print:
['858', '1790', '-156.25', '2']
is there any confuses between d{1, 10} and d{0,10} ?
If I did wrong, how to correct it ?
Thanks for checking my question !
try this:
r'\-?\d{1,10}(?:\.\d{1,10})?'
use (?:)? to make fractional part optional.
for r'\-?\d{0,10}\.?\d{1,10}', it is \.?\d{1,10} who matched 2.
The first \d{1,10} matches from 1 to 10 digits, and the second \d{1,10} also matches from 1 to 10 digits. In order for them both to match, you need at least 2 digits in your number, with an optional . between them.
You should make the entire fraction optional, not just the ..
r'\-?\d{1,10}(?:\.\d{1,10})?'
I would rather do as follows:
import re
s = '858 1790 -156.25 2'
re_matchData = re.compile(r'\-?\d{1,10}\.?\d{0,10}')
data = re_matchData.findall(s)
print data
Output:
['858', '1790', '-156.25', '2']

repeating previous regex

I have a line (and a arbitrary number of them)
0 1 1 75 55
I can get it by doing
x = re.search("\d+\s+\d+\s+(\d+)\s+(\d+)\s+(\d+)", line)
if x != None:
print(x.group(1))
print(x.group(2))
print(x.group(3))
But there must be a neater way to write this. I was looking at the docs for something to repeat the previous expression and found (exp){m times}.
So I try
x = re.search("(\d+\s+){5}", line)
and then expect x.group(1) to be 0, 2 to be 1, 3 to be 1 and so on
but x.group(1) ouputs 55 (the last number). Im sort of confused. Thanks.
Also on a side note. Do you guys have any recommendations for online tutorials (or free to download books) on regex?
Repetition of capturing groups does not work, and won't any time soon (in the sense of having the ability to individually actually access the matched parts) – you will just have to write the regex the long way or use a string method such as .split(), avoiding regex altogether.
Have you considered findall which repeats the search until the input string is exhausted and returns all matches in a list?
>>> import re
>>> line = '0 1 1 75 55'
>>> x = re.findall("(\d+)", line)
>>> print x
['0', '1', '1', '75', '55']
In your regular expression, there is only one group, since you have only one pair of parentheses. This group will return the last match, as you found out yourself.
If you want to use regular expressions, and you know the number of integers in a line in advance, I would go for
x = re.search("\s+".join(["(\d+)"] * 5), line)
in this case.
(Note that
x = re.search("(\d+\s+){5}", line)
requires a space after the last number.)
But for the example you gave I'd actually use
line = "0 1 1 75 55"
int_list = map(int, line.split())
import re
line = '0 1 2 75 55'
x = re.search('\\s+'.join(5*('(\\d+)',)), line)
if x:
print '\n'.join(x.group(3,4,5))
Bof
Or, with idea of Sven Marnach:
print '\n'.join(line.split()[2:5])

How do I regex match with grouping with unknown number of groups

I want to do a regex match (in Python) on the output log of a program. The log contains some lines that look like this:
...
VALUE 100 234 568 9233 119
...
VALUE 101 124 9223 4329 1559
...
I would like to capture the list of numbers that occurs after the first incidence of the line that starts with VALUE. i.e., I want it to return ('100','234','568','9233','119'). The problem is that I do not know in advance how many numbers there will be.
I tried to use this as a regex:
VALUE (?:(\d+)\s)+
This matches the line, but it only captures the last value, so I just get ('119',).
What you're looking for is a parser, instead of a regular expression match. In your case, I would consider using a very simple parser, split():
s = "VALUE 100 234 568 9233 119"
a = s.split()
if a[0] == "VALUE":
print [int(x) for x in a[1:]]
You can use a regular expression to see whether your input line matches your expected format (using the regex in your question), then you can run the above code without having to check for "VALUE" and knowing that the int(x) conversion will always succeed since you've already confirmed that the following character groups are all digits.
>>> import re
>>> reg = re.compile('\d+')
>>> reg.findall('VALUE 100 234 568 9233 119')
['100', '234', '568', '9223', '119']
That doesn't validate that the keyword 'VALUE' appears at the beginning of the string, and it doesn't validate that there is exactly one space between items, but if you can do that as a separate step (or if you don't need to do that at all), then it will find all digit sequences in any string.
Another option not described here is to have a bunch of optional capturing groups.
VALUE *(\d+)? *(\d+)? *(\d+)? *(\d+)? *(\d+)? *$
This regex captures up to 5 digit groups separated by spaces. If you need more potential groups, just copy and paste more *(\d+)? blocks.
You could just run you're main match regex then run a secondary regex on those matches to get the numbers:
matches = Regex.Match(log)
foreach (Match match in matches)
{
submatches = Regex2.Match(match)
}
This is of course also if you don't want to write a full parser.
I had this same problem and my solution was to use two regular expressions: the first one to match the whole group I'm interested in and the second one to parse the sub groups. For example in this case, I'd start with this:
VALUE((\s\d+)+)
This should result in three matches: [0] the whole line, [1] the stuff after value [2] the last space+value.
[0] and [2] can be ignored and then [1] can be used with the following:
\s(\d+)
Note: these regexps were not tested, I hope you get the idea though.
The reason why Greg's answer doesn't work for me is because the 2nd part of the parsing is more complicated and not simply some numbers separated by a space.
However, I would honestly go with Greg's solution for this question (it's probably way more efficient).
I'm just writing this answer in case someone is looking for a more sophisticated solution like I needed.
You can use re.match to check first and call re.split to use a regex as separator to split.
>>> s = "VALUE 100 234 568 9233 119"
>>> sep = r"\s+"
>>> reg = re.compile(r"VALUE(%s\d+)+"%(sep)) # OR r"VALUE(\s+\d+)+"
>>> reg_sep = re.compile(sep)
>>> if reg.match(s): # OR re.match(r"VALUE(\s+\d+)+", s)
... result = reg_sep.split(s)[1:] # OR re.split(r"\s+", s)[1:]
>>> result
['100', '234', '568', '9233', '119']
The separator "\s+" can be more complicated.

Categories

Resources