repeating previous regex - python

I have a line (and a arbitrary number of them)
0 1 1 75 55
I can get it by doing
x = re.search("\d+\s+\d+\s+(\d+)\s+(\d+)\s+(\d+)", line)
if x != None:
print(x.group(1))
print(x.group(2))
print(x.group(3))
But there must be a neater way to write this. I was looking at the docs for something to repeat the previous expression and found (exp){m times}.
So I try
x = re.search("(\d+\s+){5}", line)
and then expect x.group(1) to be 0, 2 to be 1, 3 to be 1 and so on
but x.group(1) ouputs 55 (the last number). Im sort of confused. Thanks.
Also on a side note. Do you guys have any recommendations for online tutorials (or free to download books) on regex?

Repetition of capturing groups does not work, and won't any time soon (in the sense of having the ability to individually actually access the matched parts) – you will just have to write the regex the long way or use a string method such as .split(), avoiding regex altogether.

Have you considered findall which repeats the search until the input string is exhausted and returns all matches in a list?
>>> import re
>>> line = '0 1 1 75 55'
>>> x = re.findall("(\d+)", line)
>>> print x
['0', '1', '1', '75', '55']

In your regular expression, there is only one group, since you have only one pair of parentheses. This group will return the last match, as you found out yourself.
If you want to use regular expressions, and you know the number of integers in a line in advance, I would go for
x = re.search("\s+".join(["(\d+)"] * 5), line)
in this case.
(Note that
x = re.search("(\d+\s+){5}", line)
requires a space after the last number.)
But for the example you gave I'd actually use
line = "0 1 1 75 55"
int_list = map(int, line.split())

import re
line = '0 1 2 75 55'
x = re.search('\\s+'.join(5*('(\\d+)',)), line)
if x:
print '\n'.join(x.group(3,4,5))
Bof
Or, with idea of Sven Marnach:
print '\n'.join(line.split()[2:5])

Related

Extracting multiple substrings from one string

I have the following string which I am parsing from another file :
"CHEM1(5GL) CH3M2(55LB) CHEM3954114(50KG)"
What I want to do is split them up into individual values, which I achieve using the .split() function. So I get them as an array:
x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
Now I want to further split them into 3 segments, and store them in 3 other variables so I can write them to excel as such :
a = CHEM1
b = 5
c = GL
for the first array, then I will loop back for the second array:
a = CH3M2
b = 55
c = LB
and finally :
a = CHEM3954114
b = 50
c = KG
I am unsure how to go about that as I am still new in python. To the best of my acknowledge I iterate multiple times with the split function, but I believe there has to be a better way to do it than that.
Thank you.
You should use the re package:
import re
x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
pattern = re.compile("([^\(]+)\((\d+)(.+)\)")
for x1 in x:
m = pattern.search(x1)
if m:
a, b, c = m.group(1), int(m.group(2)), m.group(3)
FOLLOW UP:
The regex topic is enormous and extremely well covered on this site - as Tim has highlighted above. I can share my thinking for this specific case.
Essentially, there are 3 groups of characters you want to extract:
All the characters (letters and numbers) up to the ( - not included
The digits after the (
The letters after the digits extracted in the previous step - up to the ) - not included.
A group is anything included between brackets (): in this specific case, it may become confusing because, as stressed above, you have brackets as part of sentence - which will need to be escaped with a \ to be distinguished from the ones used in the regular expression.
The first group is ([^\(]+), which essentially means: match one or more characters which are not ( (the ^ is the negation, and the bracket ( needs to be escaped here, for the reasons described above). Note that characters may include not only letters and numbers but also special characters like $, £, - and so forth. I wanted to keep my options open here, but you can be more laser guided if you need (including, for example, only numbers and letters using [\w]+)
The second group is (\d+), which is essentially matching 1 or more (expressed with +) digits (expressed with \d).
The last group is (.+) - match any remaining characters, with the final \) making sure that you match any remaining characters up to the closing bracket.
Using re.findall we can try:
x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
for inp in x:
matches = re.findall(r'(\w+)\((\d+)(\w+)\)', inp)
print(matches)
# [('CHEM1', '5', 'GL')]
# [('CH3M2', '55', 'LB')]
# [('CHEM3954114', '50', 'KG')]
Considering the elements you provided in your question, I assume that there can not be '(' more than once in an element.
Here is the function I wrote.
def decontruct(chem):
name = chem[:chem.index('(')]
qty = chem[chem.index('(') + 1:-1]
mag, unit = "", ""
for char in qty:
if char.isalpha():
unit += char
else:
mag += char
return {"name": name, "mag": float(mag), "unit": unit} # If you don't want to convert mag into float then just use int(mag) instead of float(mag).
Usage:
x = ['CHEM1(5.4GL)', 'CH3M2(55LB)', 'CHEM3954114(50KG)']
for chem in x:
d = decontruct(chem)
print(d["name"], d["mag"], d["unit"])
Use re and create a list of dictionaries
import re
x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
keys =['a', 'b', 'c']
y = []
for s in x:
vals = re.sub(r'(.*?)\((\d*)(.*?)\)', r'\1 \2 \3', s).split()
y.append(dict(zip(keys, vals)))
[print("a: %s\nb: %s\nc: %s\n" % (i['a'], i['b'], i['c'])) for i in y]
gives
a: CHEM1
b: 5
c: GL
a: CH3M2
b: 55
c: LB
a: CHEM3954114
b: 50
c: KG

re.findall Find numbers with dashes(-) and commas(,)

I have the value
x = '970.11 - 1,003.54'
I've tried many types of re.findall for example
re.findall('\+d',x)
['970', '11', '1', '003', '54']
although I would like for it to show
['970.11', '1,003.54]
\d is only digits. It won't match other characters even if we think they are part of numbers. You need to do that manually with something like:
import re
x = '970.11 - 1,003.54'
re.findall('[\d\.,]+',x) # match numbers . or ,
result:
['970.11', '1,003.54']
This is a pretty forgiving regex — it will match a lot of things that probably aren't numbers (like ..,,4). Numbers can be tricky to match with a regex if you want something that works in a general case (like .45, 11,000.2, 22.) etc. The more consistent your input, the easier it will be. And sometimes it's easier to match the nonmembers (like your -).
Try this one, it also works:
import re
re.findall('\d+\,?\d+\.*\d*',x)
Output:
['970.11', '1,003.54']
Here , is optional, if it is between the number it takes it otherwise it will not take it.
If you want . as optional then you can make it like this:
In [48]: x
Out[48]: '970.11 - 1,003.54 2345'
In [49]: re.findall('\d+\,?\d+\.?\d+',x)
Out[49]: ['970.11', '1,003.54', '2345']
For getting this you might use grouping of regular expressions, using your example
x = '970.11 - 1,003.54'
y = re.findall('([0-9.,]+)([ -]+)([0-9.,]+)',x)
print(y[0]) #prints ('970.11', ' - ', '1,003.54')
z = [y[0][0],y[0][2]]
print(z) #prints ['970.11', '1,003.54']
Regular expression in this case consists of 3 groups: first and last match at least one of 0123456789., and middle at least one of - (space or dash)

Python regular expression match number in string

I used regular expression in python2.7 to match the number in a string but I can't match a single number in my expression, here are my code
import re
import cv2
s = '858 1790 -156.25 2'
re_matchData = re.compile(r'\-?\d{1,10}\.?\d{1,10}')
data = re.findall(re_matchData, s)
print data
and then print:
['858', '1790', '-156.25']
but when I change expression from
re_matchData = re.compile(r'\-?\d{1,10}\.?\d{1,10}')
to
re_matchData = re.compile(r'\-?\d{0,10}\.?\d{1,10}')
then print:
['858', '1790', '-156.25', '2']
is there any confuses between d{1, 10} and d{0,10} ?
If I did wrong, how to correct it ?
Thanks for checking my question !
try this:
r'\-?\d{1,10}(?:\.\d{1,10})?'
use (?:)? to make fractional part optional.
for r'\-?\d{0,10}\.?\d{1,10}', it is \.?\d{1,10} who matched 2.
The first \d{1,10} matches from 1 to 10 digits, and the second \d{1,10} also matches from 1 to 10 digits. In order for them both to match, you need at least 2 digits in your number, with an optional . between them.
You should make the entire fraction optional, not just the ..
r'\-?\d{1,10}(?:\.\d{1,10})?'
I would rather do as follows:
import re
s = '858 1790 -156.25 2'
re_matchData = re.compile(r'\-?\d{1,10}\.?\d{0,10}')
data = re_matchData.findall(s)
print data
Output:
['858', '1790', '-156.25', '2']

Grabbing multiple patterns in a string using regex

In python I'm trying to grab multiple inputs from string using regular expression; however, I'm having trouble. For the string:
inputs = 12 1 345 543 2
I tried using:
match = re.match(r'\s*inputs\s*=(\s*\d+)+',string)
However, this only returns the value '2'. I'm trying to capture all the values '12','1','345','543','2' but not sure how to do this.
Any help is greatly appreciated!
EDIT: Thank you all for explaining why this is does not work and providing alternative suggestions. Sorry if this is a repeat question.
You could try something like:
re.findall("\d+", your_string).
You cannot do this with a single regex (unless you were using .NET), because each capturing group will only ever return one result even if it is repeated (the last one in the case of Python).
Since variable length lookbehinds are also not possible (in which case you could do (?<=inputs.*=.*)\d+), you will have to separate this into two steps:
match = re.match(r'\s*inputs\s*=\s*(\d+(?:\s*\d+)+)', string)
integers = re.split(r'\s+',match.group(1))
So now you capture the entire list of integers (and the spaces between them), and then you split that capture at the spaces.
The second step could also be done using findall:
integers = re.findall(r'\d+',match.group(1))
The results are identical.
You can embed your regular expression:
import re
s = 'inputs = 12 1 345 543 2'
print re.findall(r'(\d+)', re.match(r'inputs\s*=\s*([\s\d]+)', s).group(1))
>>>
['12', '1', '345', '543', '2']
Or do it in layers:
import re
def get_inputs(s, regex=r'inputs\s*=\s*([\s\d]+)'):
match = re.match(regex, s)
if not match:
return False # or raise an exception - whatever you want
else:
return re.findall(r'(\d+)', match.group(1))
s = 'inputs = 12 1 345 543 2'
print get_inputs(s)
>>>
['12', '1', '345', '543', '2']
You should look at this answer: https://stackoverflow.com/a/4651893/1129561
In short:
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).

How do I regex match with grouping with unknown number of groups

I want to do a regex match (in Python) on the output log of a program. The log contains some lines that look like this:
...
VALUE 100 234 568 9233 119
...
VALUE 101 124 9223 4329 1559
...
I would like to capture the list of numbers that occurs after the first incidence of the line that starts with VALUE. i.e., I want it to return ('100','234','568','9233','119'). The problem is that I do not know in advance how many numbers there will be.
I tried to use this as a regex:
VALUE (?:(\d+)\s)+
This matches the line, but it only captures the last value, so I just get ('119',).
What you're looking for is a parser, instead of a regular expression match. In your case, I would consider using a very simple parser, split():
s = "VALUE 100 234 568 9233 119"
a = s.split()
if a[0] == "VALUE":
print [int(x) for x in a[1:]]
You can use a regular expression to see whether your input line matches your expected format (using the regex in your question), then you can run the above code without having to check for "VALUE" and knowing that the int(x) conversion will always succeed since you've already confirmed that the following character groups are all digits.
>>> import re
>>> reg = re.compile('\d+')
>>> reg.findall('VALUE 100 234 568 9233 119')
['100', '234', '568', '9223', '119']
That doesn't validate that the keyword 'VALUE' appears at the beginning of the string, and it doesn't validate that there is exactly one space between items, but if you can do that as a separate step (or if you don't need to do that at all), then it will find all digit sequences in any string.
Another option not described here is to have a bunch of optional capturing groups.
VALUE *(\d+)? *(\d+)? *(\d+)? *(\d+)? *(\d+)? *$
This regex captures up to 5 digit groups separated by spaces. If you need more potential groups, just copy and paste more *(\d+)? blocks.
You could just run you're main match regex then run a secondary regex on those matches to get the numbers:
matches = Regex.Match(log)
foreach (Match match in matches)
{
submatches = Regex2.Match(match)
}
This is of course also if you don't want to write a full parser.
I had this same problem and my solution was to use two regular expressions: the first one to match the whole group I'm interested in and the second one to parse the sub groups. For example in this case, I'd start with this:
VALUE((\s\d+)+)
This should result in three matches: [0] the whole line, [1] the stuff after value [2] the last space+value.
[0] and [2] can be ignored and then [1] can be used with the following:
\s(\d+)
Note: these regexps were not tested, I hope you get the idea though.
The reason why Greg's answer doesn't work for me is because the 2nd part of the parsing is more complicated and not simply some numbers separated by a space.
However, I would honestly go with Greg's solution for this question (it's probably way more efficient).
I'm just writing this answer in case someone is looking for a more sophisticated solution like I needed.
You can use re.match to check first and call re.split to use a regex as separator to split.
>>> s = "VALUE 100 234 568 9233 119"
>>> sep = r"\s+"
>>> reg = re.compile(r"VALUE(%s\d+)+"%(sep)) # OR r"VALUE(\s+\d+)+"
>>> reg_sep = re.compile(sep)
>>> if reg.match(s): # OR re.match(r"VALUE(\s+\d+)+", s)
... result = reg_sep.split(s)[1:] # OR re.split(r"\s+", s)[1:]
>>> result
['100', '234', '568', '9233', '119']
The separator "\s+" can be more complicated.

Categories

Resources