I wanted to create a regex which could match the following pattern:
5,000
2.5
25
This is the regex I have thus far:
re.compile('([\d,]+)')
How can I adjust for the .?
Easiest method would just be this:
re.compile('([\d,.]+)')
But this will allow inputs like .... This might be acceptable, since your original pattern allows ,,,. However, if you want to allow only a single decimal point you can do this:
re.compile('([\d,]+.?\d*)')
Note that this won't allow inputs like .5—you'd need to use 0.5 instead.
I think the perfect regex would be
re.compile(r'\d{1,2}[,.]\d{1,3}')
This way you match one or two digits followed by a comma or a full stop, and then one to three digits.
You don't need the parentheses if you are not going to use the contents of the match later. Omitting them speeds up the process.
Here is a very big but powerful regex to capture anything that is a valid number:
import re
string = """
5,000
2.5
25
234,456,678.345
...
,,,
23,332.1
abc
45,2
0.5
"""
print re.findall("(?:\d+(?:,?\d{3})*)+\.?(?:\d+)?", string)
output:
# Note that it will not capture "45,2" because it is invalid
# It instead does "45" and "2", which are each valid
['5,000', '2.5', '25', '234,456,678.345', '23,332.1', '45', '2', '0.5']
Related
I have the value
x = '970.11 - 1,003.54'
I've tried many types of re.findall for example
re.findall('\+d',x)
['970', '11', '1', '003', '54']
although I would like for it to show
['970.11', '1,003.54]
\d is only digits. It won't match other characters even if we think they are part of numbers. You need to do that manually with something like:
import re
x = '970.11 - 1,003.54'
re.findall('[\d\.,]+',x) # match numbers . or ,
result:
['970.11', '1,003.54']
This is a pretty forgiving regex — it will match a lot of things that probably aren't numbers (like ..,,4). Numbers can be tricky to match with a regex if you want something that works in a general case (like .45, 11,000.2, 22.) etc. The more consistent your input, the easier it will be. And sometimes it's easier to match the nonmembers (like your -).
Try this one, it also works:
import re
re.findall('\d+\,?\d+\.*\d*',x)
Output:
['970.11', '1,003.54']
Here , is optional, if it is between the number it takes it otherwise it will not take it.
If you want . as optional then you can make it like this:
In [48]: x
Out[48]: '970.11 - 1,003.54 2345'
In [49]: re.findall('\d+\,?\d+\.?\d+',x)
Out[49]: ['970.11', '1,003.54', '2345']
For getting this you might use grouping of regular expressions, using your example
x = '970.11 - 1,003.54'
y = re.findall('([0-9.,]+)([ -]+)([0-9.,]+)',x)
print(y[0]) #prints ('970.11', ' - ', '1,003.54')
z = [y[0][0],y[0][2]]
print(z) #prints ['970.11', '1,003.54']
Regular expression in this case consists of 3 groups: first and last match at least one of 0123456789., and middle at least one of - (space or dash)
I have a string format let's say where A = alphanumeric and N = Integer so the template is "AAAAAA-NNNN" now the user sometimes will ommit the dash, and sometimes the "NNNN" is only three digits in which case I need it to pad a 0. The first digit of "NNNN" has to be 0, thus if it is a number is is the last digit of the "AAAAAA" as opposed to the first digit of "NNNN". So in essence if I have the following inputs I want the following results:
Sample Inputs:
"SAMPLE0001"
"SAMPL1-0002"
"SAMPL3003"
"SAMPLE-004"
Desired Outputs:
"SAMPLE-0001"
"SAMPL1-0002"
"SAMPL3-0003"
"SAMPLE-0004"
I know how to check for this using regular expressions but essentially I want to do the opposite. I was wondering if there is a easy way to do this other than doing a nested conditional checking for all these variations. I am using python and pandas but either will suffice.
The regex pattern would be:
"[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]-\d\d\d\d"
or in abbreviated form:
"[a-zA-Z0-9]{6}-[\d]{4}"
It would be possible through two re.sub functions.
>>> import re
>>> s = '''SAMPLE0001
SAMPL1-0002
SAMPL3003
SAMPLE-004'''
>>> print(re.sub(r'(?m)(?<=-)(?=\d{3}$)', '0', re.sub(r'(?m)(?<=^[A-Z\d]{6})(?!-)', '-', s)))
SAMPLE-0001
SAMPL1-0002
SAMPL3-0003
SAMPLE-0004
Explanation:
re.sub(r'(?m)(?<=^[A-Z\d]{6})(?!-)', '-', s) would be processed at first. It just places a hyphen after the 6th character from the beginning only if the following character is not a hyphen.
re.sub(r'(?m)(?<=-)(?=\d{3}$)', '0', re.sub(r'(?m)(?<=^[A-Z\d]{6})(?!-)', '-', s)) By taking the above command's output as input, this would add a digit 0 after to the hyphen and the characters following must be exactly 3.
An alternative solution, it uses str.join:
import re
inputs = ['SAMPLE0001', 'SAMPL1-0002', 'SAMPL3003','SAMPLE-004']
outputs = []
for input_ in inputs:
m = re.match(r'(\w{6})-?\d?(\d{3})', input_)
outputs.append('-0'.join(m.groups()))
print(outputs)
# ['SAMPLE-0001', 'SAMPL1-0002', 'SAMPL3-0003', 'SAMPLE-0004']
We are matching the regex (\w{6})-?\d?(\d{3}) against the input strings and joining the captured groups with the string '-0'. This is very simple and fast.
Let me know if you need a more in-depth explanation of the regex itself.
To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!
Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.
you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all
In python I'm trying to grab multiple inputs from string using regular expression; however, I'm having trouble. For the string:
inputs = 12 1 345 543 2
I tried using:
match = re.match(r'\s*inputs\s*=(\s*\d+)+',string)
However, this only returns the value '2'. I'm trying to capture all the values '12','1','345','543','2' but not sure how to do this.
Any help is greatly appreciated!
EDIT: Thank you all for explaining why this is does not work and providing alternative suggestions. Sorry if this is a repeat question.
You could try something like:
re.findall("\d+", your_string).
You cannot do this with a single regex (unless you were using .NET), because each capturing group will only ever return one result even if it is repeated (the last one in the case of Python).
Since variable length lookbehinds are also not possible (in which case you could do (?<=inputs.*=.*)\d+), you will have to separate this into two steps:
match = re.match(r'\s*inputs\s*=\s*(\d+(?:\s*\d+)+)', string)
integers = re.split(r'\s+',match.group(1))
So now you capture the entire list of integers (and the spaces between them), and then you split that capture at the spaces.
The second step could also be done using findall:
integers = re.findall(r'\d+',match.group(1))
The results are identical.
You can embed your regular expression:
import re
s = 'inputs = 12 1 345 543 2'
print re.findall(r'(\d+)', re.match(r'inputs\s*=\s*([\s\d]+)', s).group(1))
>>>
['12', '1', '345', '543', '2']
Or do it in layers:
import re
def get_inputs(s, regex=r'inputs\s*=\s*([\s\d]+)'):
match = re.match(regex, s)
if not match:
return False # or raise an exception - whatever you want
else:
return re.findall(r'(\d+)', match.group(1))
s = 'inputs = 12 1 345 543 2'
print get_inputs(s)
>>>
['12', '1', '345', '543', '2']
You should look at this answer: https://stackoverflow.com/a/4651893/1129561
In short:
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
I want to do a regex match (in Python) on the output log of a program. The log contains some lines that look like this:
...
VALUE 100 234 568 9233 119
...
VALUE 101 124 9223 4329 1559
...
I would like to capture the list of numbers that occurs after the first incidence of the line that starts with VALUE. i.e., I want it to return ('100','234','568','9233','119'). The problem is that I do not know in advance how many numbers there will be.
I tried to use this as a regex:
VALUE (?:(\d+)\s)+
This matches the line, but it only captures the last value, so I just get ('119',).
What you're looking for is a parser, instead of a regular expression match. In your case, I would consider using a very simple parser, split():
s = "VALUE 100 234 568 9233 119"
a = s.split()
if a[0] == "VALUE":
print [int(x) for x in a[1:]]
You can use a regular expression to see whether your input line matches your expected format (using the regex in your question), then you can run the above code without having to check for "VALUE" and knowing that the int(x) conversion will always succeed since you've already confirmed that the following character groups are all digits.
>>> import re
>>> reg = re.compile('\d+')
>>> reg.findall('VALUE 100 234 568 9233 119')
['100', '234', '568', '9223', '119']
That doesn't validate that the keyword 'VALUE' appears at the beginning of the string, and it doesn't validate that there is exactly one space between items, but if you can do that as a separate step (or if you don't need to do that at all), then it will find all digit sequences in any string.
Another option not described here is to have a bunch of optional capturing groups.
VALUE *(\d+)? *(\d+)? *(\d+)? *(\d+)? *(\d+)? *$
This regex captures up to 5 digit groups separated by spaces. If you need more potential groups, just copy and paste more *(\d+)? blocks.
You could just run you're main match regex then run a secondary regex on those matches to get the numbers:
matches = Regex.Match(log)
foreach (Match match in matches)
{
submatches = Regex2.Match(match)
}
This is of course also if you don't want to write a full parser.
I had this same problem and my solution was to use two regular expressions: the first one to match the whole group I'm interested in and the second one to parse the sub groups. For example in this case, I'd start with this:
VALUE((\s\d+)+)
This should result in three matches: [0] the whole line, [1] the stuff after value [2] the last space+value.
[0] and [2] can be ignored and then [1] can be used with the following:
\s(\d+)
Note: these regexps were not tested, I hope you get the idea though.
The reason why Greg's answer doesn't work for me is because the 2nd part of the parsing is more complicated and not simply some numbers separated by a space.
However, I would honestly go with Greg's solution for this question (it's probably way more efficient).
I'm just writing this answer in case someone is looking for a more sophisticated solution like I needed.
You can use re.match to check first and call re.split to use a regex as separator to split.
>>> s = "VALUE 100 234 568 9233 119"
>>> sep = r"\s+"
>>> reg = re.compile(r"VALUE(%s\d+)+"%(sep)) # OR r"VALUE(\s+\d+)+"
>>> reg_sep = re.compile(sep)
>>> if reg.match(s): # OR re.match(r"VALUE(\s+\d+)+", s)
... result = reg_sep.split(s)[1:] # OR re.split(r"\s+", s)[1:]
>>> result
['100', '234', '568', '9233', '119']
The separator "\s+" can be more complicated.