re.findall Find numbers with dashes(-) and commas(,)

re.findall Find numbers with dashes(-) and commas(,) - python

I have the value
x = '970.11 - 1,003.54'
I've tried many types of re.findall for example
re.findall('\+d',x)
['970', '11', '1', '003', '54']
although I would like for it to show
['970.11', '1,003.54]

\d is only digits. It won't match other characters even if we think they are part of numbers. You need to do that manually with something like:
import re
x = '970.11 - 1,003.54'
re.findall('[\d\.,]+',x) # match numbers . or ,
result:
['970.11', '1,003.54']
This is a pretty forgiving regex — it will match a lot of things that probably aren't numbers (like ..,,4). Numbers can be tricky to match with a regex if you want something that works in a general case (like .45, 11,000.2, 22.) etc. The more consistent your input, the easier it will be. And sometimes it's easier to match the nonmembers (like your -).

Try this one, it also works:
import re
re.findall('\d+\,?\d+\.*\d*',x)
Output:
['970.11', '1,003.54']
Here , is optional, if it is between the number it takes it otherwise it will not take it.
If you want . as optional then you can make it like this:
In [48]: x
Out[48]: '970.11 - 1,003.54 2345'
In [49]: re.findall('\d+\,?\d+\.?\d+',x)
Out[49]: ['970.11', '1,003.54', '2345']

For getting this you might use grouping of regular expressions, using your example
x = '970.11 - 1,003.54'
y = re.findall('([0-9.,]+)([ -]+)([0-9.,]+)',x)
print(y[0]) #prints ('970.11', ' - ', '1,003.54')
z = [y[0][0],y[0][2]]
print(z) #prints ['970.11', '1,003.54']
Regular expression in this case consists of 3 groups: first and last match at least one of 0123456789., and middle at least one of - (space or dash)

Related

Python regular expression match number in string

I used regular expression in python2.7 to match the number in a string but I can't match a single number in my expression, here are my code
import re
import cv2
s = '858 1790 -156.25 2'
re_matchData = re.compile(r'\-?\d{1,10}\.?\d{1,10}')
data = re.findall(re_matchData, s)
print data
and then print:
['858', '1790', '-156.25']
but when I change expression from
re_matchData = re.compile(r'\-?\d{1,10}\.?\d{1,10}')
to
re_matchData = re.compile(r'\-?\d{0,10}\.?\d{1,10}')
then print:
['858', '1790', '-156.25', '2']
is there any confuses between d{1, 10} and d{0,10} ?
If I did wrong, how to correct it ?
Thanks for checking my question !

try this:
r'\-?\d{1,10}(?:\.\d{1,10})?'
use (?:)? to make fractional part optional.
for r'\-?\d{0,10}\.?\d{1,10}', it is \.?\d{1,10} who matched 2.

The first \d{1,10} matches from 1 to 10 digits, and the second \d{1,10} also matches from 1 to 10 digits. In order for them both to match, you need at least 2 digits in your number, with an optional . between them.
You should make the entire fraction optional, not just the ..
r'\-?\d{1,10}(?:\.\d{1,10})?'

I would rather do as follows:
import re
s = '858 1790 -156.25 2'
re_matchData = re.compile(r'\-?\d{1,10}\.?\d{0,10}')
data = re_matchData.findall(s)
print data
Output:
['858', '1790', '-156.25', '2']

Removing redundant regex?

Suppose I have a list of very simple regex represented as strings (by "very simple", I mean only containing .*). Every string in the list starts and ends with .*. For example, I could have
rs = [.*a.*, .*ab.*, .*ba.*cd.*, ...]
What I would like to do is keep track of those patterns that are a subset of another. In this example, .*a.* matches everything .*ab.* does, and more. Hence, I consider the latter pattern to be redundant.
What I thought to do was to split the strings on .*, match up corresponding elements, and test if one startswith the other. More specifically, consider .*a.* and .*ab.*. Splitting these on .*
a = ['', 'a', '']
b = ['', 'ab', '']
and zipping them together gives
c = [('', ''), ('a', 'ab'), ('', '')]
And then,
all(elt[1].startswith(elt[0]) for elt in c)
returns True and so I conclude that .*ab.* is indeed redundant if .*a.* is included in the list.
Does this make sense and does it do what I am trying to do? Of course, this approach gets complicated for a number of reasons, and so my next question is, is there a better way to do this that anyone has encountered previously?

For this problem you need to find the minimal DFAs for both the regex and compare them.
Here is the link of a discussion of same problem-
How to tell if one regular expression matches a subset of another regular expression?

Assuming every letter combination is surrounded by .* and does not have it in the middle, the approach would almost work. Instead of startswith you need to check for contains, though.
reglist = ['.*a.*', '.*ab.*', '.*ba.*', '.*cd.*']
patterns = set(x.split('.*')[1] for x in reglist)
remove = []
for x in patterns:
for y in patterns:
if x in y and x != y:
remove.append(y)
print (['.*{}.*'.format(x) for x in sorted(patterns - set(remove))])
gives you
['.*a.*', '.*cd.*']

String Formatting/Template/Regular Expressions

I have a string format let's say where A = alphanumeric and N = Integer so the template is "AAAAAA-NNNN" now the user sometimes will ommit the dash, and sometimes the "NNNN" is only three digits in which case I need it to pad a 0. The first digit of "NNNN" has to be 0, thus if it is a number is is the last digit of the "AAAAAA" as opposed to the first digit of "NNNN". So in essence if I have the following inputs I want the following results:
Sample Inputs:
"SAMPLE0001"
"SAMPL1-0002"
"SAMPL3003"
"SAMPLE-004"
Desired Outputs:
"SAMPLE-0001"
"SAMPL1-0002"
"SAMPL3-0003"
"SAMPLE-0004"
I know how to check for this using regular expressions but essentially I want to do the opposite. I was wondering if there is a easy way to do this other than doing a nested conditional checking for all these variations. I am using python and pandas but either will suffice.
The regex pattern would be:
"[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]-\d\d\d\d"
or in abbreviated form:
"[a-zA-Z0-9]{6}-[\d]{4}"

It would be possible through two re.sub functions.
>>> import re
>>> s = '''SAMPLE0001
SAMPL1-0002
SAMPL3003
SAMPLE-004'''
>>> print(re.sub(r'(?m)(?<=-)(?=\d{3}$)', '0', re.sub(r'(?m)(?<=^[A-Z\d]{6})(?!-)', '-', s)))
SAMPLE-0001
SAMPL1-0002
SAMPL3-0003
SAMPLE-0004
Explanation:
re.sub(r'(?m)(?<=^[A-Z\d]{6})(?!-)', '-', s) would be processed at first. It just places a hyphen after the 6th character from the beginning only if the following character is not a hyphen.
re.sub(r'(?m)(?<=-)(?=\d{3}$)', '0', re.sub(r'(?m)(?<=^[A-Z\d]{6})(?!-)', '-', s)) By taking the above command's output as input, this would add a digit 0 after to the hyphen and the characters following must be exactly 3.

An alternative solution, it uses str.join:
import re
inputs = ['SAMPLE0001', 'SAMPL1-0002', 'SAMPL3003','SAMPLE-004']
outputs = []
for input_ in inputs:
m = re.match(r'(\w{6})-?\d?(\d{3})', input_)
outputs.append('-0'.join(m.groups()))
print(outputs)
# ['SAMPLE-0001', 'SAMPL1-0002', 'SAMPL3-0003', 'SAMPLE-0004']
We are matching the regex (\w{6})-?\d?(\d{3}) against the input strings and joining the captured groups with the string '-0'. This is very simple and fast.
Let me know if you need a more in-depth explanation of the regex itself.

Python Regular Expressions Findall

To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!

Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.

you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all

Regex to pick up comma and period Python

I wanted to create a regex which could match the following pattern:
5,000
2.5
25
This is the regex I have thus far:
re.compile('([\d,]+)')
How can I adjust for the .?

Easiest method would just be this:
re.compile('([\d,.]+)')
But this will allow inputs like .... This might be acceptable, since your original pattern allows ,,,. However, if you want to allow only a single decimal point you can do this:
re.compile('([\d,]+.?\d*)')
Note that this won't allow inputs like .5—you'd need to use 0.5 instead.

I think the perfect regex would be
re.compile(r'\d{1,2}[,.]\d{1,3}')
This way you match one or two digits followed by a comma or a full stop, and then one to three digits.
You don't need the parentheses if you are not going to use the contents of the match later. Omitting them speeds up the process.

Here is a very big but powerful regex to capture anything that is a valid number:
import re
string = """
5,000
2.5
25
234,456,678.345
...
,,,
23,332.1
abc
45,2
0.5
"""
print re.findall("(?:\d+(?:,?\d{3})*)+\.?(?:\d+)?", string)
output:
# Note that it will not capture "45,2" because it is invalid
# It instead does "45" and "2", which are each valid
['5,000', '2.5', '25', '234,456,678.345', '23,332.1', '45', '2', '0.5']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

re.findall Find numbers with dashes(-) and commas(,) - python

I have the value x = '970.11 - 1,003.54' I've tried many types of re.findall for example re.findall('\+d',x) ['970', '11', '1', '003', '54'] although I would like for it to show ['970.11', '1,003.54]

Related

Python regular expression match number in string

Removing redundant regex?

String Formatting/Template/Regular Expressions

Python Regular Expressions Findall

Regex to pick up comma and period Python

Categories

Resources