I need a regex to parse through a string that contains fractions and a operation [+, -, *, or /] and to return a 5 element tuple containing the numerators, denominators, and operation using the findall function in the re module.
Example: str = "15/9 + -9/5"
The output should of the form[("15","9","+","-9","5")]
I was able to come up with this:
pattern = r'-?\d+|\s+\W\s+'
print(re.findall(pattarn,str))
Which produces an output of ["15","9"," + ","-9","5"]. But after fiddling with this for so time, I cannot get this into a 5 element tuple and I cannot match the operation without also matching the white spaces around it.
This pattern will work:
(-?\d+)\/(\d+)\s+([+\-*/])\s+(-?\d+)\/(\d+)
#lets walk through it
(-?\d+) #matches any group of digits that may or may not have a `-` sign to group 1
\/ #escape character to match `/`
(\d+) #matches any group of digits to group 2
\s+([+\-*/])\s+ #matches any '+,-,*,/' character and puts only that into group 3 (whitespace is not captured in group)
(-?\d+)\/(\d+) #exactly the same as group 1/2 for groups 4/5
demo for this:
>>> s = "15/9 + -9/5 6/12 * 2/3"
>>> re.findall('(-?\d+)\/(\d+)\s([+\-*/])\s(-?\d+)\/(\d+)',s)
[('15', '9', '+', '-9', '5'), ('6', '12', '*', '2', '3')]
A general way to tokenize a string based on a regexp is this:
import re
pattern = "\s*(\d+|[/+*-])"
def tokens(x):
return [ m.group(1) for m in re.finditer(pattern, x) ]
print tokens("9 / 4 + 6 ")
Notes:
The regex begins with \s* to pass over any initial whitespace.
The part of the regex which matches the token is enclosed in parens to form a capture.
The different token patterns are in the capture separated by the alternation operation |.
Be careful about using \W since that will also match whitespace.
Related
s = 'scores:start:4.2.1.3.4.3:final:55.34.13.63.44.34'
I'm trying to reset the start scores to 0.0.0.0.0.0 so that it reads
scores:start:0.0.0.0.0.0:final:55.34.13.63.44.34
This works:
re.sub('start\:(\d+\.\d+\.\d+\.\d+\.\d+\.\d+)','0.0.0.0.0.0',s)
But I was looking for something more flexible to use in case the amount of scores change.
EDIT:
Actually my example does not work, because start is removed from the string
scores:0.0.0.0.0.0:final:55.34.13.63.44.34
but I would like to keep it:
scores:start:0.0.0.0.0.0:final:55.34.13.63.44.34
You may use this re.sub with a lambda if you want to replace dot seperated numbers with same number of zeroes:
>>> import re
>>> s = 'scores:start:4.2.1.3.4.3:final:55.34.13.63.44.34'
>>> re.sub(r'start:\d+(?:\.\d+)+', lambda m: re.sub(r'\d+', '0', m.group()), s)
'scores:start:0.0.0.0.0.0:final:55.34.13.63.44.34'
We are using regex as start:\d+(?:\.\d+)+ to match text that starts with start: and followed by digits separated by dot.
In lambda part we replace each 1+ digit with 0 to get same number of zeroes in output as the input.
Could replace all numbers before final with a zero:
re.sub(r'\d+(?=.*final)', '0', s)
Try it online!
Or perhaps more efficient if there were many more scores:
re.sub(r'\d|(final.*)', lambda m: m[1] or '0', s)
Another lambda in re.sub:
import re
s = 'scores:start:4.2.1.3.4.3:final:55.34.13.63.44.34'
pat=r'(?<=:start:)([\d.]+)(?=:final:)'
>>> re.sub(pat, lambda m: '.'.join(['0']*len(m.group(1).split('.'))), s)
scores:start:0.0.0.0.0.0:final:55.34.13.63.44.34
A variant using the regex PyPi module:
(?:\bstart:|\G(?!^))\.?\K\d+
The pattern matches
(?: Non capture group
\bstart: Match start:
| Or
\G(?!^) Assert the position at the end of the previous match, not at the start of the string
) Close the non capture group
\.?\K Match an optional dot, clear the current match buffer (forget what is matched so far)
\d+ Match 1 or more digits (to be replaced with a 0)
Regex demo
import regex
s = 'scores:start:4.2.1.3.4.3:final:55.34.13.63.44.34'
pattern = r"(?:\bstart:|\G(?!^))\.?\K\d+"
print(regex.sub(pattern, '0', s))
Output
scores:start:0.0.0.0.0.0:final:55.34.13.63.44.34
import re
pattern =r"[1-9][0-9]{0,2}(?:,\d{3})?(?:,\d{3})?"
string = '42 1,234 6,368,745 12,34,567 1234'
a = re.findall(pattern,string)
print(a)
Dears, what should I do to get the expected result?
Expected output:
['42', '1,234', '6,368,745']
Actual output:
['42', '1,234', '6,368,745', '12', '34,567', '123', '4']
I was trying to solve this quiz in a book.
How would you write a regex that matches a number with commas for every three digits? It must match the following:
• '42'
• '1,234'
• '6,368,745'
but not like the following:
• '12,34,567' (which has only two digits between the commas)
• '1234' (which lacks commas)
Your help would be much appreciated!
You may use
import re
pattern =r"(?<!\d,)(?<!\d)[1-9][0-9]{0,2}(?:,\d{3})*(?!,?\d)"
string = '42 1,234 6,368,745 12,34,567 1234'
a = re.findall(pattern,string)
print(a) # => ['42', '1,234', '6,368,745']
See Python demo.
Regex details
(?- no digit or digit +,` allowed immediately to the left of the current location
[1-9][0-9]{0,2} - a non-zero digit followed with any zero, one or two digits
(?:,\d{3})* - 0 or more occurrences of a comma and then any three digits
(?!,?\d) - no , or , + digit allowed immediately to the right of the current location.
You could use the following regular expression.
r'(?<![,\d])[1-9]\d{,2}(?:,\d{3})*(?![,\d])'
with re.findall.
Demo
Python's regex engine performs the following operations.
(?<! begin negative lookbehind
[,\d] match ',' or a digit
) end negative lookbehind
[1-9] match a digit other than '0'
\d{0,2} match 0-2 digits
(?: begin non-capture group
,\d{3} match ',' then 3 digits
) end non-capture group
* execute non-capture group 0+ times
(?![,\d]) previous match is not followed by ',' or a digit
(negative lookahead)
I am trying to find expression but I want the value as a whole, not seperate values:
test = "code.coding = 'DS-2002433E172062D';"
find = re.compile('[A-Z,\d]')
hey= re.findall(find,str(test))
print hey
response = ['D, 'S', '2', '0', '0', '2', '6', '6', '8', '0', '3', 'E', '1', '7', '2', '0', '6', '2', 'D']
However I want it as DS-2002433E172062D
Your current regex, [A-Z,\d] matches a single character in the range A-Z, or a digit. You want to capture your entire desired string in your match. Try out:
[a-zA-Z0-9]{2}-[a-zA-Z0-9]+
import re
regx = r'[a-zA-Z]{2}-[a-zA-Z0-9]+'
print(re.findall(regx, "code.coding = 'DS-2002433E172062D';"))
Output:
['DS-2002433E172062D']
Explanation:
[ // start of single character match
a-zA-Z0-9 // matches any letter or digit
]{2} // matches 2 times
- // matches the - character
[ // start of single character match
a-zA-Z0-9 // matches any letter or digit
]+ // matches between 1 and unlimited times (greedy)
As an improvement on the previous answer I often use a utility function like this:
Python Function:
def extractRegexPattern(theInputPattern, theInputString):
"""
Extracts the given regex pattern.
Param theInputPattern - the pattern to extract.
Param theInputString - a String to extract from.
"""
import re
sourceStr = str(theInputString)
prog = re.compile(theInputPattern)
theList = prog.findall(sourceStr)
return theList
Use like so:
test = "code.coding = 'DS-2002433E172062D';"
groups = extractRegexPattern("([A-Z,\d]+)+", test)
print(groups)
Outputs:
['DS', '2002433E172062D']
Regex needs () to make it a group:
You had:
'[A-Z,\d]'
Which means: any letter in the range A-Z or "," (comma) or the pattern \d (digits) as a single match ... and this yielded all of the matches separately (and subtly omits the dash which would have needed [-A-Z,\d] to be included).
For groups of the same type of matches try:
'([A-Z,\d]+)+'
Which means: any letter in the range A-Z or "," (comma) or the pattern \d (digits) one or more times per group match and one or more groups pre input ... and this yields all of the matches as groups (and clearly omits the dash which would have needed ([A-Z,\d]+)+ to be included).
The point: use () (parentheses) to denote groups in regular expression.
Lastly, while testing you regex it may be helpful to use a debugger like http://pythex.org until you get the hang of the pattern at hand.
Edit:
Sorry, I missed the part about how you wanted it as DS-2002433E172062D
For that result using groups try:
'([-A-Z,\d]+)+'
Which means: a - (dash) or any letter in the range A-Z or , (comma) or the pattern \d (digits) one or more times per group match and one or more groups pre input ... and this yields all of the matches as groups. Note the - (dash) MUST come before any ranges in most (all?) regex, and it is best in my experience to put dashes as the very beginning of the patern to be clear.
Hope this helps
You are very near , just change your regex pattern :
pattern=r'[A-Z,\d]'
to
pattern=r'[A-Z,\d-]+'
full code:
import re
test = "code.coding = 'DS-2002433E172062D';"
pattern=r'[A-Z,\d-]+'
print(re.findall(pattern,test))
output:
['DS-2002433E172062D']
I'm using a simple regex (.*?)(\d+[.]\d+)|(.*?)(\d+) to match int/float/double value in a string. When doing findall the regex shows empty strings in the output. The empty strings gets removed when I remove the | operator and do an individual match. I had also tried this on regex101 it doesn't show any empty string. How can I remove this empty strings ? Here's my code:
>>>import re
>>>match_float = re.compile('(.*?)(\d+[.]\d+)|(.*?)(\d+)')
>>>match_float.findall("CA$1.90")
>>>match_float.findall("RM1")
Output:
>>>[('CA$', '1.90', '', '')]
>>>[('', '', 'RM', '1')]
Since you defined 4 capturing groups in the pattern, they will always be part of the re.findall output unless you remove them (say, by using filter(None, ...).
However, in the current situation, you may "shrink" your pattern to
r'(.*?)(\d+(?:\.\d+)?)'
See the regex demo
Now, it will only have 2 capturing groups, and thus, findall will only output 2 items per tuple in the resulting list.
Details:
(.*?) - Capturing group 1 matching any zero or more chars other than line break chars, as few as possible up to the first occurrence of ...
(\d+(?:\.\d+)?) - Capturing group 2:
\d+ - one of more digits
(?:\.\d+)? - an optional *non-*capturing group that matches 1 or 0 occurrences of a . and 1+ digits.
See the Python demo:
import re
rx = r"(.*?)(\d+(?:[.]\d+)?)"
ss = ["CA$1.90", "RM1"]
for s in ss:
print(re.findall(rx, s))
# => [('CA$', '1.90')] [('RM', '1')]
I'm using python and regex (new to both) to find sequence of chars in a string as follows:
Grab the first instance of p followed by any number (It'll always be in the form of p_ _ where _ and _ will be integers). Then either find an 's' or a 'go' then all integers till the end of the string. For example:
ascjksdcvyp12nbvnzxcmgonbmbh12hjg23
should yield p12 go 12 23.
ascjksdcvyp12nbvnzxcmsnbmbh12hjg23
should yield p12 s 12 23.
I've only managed to get the p12 part of the string and this is what I've tried so far to extract the 'go' or 's':
decoded = (re.findall(r'([p][0-9]*)',myStr))
print(decoded) //prints p12
I know by doing something like
re.findall(r'[s]|[go]',myStr)
will give me all occurrences of s and g and o, but something like that is not what I'm looking for. And I'm not sure how I'd combine these regexes to get the desired output.
Use re.findall with pattern grouping:
>>> string = 'ascjksdcvyp12nbvnzxcmgonbmbh12hjg23'
>>> re.findall(r'(p\d{2}).*(s|go)\D*(\d+)(?:\D*(\d+))*', string)
[('p12', 'go', '12', '23')]
>>> string = 'ascjksdcvyp12nbvnzxcmsnbmbh12hjg23'
>>> re.findall(r'(p\d{2}).*(s|go)\D*(\d+)(?:\D*(\d+))*', string)
[('p12', 's', '12', '23')]
With re.findall we are only willing to get what are matched by pattern grouping ()
p\d{2} matches any two digits after p
After that .* matches anything
Then, s|go matches either s or go
\D* matches any number of non-digits
\d+ indicates one or more digits
(?:) is a non-capturing group i.e. the match inside won't show up in the output, it is only for the sake of grouping tokens
Note:
>>> re.findall(r'(p\d{2}).*(s|go)(?:\D*(\d+))+?', string)
[('p12', 's', '12')]
>>> re.findall(r'(p\d{2}).*(s|go)(?:\D*(\d+))+', string)
[('p12', 's', '23')]
I would like to use one of the above two as matching later digits is kind of a repeated task but there are problems with both non-greedy and greedy matches, hence we need to match the digits after s or go well, kind of explicitly.
First, try to match your line with a minimal pattern, as a test. Use (grouping) and (?:nongrouping) parens to capture the interesting parts and not capture the uninteresting parts. Store away what you care about,
then chop off the remainder of the string and search for numbers as a second step.
import re
simple_test = r'^.*p(\d{2}).*?(?:s|go).*?(\d+)'
m = re.match(simple_test, line)
if m is not None:
p_num = m.group(1)
trailing_numbers = [m.group(2)]
remainder = line[m.end()+1:]
trailing_numbers.extend( # extend list by appending
map( # list from applying
lambda m: m.group(1), # get group(1) from match
re.finditer(r"(\d+)", remainder) # of each number in string
)
)
print("P:", p_num, "Numbers:", trailing_numbers)