Does Python have a capability to use a Match object as input to a string with backreferences, eg:
match = re.match("ab(.*)", "abcd")
print re.some_replace_function("match is: \1", match) // prints "match is: cd"
You could implement yourself using usual string replace functions, but I'm sure the obvious implementation will miss edge cases resulting in subtle bugs.
You can use re.sub (instead of re.match) to search and replace strings.
To use back-references, the best practices is to use raw strings, e.g.: r"\1", or double-escaped string, e.g. "\\1":
import re
result = re.sub(r"ab(.*)", r"match is: \1", "abcd")
print(result)
# -> match is: cd
But, if you already have a Match Object , you can use the expand() method:
mo = re.match(r"ab(.*)", "abcd")
result = mo.expand(r"match is: \1")
print(result)
# -> match is: cd
Related
This is in python
Input string:
Str = 'Y=DAT,X=ZANG,FU=_COG-GAB-CANE-,FU=FARE,T=TART,RO=TOP,FU=#-_MAP.com-,Z=TRY'
Expected output
'FU=_COG-GAB-CANE_,FU=FARE,FU=#-_MAP.com_'
here 'FU=' is the occurence we are looking for and the value which follows FU=
return all occurrences of FU=(with the associated value for FU=) in a comma-separated string, they can occur anywhere within the string and special characters are allowed.
Here is one approach.
>>> import re
>>> str_ = 'Y=DAT,X=ZANG,FU=FAT,T=TART,FU=GEM,RO=TOP,FU=MAP,Z=TRY'
>>> re.findall.__doc__[:58]
'Return a list of all non-overlapping matches in the string'
>>> re.findall(r'FU=\w+', str_)
['FU=FAT', 'FU=GEM', 'FU=MAP']
>>> ','.join(re.findall(r'FU=\w+', str_))
'FU=FAT,FU=GEM,FU=MAP'
Got it working
Python Code
import re
str_ = 'Y=DAT,X=ZANG,FU=_COG-GAB-CANE-,FU=FARE,T=TART,RO=TOP,FU=#-_MAP.com-,Z=TRY'
str2='FU='+',FU='.join(re.findall(r'FU=(.*?),', str_))
print(str2)
Gives the desired output:
'FU=_COG-GAB-CANE-,FU=FARE,FU=#-_MAP.com-'
Successfully gives me all the occurrences of FU= followed by values, irrespective of order and number of special characters.
Although a bit unclean way as I am manually adding FU= for the first occurrence.
Please suggest if there is a cleaner way of doing it ? , but yes it gets the work done.
I have a input text like this (actual text file contains tons of garbage characters surrounding these 2 string too.)
(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)
I am trying to parse the text to store something like this:
value1="xxx" and value2="yyy".
I wrote python code as follows:
value1_start = content.find('value')
value1_end = content.find(';', value1_start)
value2_start = content.find('value')
value2_end = content.find(';', value2_start)
print "%s" %(content[value1_start:value1_end])
print "%s" %(content[value2_start:value2_end])
But it always returns:
value=xxx
value=xxx
Could anyone tell me how can I parse the text so that the output is:
value=xxx
value=yyy
Use a regex approach:
re.findall(r'\bvalue=[^;]*', s)
Or - if value can be any 1+ word (letter/digit/underscore) chars:
re.findall(r'\b\w+=[^;]*', s)
See the regex demo
Details:
\b - word boundary
value= - a literal char sequence value=
[^;]* - zero or more chars other than ;.
See the Python demo:
import re
rx = re.compile(r"\bvalue=[^;]*")
s = "$%$%&^(&value=xxx;$%^$%^$&^%^*value=yyy;%$#^%"
res = rx.findall(s)
print(res)
Use regex to filter the data you want from the "junk characters":
>>> import re
>>> _input = '#4#5%value=xxx38u952035983049;3^&^*(^%$3value=yyy#%$#^&*^%;$#%$#^'
>>> matches = re.findall(r'[a-zA-Z0-9]+=[a-zA-Z0-9]+', _input)
>>> matches
['value=xxx', 'value=yyy']
>>> for match in matches:
print(match)
value=xxx
value=yyy
>>>
Summary or the regular expression:
[a-zA-Z0-9]+: One or more alphanumeric characters
=: literal equal sign
[a-zA-Z0-9]+: One or more alphanumeric characters
For this input:
content = '(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)'
use a simple regex and manually strip off the first and last two characters:
import re
values = [x[2:-2] for x in re.findall(r'\*\*value=.*?\*\*', content)]
for value in values:
print(value)
Output:
value=xxx
value=yyy
Here the assumption is that there are always two leading and two trailing * as in **value=xxx**.
You already have good answers based on the re module. That would certainly be the simplest way.
If for any reason (perfs?) you prefere to use str methods, it is indeed possible. But you must search the second string past the end of the first one :
value2_start = content.find('value', value1_end)
value2_end = content.find(';', value2_start)
I am rather new to Python Regex (regex in general) and I have been encountering a problem. So, I have a few strings like so:
str1 = r'''hfo/gfbi/mytag=a_17014b_82c'''
str2 = r'''/bkyhi/oiukj/game/?mytag=a_17014b_82c&'''
str3 = r'''lkjsd/image/game/mytag=a_17014b_82c$'''
the & and the $ could be any symbol.
I would like to have a single regex (and replace) which replaces:
mytag=a_17014b_82c
to:
mytag=myvalue
from any of the above 3 strings. Would appreciate any guidance on how I can achieve this.
UPDATE: the string to be replaced is always not the same. So, a_17014b_82c could be anything in reality.
If the string to be replaced is constant you don't need a regex. Simply use replace:
>>> str1 = r'''hfo/gfbi/mytag=a_17014b_82c'''
>>> str1.replace('a_17014b_82c','myvalue')
'hfo/gfbi/mytag=myvalue'
Use re.sub:
>>> import re
>>> r = re.compile(r'(mytag=)(\w+)')
>>> r.sub(r'\1myvalue', str1)
'hfo/gfbi/mytag=myvalue'
>>> r.sub(r'\1myvalue', str2)
'/bkyhi/oiukj/game/?mytag=myvalue&'
>>> r.sub(r'\1myvalue', str3)
'lkjsd/image/game/mytag=myvalue$'
import re
r = re.compile(r'(mytag=)\w+$')
r.sub(r'\1myvalue', str1)
This is based on #Ashwini's answer, two small changes are we are saying the mytag=a_17014b part should be at the end of input, so that even inputs such as
str1 = r'''/bkyhi/mytag=blah/game/?mytag=a_17014b_82c&'''
will work fine, substituting the last mytag instead of the the first.
Another small change is we are not unnecessarily capturing the \w+, since we aren't using it anyway. This is just for a bit of code clarity.
>>> import re
>>> s = 'this is a test'
>>> reg1 = re.compile('test$')
>>> match1 = reg1.match(s)
>>> print match1
None
in Kiki that matches the test at the end of the s. What do I miss? (I tried re.compile(r'test$') as well)
Use
match1 = reg1.search(s)
instead. The match function only matches at the start of the string ... see the documentation here:
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).
Your regex does not match the full string. You can use search instead as Useless mentioned, or you can change your regex to match the full string:
'^this is a test$'
Or somewhat harder to read but somewhat less useless:
'^t[^t]*test$'
It depends on what you're trying to do.
It's because of that match method returns None if it couldn't find expected pattern, if it find the pattern it would return an object with type of _sre.SRE_match .
So, if you want Boolean (True or False) result from match you must check the result is None or not!
You could examine texts are matched or not somehow like this:
string_to_evaluate = "Your text that needs to be examined"
expected_pattern = "pattern"
if re.match(expected_pattern, string_to_evaluate) is not None:
print("The text is as you expected!")
else:
print("The text is not as you expected!")
in Perl:
if ($test =~ /^id\:(.*)$/ ) {
print $1;
}
In Python:
import re
test = 'id:foo'
match = re.search(r'^id:(.*)$', test)
if match:
print match.group(1)
In Python, regular expressions are available through the re library.
The r before the string indicates that it is a raw string literal, meaning that backslashes are not treated specially (otherwise every backslash would need to be escaped with another backslash in order for a literal backslash to make its way into the regex string).
I have used re.search here because this is the closest equivalent to Perl's =~ operator. There is another function re.match which does the same thing but only checks for a match starting at the beginning of the string (counter-intuitive to a Perl programmer's definition of "matching"). See this explanation for full details of the differences between the two.
Also note that there is no need to escape the : since it is not a special character in regular expressions.
match = re.match("^id:(.*)$", test)
if match:
print match.group(1)
Use a RegexObject like stated here:
http://docs.python.org/library/re.html#regular-expression-objects
I wrote this Perl to Python regex converter when I had to rewrite a bunch of Perl regex'es (a lot) to Python's re package calls. It covers some basic stuff, but might be still helpful in many ways:
def convert_re (perl_re, string_var='column_name',
test_value=None, expected_test_result=None):
'''
Returns Python regular expression converted to calls of Python `re` library
'''
match = re.match(r"(\w+)/(.+)/(.*)/(\w*)", perl_re)
if not match:
raise ValueError("Not a Perl regex? "+ perl_re)
if not match.group(1)=='s':
raise ValueError("This function is only for `s` Perl regexpes (substitutes), i.e s/a/b/")
flags = match.group(4)
if 'g' in flags:
count=0 # all matches
flags=flags.replace('g','') # remove g
else:
count=1 # one exact match only
if not flags:
flags=0
# change any group references in replacements like \2 to group references like \g<2>
replacement=match.group(3)
replacement = re.sub(r"\$(\d+)", r"\\g<\1>", replacement)
python_code = "re.sub(r'{regexp}', r'{replacement}', {string}{count}{flags})".format(
regexp=match.group(2)
, replacement=replacement
, string=string_var
, count=", count={}".format(count) if count else ''
, flags=", flags={}".format(flags) if flags else ''
)
if test_value:
print("Testing Perl regular expression {} with value '{}':".format(perl_re, test_value))
print("(generated equivalent Python code: {} )".format(python_code))
exec('{}=r"{}"; test_result={}'.format(string_var, test_value, python_code))
assert test_result==expected_test_result, "produced={} expected={}".format(test_result, expected_test_result)
print("Test OK.")
return string_var+" = "+python_code
print convert_re(r"s/^[ 0-9-]+//", test_value=' 2323 col', expected_test_result='col')
print convert_re(r"s/[+-]/_/g", test_value='a-few+words', expected_test_result='a_few_words')