in Perl:
if ($test =~ /^id\:(.*)$/ ) {
print $1;
}
In Python:
import re
test = 'id:foo'
match = re.search(r'^id:(.*)$', test)
if match:
print match.group(1)
In Python, regular expressions are available through the re library.
The r before the string indicates that it is a raw string literal, meaning that backslashes are not treated specially (otherwise every backslash would need to be escaped with another backslash in order for a literal backslash to make its way into the regex string).
I have used re.search here because this is the closest equivalent to Perl's =~ operator. There is another function re.match which does the same thing but only checks for a match starting at the beginning of the string (counter-intuitive to a Perl programmer's definition of "matching"). See this explanation for full details of the differences between the two.
Also note that there is no need to escape the : since it is not a special character in regular expressions.
match = re.match("^id:(.*)$", test)
if match:
print match.group(1)
Use a RegexObject like stated here:
http://docs.python.org/library/re.html#regular-expression-objects
I wrote this Perl to Python regex converter when I had to rewrite a bunch of Perl regex'es (a lot) to Python's re package calls. It covers some basic stuff, but might be still helpful in many ways:
def convert_re (perl_re, string_var='column_name',
test_value=None, expected_test_result=None):
'''
Returns Python regular expression converted to calls of Python `re` library
'''
match = re.match(r"(\w+)/(.+)/(.*)/(\w*)", perl_re)
if not match:
raise ValueError("Not a Perl regex? "+ perl_re)
if not match.group(1)=='s':
raise ValueError("This function is only for `s` Perl regexpes (substitutes), i.e s/a/b/")
flags = match.group(4)
if 'g' in flags:
count=0 # all matches
flags=flags.replace('g','') # remove g
else:
count=1 # one exact match only
if not flags:
flags=0
# change any group references in replacements like \2 to group references like \g<2>
replacement=match.group(3)
replacement = re.sub(r"\$(\d+)", r"\\g<\1>", replacement)
python_code = "re.sub(r'{regexp}', r'{replacement}', {string}{count}{flags})".format(
regexp=match.group(2)
, replacement=replacement
, string=string_var
, count=", count={}".format(count) if count else ''
, flags=", flags={}".format(flags) if flags else ''
)
if test_value:
print("Testing Perl regular expression {} with value '{}':".format(perl_re, test_value))
print("(generated equivalent Python code: {} )".format(python_code))
exec('{}=r"{}"; test_result={}'.format(string_var, test_value, python_code))
assert test_result==expected_test_result, "produced={} expected={}".format(test_result, expected_test_result)
print("Test OK.")
return string_var+" = "+python_code
print convert_re(r"s/^[ 0-9-]+//", test_value=' 2323 col', expected_test_result='col')
print convert_re(r"s/[+-]/_/g", test_value='a-few+words', expected_test_result='a_few_words')
Related
I am a very beginner of programming and reading the book "Automate the boring stuff with Python'. In Chapter 7, there is a project practice: the regex version of strip(). My code below does not work (I use Python 3.6.1). Could anyone help?
import re
string = input("Enter a string to strip: ")
strip_chars = input("Enter the characters you want to be stripped: ")
def strip_fn(string, strip_chars):
if strip_chars == '':
blank_start_end_regex = re.compile(r'^(\s)+|(\s)+$')
stripped_string = blank_start_end_regex.sub('', string)
print(stripped_string)
else:
strip_chars_start_end_regex = re.compile(r'^(strip_chars)*|(strip_chars)*$')
stripped_string = strip_chars_start_end_regex.sub('', string)
print(stripped_string)
You can also use re.sub to substitute the characters in the start or end.
Let us say if the char is 'x'
re.sub(r'^x+', "", string)
re.sub(r'x+$', "", string)
The first line as lstrip and the second as rstrip
This just looks simpler.
When using r'^(strip_chars)*|(strip_chars)*$' string literal, the strip_chars is not interpolated, i.e. it is treated as a part of the string. You need to pass it as a variable to the regex. However, just passing it in the current form would result in a "corrupt" regex because (...) in a regex is a grouping construct, while you want to match a single char from the define set of chars stored in the strip_chars variable.
You could just wrap the string with a pair of [ and ] to create a character class, but if the variable contains, say z-a, it would make the resulting pattern invalid. You also need to escape each char to play it safe.
Replace
r'^(strip_chars)*|(strip_chars)*$'
with
r'^[{0}]+|[{0}]+$'.format("".join([re.escape(x) for x in strip_chars]))
I advise to replace * (zero or more occurrences) with + (one or more occurrences) quantifier because in most cases, when we want to remove something, we need to match at least 1 occurrence of the unnecessary string(s).
Also, you may replace r'^(\s)+|(\s)+$' with r'^\s+|\s+$' since the repeated capturing groups will keep on re-writing group values upon each iteration slightly hampering the regex execution.
#! python
# Regex Version of Strip()
import re
def RegexStrip(mainString,charsToBeRemoved=None):
if(charsToBeRemoved!=None):
regex=re.compile(r'[%s]'%charsToBeRemoved)#Interesting TO NOTE
return regex.sub('',mainString)
else:
regex=re.compile(r'^\s+')
regex1=re.compile(r'$\s+')
newString=regex1.sub('',mainString)
newString=regex.sub('',newString)
return newString
Str=' hello3123my43name is antony '
print(RegexStrip(Str))
Maybe this could help, it can be further simplified of course.
Does Python have a capability to use a Match object as input to a string with backreferences, eg:
match = re.match("ab(.*)", "abcd")
print re.some_replace_function("match is: \1", match) // prints "match is: cd"
You could implement yourself using usual string replace functions, but I'm sure the obvious implementation will miss edge cases resulting in subtle bugs.
You can use re.sub (instead of re.match) to search and replace strings.
To use back-references, the best practices is to use raw strings, e.g.: r"\1", or double-escaped string, e.g. "\\1":
import re
result = re.sub(r"ab(.*)", r"match is: \1", "abcd")
print(result)
# -> match is: cd
But, if you already have a Match Object , you can use the expand() method:
mo = re.match(r"ab(.*)", "abcd")
result = mo.expand(r"match is: \1")
print(result)
# -> match is: cd
I have a list of regexes in string form (created after parsing natural language text which were search queries). I want to use them for searching text now. Here is how I am doing it right now-
# given that regex_list=["r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))'", "r'((?<=[\W_])(activation\ of\ %s)(?=[\W_]|$))'"....]
sent='in this file we have the case of a foo(+) in the town'
gs1='foo'
for string_regex in regex_list:
mo=re.search(string_regex %gs1,sent,re.I)
if mo:
print(mo.group())
What I need is to be able to use these string regexes, but also have Python's raw literal notation on them, as we all should for regex queries. Now about these expressions - I have natural text search commands like -
LINE_CONTAINS foo(+)
Which I use pyparsing to convert to regex like r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))' based on a grammar. I send a list of these human rules to the pyparsing code and it gives me back a list of ~100 of these regexes. These regexes are constructed in string format.
This is the MCVE version of the code that generates these strings that are supposed to act as regexes -
from pyparsing import *
import re
def parse_hrr(received_sentences):
UPTO, AND, OR, WORDS, CHARACTERS = map(Literal, "UPTO AND OR WORDS CHARACTERS".split())
LBRACE,RBRACE = map(Suppress, "{}")
integer = pyparsing_common.integer()
LINE_CONTAINS, PARA_STARTSWITH, LINE_ENDSWITH = map(Literal,
"""LINE_CONTAINS PARA_STARTSWITH LINE_ENDSWITH""".split()) # put option for LINE_ENDSWITH. Users may use, I don't presently
keyword = UPTO | WORDS | AND | OR | BEFORE | AFTER | JOIN | LINE_CONTAINS | PARA_STARTSWITH
class Node(object):
def __init__(self, tokens):
self.tokens = tokens
def generate(self):
pass
class LiteralNode(Node):
def generate(self):
return "(%s)" %(re.escape(''.join(self.tokens[0]))) # here, merged the elements, so that re.escape does not have to do an escape for the entire list
def __repr__(self):
return repr(self.tokens[0])
class ConsecutivePhrases(Node):
def generate(self):
join_these=[]
tokens = self.tokens[0]
for t in tokens:
tg = t.generate()
join_these.append(tg)
seq = []
for word in join_these[:-1]:
if (r"(([\w]+\s*)" in word) or (r"((\w){0," in word): #or if the first part of the regex in word:
seq.append(word + "")
else:
seq.append(word + "\s+")
seq.append(join_these[-1])
result = "".join(seq)
return result
class AndNode(Node):
def generate(self):
tokens = self.tokens[0]
join_these=[]
for t in tokens[::2]:
tg = t.generate()
tg_mod = tg[0]+r'?=.*\b'+tg[1:][:-1]+r'\b)' # to place the regex commands at the right place
join_these.append(tg_mod)
joined = ''.join(ele for ele in join_these)
full = '('+ joined+')'
return full
class OrNode(Node):
def generate(self):
tokens = self.tokens[0]
joined = '|'.join(t.generate() for t in tokens[::2])
full = '('+ joined+')'
return full
class LineTermNode(Node):
def generate(self):
tokens = self.tokens[0]
ret = ''
dir_phr_map = {
'LINE_CONTAINS': lambda a: r"((?:(?<=[\W_])" + a + r"(?=[\W_]|$))456", #%gs1, sent, re.I)",
'PARA_STARTSWITH':
lambda a: ("r'(^" + a + "(?=[\W_]|$))' 457") if 'gene' in repr(a) #%gs1, s, re.I)"
else ("r'(^" + a + "(?=[\W_]|$))' 458")} #,s, re.I
for line_dir, phr_term in zip(tokens[0::2], tokens[1::2]):
ret = dir_phr_map[line_dir](phr_term.generate())
return ret
## THE GRAMMAR
word = ~keyword + Word(alphas, alphanums+'-_+/()')
some_words = OneOrMore(word).setParseAction(' '.join, LiteralNode)
phrase_item = some_words
phrase_expr = infixNotation(phrase_item,
[
(None, 2, opAssoc.LEFT, ConsecutivePhrases),
(AND, 2, opAssoc.LEFT, AndNode),
(OR, 2, opAssoc.LEFT, OrNode),
],
lpar=Suppress('{'), rpar=Suppress('}')
) # structure of a single phrase with its operators
line_term = Group((LINE_CONTAINS|PARA_STARTSWITH)("line_directive") +
(phrase_expr)("phrases")) # basically giving structure to a single sub-rule having line-term and phrase
line_contents_expr = line_term.setParseAction(LineTermNode)
###########################################################################################
mrrlist=[]
for t in received_sentences:
t = t.strip()
try:
parsed = line_contents_expr.parseString(t)
temp_regex = parsed[0].generate()
mrrlist.append(temp_regex)
return(mrrlist)
So basically, the code is stringing together the regex. Then I add the necessary parameters like re.search, %gs1 etc .to have the complete regex search query. I want to be able to use these string regexes for searching, hence I had earlier thought eval() would convert the string to its corresponding Python expression here, which is why I used it - I was wrong.
TL;DR - I basically have a list of strings that have been created in the source code, and I want to be able to use them as regexes, using Python's raw literal notation.
Your issue seems to stem from a misunderstanding of what raw string literals do and what they're for. There's no magic raw string type. A raw string literal is just another way of creating a normal string. A raw literal just gets parsed a little bit differently.
For instance, the raw string r"\(foo\)" can also be written "\\(foo\\)". The doubled backslashes tell Python's regular string parsing algorithm that you want an actual backslash character in the string, rather than the backslash in the literal being part of an escape sequence that gets replaced by a special character. The raw string algorithm doesn't the extra backslashes since it never replaces escape sequences.
However, in this particular case the special treatment is not actually necessary, since the \( and \) are not meaningful escape sequences in a Python string. When Python sees an invalid escape sequence, it just includes it literally (backslash and all). So you could also use "\(foo\)" (without the r prefix) and it will work just fine too.
But it's not generally a good idea to rely upon backslashes being ignored however, since if you edit the string later you might inadvertently add an escape sequence that Python does understand (when you really wanted the raw, un-transformed version). Since regex syntax has a number of its own escape sequences that are also escape sequences in Python (but with different meanings, such as \b and \1), it's a best practice to always write regex patterns with raw strings to avoid introducing issues when editing them.
Now to bring this around to the example code you've shown. I have no idea why you're using eval at all. As far as I can tell, you've mistakenly wrapped extra quotes around your regex patterns for no good reason. You're using exec to undo that wrapping. But because only the inner strings are using raw string syntax, by the time you eval them you're too late to avoid Python's string parsing messing up your literals if you have any of the troublesome escape sequences (the outer string will have already parsed \b for instance and turned it into the ASCII backspace character \x08).
You should tear the exec code out and fix your literals to avoid the extra quotes. This should work:
regex_list=[r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))', # use raw literals, with no extra quotes!
r'((?<=[\W_])(activation\ of\ %s)(?=[\W_]|$))'] # unnecessary backslashes?
sent='in this file we have the case of a foo(+) in the town'
gs1='foo'
for string_regex in regex_list:
mo=re.search(string_regex %gs1,sent,re.I) # no eval here!
if mo:
print(mo.group())
This example works for me (it prints foo(+)). Note that you've got some extra unnecessary backslashes in your second pattern (before the spaces). Those are harmless, but might be adding even more confusion to a complicate subject (regex are notoriously hard to understand).
I'm looking for a squaremeter term in some kind of text using this RegExpression:
([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]
Works pretty well.
Now, this thing should only be matched if before OR after it, a string like "Wohnfläche"/"Wohnfl"/"Wfl" exists. In other words: the latter term is mandatory, however its positon is not.
Writing a RegEx for this is not the issue in general, my problem is how to write it most elegant. Currently I only see one approach:
^[.]*[Wohnfläche|Wohnfl|Wfl]([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]
new search, kombined with 'or' statement (I'm using Python)
([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2][.]*[Wohnfläche|Wohnfl|Wfl]$
Ugly, isn't it? ;)
You can use alternation like this:
(?:Wohnfläche|Wohnfl|Wfl)\s*(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)|(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)\s*(?:Wohnfläche|Wohnfl|Wfl)
And check which capture group matched. It is just not possible to use the restrictive strings optionally in the regex on both sides, the will just be ignored.
See the regex demo
IDEONE demo:
import re
pat = re.compile(r'(?:Wohnfläche|Wohnfl|Wfl)\s*(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)|(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)\s*(?:Wohnfläche|Wohnfl|Wfl)')
strs = ["12,56m qm Wohnfläche", "14.54 mqm Wohnfl", "Wfl 134 m qm"]
for x in strs:
m = pat.search(x)
if m:
if m.group(1): # First alternative found a match
print("{}".format(m.group(1), " - ", m.group(2)))
else: # Second alternative "won"
print("{}".format(m.group(3), " - ", m.group(4)))
Specify a logical conjunction in the controlling application, like (pseudo-code) <area-regex>.match(string) and <text-regex>.match(string).
This assumes that any pair of matches of the two regexen on the same string will never overlap ( if they did, you'd get a false positive ). Your regexen meet this requirement.
Note that your regex for the textual context contains the additional restriction that your test string either starts or ends with a match, while in your informal description you just require a match to either occur before or after the area spec. This difference is incorporated in pt vs pt_anchored in the code below.
Python fragment (untested):
import re
...
# pa: <area_regex>
# pt: <text_regex>
# pt_anchored: <text_regex>, anchored
#
pa = re.compile ( r'([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]' )
pt = re.compile ( r'[.]*[Wohnfläche|Wohnfl|Wfl]' )
pt_anchored = re.compile ( r'^[.]*[Wohnfläche|Wohnfl|Wfl]|[.]*[Wohnfläche|Wohnfl|Wfl]$' )
if pa.match(<teststring>) and pt.match(<teststring>):
print 'Match found: '
else:
print 'No match'
...
I'm trying to do a simple VB6 to c translator to help me port an open source game to the c language.
I want to be able to get "NpcList[NpcIndex]" from "With Npclist[NpcIndex]" using ragex and to replace it everywhere it has to be replaced. ("With" is used as a macro in VB6 that adds Npclist[NpcIndex] when ever it needs to until it founds "End With")
Example:
With Npclist[NpcIndex]
.goTo(245) <-- it should be replaced with Npclist[NpcIndex].goTo(245)
End With
Is it possible to use regex to do the job?
I've tried using a function to perfom another regex replace between the "With" and the "End With" but I can't know the text the "With" is replacing (Npclist[NpcIndex]).
Thanks in advance
I personally wouldn't trust any single-regex solution to get it right on the first time nor feel like debugging it. Instead, I would parse the code line-to-line and cache any With expression to use it to replace any . directly preceded by whitespace or by any type of brackets (add use-cases as needed):
(?<=[\s[({])\. - positive lookbehind for any character from the set + escaped literal dot
(?:(?<=[\s[({])|^)\. - use this non-capturing alternatives list if to-be-replaced . can occur on the beginning of line
import re
def convert_vb_to_c(vb_code_lines):
c_code = []
current_with = ""
for line in vb_code_lines:
if re.search(r'^\s*With', line) is not None:
current_with = line[5:] + "."
continue
elif re.search(r'^\s*End With', line) is not None:
current_with = "{error_outside_with_replacement}"
continue
line = re.sub(r'(?<=[\s[({])\.', current_with, line)
c_code.append(line)
return "\n".join(c_code)
example = """
With Npclist[NpcIndex]
.goTo(245)
End With
With hatla
.matla.tatla[.matla.other] = .matla.other2
dont.mind.me(.do.mind.me)
.next()
End With
"""
# use file_object.readlines() in real life
print(convert_vb_to_c(example.split("\n")))
You can pass a function to the sub method:
# just to give the idea of the regex
regex = re.compile(r'''With (.+)
(the-regex-for-the-VB-expression)+?
End With''')
def repl(match):
beginning = match.group(1) # NpcList[NpcIndex] in your example
return ''.join(beginning + line for line in match.group(2).splitlines())
re.sub(regex, repl, the_string)
In repl you can obtain all the information about the matching from the match object, build whichever string you want and return it. The matched string will be replaced by the string you return.
Note that you must be really careful to write the regex above. In particular using (.+) as I did matches all the line up to the newline excluded, which or may not be what you want(but I don't know VB and I have no idea which regex could go there instead to catch only what you want.
The same goes for the (the-regex-forthe-VB-expression)+. I have no idea what code could be in those lines, hence I leave to you the detail of implementing it. Maybe taking all the line can be okay, but I wouldn't trust something this simple(probably expressions can span multiple lines, right?).
Also doing all in one big regular expression is, in general, error prone and slow.
I'd strongly consider regexes only to find With and End With and use something else to do the replacements.
This may do what you need in Python 2.7. I'm assuming you want to strip out the With and End With, right? You don't need those in C.
>>> import re
>>> search_text = """
... With Np1clist[Npc1Index]
... .comeFrom(543)
... End With
...
... With Npc2list[Npc2Index]
... .goTo(245)
... End With"""
>>>
>>> def f(m):
... return '{0}{1}({2})'.format(m.group(1), m.group(2), m.group(3))
...
>>> regex = r'With\s+([^\s]*)\s*(\.[^(]+)\(([^)]+)\)[^\n]*\nEnd With'
>>> print re.sub(regex, f, search_text)
Np1clist[Npc1Index].comeFrom(543)
Npc2list[Npc2Index].goTo(245)