Regex translation from Perl to Python - python

I would like to rewrite a small Perl programm to Python.
I am processing text files with it as follows:
Input:
00000001;Root;;
00000002; Documents;;
00000003; oracle-advanced_plsql.zip;file;
00000004; Public;;
00000005; backup;;
00000006; 20110323-JM-F.7z.001;file;
00000007; 20110426-JM-F.7z.001;file;
00000008; 20110603-JM-F.7z.001;file;
00000009; 20110701-JM-F-via-summer_school;;
00000010; 20110701-JM-F-yyy.7z.001;file;
Desired output:
00000001;;Root;;
00000002; ;Documents;;
00000003; ;oracle-advanced_plsql.zip;file;
00000004; ;Public;;
00000005; ;backup;;
00000006; ;20110323-JM-F.7z.001;file;
00000007; ;20110426-JM-F.7z.001;file;
00000008; ;20110603-JM-F.7z.001;file;
00000009; ;20110701-JM-F-via-summer_school;;
00000010; ;20110701-JM-F-yyy.7z.001;file;
Here is the working Perl code:
#filename: perl_regex.pl
#/usr/bin/perl -w
while(<>) {
s/^(.*?;.*?)(\w)/$1;$2/;
print $_;
}
It call it from the command line: perl_regex.pl input.txt
Explanation of the Perl-style regex:
s/ # start search-and-replace regexp
^ # start at the beginning of this line
( # save the matched characters until ')' in $1
.*?; # go forward until finding the first semicolon
.*? # go forward until finding... (to be continued below)
)
( # save the matched characters until ')' in $2
\w # ... the next alphanumeric character.
)
/ # continue with the replace part
$1;$2 # write all characters found above, but insert a ; before $2
/ # finish the search-and-replace regexp.
Could anyone tell me, how to get the same result in Python? Especially for the $1 and $2 variables I couldn't find something alike.

The replace instruction for s/pattern/replace/ in python regexes is the re.sub(pattern, replace, string) function, or re.compile(pattern).sub(replace, string). In your case, you will do it so:
_re_pattern = re.compile(r"^(.*?;.*?)(\w)")
result = _re_pattern.sub(r"\1;\2", line)
Note that $1 becomes \1. As for perl, you need to iterate over your lines the way you want to do it (open, inputfile, splitlines, ...).

Python regular expression is very similar to Perl's, except:
In Python there's no regular expression literal. It should be expressed using string. I used r'raw string literal' in the following code.
Backreferences are expressed as \1, \2, .. or \g<1>, \g<2>, ..
...
Use re.sub to replace.
import re
import sys
for line in sys.stdin: # Explicitly iterate standard input line by line
# `line` contains trailing newline!
line = re.sub(r'^(.*?;.*?)(\w)', r'\1;\2', line)
#print(line) # This print trailing newline
sys.stdout.write(line) # Print the replaced string back.

Related

Python Regex - non-greedy match does not work

I have a flat file with one C++ function name and part of its declaration like this:
virtual void NameSpace1::NameSpace2::ClassName1::function_name1(int arg1) const
void function_name2
void NameSpace2::NameSpace4::ClassName2::function_name3
function_name4
I am trying to extract the function names alone by using this line:
fn_name = re.match(":(.*?)\(?", lines)
I can understand why function_name2 and function_name4 do not match (because there is no leading :. But I am seeing that even for function_name1 and function_name3, it does not do non-greedy match. The output of fn_name.group() is
:NameSpace2::ClassName1::function_name1
I have three questions:
I expected just the string "function_name1" to be extracted from line 1, but the non-greedy match does not seem to work. Why?
Why is line 3 not being extracted?
How do I get the function names from all the lines using a single regex?
Please help.
This works pretty well, with your example at least:
^(?:\w+ +)*(?:\w+::)*(\w+)
i.e., in Python code:
import re
function_name = re.compile(r'^(?:\w+ +)*(?:\w+::)*(\w+)', re.MULTILINE)
matches = function_name.findall(your_txt)
# -> ['function_name1', 'function_name2', 'function_name3', 'function_name4']
Takeaway: If you can do it with greedy matching, do it with greedy matching.
Note that \w is not correct for a C identifier, but writing down the technically correct character class that matches those is besides the question. Find and use the correct set of characters instead of \w.
1) Always use r" " strings for regexes.
2)
I am trying to extract the function names alone by using this line:
fn_name = re.match(":(.*?)\(?", lines)
The output of fn_name.group() is
:NameSpace2::ClassName1::function_name1
I'm not seeing that:
import re
line = "virtual void NameSpace1::NameSpace2::ClassName1::function_name1(int arg1) const"
fn_name = re.search(r":(.*?)\(?", line)
print(fn_name.group())
--output:--
:
In any case, if you want to see how non-greedy works, look at this code:
import re
line = "N----1----2"
greedy_pattern = r"""
N
.*
\d
"""
match_obj = re.search(greedy_pattern, line, flags=re.X)
print(match_obj.group())
non_greedy_pattern = r"""
N
.*?
\d
"""
match_obj = re.search(non_greedy_pattern, line, flags=re.X)
print(match_obj.group())
--output:--
N----1----2
N----1
The non-greedy version asks for all the characters matching .* up until the first digit that is encountered, while the greedy version will try to find the longest match for .* that is followed by a digit.
3) Warning! No regex zone!
func_names = [
"virtual void NameSpace1::NameSpace2::ClassName1::function_name1(int arg1) const",
"void function_name2",
"void NameSpace2::NameSpace4::ClassName2::function_name3",
"function_name4",
]
for func_name in func_names:
name = func_name.rsplit("::", 1)[-1]
pieces = name.rsplit(" ", 1)
if pieces[-1] == "const":
name = pieces[-2]
else:
name = pieces[-1]
name = name.split('(', 1)[0]
print(name)
--output:--
function_name1
function_name2
function_name3
function_name4
I expected just the string "function_name1" to be extracted from line 1, but the non-greedy match does not seem to work. Why?
This is the result from your regex ":(.*?)\(?"
I think your regex is "Too Lazy". It will match only : because (.*?) stand for match any characters "as less as possible" then regex engine chooses to match zero character. It will not match till \(? as you expected because ? just means "optional".
Why is line 3 not being extracted?
As I've tested your regex. It doesn't work at all not only the third line.
How do I get the function names from all the lines using a single regex?
You can start from this minimal example
(?:\:\:|void\s+)(\w+)(?:\(|$)|(function_name4)
Where (?:\:\:|void\s+) represents to anything that leading your function name and (?:\(|$) represents to anything that follow you function name.
Note that function_name4 suppose to be declared explicitly due to lacking of pattern.
see: DEMO
I've been stumped before by something similar when trying to capture the "N----1" from "N foo bar N----1----2". Adding a leading .* gave the desired result.
import re
line = "N foo bar N----1----2"
match_obj = re.search(r'(N.*?\d)', line)
print(match_obj.group(1))
match_obj = re.search(r'.*(N.*?\d)', line)
print(match_obj.group(1))
--output:--
N foo bar N----1
N----1

Is is possible to clean a verbose python regex before printing it?

The Setup:
Let's say I have the following regex defined in my script. I want to keep the comments there for future me because I'm quite forgetful.
RE_TEST = re.compile(r"""[0-9] # 1 Number
[A-Z] # 1 Uppercase Letter
[a-y] # 1 lowercase, but not z
z # gotta have z...
""",
re.VERBOSE)
print(magic_function(RE_TEST)) # returns: "[0-9][A-Z][a-y]z"
The Question:
Does Python (3.4+) have a way to convert that to the simple string "[0-9][A-Z][a-y]z"?
Possible Solutions:
This question ("strip a verbose python regex") seems to be pretty close to what I'm asking for and it was answered. But that was a few years ago, so I'm wondering if a new (preferably built-in) solution has been found.
In addition to the above, there are work-arounds such as using implicit string concatenation and then using the .pattern attribute:
RE_TEST = re.compile(r"[0-9]" # 1 Number
r"[A-Z]" # 1 Uppercase Letter
r"[a-y]" # 1 lowercase, but not z
r"z", # gotta have z...
re.VERBOSE)
print(RE_TEST.pattern) # returns: "[0-9][A-Z][a-y]z"
or just commenting the pattern separately and not compiling it:
# matches pattern "nXxz"
RE_TEST = "[0-9][A-Z][a-y]z"
print(RE_TEST)
But I'd really like to keep the compiled regex the way it is (1st example). Perhaps I'm pulling the regex string from some file, and that file is already using the verbose form.
Background
I'm asking because I want to suggest an edit to the unittest module.
Right now, if you run assertRegex(string, pattern) using a compiled pattern with comments and that assertion fails, then the printed output is somewhat ugly (the below is a dummy regex):
Traceback (most recent call last):
File "verify_yaml.py", line 113, in test_verify_mask_names
self.assertRegex(mask, RE_MASK)
AssertionError: Regex didn't match: '(X[1-9]X[0-9]{2}) # comment\n |(XXX[0-9]{2}) # comment\n |(XXXX[0-9E]) # comment\n |(XXXX[O1-9]) # c
omment\n |(XXX[0-9][0-9]) # comment\n |(XXXX[
1-9]) # comment\n ' not found in 'string'
I'm going to propse that the assertRegex and assertNotRegex methods clean the regex before printing it by either removing the comments and extra whitespace or by printing it differently.
The following tested script includes a function that does a pretty good job converting an xmode regex string to non-xmode:
pcre_detidy(retext)
# Function pcre_detidy to convert xmode regex string to non-xmode.
# Rev: 20160225_1800
import re
def detidy_cb(m):
if m.group(2): return m.group(2)
if m.group(3): return m.group(3)
return ""
def pcre_detidy(retext):
decomment = re.compile(r"""(?#!py/mx decomment Rev:20160225_1800)
# Discard whitespace, comments and the escapes of escaped spaces and hashes.
( (?: \s+ # Either g1of3 $1: Stuff to discard (3 types). Either ws,
| \#.* # or comments,
| \\(?=[\r\n]|$) # or lone escape at EOL/EOS.
)+ # End one or more from 3 discardables.
) # End $1: Stuff to discard.
| ( [^\[(\s#\\]+ # Or g2of3 $2: Stuff to keep. Either non-[(\s# \\.
| \\[^# Q\r\n] # Or escaped-anything-but: hash, space, Q or EOL.
| \( # Or an open parentheses, optionally
(?:\?\#[^)]*(?:\)|$))? # starting a (?# Comment group).
| \[\^?\]? [^\[\]\\]* # Or Character class. Allow unescaped ] if first char.
(?:\\[^Q][^\[\]\\]*)* # {normal*} Zero or more non-[], non-escaped-Q.
(?: # Begin unrolling loop {((special1|2) normal*)*}.
(?: \[(?::\^?\w+:\])? # Either special1: "[", optional [:POSIX:] char class.
| \\Q [^\\]* # Or special2: \Q..\E literal text. Begin with \Q.
(?:\\(?!E)[^\\]*)* # \Q..\E contents - everything up to \E.
(?:\\E|$) # \Q..\E literal text ends with \E or EOL.
) [^\[\]\\]* # End special: One of 2 alternatives {(special1|2)}.
(?:\\[^Q][^\[\]\\]*)* # More {normal*} Zero or more non-[], non-escaped-Q.
)* (?:\]|\\?$) # End character class with ']' or EOL (or \\EOL).
| \\Q [^\\]* # Or \Q..\E literal text start delimiter.
(?:\\(?!E)[^\\]*)* # \Q..\E contents - everything up to \E.
(?:\\E|$) # \Q..\E literal text ends with \E or EOL.
) # End $2: Stuff to keep.
| \\([# ]) # Or g3of3 $6: Escaped-[hash|space], discard the escape.
""", re.VERBOSE | re.MULTILINE)
return re.sub(decomment, detidy_cb, retext)
test_text = r"""
[0-9] # 1 Number
[A-Z] # 1 Uppercase Letter
[a-y] # 1 lowercase, but not z
z # gotta have z...
"""
print(pcre_detidy(test_text))
This function detidies regexes written in pcre-8/pcre2-10 xmode syntax.
It preserves whitespace inside [character classes], (?#comment groups) and \Q...\E literal text spans.
RegexTidy
The above decomment regex, is a variant of one I am using in my upcoming, yet to be released: RegexTidy application, which will not only detidy a regex as shown above (which is pretty easy to do), but it will also go the other way and Tidy a regex - i.e. convert it from non-xmode regex to xmode syntax, adding whitespace indentation to nested groups as well as adding comments (which is harder).
p.s. Before giving this answer a downvote on general principle because it uses a regex longer than a couple lines, please add a comment describing one example which is not handled correctly. Cheers!
Looking through the way sre_parse handles this, there really isn't any point where your verbose regex gets "converted" into a regular one and then parsed. Rather, your verbose regex is being fed directly to the parser, where the presence of the VERBOSE flag makes it ignore unescaped whitespace outside character classes, and from unescaped # to end-of-line if it is not inside a character class or a capture group (which is missing from the docs).
The outcome of parsing your verbose regex there is not "[0-9][A-Z][a-y]z". Rather it is:
[(IN, [(RANGE, (48, 57))]), (IN, [(RANGE, (65, 90))]), (IN, [(RANGE, (97, 121))]), (LITERAL, 122)]
In order to do a proper job of converting your verbose regex to "[0-9][A-Z][a-y]z" you could parse it yourself. You could do this with a library like pyparsing. The other answer linked in your question uses regex, which will generally not duplicate the behavior correctly (specifically, spaces inside character classes and # inside capture groups/character classes. And even just dealing with escaping is not as convenient as with a good parser.)

How to match the string in $(....) in python

with open('templates/data.xml', 'r') as s:
for line in s:
line = line.rstrip() #removes trailing whitespace and '\n' chars
if "\\$\\(" not in line:
if ")" not in line:
continue
print(line)
start = line.index("$(")
end = line.index(")")
print(line[start+2:end])
I need to match the strings which are like $(hello). But now this even matches (hello).
Im really new to python. So what am i doing wrong here ?
Use the following regex:
\$\(([^)]+)\)
It matches $, followed by (, then anything until the last ), and catches the characters between the parenthesis.
Here we did escape the $, ( and ) since when you use a function that accepts a regex (like findall), you don't want $ to be treated as the special character $, but as the literal "$" (same holds for the ( and )). However, note that the inner parenthesis didn't get quoted since you want to capture the text between the outer parenthesis.
Note that you don't need to escape the special characters when you're not using regex.
You can do:
>>> import re
>>> escaper = re.compile(r'\$\((.*?)\)')
>>> escaper.findall("I like to say $(hello)")
['hello']
I believe something along the lines of:
import re
data = "$(hello)"
matchObj = re.match( r'\$\(([^)]+)\)', data, re.M|re.I)
print matchObj.group()
might do the trick.
If you don't want to do it with regexes (I wouldn't necessarily; they can be hard to read).
Your for loop indentation is wrong.
"\$\(" means \$\( (you're escaping the brackets, not the $ and (.
You don't need to escpae $ or (. Just do if "$(" not in line
You need to check the $( is found before ). Currently your code will match "foo)bar$(baz".
Rather than checking if $( and ) are in the string twice, it would be better to just do the .index() anyway and catch the exception. Something like this:
with open('templates/data.xml', 'r') as s:
for line in s:
try:
start = line.index("$(")
end = line.index(")", start)
print(line[start+2:end])
except ValueError:
pass
Edit: That will only match one $() per line; you'll want to add a loop.

Regex in Python. NOT matches

I'll go straight: I have a string like this (but with thousands of lines)
Ach-emos_2
Ach. emos_54
Achėmos_18
Ąžuolas_4
Somtehing else_2
and I need to remove lines that does not match a-z and ąčęėįšųūž plus _ plus any integer (3rd and 4th lines match this). And this should be case insensitive. I think regex should be
[a-ząčęėįšųūž]+_\d+ #don't know where to put case insensitive modifier
But how should look a regex that matches lines that are NOT alpha (and lithuanian letters) plus underscore plus integer? I tried
re.sub(r'[^a-ząčęėįšųūž]+_\d+\n', '', words)
but no good.
Thanks in advance, sorry if my english is not quite good.
As to making the matching case insensitive, you can use the I or IGNORECASE flags from the re module, for example when compiling your regex:
regex = re.compile("^[a-ząčęėįšųūž]+_\d+$", re.I)
As to removing the lines not matching this regex, you can simply construct a new string consisting of the lines that do match:
new_s = "\n".join(line for line in s.split("\n") if re.match(regex, line))
First of all, given your example inputs, every line ends with underscore + integers, so all you really need to do is invert the original match. If the example wasn't really representative, then inverting the match could land you results like this:
abcdefg_nodigitshere
But you can subfilter that this way:
import re
mydigre = re.compile(r'_\d+$')
myreg = re.compile(r'^[a-ząčęėįšųūž]+_\d+$', re.I)
for line in inputs.splitlines():
if re.match(myreg, line):
# do x
elif re.match(mydigre, line):
# do y
else:
# line doesn't end with _\d+
Another option would be to use Python sets. This approach only makes sense if all your lines are unique (or if you don't mind eliminating duplicate lines) and you don't care about order. It probably has a high memory cost, too, but is likely to be fast.
all_lines = set([line for line in inputs.splitlines()])
alpha_lines = set([line for line in all_lines if re.match(myreg, line)])
nonalpha_lines = all_lines - alpha_lines
nonalpha_digi_lines = set([line for line in nonalpha_lines if re.match(mydigire, line)])
Not sure how python does modifiers, but to edit in-place, use something like this (case insensitive):
edit Note that some of these characters are utf8. To use the literal representation your editor and language must support this, otherwise use the \u.. code in the character class (recommended).
s/(?i)^(?![a-ząčęėįšųūž]+_\d+(?:\n|$)).*(?:\n|$)//mg;
where the regex is: r'(?i)^(?![a-ząčęėįšųūž]+_\d+(?:\n|$)).*(?:\n|$)'
the replacement is ''
modifier is multiline and global.
Breakdown: modifiers are global and multiline
(?i) // case insensitive flag
^ // start of line
(?![a-ząčęėįšųūž]+_\d+(?:\n|$)) // look ahead, not this form of a line ?
.* // ok then select all except newline or eos
(?:\n|$) // select newline or end of string

My regex in python isn't recursing properly

I'm suppose to capture everything inside a tag and the next lines after it, but it's suppose to stop the next time it meets a bracket. What am i doing wrong?
import re #regex
regex = re.compile(r"""
^ # Must start in a newline first
\[\b(.*)\b\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
(\b(?:.|\s)*(?!\[)) # should read: anyword that doesn't precede a bracket
""", re.MULTILINE | re.VERBOSE)
haystack = """
[tab1]
this is captured
but this is suppose to be captured too!
#[this should be taken though as this is in the content]
[tab2]
help me
write a better RE
"""
m = regex.findall(haystack)
print m
what im trying to get is:
[('tab1', 'this is captured\nbut this is suppose to be captured too!\n#[this should be taken though as this is in the content]\n', '[tab2]','help me\nwrite a better RE\n')]
edit:
regex = re.compile(r"""
^ # Must start in a newline first
\[(.*?)\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
([^\[]*) # stop reading at opening bracket
""", re.MULTILINE | re.VERBOSE)
this seems to work but it's also trimming the brackets inside the content.
Python regex doesn't support recursion afaik.
EDIT: but in your case this would work:
regex = re.compile(r"""
^ # Must start in a newline first
\[(.*?)\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
([^\[]*) # stop reading at opening bracket
""", re.MULTILINE | re.VERBOSE)
EDIT 2: yes, it doesn't work properly.
import re
regex = re.compile(r"""
(?:^|\n)\[ # tag's opening bracket
([^\]\n]*) # 1. text between brackets
\]\n # tag's closing bracket
(.*?) # 2. text between the tags
(?=\n\[[^\]\n]*\]\n|$) # until tag or end of string but don't consume it
""", re.DOTALL | re.VERBOSE)
haystack = """[tag1]
this is captured [not a tag[
but this is suppose to be captured too!
[another non-tag
[tag2]
help me
write a better RE[[[]
"""
print regex.findall(haystack)
I do agree with viraptor though. Regex are cool but you can't check your file for errors with them. A hybrid perhaps? :P
tag_re = re.compile(r'^\[([^\]\n]*)\]$', re.MULTILINE)
tags = list(tag_re.finditer(haystack))
result = {}
for (mo1, mo2) in zip(tags[:-1], tags[1:]):
result[mo1.group(1)] = haystack[mo1.end(1)+1:mo2.start(1)-1].strip()
result[mo2.group(1)] = haystack[mo2.end(1)+1:].strip()
print result
EDIT 3: That's because ^ character means negative match only inside [^squarebrackets]. Everywhere else it means string start (or line start with re.MULTILINE). There's no good way for negative string matching in regex, only character.
First of all why a regex if you're trying to parse? As you can see you cannot find the source of the problem yourself, because regex gives no feedback. Also you don't have any recursion in that RE.
Make your life simple:
def ini_parse(src):
in_block = None
contents = {}
for line in src.split("\n"):
if line.startswith('[') and line.endswith(']'):
in_block = line[1:len(line)-1]
contents[in_block] = ""
elif in_block is not None:
contents[in_block] += line + "\n"
elif line.strip() != "":
raise Exception("content out of block")
return contents
You get error handling with exceptions and the ability to debug execution as a bonus. Also you get a dictionary as a result and can handle duplicate sections while processing. My result:
{'tab2': 'help me\nwrite a better RE\n\n',
'tab1': 'this is captured\nbut this is suppose to be captured too!\n#[this should be taken though as this is in the content]\n\n'}
RE is much overused these days...
Does this do what you want?
regex = re.compile(r"""
^ # Must start in a newline first
\[\b(.*)\b\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
([^[]*)
""", re.MULTILINE | re.VERBOSE)
This gives a list of tuples (one 2-tuple per match). If you want a flattened tuple you can write:
m = sum(regex.findall(haystack), ())

Categories

Resources