with open('templates/data.xml', 'r') as s:
for line in s:
line = line.rstrip() #removes trailing whitespace and '\n' chars
if "\\$\\(" not in line:
if ")" not in line:
continue
print(line)
start = line.index("$(")
end = line.index(")")
print(line[start+2:end])
I need to match the strings which are like $(hello). But now this even matches (hello).
Im really new to python. So what am i doing wrong here ?
Use the following regex:
\$\(([^)]+)\)
It matches $, followed by (, then anything until the last ), and catches the characters between the parenthesis.
Here we did escape the $, ( and ) since when you use a function that accepts a regex (like findall), you don't want $ to be treated as the special character $, but as the literal "$" (same holds for the ( and )). However, note that the inner parenthesis didn't get quoted since you want to capture the text between the outer parenthesis.
Note that you don't need to escape the special characters when you're not using regex.
You can do:
>>> import re
>>> escaper = re.compile(r'\$\((.*?)\)')
>>> escaper.findall("I like to say $(hello)")
['hello']
I believe something along the lines of:
import re
data = "$(hello)"
matchObj = re.match( r'\$\(([^)]+)\)', data, re.M|re.I)
print matchObj.group()
might do the trick.
If you don't want to do it with regexes (I wouldn't necessarily; they can be hard to read).
Your for loop indentation is wrong.
"\$\(" means \$\( (you're escaping the brackets, not the $ and (.
You don't need to escpae $ or (. Just do if "$(" not in line
You need to check the $( is found before ). Currently your code will match "foo)bar$(baz".
Rather than checking if $( and ) are in the string twice, it would be better to just do the .index() anyway and catch the exception. Something like this:
with open('templates/data.xml', 'r') as s:
for line in s:
try:
start = line.index("$(")
end = line.index(")", start)
print(line[start+2:end])
except ValueError:
pass
Edit: That will only match one $() per line; you'll want to add a loop.
Related
I have a string like this:
data='WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage'
I need to get rid of everything until the first instance of the underline (inclusive) in regex.
I've tried this:
re.sub("(^.*\_),"", data)
but this get rids of everything before all underlines
ProcessCpuUsage
I need it to be:
jvmRuntimeModule_ProcessCpuUsag
Use this instead:
from string import find
data='WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage'
result = data[find(data, "_")+1:]
print result
re.sub("(^.*\_),"", data)
This makes . match every character in the line. Once it gets to the end, and can't match any more ".", it goes to the next token. Oops, that's a underscore! So, it backtracks back before the _ProcessCpuUsage, where it can match a underscore at the start, and then complete the match.
You should ask the . multiplier to be less greedy. You also do not need to capture the contents. Drop the parens. The backslash does nothing. Drop it. The leading line-start anchor also does nothing. Drop it.
re.sub(".*?_,", data)
You have become a victim of greedy matching. The expression matches the longest sequence that it possibly can.
I know there's a way to turn off greedy matching, but I never remember it. Instead there's a trick I use when there's a character I want to stop at. Instead of matching on every character with . I match on every character except the one I want to stop at.
re.sub("(^[^_]*\_", "", data)
This should do:
import re
def get_last_part(d):
m = re.match('[^_]*_(.*)', d)
if m:
return m.group(1)
else:
return None
print get_last_part('WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage')
you can use str.index:
>>> data = 'WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage'
>>> data[data.index('_')+1:]
'jvmRuntimeModule_ProcessCpuUsage'
Using str.split
>>> data.split('_',1)[1]
'jvmRuntimeModule_ProcessCpuUsage'
Using str.find:
>>> data[data.find('_')+1:]
'jvmRuntimeModule_ProcessCpuUsage'
Take a look at string methods Here
Try this regex:
result = re.sub("^.*?_", "", text)
What the regex ^.*?_ does:
^ .. Assert that the position is at the beginning of the string.
.*? .. Match every character that is not a linebreak character
between zero and unlimitted times as few times as possible.
- .. Match the character _
Try using split():
s = 'WebSpherePMI_jvmRuntimeModule_ProcessCpuUsage'
print(s.split('_',1)[1])
Result:
jvmRuntimeModule_ProcessCpuUsage
I'm working with a search&replace programming assignment. I'm a student and I'm finding the regex documentation a bit overwhelming (e.g. https://docs.python.org/2/library/re.html), so I'm hoping someone here could explain to me how to accomplish what I'm looking for.
I've used regex to get a list of strings from my document. They all look like this:
%#import fileName (regexStatement)
An actual example:
%#import script_example.py ( *out =(.|\n)*?return out)
Now, I'm wondering how I can split these up so I get the fileName and regexStatements as separate strings. I'd assume using a regex or string split function, but I'm not sure how to make it work on all kinds of variations of %#import fileName (regexstatement). Splitting using parentheses could hit the middle of the regex statement, or if a parentheses is part of the fileName, for instance. The assignment doesn't specify if it should only be able to import from python files, so I don't believe I can use ".py (" as a splitting point before the regex statement either.
I'm thinking something like a regex "%#import " to hit the gap after import, "\..* " to hit the gap after fileName. But I'm not sure how to get rid of the parentheses that encapsule the regex statement, or how to use all of it to actually split the string correctly so i have one variable storing fileName and one storing regexStatement for each entry in my list.
Thanks a lot for your attention!
If the filename can't contain spaces, just split your string on spaces with maxsplit 2:
>>> line.split(' ', 2)
['%#import', 'script_example.py', '( *out =(.|\n)*?return out)']
The maxsplit 2 makes it split only the first two spaces, and leave intact any spaces within the regex. Now you have the filename as the second element and the regex as the third. It's not clear from your statement whether the parentheses are part of the regex or not (i.e., as a capturing group). If not, you can easily remove them by trimming the first and last characters from that part.
If you assign the values like this:
filename, regex = line.split(' ', 2)[1:]
then you can strip the parentheses with:
regex = regex[1:-1]
That should do it nicely
^%#import (\S+) \((.*)\)
or, if the filename may have spaces:
^%#import ((?:(?! \().)+) \((.*)\)
Both expressions contain two groups, one for the file name and one for the contents of the parentheses. Run in multiline mode on the entire file or in normal mode if you work with single lines anyway.
This: ((?:(?! \().)+) breaks down as:
( # group start
(?: # non-capturing group
(?! # negative look-ahead: a position NOT followed by
\( # " ("
) # end look-ahead
. # match any char (this is part of the filename)
)+ # end non-capturing group, repeat
) # end group
The other bits of the expression should be self-explanatory.
import re
line = "%#import script_example.py ( *out =(.|\\n)*?return out)"
pattern = r'^%#import (\S+) \((.*)\)'
match = re.match(pattern, line)
if match:
print "match.group(1) '" + match.group(1) + "'"
print "match.group(2) '" + match.group(2) + "'"
else:
print "No match."
prints
match.group(1) 'script_example.py'
match.group(2) ' *out =(.|\n)*?return out'
For matching something like %#import script_example.py ( *out =(.|\n)*?return out) i suggest :
r'%#impor[\w\W ]+'
DEMO
note that :
\w match any word character [a-zA-Z0-9_]
\W match any non-word character [^a-zA-Z0-9_]
so you can use re.findall() for find all the matches :
import re
re.findall(r'%#impor[\w\W ]+', your_string)
I would like to rewrite a small Perl programm to Python.
I am processing text files with it as follows:
Input:
00000001;Root;;
00000002; Documents;;
00000003; oracle-advanced_plsql.zip;file;
00000004; Public;;
00000005; backup;;
00000006; 20110323-JM-F.7z.001;file;
00000007; 20110426-JM-F.7z.001;file;
00000008; 20110603-JM-F.7z.001;file;
00000009; 20110701-JM-F-via-summer_school;;
00000010; 20110701-JM-F-yyy.7z.001;file;
Desired output:
00000001;;Root;;
00000002; ;Documents;;
00000003; ;oracle-advanced_plsql.zip;file;
00000004; ;Public;;
00000005; ;backup;;
00000006; ;20110323-JM-F.7z.001;file;
00000007; ;20110426-JM-F.7z.001;file;
00000008; ;20110603-JM-F.7z.001;file;
00000009; ;20110701-JM-F-via-summer_school;;
00000010; ;20110701-JM-F-yyy.7z.001;file;
Here is the working Perl code:
#filename: perl_regex.pl
#/usr/bin/perl -w
while(<>) {
s/^(.*?;.*?)(\w)/$1;$2/;
print $_;
}
It call it from the command line: perl_regex.pl input.txt
Explanation of the Perl-style regex:
s/ # start search-and-replace regexp
^ # start at the beginning of this line
( # save the matched characters until ')' in $1
.*?; # go forward until finding the first semicolon
.*? # go forward until finding... (to be continued below)
)
( # save the matched characters until ')' in $2
\w # ... the next alphanumeric character.
)
/ # continue with the replace part
$1;$2 # write all characters found above, but insert a ; before $2
/ # finish the search-and-replace regexp.
Could anyone tell me, how to get the same result in Python? Especially for the $1 and $2 variables I couldn't find something alike.
The replace instruction for s/pattern/replace/ in python regexes is the re.sub(pattern, replace, string) function, or re.compile(pattern).sub(replace, string). In your case, you will do it so:
_re_pattern = re.compile(r"^(.*?;.*?)(\w)")
result = _re_pattern.sub(r"\1;\2", line)
Note that $1 becomes \1. As for perl, you need to iterate over your lines the way you want to do it (open, inputfile, splitlines, ...).
Python regular expression is very similar to Perl's, except:
In Python there's no regular expression literal. It should be expressed using string. I used r'raw string literal' in the following code.
Backreferences are expressed as \1, \2, .. or \g<1>, \g<2>, ..
...
Use re.sub to replace.
import re
import sys
for line in sys.stdin: # Explicitly iterate standard input line by line
# `line` contains trailing newline!
line = re.sub(r'^(.*?;.*?)(\w)', r'\1;\2', line)
#print(line) # This print trailing newline
sys.stdout.write(line) # Print the replaced string back.
I would like to have a regex pattern to match smileys ":)" ,":(" .Also it should capture repeated smileys like ":) :)" , ":) :(" but filter out invalid syntax like ":( (" .
I have this with me, but it matches ":( ("
bool( re.match("(:\()",str) )
I maybe missing something obvious here, and I'd like some help for this seemingly simple task.
I think it finally "clicked" exactly what you're asking about here. Take a look at the below:
import re
smiley_pattern = '^(:\(|:\))+$' # matches only the smileys ":)" and ":("
def test_match(s):
print 'Value: %s; Result: %s' % (
s,
'Matches!' if re.match(smiley_pattern, s) else 'Doesn\'t match.'
)
should_match = [
':)', # Single smile
':(', # Single frown
':):)', # Two smiles
':(:(', # Two frowns
':):(', # Mix of a smile and a frown
]
should_not_match = [
'', # Empty string
':(foo', # Extraneous characters appended
'foo:(', # Extraneous characters prepended
':( :(', # Space between frowns
':( (', # Extraneous characters and space appended
':((' # Extraneous duplicate of final character appended
]
print('The following should all match:')
for x in should_match: test_match(x);
print('') # Newline for output clarity
print('The following should all not match:')
for x in should_not_match: test_match(x);
The problem with your original code is that your regex is wrong: (:\(). Let's break it down.
The outside parentheses are a "grouping". They're what you'd reference if you were going to do a string replacement, and are used to apply regex operators on groups of characters at once. So, you're really saying:
( begin a group
:\( ... do regex stuff ...
')' end the group
The : isn't a regex reserved character, so it's just a colon. The \ is, and it means "the following character is literal, not a regex operator". This is called an "escape sequence". Fully parsed into English, your regex says
( begin a group
: a colon character
\( a left parenthesis character
) end the group
The regex I used is slightly more complex, but not bad. Let's break it down: ^(:\(|:\))+$.
^ and $ mean "the beginning of the line" and "the end of the line" respectively. Now we have ...
^ beginning of line
(:\(|:\))+ ... do regex stuff ...
$ end of line
... so it only matches things that comprise the entire line, not simply occur in the middle of the string.
We know that ( and ) denote a grouping. + means "one of more of these". Now we have:
^ beginning of line
( start a group
:\(|:\) ... do regex stuff ...
) end the group
+ match one or more of this
$ end of line
Finally, there's the | (pipe) operator. It means "or". So, applying what we know from above about escaping characters, we're ready to complete the translation:
^ beginning of line
( start a group
: a colon character
\( a left parenthesis character
| or
: a colon character
\) a right parenthesis character
) end the group
+ match one or more of this
$ end of line
I hope this helps. If not, let me know and I'll be happy to edit my answer with a reply.
Maybe something like:
re.match('[:;][)(](?![)(])', str)
Try (?::|;|=)(?:-)?(?:\)|\(|D|P). Haven't tested it extensively, but does seem to match the right ones and not more...
In [15]: import re
In [16]: s = "Just: to :)) =) test :(:-(( ():: :):) :(:( :P ;)!"
In [17]: re.findall(r'(?::|;|=)(?:-)?(?:\)|\(|D|P)',s)
Out[17]: [':)', '=)', ':(', ':-(', ':)', ':)', ':(', ':(', ':P', ';)']
I got the answer I was looking for from the comments and answers posted here.
re.match("^(:[)(])*$",str)
Thanks to all.
I'm suppose to capture everything inside a tag and the next lines after it, but it's suppose to stop the next time it meets a bracket. What am i doing wrong?
import re #regex
regex = re.compile(r"""
^ # Must start in a newline first
\[\b(.*)\b\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
(\b(?:.|\s)*(?!\[)) # should read: anyword that doesn't precede a bracket
""", re.MULTILINE | re.VERBOSE)
haystack = """
[tab1]
this is captured
but this is suppose to be captured too!
#[this should be taken though as this is in the content]
[tab2]
help me
write a better RE
"""
m = regex.findall(haystack)
print m
what im trying to get is:
[('tab1', 'this is captured\nbut this is suppose to be captured too!\n#[this should be taken though as this is in the content]\n', '[tab2]','help me\nwrite a better RE\n')]
edit:
regex = re.compile(r"""
^ # Must start in a newline first
\[(.*?)\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
([^\[]*) # stop reading at opening bracket
""", re.MULTILINE | re.VERBOSE)
this seems to work but it's also trimming the brackets inside the content.
Python regex doesn't support recursion afaik.
EDIT: but in your case this would work:
regex = re.compile(r"""
^ # Must start in a newline first
\[(.*?)\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
([^\[]*) # stop reading at opening bracket
""", re.MULTILINE | re.VERBOSE)
EDIT 2: yes, it doesn't work properly.
import re
regex = re.compile(r"""
(?:^|\n)\[ # tag's opening bracket
([^\]\n]*) # 1. text between brackets
\]\n # tag's closing bracket
(.*?) # 2. text between the tags
(?=\n\[[^\]\n]*\]\n|$) # until tag or end of string but don't consume it
""", re.DOTALL | re.VERBOSE)
haystack = """[tag1]
this is captured [not a tag[
but this is suppose to be captured too!
[another non-tag
[tag2]
help me
write a better RE[[[]
"""
print regex.findall(haystack)
I do agree with viraptor though. Regex are cool but you can't check your file for errors with them. A hybrid perhaps? :P
tag_re = re.compile(r'^\[([^\]\n]*)\]$', re.MULTILINE)
tags = list(tag_re.finditer(haystack))
result = {}
for (mo1, mo2) in zip(tags[:-1], tags[1:]):
result[mo1.group(1)] = haystack[mo1.end(1)+1:mo2.start(1)-1].strip()
result[mo2.group(1)] = haystack[mo2.end(1)+1:].strip()
print result
EDIT 3: That's because ^ character means negative match only inside [^squarebrackets]. Everywhere else it means string start (or line start with re.MULTILINE). There's no good way for negative string matching in regex, only character.
First of all why a regex if you're trying to parse? As you can see you cannot find the source of the problem yourself, because regex gives no feedback. Also you don't have any recursion in that RE.
Make your life simple:
def ini_parse(src):
in_block = None
contents = {}
for line in src.split("\n"):
if line.startswith('[') and line.endswith(']'):
in_block = line[1:len(line)-1]
contents[in_block] = ""
elif in_block is not None:
contents[in_block] += line + "\n"
elif line.strip() != "":
raise Exception("content out of block")
return contents
You get error handling with exceptions and the ability to debug execution as a bonus. Also you get a dictionary as a result and can handle duplicate sections while processing. My result:
{'tab2': 'help me\nwrite a better RE\n\n',
'tab1': 'this is captured\nbut this is suppose to be captured too!\n#[this should be taken though as this is in the content]\n\n'}
RE is much overused these days...
Does this do what you want?
regex = re.compile(r"""
^ # Must start in a newline first
\[\b(.*)\b\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
([^[]*)
""", re.MULTILINE | re.VERBOSE)
This gives a list of tuples (one 2-tuple per match). If you want a flattened tuple you can write:
m = sum(regex.findall(haystack), ())