repetition in regular expression in python

repetition in regular expression in python - python

I've got a file with lines for example:
aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj
I need to take what is inside $$ so expected result is:
$bb$
$ddd$
$ggg$
$iii$
My result:
$bb$
$ggg$
My solution:
m = re.search(r'$(.*?)$', line)
if m is not None:
print m.group(0)
Any ideas how to improve my regexp? I was trying with * and + sign, but I'm not sure how to finally create it.
I was searching for similar post, but couldnt find it :(

You can use re.findall with r'\$[^$]+\$' regex:
import re
line = """aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj"""
m = re.findall(r'\$[^$]+\$', line)
print(m)
# => ['$bb$', '$ddd$', '$ggg$', '$iii$']
See Python demo
Note that you need to escape $s and remove the capturing group for the re.findall to return the $...$ substrings, not just what is inside $s.
Pattern details:
\$ - a dollar symbol (literal)
[^$]+ - 1 or more symbols other than $
\$ - a literal dollar symbol.
NOTE: The [^$] is a negated character class that matches any char but the one(s) defined in the class. Using a negated character class here speeds up matching since .*? lazy dot pattern expands at each position in the string between two $s, thus taking many more steps to complete and return a match.
And a variation of the pattern to get only the texts inside $...$s:
re.findall(r'\$([^$]+)\$', line)
^ ^
See another Python demo. Note the (...) capturing group added so that re.findall could only return what is captured, and not what is matched.

re.search finds only the first match. Perhaps you'd want re.findall, which returns list of strings, or re.finditer that returns iterator of match objects. Additionally, you must escape $ to \$, as unescaped $ means "end of line".
Example:
>>> re.findall(r'\$.*?\$', 'aaa$bb$ccc$ddd$eee')
['$bb$', '$ddd$']
>>> re.findall(r'\$(.*?)\$', 'aaa$bb$ccc$ddd$eee')
['bb', 'ddd']
One more improvement would be to use [^$]* instead of .*?; the former means "zero or more any characters besides $; this can potentially avoid more pathological backtracking behaviour.

Your regex is fine. re.search only finds the first match in a line. You are looking for re.findall, which finds all non-overlapping matches. That last bit is important for you since you have the same start and end delimiter.
for m in m = re.findall(r'$(.*?)$', line):
if m is not None:
print m.group(0)

Related

python re regex matching in string with multiple () parenthesis

I have this string
cmd = "show run IP(k1) new Y(y1) add IP(dev.maintserial):Y(dev.maintkeys)"
What is a regex to first match exactly "IP(dev.maintserial):Y(dev.maintkeys)"
There might be a different path inside the parenthesis, like (name.dev.serial), so it is not like there will always be one dot there.
I though of something like this:
re.search('(IP\(.*?\):Y\(.*?\))', cmd) but this will also match the single IP(k1) and Y(y1
My usage will be:
If "IP(*):Y(*)" in cmd:
do substitution of IP(dev.maintserial):Y(dev.maintkeys) to Y(dev.maintkeys.IP(dev.maintserial))
How can I then do the above substitution? In the if condition I want to do this change in order: from IP(path_to_IP_key):Y(path_to_Y_key) to Y(path_to_Y_key.IP(path_to_IP_key)) , so IP is inside Y at the end after the dot.

This should work as it is more restrictive.
(IP\([^\)]+\):Y\(.*?\))
[^\)]+ means at least one character that isn't a closing parenthesis.
.*? in yours is too open ended allowing almost anything to be in until "):Y("

Something like this?
r"IP\(([^)]*\..+)\):Y\(([^)]*\..+)\)"
You can try it with your string. It matches the entire string IP(dev.maintserial):Y(dev.maintkeys) with groups dev.maintserial and dev.maintkeys.
The RE matches IP(, zero or more characters that are not a closing parenthesis ([^)]*), a period . (\.), one or more of any characters (.+), then ):Y(, ... (between the parentheses -- same as above), ).
Example Usage
import re
cmd = "show run IP(k1) new Y(y1) add IP(dev.maintserial):Y(dev.maintkeys)"
# compile regular expression
p = re.compile(r"IP\(([^)]*\..+)\):Y\(([^)]*\..+)\)")
s = p.search(cmd)
# if there is a match, s is not None
if s:
print(f"{s[0]}\n{s[1]}\n{s[2]}")
a = "Y(" + s[2] + ".IP(" + s[1] + "))"
print(f"\n{a}")
Above p.search(cmd) "[s]can[s] through [cmd] looking for the first location where this regular expression [p] produces a match, and return[s] a corresponding match object" (docs). None is the return value if there is no match. If there is a match, s[0] gives the entire match, s[1] gives the first parenthesized subgroup, and s[2] gives the second parenthesized subgroup (docs).
Output
IP(dev.maintserial):Y(dev.maintkeys)
dev.maintserial
dev.maintkeys
Y(dev.maintkeys.IP(dev.maintserial))

You can use 2 negated character classes [^()]* to match any character except parenthesis, and omit the outer capture group for a match only.
To prevent a partial word match, you might start matching IP with a word boundary \b
\bIP\([^()]*\):Y\([^()]*\)
Regex demo

Regex to extract the string

I need help with regex to get the following out of the string
dal001.caxxxxx.test.com. ---> caxxxxx.test.com
caxxxx.test.com -----> caxxxx.test.com
So basically in the first example, I don't want dal001 or anything that starts with 3 letters and 3 digits and want the rest of the string if it starts with only ca.
In second example I want the whole string that starts only with ca.
So far I have tried (^[a-z]{3}[\d]+\.)?(ca.*) but it doesn't work when the string is
dal001.mycaxxxx.test.com.
Any help would be appreciated.

You can use
^(?:[a-z]{3}\d{3}\.)?(ca.*)
See the regex demo. To make it case insensitive, compile with re.I (re.search(rx, s, re.I), see below).
Details:
^ - start of string
(?:[a-z]{3}\d{3}\.)? - an optional sequence of 3 letters and then 3 digits and a .
(ca.*) - Group 1: ca and the rest of the string.
See the Python demo:
import re
rx = r"^(?:[a-z]{3}\d{3}\.)?(ca.*)"
strs = ["dal001.caxxxxx.test.com","caxxxx.test.com"]
for s in strs:
m = re.search(rx, s)
if m:
print( m.group(1) )

Use re.sub like so:
import re
strs = ['dal001.caxxxxx.test.com', 'caxxxx.test.com']
for s in strs:
s = re.sub(r'^[A-Za-z]{3}\d{3}[.]', '', s)
print(s)
# caxxxxx.test.com
# caxxxx.test.com

if you are using re:
import re
my_strings = ['dal001.caxxxxx.test.com', 'caxxxxx.test.com']
my_regex = r'^(?:[a-zA-Z]{3}[0-9]{3}\.)?(ca.*)'
compiled_regex = re.compile(r)
for a_string in my_strings:
if compiled_regex.match(a_string):
compiled_regex.sub(r'\1', a_string)
my_regex matches a string that starts (^ anchors to the start of the string) with [3 letters][3 digits][a .], but only optionally, and using a non-capturing group (the (?:) will not get a numbered reference to use in sub). In either case, it must then contain ca followed by anything, and this part is used as the replacement in the call to re.sub. re.compile is used to make it a bit faster, in case you have many strings to match.
Note on re.compile:
Some answers don't bother pre-compiling the regex before the loop. They have made a trade: removing a single line of code, at the cost of re-compiling the regex implicitly on every iteration. If you will use a regex in a loop body, you should always compile it first. Doing so can have a major effect on the speed of a program, and there is no added cost even when the number of iterations is small. Here is a comparison of compiled vs. non-compiled versions of the same loop using the same regex for different numbers of loop iterations and number of trials. Judge for yourself.

how to make a list in python from a string and using regular expression [duplicate]

I have a sample string <alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card] ...>, created=1324336085, description='Customer for My Test App', livemode=False>
I only want the value cus_Y4o9qMEZAugtnW and NOT card (which is inside another [])
How could I do it in easiest possible way in Python?
Maybe by using RegEx (which I am not good at)?

How about:
import re
s = "alpha.Customer[cus_Y4o9qMEZAugtnW] ..."
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
For me this prints:
cus_Y4o9qMEZAugtnW
Note that the call to re.search(...) finds the first match to the regular expression, so it doesn't find the [card] unless you repeat the search a second time.
Edit: The regular expression here is a python raw string literal, which basically means the backslashes are not treated as special characters and are passed through to the re.search() method unchanged. The parts of the regular expression are:
\[ matches a literal [ character
( begins a new group
[A-Za-z0-9_] is a character set matching any letter (capital or lower case), digit or underscore
+ matches the preceding element (the character set) one or more times.
) ends the group
\] matches a literal ] character
Edit: As D K has pointed out, the regular expression could be simplified to:
m = re.search(r"\[(\w+)\]", s)
since the \w is a special sequence which means the same thing as [a-zA-Z0-9_] depending on the re.LOCALE and re.UNICODE settings.

You could use str.split to do this.
s = "<alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card]\
...>, created=1324336085, description='Customer for My Test App',\
livemode=False>"
val = s.split('[', 1)[1].split(']')[0]
Then we have:
>>> val
'cus_Y4o9qMEZAugtnW'

This should do the job:
re.match(r"[^[]*\[([^]]*)\]", yourstring).groups()[0]

your_string = "lnfgbdgfi343456dsfidf[my data] ljfbgns47647jfbgfjbgskj"
your_string[your_string.find("[")+1 : your_string.find("]")]
courtesy: Regular expression to return text between parenthesis

You can also use
re.findall(r"\[([A-Za-z0-9_]+)\]", string)
if there are many occurrences that you would like to find.
See also for more info:
How can I find all matches to a regular expression in Python?

You can use
import re
s = re.search(r"\[.*?]", string)
if s:
print(s.group(0))

How about this ? Example illusrated using a file:
f = open('abc.log','r')
content = f.readlines()
for line in content:
m = re.search(r"\[(.*?)\]", line)
print m.group(1)
Hope this helps:
Magic regex : \[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.

This snippet should work too, but it will return any text enclosed within "[]"
re.findall(r"\[([a-zA-Z0-9 ._]*)\]", your_text)

non greedy Python regex from end of string

I need to search a string in Python 3 and I'm having troubles implementing a non greedy logic starting from the end.
I try to explain with an example:
Input can be one of the following
test1 = 'AB_x-y-z_XX1234567890_84481.xml'
test2 = 'x-y-z_XX1234567890_84481.xml'
test3 = 'XX1234567890_84481.xml'
I need to find the last part of the string ending with
somestring_otherstring.xml
In all the above cases the regex should return XX1234567890_84481.xml
My best try is:
result = re.search('(_.+)?\.xml$', test1, re.I).group()
print(result)
Here I used:
(_.+)? to match "_anystring" in a non greedy mode
\.xml$ to match ".xml" in the final part of the string
The output I get is not correct:
_x-y-z_XX1234567890_84481.xml
I found some SO questions (link) explaining the regex starts from the left even with non greedy qualifier.
Could anyone explain me how to implement a non greedy regex from the right?

Your pattern (_.+)?\.xml$ captures in an optional group from the first underscore until it can match .xml at the end of the string and it does not take the number of underscores that should be between into account.
To only match the last part you can omit the capturing group. You could use a negated character class and use the anchor $ to assert the end of the line as it is the last part:
[^_]+_[^_]+\.xml$
Regex demo | Python demo
That will match
[^_]+ Match 1+ times not _
_ Match literally
[^_]+ Match 1+ times not _
\.xml$ Match .xml at the end of the string
For example:
import re
test1 = 'AB_x-y-z_XX1234567890_84481.xml'
result = re.search('[^_]+_[^_]+\.xml$', test1, re.I)
if result:
print(result.group())

Not sure if this matches what you're looking for conceptually as "non greedy from the right" - but this pattern yields the correct answer:
'[^_]+_[^_]+\.xml$'
The [^_] is a character class matching any character which is not an underscore.

You need to use this regex to capture what you want,
[^_]*_[^_]*\.xml
Demo
Check out this Python code,
import re
arr = ['AB_x-y-z_XX1234567890_84481.xml','x-y-z_XX1234567890_84481.xml','XX1234567890_84481.xml']
for s in arr:
m = re.search(r'[^_]*_[^_]*\.xml', s)
if (m):
print(m.group(0))
Prints,
XX1234567890_84481.xml
XX1234567890_84481.xml
XX1234567890_84481.xml
The problem in your regex (_.+)?\.xml$ is, (_.+)? part will start matching from the first _ and will match anything until it sees a literal .xml and whole of it is optional too as it is followed by ?. Due to which in string _x-y-z_XX1234567890_84481.xml, it will also match _x-y-z_XX1234567890_84481 which isn't the correct behavior you desired.

python. re.findall and re.sub with '^'

I try to change string like s='2.3^2+3^3-√0.04*2+√4',
where 2.3^2 has to change to pow(2.3,2), 3^3 - pow(3,3), √0.04 - sqrt(0.04) and
√4 - sqrt(4).
s='2.3^2+3^3-√0.04*2+√4'
patt1='[0-9]+\.[0-9]+\^[0-9]+|[0-9]+\^[0-9]'
patt2='√[0-9]+\.[0-9]+|√[0-9]+'
idx1=re.findall(patt1, s)
idx2=re.findall(patt2, s)
idx11=[]
idx22=[]
for i in range(len(idx1)):
idx11.append('pow('+idx1[i][:idx1[i].find('^')]+','+idx1[i][idx1[i].find('^')+1:]+')')
for i in range(len(idx2)):
idx22.append('sqrt('+idx2[i][idx2[i].find('√')+1:]+')')
for i in range(len(idx11)):
s=re.sub(idx1[i], idx11[i], s)
for i in range(len(idx22)):
s=re.sub(idx2[i], idx22[i], s)
print(s)
Temp results:
idx1=['2.3^2', '3^3']
idx2=['√0.04', '√4']
idx11=['pow(2.3,2)', 'pow(3,3)']
idx22=['sqrt(0.04)', 'sqrt(4)']
but string result:
2.3^2+3^3-sqrt(0.04)*2+sqrt(4)
Why calculating 'idx1' is right, but re.sub don't insert this value into string ?
(sorry for my english:)

Try this using only re.sub()
Input string:
s='2.3^2+3^3-√0.04*2+√4'
Replacing for pow()
s = re.sub("(\d+(?:\.\d+)?)\^(\d+)", "pow(\\1,\\2)", s)
Replacing for sqrt()
s = re.sub("√(\d+(?:\.\d+)?)", "sqrt(\\1)", s)
Output:
pow(2.3,2)+pow(3,3)-sqrt(0.04)*2+sqrt(4)
() means group capture and \\1 means first captured group from regex match. Using this link you can get the detail explanation for the regex.

I've only got python 2.7.5 but this works for me, using str.replace rather than re.sub. Once you've gone to the effort of finding the matches and constructing their replacements, this is a simple find and replace job:
for i in range(len(idx11)):
s = s.replace(idx1[i], idx11[i])
for i in range(len(idx22)):
s = s.replace(idx2[i], idx22[i])
edit
I think you're going about this in quite a long-winded way. You can use re.sub in one go to make these changes:
s = re.sub('(\d+(\.\d+)?)\^(\d+)', r'pow(\1,\3)', s)
Will substitute 2.3^2+3^3 for pow(2.3,2)+pow(3,3) and:
s = re.sub('√(\d+(\.\d+)?)', r'sqrt(\1)', s)
Will substitute √0.04*2+√4 to sqrt(0.04)*2+sqrt(4)
There's a few things going on here that are different. Firstly, \d, which matches a digit, the same as [0-9]. Secondly, the ( ) capture whatever is inside them. In the replacement, you can refer to these captured groups by the order in which they appear. In the pow example I'm using the first and third group that I have captured.
The prefix r before the replacement string means that the string is to be treated as "raw", so characters are interpreted literally. The groups are accessed by \1, \2 etc. but because the backslash \ is an escape character, I would have to escape it each time (\\1, \\2, etc.) without the r.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

repetition in regular expression in python - python

Related

python re regex matching in string with multiple () parenthesis

Regex to extract the string

how to make a list in python from a string and using regular expression [duplicate]

non greedy Python regex from end of string

python. re.findall and re.sub with '^'

Categories

Resources