Regular expression print only the first character - python

Instead of printing whole "final' sentence It is printing only "p". Can anyone help here?
final = r'print "\n^^^###***===TP test result: $final_verdict===***###^^^\n";'
searchObj = re.compile(r'[\w\s\"\n\^\^\^\#\#\#\*\*\*\=\=\=\w+\s\w+\:\s\$\w+\=\=\=\#\#\#\*\*\*\^\^\^\n\"\;]')
print(searchObj)
y=searchObj.match(final)
if y:
print("Found",y.group())
else:
print("Nothing")
Result:
re.compile('[\\w\\s\\"\\n\\^\\^\\^\\#\\#\\#\\*\\*\\*\\=\\=\\=\\w+\\s\\w+\\:\\s\\$\\w+\\=\\=\\=\\#\\#\\#\\*\\*\\*\\^\\^\\^\\n\\"\\;]')
Found p

You've put square brackets over your regex, this means you defined a character group, you should remove these:
r'\w\s\"\n\^\^\^\#\#\#\*\*\*\=\=\=\w+\s\w+\:\s\$\w+\=\=\=\#\#\#\*\*\*\^\^\^\n\"\;'
By using a character group you say: any of the characters between the square brackets. So [ab] means: a or b, not a followed by b.
Now however your string does not match anymore (it is of course harder to match a sequence than a single character). You can however improve it to:
r'\w\s\"\\n\^\^\^###\*\*\*===\w+\s\w+\s\w+:\s\$\w+===\*\*\*###\^\^\^\\n\";'
# ^ ^^^ ^^^ ^^^^^ ^^^ ^^^ ^
The carrets on the second line show the changes. First of all you do not need to escape # and =, furthermore you specify \n which Python sees as a new line character, but you want to match \n (two characters), so you need to escape the backslash, so \\n; finally you forgot that there are three words before the colon (:).
You can test and modify your regex with this regex101.

Related

python re regex matching in string with multiple () parenthesis

I have this string
cmd = "show run IP(k1) new Y(y1) add IP(dev.maintserial):Y(dev.maintkeys)"
What is a regex to first match exactly "IP(dev.maintserial):Y(dev.maintkeys)"
There might be a different path inside the parenthesis, like (name.dev.serial), so it is not like there will always be one dot there.
I though of something like this:
re.search('(IP\(.*?\):Y\(.*?\))', cmd) but this will also match the single IP(k1) and Y(y1
My usage will be:
If "IP(*):Y(*)" in cmd:
do substitution of IP(dev.maintserial):Y(dev.maintkeys) to Y(dev.maintkeys.IP(dev.maintserial))
How can I then do the above substitution? In the if condition I want to do this change in order: from IP(path_to_IP_key):Y(path_to_Y_key) to Y(path_to_Y_key.IP(path_to_IP_key)) , so IP is inside Y at the end after the dot.
This should work as it is more restrictive.
(IP\([^\)]+\):Y\(.*?\))
[^\)]+ means at least one character that isn't a closing parenthesis.
.*? in yours is too open ended allowing almost anything to be in until "):Y("
Something like this?
r"IP\(([^)]*\..+)\):Y\(([^)]*\..+)\)"
You can try it with your string. It matches the entire string IP(dev.maintserial):Y(dev.maintkeys) with groups dev.maintserial and dev.maintkeys.
The RE matches IP(, zero or more characters that are not a closing parenthesis ([^)]*), a period . (\.), one or more of any characters (.+), then ):Y(, ... (between the parentheses -- same as above), ).
Example Usage
import re
cmd = "show run IP(k1) new Y(y1) add IP(dev.maintserial):Y(dev.maintkeys)"
# compile regular expression
p = re.compile(r"IP\(([^)]*\..+)\):Y\(([^)]*\..+)\)")
s = p.search(cmd)
# if there is a match, s is not None
if s:
print(f"{s[0]}\n{s[1]}\n{s[2]}")
a = "Y(" + s[2] + ".IP(" + s[1] + "))"
print(f"\n{a}")
Above p.search(cmd) "[s]can[s] through [cmd] looking for the first location where this regular expression [p] produces a match, and return[s] a corresponding match object" (docs). None is the return value if there is no match. If there is a match, s[0] gives the entire match, s[1] gives the first parenthesized subgroup, and s[2] gives the second parenthesized subgroup (docs).
Output
IP(dev.maintserial):Y(dev.maintkeys)
dev.maintserial
dev.maintkeys
Y(dev.maintkeys.IP(dev.maintserial))
You can use 2 negated character classes [^()]* to match any character except parenthesis, and omit the outer capture group for a match only.
To prevent a partial word match, you might start matching IP with a word boundary \b
\bIP\([^()]*\):Y\([^()]*\)
Regex demo

Trouble with re.sub when used with boundary and string containing round brackets

I have the following string and I want to replace the (abc +0.5)*3 by 10
Test_String= 'I am not able to replace (abc+0.5)*3'
I've tried the following
re.sub('\\b(abc\\+0.5)\\*3\\b','10',Test_String)
re.sub('\\b\\(abc\\+0.5\\)\\*3\\b','10',Test_String)
But nothing seems to work and I am using boundary as I want to replace the exact match.
Expected Output
I am not able to replace 10
Actual Output
I am not able to replace (abc+0.5)*3
What am I doing wrong?
You may
not double escape, \\+ becomes \+
not use word boudary \b as there is no word boundary next to the openning parenthesis (could keep the closing one)
escape the dot \. (optional as it represents any character)
test_string = 'I am not able to replace (abc+0.5)*3'
res = re.sub(r'\(abc\+0\.5\)\*3\b', '10', test_string) # \b is optional
print(res) # I am not able to replace 10

using Python re to check a string

I have a list of IDs, and I need to check whether these IDs are properly formatted. The correct format is as follows:
[O,P,Q][0-9][A-Z,0-9][A-Z,0-9][A-Z,0-9][0-9]
[A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]
A-N,R-Z][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9][A-Z][A-Z,0-9][A-Z,0-9][0-9]
The string can also be followed by a dash and a number. I have two problems with my code: 1) how do I limit the length of the string to exactly the number of characters specified by the search terms? and 2) how can I specify that there can be a "-[0-9]" following the string if it matches?
potential_uniprots=['D4S359N116-2', 'DFQME6AGX4', 'Y6IT25', 'V5PG90', 'A7TD4U7ZN11', 'C3KQY5-V']
import re
def is_uniprot(ID):
status=False
uniprot1=re.compile(r'\b[O,P,Q]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
uniprot2=re.compile(r'\b[A-N,R-Z]{1}[0-9]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
uniprot3=re.compile(r'\b[A-N,R-Z]{1}[0-9]{1}[A-Z]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}[A-Z]{1}[A-Z,0-9]{1}[A-Z,0-9]{1}[0-9]{1}\b')
if uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID):
status=True
return status
correctIDs=[]
for prot in potential_uniprots:
if is_uniprot(prot) == True:
correctIDs.append(prot)
print(correctIDs)
Expression Fixes:
BEFORE READING:
All credit for the expression fixes goes to The fourth bird's comment. Please see that comment here or under the original post:
You can omit {1} and the comma's from the character class (If you don't want to match comma's) The patterns by them selves do not contain a quantifier and have word boundaries. So between these word boundaries, you are already matching an exact amount of characters. To match an optional hyphen and digit, you can use an optional non capturing group (?:-[0-9])?
You don't need the , separating the characters in the square brackets as the brackets dictate that the regex should match all characters in the square brackets. For example, a regex such as [A-Z,0-9] is going to match an uppercase character, comma, or a digit whereas a regex such as [A-Z0-9] is going to match an uppercase character or a digit. Furthermore, you don't need the {1} as the regex will match one by default if no quantifiers are specified. This means that you can just delete the {1} from the expression.
Checking Length?
There is a simple way to do this without regex, which is as follows:
string = "Q08F88"
status = (len(string) == 6 or len(string) == 8)
But you can also force the regex to match certain lengths use \b (word-boundary), which you have already done. You can alternatively use ^ and $ at the beginning and end of the expression, respectively, to denote the beginning and end of the string.
Consider this expression: ^abcd$ (only match strings that contain abcd and nothing else)
This means that it is only going to match the string:
abcd
And not:
eabcd
abcde
This is because ^ denotes the start of the string and $ denotes the end of the string.
In the end, you're left with this first expression:
(^[OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9](?:-[0-9])?$)
You can modify your other expressions easily as they follow the same structure as above.
Code Suggestions
Your code looks great, but you could make a few minor fixes to improve readability and conventions. For example, you could change this:
if uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID):
status=True
return status
To this:
return (uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID))
# -OR-
stats = (uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID))
return status
Because uniprot1.search(ID) or uniprot2.search(ID)or uniprot3.search(ID) is never going to return anything other than True or False, so it is safe to return that expression.

Does regex automatically ignore trailing whitespace?

Why do these two expressions return the same output?
phillip = '#awesome '
nltk.re_show('\w+|[^\w\s]+', phillip)
vs.
nltk.re_show('\w+|[^\w]+', phillip)
Both return:
{#}{awesome}
Why doesn't the second one return
{#}{awesome}{ }?
It appears this that nltk right-strips whitespace in strings before applying the regex.
See the source code (or you could import inspect and print inspect.get_source(nltk.re_show))
def re_show(regexp, string, left="{", right="}"):
"""docstring here -- I stripped it for brevity"""
print(re.compile(regexp, re.M).sub(left + r"\g<0>" + right, string.rstrip()))
In particular, see the string.rstrip(), which strips all trailing whitespace.
For example, if you make sure that your phillip string does not have a space to the right:
nltk.re_show('\w+|[^\w]+', phillip + '.')
# {#}{awesome}{ .}
Not sure why nltk would do this, it seems like a bug to me...
\w looks to match [A-Za-z0-9_]. And since you are looking for one OR the other (1+ "word" characters OR 1+ non-"word" characters), it matches the first character as a \w character and keeps going until it hits a non-match .
If you do a global match, you will see that there is another match containing the space (the first non-"word" character).

Match series of (non-nested) balanced parentheses at end of string

How can I match one or more parenthetical expressions appearing at the end of string?
Input:
'hello (i) (m:foo)'
Desired output:
['i', 'm:foo']
Intended for a python script. Paren marks cannot appear inside of each other (no nesting), and the parenthetical expressions may be separated by whitespace.
It's harder than it might seem at first glance, at least so it seems to me.
paren_pattern = re.compile(r"\(([^()]*)\)(?=(?:\s*\([^()]*\))*\s*$)")
def getParens(s):
return paren_pattern.findall(s)
or even shorter:
getParens = re.compile(r"\(([^()]*)\)(?=(?:\s*\([^()]*\))*\s*$)").findall
explaination:
\( # opening paren
([^()]*) # content, captured into group 1
\) # closing paren
(?= # look ahead for...
(?:\s*\([^()]*\))* # a series of parens, separated by whitespace
\s* # possibly more whitespace after
$ # end of string
) # end of look ahead
You don't need to use regex:
def splitter(input):
return [ s.rstrip(" \t)") for s in input.split("(") ][1:]
print splitter('hello (i) (m:foo)')
Note: this solution only works if your input is already known to be valid. See MizardX's solution that will work on any input.

Categories

Resources