Input string could be any of the below strings.
image: xyz.com/elk_init_cos7:1.0.0-20.12.0
image: "xyz.com/ckaf/kafka:4.1.0-5.4.1-59"
import re
mat = re.search("image:\s*\"?(.+?/(.+?):(.+?))\"?", str)
if mat:
print (mat.group(1))
print (mat.group(2))
print (mat.group(3))
Ouptut:
artifactory.net.nokia.com/ckaf/kafka:4
ckaf/kafka
4
If I use regex as "image:\s*"?(.+?/(.+?):(.+))"?", then I am getting the string with double quote 4.1.0-5.4.1-59".
How can I get last part of the string without " coming at end and still satisfy other input string also?
The (.+?))\"? part of the pattern when used at the end of string is matching very few chars because .+? only has to match a single char, then it goes on to check for a ", and if there is no " the single char captured with (.+?). There is no obligation here to proceed matching until a " char.
The (.+))\"? at the end of the pattern will match and capture text up to the end of the line, and \"? will match nothing (or, in other words, empty string).
You want to match anything but a " char, one or more times here.
image:\s*\"?(.+?/(.+?):([^\"]+))
See the regex demo. I added \n at the online demo just to make sure the match does not go across lines, if the line are standalone strings in your real scenario, you do not need it.
You may use the negated character classes in other places of your regex, too:
image:\s*\"?([^/]+/([^:]+):([^\"]+))
See this regex demo.
The [^/]+/([^:]+) part now matches and captures into Group 1 any one or more chars other than / (with ([^/]+)), then matches a / char, and then captures into Group 2 any one or more chars other than a : char (([^:]+)).
re.search("image:\s*\"?(.+?/(.+?):(.+?))\"?$", str)
when I placed $ at the end, it addressed my issue. Thank you for sharing your inputs.
Related
I am trying to grab fary_trigger_post in the code below using Regex. However, I don't understand why it always includes " in the end of the matched pattern, which I don't expect.
Any idea or suggestion?
re.match(
r'-instance[ "\']*(.+)[ "\']*$',
'-instance "fary_trigger_post" '.strip(),
flags=re.S).group(1)
'fary_trigger_post"'
Thank you.
The (.+) is greedy and grabs ANY character until the end of the input. If you modified your input to include characters after the final double quote (e.g. '-instance "fary_trigger_post" asdf') you would find the double quote and the remaining characters in the output (e.g. fary_trigger_post" asdf). Instead of .+ you should try [^"\']+ to capture all characters except the quotes. This should return what you expect.
re.match(r'-instance[ "\']*([^"\']+)[ "\'].*$', '-instance "fary_trigger_post" '.strip(), flags=re.S).group(1)
Also, note that I modified the end of the expression to use .* which will match any characters following the last quote.
Here's what I'd use in your matching string, but it's hard to provide a better answer without knowing all your cases:
r'-instance\s+"(.+)"\s*$'
When you try to get group 1 (i.e. (.+)) regex will follow this match to the end of string, as it can match . (any character) 1 or more times (but it will take maximum amount of times). I would suggest use the following pattern:
'-instance[ "\']*(.+)["\']+ *$'
This will require regex to match all spaces in the end and all quoutes separatelly, so that it won't be included into group 1
I am quite new to regex and I right now Have a problem formulating a regex to match a string where the first and last letter are different. I looked up on the internet and found a regex that just does it's opposite. i.e. matches words that have same starting and ending letter. Can anyone please help me to understand if I can negeate this regex in some way or can create a new regex to match my requirements. The regex that needs to be modiifed or changed is:
^\s|^[a-z]$|^([a-z]).*\1$
This matches these Strings :
aba,
a,
b,
c,
d,
" ",
cccbbbbbbac,
aaaaba
But I want it to match strings like:
aaabbcz,
zba,
ccb,
cbbbba
Can anyone please help me in this regard? Thank you.
Note: I will be using this with Python Regex, so the regex should be compataible to be used with Python.
You don't need a regex for this, just use
s[0] != s[-1]
where s is your string. If you must use a regex, you can use this:
^(.).*(?!\1).$
This looks for
^ : beginning of string
(.) : a character (captured in group 1)
.* : some number of characters
(?!\1). : a character which is not the character captured in group 1
$ : end of string
Regex demo on regex101
This part of your pattern ^([a-z]).*\1$ only accounts for chars a-z, but you also want to exclude " "
You can rewrite that pattern by putting the part after the capture group inside a negative lookahead.
^(.)(?!.*\1$).+
^ Start of string
(.) Capture a single char (including spaces) in group 1
(?!.*\1$) Negative lookahead, assert that the string does not end with the same character
.+ Match 1+ characters so that the string has a minimum of 2 characters
See a regex demo.
If the string should start and end with a non whitespace character to prevent / trailing trailing spaces, you can start the match with a non whitespace character \S and also end the match with a non whitespace character.
^(\S)(?!.*\1$).*\S$
See another regex demo.
I got this regex:
(\s|'|\")((?=.*[0-9])(?=.*[a-zA-Z]))([a-z0-9]{8})(\s|'|\")
to search for strings of length 8 having one lower case character and one digit. The string needs to be enclosed by space, quote or double quote.
What does not work in the expression: something like this would be accepted:
"1234567a'. If string starts with ' it should end with ', when starting with " it should end by " etc.
I am not very strong at regexes so let me ask if there is a better way to enforce same character for begin and end without repeating regex 3 times?
If you want to match the same char at the end of the string as the one at its start, you may use a backreference to the char once it is captured into a capturing group.
Besides, to make sure you match at the start of the string, add ^ anchor at the start of the string and $ anchor at the end of string:
r'''^([\s'"])(?=.*[0-9])(?=.*[a-zA-Z])[a-zA-Z0-9]{8}\1$'''
See the regex demo
The ([\s'"]) is a capturing group with ID 1, so, the \1 backreference at the end matches the same text as is stored in Group 1 memory buffer.
I'm trying to use regular expressions to capture all Twitter handles within a tweet body. The challenge is that I'm trying to get handles that
Contain a specific string
Are of unknown length
May be followed by either
punctuation
whitespace
or the end of string.
For example, for each of these strings, Ive marked in italics what I'd like to return.
"#handle what is your problem?" [RETURN '#handle']
"what is your problem #handle?" [RETURN '#handle']
"#123handle what is your problem #handle123?" [RETURN '#123handle', '#handle123']
This is what I have so far:
>>> import re
>>> re.findall(r'(#.*handle.*?)\W','hi #123handle, hello #handle123')
['#123handle']
# This misses the handles that are followed by end-of-string
I tried modifying to include an or character allowing the end-of-string character. Instead, it just returns the whole string.
>>> re.findall(r'(#.*handle.*?)(?=\W|$)','hi #123handle, hello #handle123')
['#123handle, hello #handle123']
# This looks like it is too greedy and ends up returning too much
How can I write an expression that will satisfy both conditions?
I've looked at a couple other places, but am still stuck.
It seems you are trying to match strings starting with #, then having 0+ word chars, then handle, and then again 0+ word chars.
Use
r'#\w*handle\w*'
or - to avoid matching #+word chars in emails:
r'\B#\w*handle\w*'
See the Regex 1 demo and the Regex 2 demo (the \B non-word boundary requires a non-word char or start of string to be right before the #).
Note that the .* is a greedy dot matching pattern that matches any characters other than newline, as many as possible. \w* only matches 0+ characters (also as many as possible) but from the [a-zA-Z0-9_] set if the re.UNICODE flag is not used (and it is not used in your code).
Python demo:
import re
p = re.compile(r'#\w*handle\w*')
test_str = "#handle what is your problem?\nwhat is your problem #handle?\n#123handle what is your problem #handle123?\n"
print(p.findall(test_str))
# => ['#handle', '#handle', '#123handle', '#handle123']
Matches only handles that contain this range of characters -> /[a-zA-Z0-9_]/.
s = "#123handle what is your problem #handle123?"
print re.findall(r'\B(#[\w\d_]+)', s)
>>> ['#123handle', '#handle123']
s = '#The quick brown fox#jumped over the LAAZY #_dog.'
>>> ['#The', '#_dog']
Given the following simple regular expression which goal is to capture the text between quotes characters:
regexp = '"?(.+)"?'
When the input is something like:
"text"
The capturing group(1) has the following:
text"
I expected the group(1) to have text only (without the quotes). Could somebody explain what's going on and why the regular expression is capturing the " symbol even when it's outside the capturing group #1. Another strange behavior that I don't understand is why the second quote character is captured but not the first one given that both of them are optional. Finally I fixed it by using the following regex, but I would like to understand what I'm doing wrong:
regexp = '"?([^"]+)"?'
Quantifiers in regular expressions are greedy: they try to match as much text as possible. Because your last " is optional (you wrote "? in your regular expression), the .+ will match it.
Using [^"] is one acceptable solution. The drawback is that your string cannot contain " characters (which may or may not be desirable, depending on the case).
Another is to make " required:
regexp = '"(.+)"'
Another one is to make the + non-greedy, by using +?. However you also need to add anchors ^ and $ (or similar, depending on the context), otherwise it'll match only the first character (t in the case of "test"):
regexp = '^"?(.+?)"?$'
This regular expression allows " characters to be in the middle of the string, so that "t"e"s"t" will result in t"e"s"t being captured by the group.
why the regular expression is capturing the " symbol even when it's outside the capturing group #1
The "?(.+)"? pattern contains a greedy dot matching subpattern. A . can match a ", too. The "? is an optional subpattern. It means that if the previous subpattern is greedy (and .+ is a greedy subpattern) and can match the subsequent subpattern (and . can match a "), the .+ will take over that optional value.
The negated character class is a correct way to match any characters but a certain one/range(s) of characters. [^"] will never match a ", so the last " will never get matched with this pattern.
why the second quote character is captured but not the first one given that both of them are optional
The first "? comes before the greedy dot matching pattern. The engine sees the " (if it is in the string) and matches the quote with the first "?.
.+ is greedy. It'll collect everything including the ". Your final "? doesn't require that a quote be present, hence .+ includes the quote.
The first quote isn't captured because it's matched by the "?
The regexp is greedy by default, it will try to match as much as possible as soon as possible.
Since your capturing group contains .+, this will match the ending parenthesis before the "?. Then, when exiting the group, it is at the end of your line, which is matched by the optional ".
.+ matches any character as long as it can (including the "). And when it reaches end of the input the "? is matching as it means the " is optional.
You should use "non greedy":
regex
"(.+?)"