Need expression not to match after a colon appear - python

So I have a list of names and wanted to filter out the ones in proper format. For reference, the format I need is IP::hostname. This is the regex formula I currently have:
^\d+(\.|\:)\d+\.\d+\.\d+::.+\w$
However, I need to modify it so that if there are any colons (:) in or after the hostname, for it to not match the expression:
This matches which is correct:
10.179.12.241::CALMGTVCSRM0210
This matches but should not:
10.179.12.241::CALMGTVCSRM0210:as
Any help on how to modify my expression to not match any colons after the host name would be appreciated

The .+ pattern matches 1 or more chars other than line break chars, as many as possible, and thus matches colons allowing them. You need a negated character class, [^:]*, that will match 0+ chars other than a colon.
You may fix you regex (and enhance a bit) using
^\d+[.:]\d+\.\d+\.\d+::[^:]*\w$
^^^^^
See the regex demo
To make sure you want to match a valid IP you'd rather use
^(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(?:\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3}::[^:]*\w$
See another regex demo (IP regex source). The (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) matches a single octet from 0 to 255 and (?:\.<octet_pattern>){3} matches three repetitions of a dot and an octet pattern.

Related

Capture strings inside escaped quotes

I have 3 strings in this format
Bank: {"955974044748481":["BANK_A"]}
{"reason": "Bank: {"455049295219902":["BANK_B"]}"}
{"reason": "Bank: {\\"1876212592475597\\":[\\"BANK_C\\"]}"}
I need to extract the bank_id and bank_name from these strings using a single regex in a presto SQL statement.
I have tried this regex but it only captures the first two and not the last one which has escape characters. https://regex101.com/r/ejW68x/1
Bank: {"(.*)":\["(.*)"\]}
What's the right way to capture all 3 variations?
How about something like this:
Bank:.*{(?:\\\\)?"([^{"]*?)(?:\\\\)?":\[(?:\\\\)?"(.*?)(?:\\\\)?"\]}
Demo.
Or to make sure the \\ are only matched in pairs:
Bank:.*{((?:\\\\)?)"([^{"]*?)\1":\[((?:\\\\)?)"(.*?)\3"\]}
Demo.
Note that in the second case, your captures will be in groups #2 and #4.
Update:
Your new test strings would still be matched by the above patterns. You may just replace Bank:.* with Bank:[ ] if you like. Demo1 - Demo2.
Explanaion: (changes to your pattern)
Added (?:\\\\)? --> An optional non-capturing group to match the two backslash characters.
Replaced your first capturing group (.*) with ([^{"]*?) to avoid matching double-quote and { characters (this is especially necessary for your first test strings). Also, converted it from greedy to lazy (by adding ?) to avoid capturing the escaping characters (\\) if present.
Made the second capturing group lazy as well (.*?) for the same reason.
In the second pattern, (?:\\\\)? was added to a capturing group so that a backreference can be used (i.e., \1 and \3). The purpose of this is to only match if both the double-quote characters are escaped (preceded by \\).

python regex match a group or not match it

I want to match the string:
from string as string
It may or may not contain as.
The current code I have is
r'(?ix) from [a-z0-9_]+ [as ]* [a-z0-9_]+'
But this code matches a single a or s. So something like from string a little will also be in the result.
I wonder what is the correct way of doing this.
You may use
(?i)from\s+[a-z0-9_]+\s+(?:as\s+)?[a-z0-9_]+
See the regex demo
Note that you use x "verbose" (free spacing) modifier, and all spaces in your pattern became formatting whitespaces that the re engine omits when parsing the pattern. Thus, I suggest using \s+ to match 1 or more whitespaces. If you really want to use single regular spaces, just omit the x modifier and use the regular space. If you need the x modifier to insert comments, escape the regular spaces:
r'(?ix) from\ [a-z0-9_]+\ (?:as\ )?[a-z0-9_]+'
Also, to match a sequence of chars, you need to use a grouping construct rather than a character class. Here, (?:as\s+)? defines an optional non-capturing group that matches 1 or 0 occurrences of as + space substring.

Find first ReGex pattern following a different pattern

Objective: find a second pattern and consider it a match only if it is the first time the pattern was seen following a different pattern.
Background:
I am using Python-2.7 Regex
I have a specific Regex match that I am having trouble with. I am trying to get the text between the square brackets in the following sample.
Sample comments:
[98 g/m2 Ctrl (No IP) 95 min 340oC ]
[ ]
I need the line:
98 g/m2 Ctrl (No IP) 95 min 340oC
The problem is the undetermined number of white-spaces, tabs, and new-lines between the search pattern Sample comments: and the match I want is giving me trouble.
Best Attempt:
I am able to match the first part easily,
match = re.findall(r'Sample comments:[.+\n+]+', string)
But I can't get the match to the length I want to grab the portion between the square brackets,
match = re.findall(r'Sample comments:[.+\n+]+\[(.+)\]', string)
My Thinking:
Is there a way to use ReGex to find the first instance of the pattern \[(.+)\] after a match of the pattern Sample comments:? Or is there a more robust way to find the bit between the square braces in my example case.
Thanks,
Michael
I suggest using
r'Sample comments:\s*\[(.*?)\s*]'
See the regex and IDEONE demo
The point is the \s* matches zero or more whitespace, both vertical (linebreaks) and horizontal. See Python re reference:
\s
When the UNICODE flag is not specified, it matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v]. The LOCALE flag has no extra effect on matching of the space. If UNICODE is set, this will match the characters [ \t\n\r\f\v] plus whatever is classified as space in the Unicode character properties database.
Pattern details:
Sample comments: - a sequence of literal chars
\s* - 0 or more whitespaces
\[ - a literal [
(.*?) - Group 1 (returned by re.findall) capturing 0+ any chars but a newline as few as possible up to the first...
\s* - 0+ whitespaces and
] - a literal ] (note it does not have to be escaped outside the character class).
Not sure if I understand your problem correctly, but re.findall('Sample comments:[^\\[]*\\[([^\\]]*)\\]', string) seems to work.
Or maybe re.findall('Sample comments:[^\\[]*\\[[ \t]*([^\\]]*?)[ \t]*\\]', string) if you want to strip the final spaces from your line?

search for string embedded in {} after keyword

How can I get the string embedded in {} after a keyword, where the number of characters between the keyword and the braces {} is unknown. e.g.:
includegraphics[x=2]{image.pdf}
the keyword would be includegraphics and the string to be found is image.pdf, but the text in between [x=2] could have anything between the two [].
So I want to ignore all characters between the keyword and { or I want to ignore everything between []
Use re.findall
>>> sample = 'includegraphics[x=2]{image.pdf}'
>>> re.findall('includegraphics.*?{(.*?)}',sample)
['image.pdf']
Explanation:
The re module deals with regular expressions in Python. Its findall method is useful to find all occurences of a pattern in a string.
A regular expression for the pattern you are interested in is 'includegraphics.*?{(.*?)}'. Here . symbolizes "any character", while the * means 0 or more times. The question mark makes this a non-greedy operation. From the documentation:
The *, +, and ? qualifiers are all greedy; they match as much
text as possible. Sometimes this behaviour isn’t desired; if the RE
<.*> is matched against <H1\>title</H1>, it will match the entire
string, and not just <H1>. Adding ? after the qualifier makes it
perform the match in non-greedy or minimal fashion; as few characters
as possible will be matched. Using .*? in the previous expression will
match only <H1>.
Please note that while in your case using .*? should be fine, in general it's better to use more specialized character groups such as \w for alphanumerics and \d for digits, when you know what the content is going to consist of in advance.
Use re.search
re.search(r'includegraphics\[[^\[\]]*\]\{([^}]*)\}', s).group(1)

Finding big string sequence between two keywords within multiple lines

I have a file with the format of
sjaskdjajldlj_abc:
cdf_asjdl_dlsf1:
dfsflks %jdkeajd
sdjfls:
adkfld %dk_.(%sfj)sdaj, %kjdflajfs
afjdfj _ajhfkdjf
zjddjh -15afjkkd
xyz
and I want to find the text in between the string _abc: in the first line and xyz in the last line.
I have already tried print
re.findall(re.escape("*_abc:")+"(*)"+re.escape("xyz"),line)
But I got null.
If I understood the requirement correctly:
a1=re.search(r'_abc(.*)xyz',line,re.DOTALL)
print a1.group(1)
Use re.DOTALL which will enable . to match a newline character as well.
You used re.escape on your pattern when it contains special characters, so there's no way it will work.
>>>>re.escape("*_abc:")
'\\*_abc\\:'
This will match the actual phrase *_abc:, but that's not what you want.
Just take the re.escape calls out and it should work more or less correctly.
It sounds like you have a misunderstanding about what the * symbol means in a regular expression. It doesn't mean "match anything", but rather "repeat the previous thing zero or more times".
To match any string, you need to combine * with ., which matches any single character (almost, more on this later). The pattern .* matches any string of zero or more characters.
So, you could change your pattern to be .*abc(.*)xyz and you'd be most of the way there. However, if the prefix and suffix only exist once in the text the leading .* is unnecessary. You can omit it and just let the regular expression engine handle skipping over any unmatched characters before the abc prefix.
The one remaining issue is that you have multiple lines of text in your source text. I mentioned above that the . patter matches character, but that's not entirely true. By default it won't match a newline. For single-line texts that doesn't matter, but it will cause problems for you here. To change that behavior you can pass the flag re.DOTALL (or its shorter spelling, re.S) as a third argument to re.findall or re.search. That flag tells the regular expression system to allow the . pattern to match any character including newlines.
So, here's how you could turn your current code into a working system:
import re
def find_between(prefix, suffix, text):
pattern = r"{}.*{}".format(re.escape(prefix), re.escape(suffix))
result = re.search(pattern, text, re.DOTALL)
if result:
return result.group()
else:
return None # or perhaps raise an exception instead
I've simplified the pattern a bit, since your comment suggested that you want to get the whole matched text, not just the parts in between the prefix and suffix.

Categories

Resources