Capture strings inside escaped quotes - python

I have 3 strings in this format
Bank: {"955974044748481":["BANK_A"]}
{"reason": "Bank: {"455049295219902":["BANK_B"]}"}
{"reason": "Bank: {\\"1876212592475597\\":[\\"BANK_C\\"]}"}
I need to extract the bank_id and bank_name from these strings using a single regex in a presto SQL statement.
I have tried this regex but it only captures the first two and not the last one which has escape characters. https://regex101.com/r/ejW68x/1
Bank: {"(.*)":\["(.*)"\]}
What's the right way to capture all 3 variations?

How about something like this:
Bank:.*{(?:\\\\)?"([^{"]*?)(?:\\\\)?":\[(?:\\\\)?"(.*?)(?:\\\\)?"\]}
Demo.
Or to make sure the \\ are only matched in pairs:
Bank:.*{((?:\\\\)?)"([^{"]*?)\1":\[((?:\\\\)?)"(.*?)\3"\]}
Demo.
Note that in the second case, your captures will be in groups #2 and #4.
Update:
Your new test strings would still be matched by the above patterns. You may just replace Bank:.* with Bank:[ ] if you like. Demo1 - Demo2.
Explanaion: (changes to your pattern)
Added (?:\\\\)? --> An optional non-capturing group to match the two backslash characters.
Replaced your first capturing group (.*) with ([^{"]*?) to avoid matching double-quote and { characters (this is especially necessary for your first test strings). Also, converted it from greedy to lazy (by adding ?) to avoid capturing the escaping characters (\\) if present.
Made the second capturing group lazy as well (.*?) for the same reason.
In the second pattern, (?:\\\\)? was added to a capturing group so that a backreference can be used (i.e., \1 and \3). The purpose of this is to only match if both the double-quote characters are escaped (preceded by \\).

Related

How to ignore comments inside string literals

I'm doing a lexer as a part of a university course. One of the brain teasers (extra assignments that don't contribute to the scoring) our professor gave us is how could we implement comments inside string literals.
Our string literals start and end with exclamation mark. e.g. !this is a string literal!
Our comments start and end with three periods. e.g. ...This is a comment...
Removing comments from string literals was relatively straightforward. Just match string literal via /!.*!/ and remove the comment via regex. If there's more than three consecutive commas, but no ending commas, throw an error.
However, I want to take this even further. I want to implement the escaping of the exclamation mark within the string literal. Unfortunately, I can't seem to get both comments and exclamation mark escapes working together.
What I want to create are string literals that can contain both comments and exclamation mark escapes. How could this be done?
Examples:
!Normal string!
!String with escaped \! exclamation mark!
!String with a comment ... comment ...!
!String \! with both ... comments can have unescaped exclamation marks!!!... !
This is my current code that can't ignore exclamation marks inside comments:
def t_STRING_LITERAL(t):
r'![^!\\]*(?:\\.[^!\\]*)*!'
# remove the escape characters from the string
t.value = re.sub(r'\\!', "!", t.value)
# remove single line comments
t.value = re.sub(r'\.\.\.[^\r\n]*\.\.\.', "", t.value)
return t
Perhaps this might be another option.
Match 0+ times any character except a backslash, dot or exclamation mark using the first negated character class.
Then when you do match a character that the first character class does not matches, use an alternation to match either:
repeat 0+ times matching either a dot that is not directly followed by 2 dots
or match from 3 dots to the next first match of 3 dots
or match only an escaped character
To prevent catastrophic backtracking, you can mimic an atomic group in Python using a positive lookahead with a capturing group inside. If the assertion is true, then use the backreference to \1 to match.
For example
(?<!\\)![^!\\.]*(?:(?:\.(?!\.\.)|(?=(\.{3}.*?\.{3}))\1|\\.)[^!\\.]*)*!
Explanation
(?<!\\)! Match ! not directly preceded by \
[^!\\.]* Match 1+ times any char except ! \ or .
(?: Non capture group
(?:\.(?!\.\.) Match a dot not directly followed by 2 dots
| Or
(?=(\.{3}.*?\.{3}))\1 Assert and capture in group 1 from ... to the nearest ...
| Or
\\. Match an escaped char
) Close group
[^!\\.]* Match 1+ times any char except ! \ or .
)*! Close non capture group and repeat 0+ times, then match !
Regex demo
Look at this regex to match string literals: https://regex101.com/r/v2bjWi/2.
(?<!\\)!(?:\\!|(?:\.\.\.(?P<comment>.*?)\.\.\.)|[^!])*?(?<!\\)!.
It is surrounded by two (?<!\\)! meaning unescaped exclamation mark,
It consists of alternating escaped exclamation marks \\!, comments (?:\.\.\.(?P<comment>.*?)\.\.\.) and non-exclamation marks [^!].
Note that this is about as much as you can achieve with a regular expression. Any additional request, and it will not be sufficient any more.

Need expression not to match after a colon appear

So I have a list of names and wanted to filter out the ones in proper format. For reference, the format I need is IP::hostname. This is the regex formula I currently have:
^\d+(\.|\:)\d+\.\d+\.\d+::.+\w$
However, I need to modify it so that if there are any colons (:) in or after the hostname, for it to not match the expression:
This matches which is correct:
10.179.12.241::CALMGTVCSRM0210
This matches but should not:
10.179.12.241::CALMGTVCSRM0210:as
Any help on how to modify my expression to not match any colons after the host name would be appreciated
The .+ pattern matches 1 or more chars other than line break chars, as many as possible, and thus matches colons allowing them. You need a negated character class, [^:]*, that will match 0+ chars other than a colon.
You may fix you regex (and enhance a bit) using
^\d+[.:]\d+\.\d+\.\d+::[^:]*\w$
^^^^^
See the regex demo
To make sure you want to match a valid IP you'd rather use
^(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(?:\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3}::[^:]*\w$
See another regex demo (IP regex source). The (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) matches a single octet from 0 to 255 and (?:\.<octet_pattern>){3} matches three repetitions of a dot and an octet pattern.

unexpected result for python re.sub() with non-capturing character

I cannot understand the following output :
import re
re.sub(r'(?:\s)ff','fast-forward',' ff')
'fast-forward'
According to the documentation :
Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl.
So why is the whitespace included in the captured occurence, and then replaced, since I added a non-capturing tag before it?
I would like to have the following output :
' fast-forward'
The non-capturing group still matches and consumes the matched text. Note that consuming means adding the matched text to the match value (memory buffer alotted for the whole matched substring) and the corresponding advancing of the regex index. So, (?:\s) puts the whitespace into the match value, and it is replaced with the ff.
You want to use a look-behind to check for a pattern without consuming it:
re.sub(r'(?<=\s)ff','fast-forward',' ff')
See the regex demo.
An alternative to this approach is using a capturing group around the part of the pattern one needs to keep and a replacement backreference in the replacement pattern:
re.sub(r'(\s)ff',r'\1fast-forward',' ff')
^ ^ ^^
Here, (\s) saves the whitespace in Group 1 memory buffer and \1 in the replacement retrieves it and adds to the replacement string result.
See the Python demo:
import re
print('"{}"'.format(re.sub(r'(?<=\s)ff','fast-forward',' ff')))
# => " fast-forward"
A non-capturing group still matches the pattern it contains. What you wanted to express was a look-behind, which does not match its pattern but simply asserts it is present before your match.
Although, if you are to use a look-behind for whitespace, you might want to consider using a word boundary metacharacter \b instead. It matches the empty string between a \w and a \W character, asserting that your pattern is at the beginning of a word.
import re
re.sub(r'\bff\b', 'fast-forward', ' ff') # ' fast-forward'
Adding a trailing \b will also make sure that you only match 'ff' if it is surrounded by whitespaces, not at the beginning of a word such as in 'ffoo'.
See the demo.

python regex match a group or not match it

I want to match the string:
from string as string
It may or may not contain as.
The current code I have is
r'(?ix) from [a-z0-9_]+ [as ]* [a-z0-9_]+'
But this code matches a single a or s. So something like from string a little will also be in the result.
I wonder what is the correct way of doing this.
You may use
(?i)from\s+[a-z0-9_]+\s+(?:as\s+)?[a-z0-9_]+
See the regex demo
Note that you use x "verbose" (free spacing) modifier, and all spaces in your pattern became formatting whitespaces that the re engine omits when parsing the pattern. Thus, I suggest using \s+ to match 1 or more whitespaces. If you really want to use single regular spaces, just omit the x modifier and use the regular space. If you need the x modifier to insert comments, escape the regular spaces:
r'(?ix) from\ [a-z0-9_]+\ (?:as\ )?[a-z0-9_]+'
Also, to match a sequence of chars, you need to use a grouping construct rather than a character class. Here, (?:as\s+)? defines an optional non-capturing group that matches 1 or 0 occurrences of as + space substring.

Regex, better way

How do you separate a regex, that could be matched multiple times within a string, if the delimiter is within the string, ie:
Well then 'Bang bang swing'(BBS) aota 'Bing Bong Bin'(BBB)
With the regex: "'.+'(\S+)"
It would match from Everything from 'Bang ... (BBB) instead of matching 'Bang bang swing'(BBS) and 'Bing Bong Bin'(BBB)
I have a manner of making this work with regex: '[A-z0-9-/?|q~`!##$%^&*()_-=+ ]+'(\S+)
But this is excessive, and honestly I hate that it even works correctly.
I'm fairly new to regexes, and beginning with Pythons implementation of them is apparently not the smartest manner in which to start it.
To get a substring from one character up to another character, where neither can appear in-between, you should always consider using negated character classes.
The [negated] character class matches any character that is not in the character class. Unlike the dot, negated character classes also match (invisible) line break characters. If you don't want a negated character class to match line breaks, you need to include the line break characters in the class. [^0-9\r\n] matches any character that is not a digit or a line break.
So, you can use
'[^']*'\([^()]*\)
See regex demo
Here,
'[^']*' - matches ' followed by 0 or more characters other than ' and then followed by a ' again
\( - matches a literal ) (it must be escaped)
[^()]* - matches 0 or more characters other than ( and ) (they do not have to be escaped inside a character class)
\) - matches a literal ) (must be escaped outside a character class).
If you might have 1 or more single quotes before (...) part, you will need an unrolled lazy matching regex:
'[^']*(?:'(?!\([^()]*\))[^']*)*'\([^()]*\)
See regex demo.
Here, the '[^']*(?:'(?!\([^()]*\))[^']*)*' is matching the same as '.*?' with DOTALL flag, but is much more efficient due to the linear regex execution. See more about unrolling regex technique here.
EDIT:
When input strings are not complex and short, lazy dot matching turns out more efficient. However, when complexity grows, lazy dot matching may cause issues.
How about this regular expression
'.+?'\(\S+\)

Categories

Resources