Allowing escape sequences in my regular expression - python

I'm trying to create a regular expression which finds occurences of $VAR or ${VAR}. If something like \$VAR or \${VAR} was given, it would not match. If it were given something like \\$VAR or \\${VAR} or any multiple of 2 \'s, it should match.
i.e.
$BLOB matches
\$BLOB doesn't match
\\$BLOB matches
\\\$BLOB doesn't match
\\\\$BLOB matches
... etc
I'm currently using the following regex:
line = re.sub("[^\\][\\\\]*\$(\w[^-]+)|"
"[^\\][\\\\]*\$\{(\w[^-]+)\}",replace,line)
However, this doesn't work properly. When I give it \$BLOB, it still matches for some reason. Why is this?

The second groupings of double slashes are written as a redundant character class [\\\\]*, matching one or more backslashes, but should be a repeating group ((?:\\\\)*) matching one or more sets of two backslashes:
re.sub(r'(?<!\\)((?:\\\\)*)\$(\w[^-]+|\{(\w[^-]+)\})',r'\1' + replace, line)

To write a regular expression that finds $ unless it is escaped using E unless it in turn is also escaped EE:
import re
values = dict(BLOB='some value')
def repl(m):
return m.group('before') + values[m.group('name').strip('{}')]
regex = r"(?<!E)(?P<before>(?:EE)*)\$(?P<name>N|\{N\})"
regex = regex.replace('E', re.escape('\\'))
regex = regex.replace('N', r'\w+') # name
line = re.sub(regex, repl, line)
Using E instead of '\\\\' exposes your embed language without thinking about backslashes in Python string literals and regular expression patterns.

Related

Regex find content in between single quotes, but only if contains certain word

I want to get the content between single quotes, but only if it contains a certain word (i.e 'sample_2'). It additionally should not match ones with white space.
Input example: (The following should match and return only: ../sample_2/file and sample_2/file)
['asdf', '../sample_2/file', 'sample_2/file', 'example with space', sample_2, sample]
Right now I just have that matched the first 3 items in the list:
'(.\S*?)'
I can't seem to find the right regex that would return those containing the word 'sample_2'
If you want specific words/characters you need to have them in the regular expression and not use the '\S'. The \S is the equivalent to [^\r\n\t\f\v ] or "any non-whitespace character".
import re
teststr = "['asdf', '../sample_2/file', 'sample_2/file', 'sample_2 with spaces','example with space', sample_2, sample]"
matches = re.findall(r"'([^\s']*sample_2[^\s]*?)',", teststr)
# ['../sample_2/file', 'sample_2/file']
Based on your wording, you suggest the desired word can change. In that case, I would recommend using re.compile() to dynamically create a string which then defines the regular expression.
import re
word = 'sample_2'
teststr = "['asdf', '../sample_2/file', 'sample_2/file', ' sample_2 with spaces','example with space', sample_2, sample]"
regex = re.compile("'([^'\\s]*"+word+"[^\\s]*?)',")
matches = regex.findall(teststr)
# ['../sample_2/file', 'sample_2/file']
Also if you haven't heard of this tool yet, check out regex101.com. I always build my regular expressions here to make sure I get them correct. It gives you the references, explanation of what is happening and even lets you test it right there in the browser.
Explanation of regex
regex = r"'([^\s']*sample_2[^\s]*?)',"
Find first apostrophe, start group capture. Capture anything except a whitespace character or the corresponding ending apostrophe. It must see the letters "sample_2" before accepting any non-whitespace character. Stop group capture when you see the closing apostrophe and a comma.
Note: In python, a string " or ' prepositioned with the character 'r' means the text is compiled as a regular expression. Strings with the character 'r' also do not require double-escape '\' characters.

how to make a list in python from a string and using regular expression [duplicate]

I have a sample string <alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card] ...>, created=1324336085, description='Customer for My Test App', livemode=False>
I only want the value cus_Y4o9qMEZAugtnW and NOT card (which is inside another [])
How could I do it in easiest possible way in Python?
Maybe by using RegEx (which I am not good at)?
How about:
import re
s = "alpha.Customer[cus_Y4o9qMEZAugtnW] ..."
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
For me this prints:
cus_Y4o9qMEZAugtnW
Note that the call to re.search(...) finds the first match to the regular expression, so it doesn't find the [card] unless you repeat the search a second time.
Edit: The regular expression here is a python raw string literal, which basically means the backslashes are not treated as special characters and are passed through to the re.search() method unchanged. The parts of the regular expression are:
\[ matches a literal [ character
( begins a new group
[A-Za-z0-9_] is a character set matching any letter (capital or lower case), digit or underscore
+ matches the preceding element (the character set) one or more times.
) ends the group
\] matches a literal ] character
Edit: As D K has pointed out, the regular expression could be simplified to:
m = re.search(r"\[(\w+)\]", s)
since the \w is a special sequence which means the same thing as [a-zA-Z0-9_] depending on the re.LOCALE and re.UNICODE settings.
You could use str.split to do this.
s = "<alpha.Customer[cus_Y4o9qMEZAugtnW] active_card=<alpha.AlphaObject[card]\
...>, created=1324336085, description='Customer for My Test App',\
livemode=False>"
val = s.split('[', 1)[1].split(']')[0]
Then we have:
>>> val
'cus_Y4o9qMEZAugtnW'
This should do the job:
re.match(r"[^[]*\[([^]]*)\]", yourstring).groups()[0]
your_string = "lnfgbdgfi343456dsfidf[my data] ljfbgns47647jfbgfjbgskj"
your_string[your_string.find("[")+1 : your_string.find("]")]
courtesy: Regular expression to return text between parenthesis
You can also use
re.findall(r"\[([A-Za-z0-9_]+)\]", string)
if there are many occurrences that you would like to find.
See also for more info:
How can I find all matches to a regular expression in Python?
You can use
import re
s = re.search(r"\[.*?]", string)
if s:
print(s.group(0))
How about this ? Example illusrated using a file:
f = open('abc.log','r')
content = f.readlines()
for line in content:
m = re.search(r"\[(.*?)\]", line)
print m.group(1)
Hope this helps:
Magic regex : \[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.
This snippet should work too, but it will return any text enclosed within "[]"
re.findall(r"\[([a-zA-Z0-9 ._]*)\]", your_text)

Python pattern match a string

I am trying to pattern match a string, so that if it ends in the characters 'std' I split the last 6 characters and append a different prefix.
I am assuming I can do this with regular expressions and re.split, but I am unsure of the correct notation to append a new prefix and take last 6 chars based on the presence of the last 3 chars.
regex = r"([a-zA-Z])"
if re.search(regex, "std"):
match = re.search(regex, "std")
#re.sub(r'\Z', '', varname)
You're confused about how to use regular expressions here. Your code is saying "search the string 'std' for any alphanumeric character".
But there is no need to use regexes here anyway. Just use string slicing, and .endswith:
if my_string.endswith('std'):
new_string = new_prefix + mystring[-6:]
No need for a regex. Just use standard string methods:
if s.endswith('std'):
s = s[:-6] + new_suffix
But if you had to use a regex, you would substitute a regex, you would substitute the new suffix in:
regex = re.compile(".{3}std$")
s = regex.sub(new_suffix, s)

repetition in regular expression in python

I've got a file with lines for example:
aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj
I need to take what is inside $$ so expected result is:
$bb$
$ddd$
$ggg$
$iii$
My result:
$bb$
$ggg$
My solution:
m = re.search(r'$(.*?)$', line)
if m is not None:
print m.group(0)
Any ideas how to improve my regexp? I was trying with * and + sign, but I'm not sure how to finally create it.
I was searching for similar post, but couldnt find it :(
You can use re.findall with r'\$[^$]+\$' regex:
import re
line = """aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj"""
m = re.findall(r'\$[^$]+\$', line)
print(m)
# => ['$bb$', '$ddd$', '$ggg$', '$iii$']
See Python demo
Note that you need to escape $s and remove the capturing group for the re.findall to return the $...$ substrings, not just what is inside $s.
Pattern details:
\$ - a dollar symbol (literal)
[^$]+ - 1 or more symbols other than $
\$ - a literal dollar symbol.
NOTE: The [^$] is a negated character class that matches any char but the one(s) defined in the class. Using a negated character class here speeds up matching since .*? lazy dot pattern expands at each position in the string between two $s, thus taking many more steps to complete and return a match.
And a variation of the pattern to get only the texts inside $...$s:
re.findall(r'\$([^$]+)\$', line)
^ ^
See another Python demo. Note the (...) capturing group added so that re.findall could only return what is captured, and not what is matched.
re.search finds only the first match. Perhaps you'd want re.findall, which returns list of strings, or re.finditer that returns iterator of match objects. Additionally, you must escape $ to \$, as unescaped $ means "end of line".
Example:
>>> re.findall(r'\$.*?\$', 'aaa$bb$ccc$ddd$eee')
['$bb$', '$ddd$']
>>> re.findall(r'\$(.*?)\$', 'aaa$bb$ccc$ddd$eee')
['bb', 'ddd']
One more improvement would be to use [^$]* instead of .*?; the former means "zero or more any characters besides $; this can potentially avoid more pathological backtracking behaviour.
Your regex is fine. re.search only finds the first match in a line. You are looking for re.findall, which finds all non-overlapping matches. That last bit is important for you since you have the same start and end delimiter.
for m in m = re.findall(r'$(.*?)$', line):
if m is not None:
print m.group(0)

Use python 3 regex to match a string in double quotes

I want to match a string contained in a pair of either single or double quotes. I wrote a regex pattern as so:
pattern = r"([\"\'])[^\1]*\1"
mytext = '"bbb"ccc"ddd'
re.match(pattern, mytext).group()
The expected output would be:
"bbb"
However, this is the output:
"bbb"ccc"
Can someone explain what's wrong with the pattern above? I googled and found the correct pattern to be:
pattern = r"([\"\'])[^\1]*?\1"
However, I don't understand why I must use ?.
In your regex
([\"'])[^\1]*\1
Character class is meant for matching only one character. So your use of [^\1] is incorrect. Think, what would have have happened if there were more than one characters in the first capturing group.
You can use negative lookahead like this
(["'])((?!\1).)*\1
or simply with alternation
(["'])(?:[^"'\\]+|\\.)*\1
or
(?<!\\)(["'])(?:[^"'\\]+|\\.)*\1
if you want to make sure "b\"ccc" does not matches in string bb\"b\"ccc"
You should use a negative lookahead assertion. And I assume there won't be any escaped quotes in your input string.
>>> pattern = r"([\"'])(?:(?!\1).)*\1"
>>> mytext = '"bbb"ccc"ddd'
>>> re.search(pattern, mytext).group()
'"bbb"'
You can use:
pattern = r"[\"'][^\"']*[\"']"
https://regex101.com/r/dO0cA8/1
[^\"']* will match everything that isn't " or '

Categories

Resources