How is it possible to encode """ (triple quotes) into a raw string? - python

How do I encode """ in a raw python string?
The following does not seem to work:
string = r"""\"\"\""""
since when trying to match """ with a regular expression, I have to double-escape the character ":
Returns an empty list:
string = r"""\"\"\""""
regEx = re.compile(r"""
(\"\"\")
""", re.S|re.X)
result = re.findall(regEx, string)
in this case result is an empty list.
This same regular expression returns ['"""'] when I load a string with """ from file content.
Returns double-escaped quotations:
string = r"""\"\"\""""
regEx = re.compile(r"""
(\\"\\"\\")
""", re.S|re.X)
result = re.findall(regEx, string)
now result is equal to ['\\"\\"\\"'].
It want it to be equal to ['"""'].

In general, there are three options:
Don't use the r prefix. That's just a convenience to avoid excessive use of double-backslashes in regexes. It isn't required.
Use r'…', inside which the " character isn't special.
Mix and match r"…" and '':, e.g. pattern = '"""' + r"\s*\d\d-'\d\d'-\d\d\s*" + '"""'
In this instance, you can do both 1 and 2: single quotes and no r prefix.

The simplest way is to just do '"""'.

Related

Using variables in re.findall() regex function

I have a list of regex patterns like k[a-z]p[a-z]+a
and a list of words that can fit into these patterns. Now, the problem is that,
when I use:
re.findall(r'k[a-z]p[a-z]+a', list)
Everything works properly, but when I replace the raw expression with a variable like:
pattern = "r'" + pattern + "'"
and then try:
re.findall(pattern, list)
or
re.findall(str(pattern), list)
It no longer works. How could I fix it?
Thanks!
Spike
You are overthinking it. The r prefix is not part of the pattern string itself, it merely indicates that the following string should not use escape codes for certain characters.
This will work without adjusting your pattern:
re.findall(pattern, list)
If your pattern contains characters that do not need escaping (as they do not), you can add the prefix r to the pattern definition. Suppose you want to search for a different regex, then use
pattern = r'k\wp\wa'
re.findall(pattern, list)
and you don't need to escape it. Since pattern in itself is a perfectly ordinary string, you can concatenate it with other strings:
start = 'a'
middle = 'b'
end = 'c'
pattern = a + r'\w' + b + r'\w' + c
re.findall(pattern, list)

Python pattern match a string

I am trying to pattern match a string, so that if it ends in the characters 'std' I split the last 6 characters and append a different prefix.
I am assuming I can do this with regular expressions and re.split, but I am unsure of the correct notation to append a new prefix and take last 6 chars based on the presence of the last 3 chars.
regex = r"([a-zA-Z])"
if re.search(regex, "std"):
match = re.search(regex, "std")
#re.sub(r'\Z', '', varname)
You're confused about how to use regular expressions here. Your code is saying "search the string 'std' for any alphanumeric character".
But there is no need to use regexes here anyway. Just use string slicing, and .endswith:
if my_string.endswith('std'):
new_string = new_prefix + mystring[-6:]
No need for a regex. Just use standard string methods:
if s.endswith('std'):
s = s[:-6] + new_suffix
But if you had to use a regex, you would substitute a regex, you would substitute the new suffix in:
regex = re.compile(".{3}std$")
s = regex.sub(new_suffix, s)

Allowing escape sequences in my regular expression

I'm trying to create a regular expression which finds occurences of $VAR or ${VAR}. If something like \$VAR or \${VAR} was given, it would not match. If it were given something like \\$VAR or \\${VAR} or any multiple of 2 \'s, it should match.
i.e.
$BLOB matches
\$BLOB doesn't match
\\$BLOB matches
\\\$BLOB doesn't match
\\\\$BLOB matches
... etc
I'm currently using the following regex:
line = re.sub("[^\\][\\\\]*\$(\w[^-]+)|"
"[^\\][\\\\]*\$\{(\w[^-]+)\}",replace,line)
However, this doesn't work properly. When I give it \$BLOB, it still matches for some reason. Why is this?
The second groupings of double slashes are written as a redundant character class [\\\\]*, matching one or more backslashes, but should be a repeating group ((?:\\\\)*) matching one or more sets of two backslashes:
re.sub(r'(?<!\\)((?:\\\\)*)\$(\w[^-]+|\{(\w[^-]+)\})',r'\1' + replace, line)
To write a regular expression that finds $ unless it is escaped using E unless it in turn is also escaped EE:
import re
values = dict(BLOB='some value')
def repl(m):
return m.group('before') + values[m.group('name').strip('{}')]
regex = r"(?<!E)(?P<before>(?:EE)*)\$(?P<name>N|\{N\})"
regex = regex.replace('E', re.escape('\\'))
regex = regex.replace('N', r'\w+') # name
line = re.sub(regex, repl, line)
Using E instead of '\\\\' exposes your embed language without thinking about backslashes in Python string literals and regular expression patterns.

Using "r" with variables in re.sub

I need to do this:
text = re.sub(r'\]\n', r']', text)
But with find and replace as variables:
find = '\]\n'
replace = ']'
text = re.sub(find, replace, text)
Where should I put r (raw)? It is not a string.
The r'' is part of the string literal syntax:
find = r'\]\n'
replace = r']'
text = re.sub(find, replace, text)
The syntax is in no way specific to the re module. However, specifying regular expressions is one of the main use cases for raw strings.
Short answer: you should keep the r together with the string.
The r prefix is part of the string syntax. With r, Python doesn't interpret backslash sequences such as \n, \t etc inside the quotes. Without r, you'd have to type each backslash twice in order to pass it to re.sub.
r'\]\n'
and
'\\]\\n'
are two ways to write same string.
Keep r'...'
find = r'\]\n'
replace = r']'
text = re.sub(find, replace, text)
or go with
find = '\\]\\n'
replace = ']'
text = re.sub(find, replace, text)

How do I removes \n founds between double quotes from a string?

Good day,
I am totally new to Python and I am trying to do something with string.
I would like to remove any \n characters found between double quotes ( " ) only, from a given string :
str = "foo,bar,\n\"hihi\",\"hi\nhi\""
The desired output must be:
foo,bar
"hihi", "hihi"
Edit:
The desired output must be similar to that string:
after = "foo,bar,\n\"hihi\",\"hihi\""
Any tips?
A simple stateful filter will do the trick.
in_string = False
input_str = 'foo,bar,\n"hihi","hi\nhi"'
output_str = ''
for ch in input_str:
if ch == '"': in_string = not in_string
if ch == '\n' and in_string: continue
output_str += ch
print output_str
This should do:
def removenewlines(s):
inquotes = False
result = []
for chunk in s.split("\""):
if inquotes: chunk.replace("\n", "")
result.append(chunk)
inquotes = not inquotes
return "\"".join(result)
Quick note: Python strings can use '' or "" as delimiters, so it's common practice to use one when the other is inside your string, for readability. Eg: 'foo,bar,\n"hihi","hi\nhi"'. On to the question...
You probably want the python regexp module: re.
In particular, the substitution function is what you want here. There are a bunch of ways to do it, but one quick option is to use a regexp that identifies the "" substrings, then calls a helper function to strip any \n out of them...
import re
def helper(match):
return match.group().replace("\n","")
input = 'foo,bar,\n"hihi","hi\nhi"'
result = re.sub('(".*?")', helper, input, flags=re.S)
>>> str = "foo,bar,\n\"hihi\",\"hi\nhi\""
>>> re.sub(r'".*?"', lambda x: x.group(0).replace('\n',''), str, flags=re.S)
'foo,bar,\n"hihi","hihi"'
>>>
Short explanation:
re.sub is a substitution engine. It takes a regular expression, a substitution function or expression, a string to work on, and other options.
The regular expression ".*?" catches strings in double quotes that don't in themselves contain other double quotes (it has a small bug, because it wouldn't catch strings which contain escaped double-quotes).
lambda x: ... is an expression which can be used wherever a function can be used.
The substitution engine calls the function with the match object.
x.group(0) is "the whole matched string", which also includes the double quotes.
x.group(0) is the matched string with '\n' substituted for ''.
The flag re.S tells re.sub that '\n' is a valid character to catch with a dot.
Personally I find longer functions that say the same thing more tiring and less readable, in the same way that in C I would prefer i++ to i = i + 1. It's all about what one is used to reading.
This regex works (assuming that quotes are correctly balanced):
import re
result = re.sub(r"""(?x) # verbose regex
\n # Match a newline
(?! # only if it is not followed by
(?:
[^"]*" # an even number of quotes
[^"]*" # (and any other non-quote characters)
)* # (yes, zero counts, too)
[^"]*
\z # until the end of the string.
)""",
"", str)
Something like this
Break the CSV data into columns.
>>> m=re.findall(r'(".*?"|[^"]*?)(,\s*|\Z)',s,re.M|re.S)
>>> m
[('foo', ','), ('bar', ',\n'), ('"hihi"', ','), ('"hi\nhi"', ''), ('', '')]
Replace just the field instances of '\n' with ''.
>>> [ field.replace('\n','') + sep for field,sep in m ]
['foo,', 'bar,\n', '"hihi",', '"hihi"', '']
Reassemble the resulting stuff (if that's really the point.)
>>> "".join(_)
'foo,bar,\n"hihi","hihi"'

Categories

Resources