I discovered that if searching for a filename with an apostrophe(') in Google Drive API, I needed to escape the apostrophe with a \. e.g:
# file_name is "tim's file"
file_name = file_name.replace("'", "\\'")
# file_name is "tim\'s file"
response = service.files().list(q = "name='" + file_name + "'").execute() #works
The docs mention that the backslash also needs special treatment.
My question is what the the general solution to this problem of special characters in the filename, are there other characters that similarly needed to be escaped?
TL;DR: No, there isn't a generic way to handle escaping ' and \ Google drive queries (and possibly other Google API's). Each API provider (Microsoft, Amazon, Twitter, etc.) would have their filename/string-escaping rules so creating one for each would be tedious. However, it should have been part of the API client they provided.
My question is what the the general solution to this problem of special characters in the filename
This is separate from the issue of sanitising strings for actual filenames because local filesystems don't follow the same rules as GDrive.
are there other characters that similarly needed to be escaped?
As far as I can tell, GDrive only needs the apostrophe (') and backslash (\) escaped, as you pointed out. As for the actual request, there's:
Note: These examples use the unencoded q parameter, where name = 'hello' is encoded as name+%3d+%27hello%27. Client libraries handle this encoding automatically.
That part is probably being handled by google-api-python-client.
As for the two specific replacements you need:
file_name = r"tim's file\has slashes"
print(file_name)
# tim's file\has slashes
print(file_name.replace('\\', '\\\\').replace("'", "\\'"))
# tim\'s file\\has slashes
# or, better
print(file_name.replace('\\', '\\\\').replace("'", r"\'"))
# tim\'s file\\has slashes
# using raw strings also for the backslash replacement
print(file_name.replace('\\', r'\\').replace("'", r"\'"))
# tim\'s file\\has slashes
Note that there's no point using raw strings for the backslash escape in the find part of the first replacement because the trailing backslash before the close quote needs to be escaped anyway. And r'\' is not a valid Python string (SyntaxError: EOL while scanning string literal). However, r'\\' means two backslashes because in a raw string the first backslash doesn't escape the 2nd backslash. Ie '\\' vs r'\\' == 1 backslash vs 2 backslashes. And if you want 3 or any odd number of number of backslashes.
Btw, replacement order is important because if you did it in reverse, then the backslash added for the apostrophe would then get escaped further:
print(file_name.replace("'", r"\'").replace('\\', r'\\')) # WRONG!
# tim\\'s file\\has slashes
And do use f-strings for the query, it's much more readable:
f"name='{file_name}'"
# "name='tim\\'s file\\\\has slashes'"
print(f"name='{file_name}'")
# name='tim\'s file\\has slashes'
If you are passing filename in apostrophes, you should replace only them. This is the proper way to do it:
file_name = file_name.replace("'", "\'")
Here's why:
>>> print('\'')
'
>>> print('\\'')
File "<stdin>", line 1
print('\\'')
^
SyntaxError: EOL while scanning string literal
You can also do something like that:
response = service.files().list(q = f"name='{file_name}'").execute()
It is a lot easier and more cleaner.
EDIT: I have read in docs that characters like \ should be replaced as well. So you can just replace \ with \\.
Related
I am getting filename from an api in this format containing mix of / and \.
infilename = 'c:/mydir1/mydir2\mydir3\mydir4\123xyz.csv'
When I try to parse the directory structure, \ followed by a character is converted into single character.
Is there a way around to get each component correctly?
What I already tried:
path.normpath didn't help.
infilename = 'c:/mydir1/mydir2\mydir3\mydir4\123xyz.csv'
os.path.normpath(infilename)
out:
'c:\\mydir1\\mydir2\\mydir3\\mydir4Sxyz.csv'
use r before the string to process it as a raw string (i.e. no string formatting).
e.g.
infilename = r'C:/blah/blah/blah.csv'
More details here:
https://docs.python.org/3.6/reference/lexical_analysis.html#string-and-bytes-literals
that's not visible in your example but writing this:
infilename = 'c:/mydir1/mydir2\mydir3\mydir4\123xyz.csv'
isn't a good idea because some of the lowercase (and a few uppercase) letters are interpreted as escape sequences if following an antislash. Notorious examples are \t, \b, there are others. For instance:
infilename = 'c:/mydir1/mydir2\thedir3\bigdir4\123xyz.csv'
doubly fails because 2 chars are interpreted as "tab" and "backspace".
When dealing with literal Windows-style path (or regexes), you have to use the raw prefix, and better, normalize your path to get rid of the slashes.
infilename = os.path.normpath(r'c:/mydir1/mydir2\mydir3\mydir4\123xyz.csv')
However, the raw prefix only applies to literals. If the returned string appears, when printing repr(string), as 'the\terrible\\dir', then tab chars have already been put in the string, and there's nothing you can do except a lousy post-processing.
Instead of parsing by \ try parsing by \\. You usually have to escape by \ so the \ character is actually \\.
I know that \f is a form feed. I want to access my folder the following way:
os.chdir("C:\Python27\BGT_Python\skills\fuzzymatching")
The folder 'fuzzymatching' starts with the \f symbol which breaks the string.
What's the easiest way to get around these types of symbols?
Add an r character in front of the string:
os.chdir(r"C:\Python27\BGT_Python\skills\fuzzymatching")
See the Python docs.
In triple-quoted strings, unescaped newlines and quotes are allowed (and are retained), except that three unescaped quotes in a row terminate the string. (A ``quote'' is the character used to open the string, i.e. either ' or ".)
and
Unless an r' orR' prefix is present, escape sequences in strings are interpreted according to rules similar to those used by Standard C.
For completeness, I'll add:
os.chdir("C:/Python27/BGT_Python/skills/fuzzymatching")
About the only part of Windows that actually requires backslashes is the command line.
This should work:
os.chdir("C:\Python27\BGT_Python\skills\\fuzzymatching")
I just added a \ to scape \f.
I'm trying to use pyparsing to parse quoted strings under the following conditions:
The quoted string might contain internal quotes.
I want to use backslashes to escape internal quotes.
The quoted string might end with a backslash.
I'm struggling to define a successful parser. Also, I'm starting to wonder whether the regular expression used by pyparsing for quoted strings of this kind is correct (see my alternative regular expression below).
Am I using pyparsing incorrectly (most likely) or is there a bug in pyparsing?
Here's a script that demonstrates the problem (Note: ignore this script; please focus instead on the Update below.):
import pyparsing as pp
import re
# A single-quoted string having:
# - Internal escaped quote.
# - A backslash as the last character before the final quote.
txt = r"'ab\'cd\'"
# Parse with pyparsing.
# Does not work as expected: grabs only first 3 characters.
parser = pp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks
# Parse with a regex just like the pyparsing pattern, but with
# the last two groups flipped -- which seems more correct to me.
# This works.
rgx = re.compile(r"\'(?:[^'\n\r\\]|(?:\\.)|(?:\\))*\'")
print
print rgx.search(txt).group(0)
Output:
txt: 'ab\'cd\'
pattern: \'(?:[^'\n\r\\]|(?:\\)|(?:\\.))*\'
toks: ["ab'"]
'ab\'cd\'
Update
Thanks for the replies. I suspect that I've confused things by framing my question badly, so let me try again.
Let's say we are trying to parse a language that uses quoting rules generally like Python's. We want users to be able to define strings that can include internal quotes (protected by backslashes) and we want those strings to be able to end with a backslash. Here's an example file in our language. Note that the file would also parse as valid Python syntax, and if we printed foo (in Python), the output would be the literal value: ab'cd\
# demo.txt
foo = 'ab\'cd\\'
My goal is to use pyparsing to parse such a language. Is there a way to do it? The question above is basically where I ended up after several failed attempts. Below is my initial attempt. It fails because there are two backslashes at the end, rather than just one.
with open('demo.txt') as fh:
txt = fh.read().split()[-1].strip()
parser = pp.QuotedString(quoteChar = "'", escChar = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks # ["ab'cd\\\\"]
I guess the problem is that QuotedString treats the backslash only as a quote-escape whereas Python treats a backslash as a more general-purpose escape.
Is there a simple way to do this that I'm overlooking? One workaround that occurs to me is to use .setParseAction(...) to handle the double-backslashes after the fact -- perhaps like this, which seems to work:
qHandler = lambda s,l,t: [ t[0].replace('\\\\', '\\') ]
parser = pp.QuotedString(quoteChar = "'", escChar = '\\').setParseAction(qHandler)
I think you're misunderstanding the use of escQuote. According to the docs:
escQuote - special quote sequence to escape an embedded quote string (such as SQL's "" to escape an embedded ") (default=None)
So escQuote is for specifying a complete sequence that is parsed as a literal quote. In the example given in the docs, for instance, you would specify escQuote='""' and it would be parsed as ". By specifying a backslash as escQuote, you are causing a single backslash to be interpreted as a quotation mark. You don't see this in your example because you don't escape anything but quotes. However, if you try to escape something else, you'll see it won't work:
>>> txt = r"'a\Bc'"
>>> parser = pyp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = "\\")
>>> parser.parseString(txt)
(["a'Bc"], {})
Notice that the backslash was replaced with '.
As for your alternative, I think the reason that pyparsing (and many other parsers) don't do this is that it involves special-casing one position within the string. In your regex, a single backslash is an escape character everywhere except as the last character in the string, in which position it is treated literally. This means that you cannot tell "locally" whether a given quote is really the end of the string or not --- even if it has a backslash, it might not be the end if there is one later on without a backslash. This can lead to parse ambiguities and surprising parsing behavior. For instance, consider these examples:
>>> txt = r"'ab\'xxxxxxx"
>>> print rgx.search(txt).group(0)
'ab\'
>>> txt = r"'ab\'xxxxxxx'"
>>> print rgx.search(txt).group(0)
'ab\'xxxxxxx'
By adding an apostrophe at the end of the string, I suddenly caused the earlier apostrophe to no longer be the end, and added all the xs to the string at once. In a real-usage context, this can lead to confusing situations in which mismatched quotes silently result in a reparsing of the string rather than a parse error.
Although I can't come up with an example at the moment, I also suspect that this has the possibility to cause "catastrophic backstracking" if you actually try to parse a sizable document containing multiple strings of this type. (This was my point about the "100MB of other text".) Because the parser can't know whether a given \' is the end of the string without parsing further, it might potentially have to go all the way to the end of the file just to make sure there are no more quote marks out there. If that remaining portion of the file contains additional strings of this type, it may become complicated to figure out which quotes are delimiting which strings. For instance, if the input contains something like
'one string \' 'or two'
we can't tell whether this is two valid strings (one string \ and or two) or one with invalid material after it (one string \' and the non-string tokens or two followed by an unmatched quote). This kind of situation is not desirable in many parsing contexts; you want the decisions about where strings begin and end to be locally determinable, and not depend on the occurrence of other tokens much later in the document.
What is it about this code that is not working for you?
from pyparsing import *
s = r"foo = 'ab\'cd\\'" # <--- IMPORTANT - use a raw string literal here
ident = Word(alphas)
strValue = QuotedString("'", escChar='\\')
strAssign = ident + '=' + strValue
results = strAssign.parseString(s)
print results.asList() # displays repr form of each element
for r in results:
print r # displays str form of each element
# count the backslashes
backslash = '\\'
print results[-1].count(backslash)
prints:
['foo', '=', "ab'cd\\\\"]
foo
=
ab'cd\\
2
EDIT:
So "\'" becomes just "'", but "\" is parsed but stays as "\" instead of being an escaped "\". Looks like a bug in QuotedString. For now you can add this workaround:
import re
strValue.setParseAction(lambda t: re.sub(r'\\(.)', r'\g<1>', t[0]))
Which will take every escaped character sequence and just give back the escaped character alone, without the leading '\'.
I'll add this in the next patch release of pyparsing.
PyParsing's QuotedString parser does not handle quoted strings that end with backslashes. This is a fundamental limitation, that doesn't have any easy workaround that I can see. If you want to support that kind of string, you'll need to use something other than QuotedString.
This is not an uncommon limitation either. Python itself does not allow an odd number of backslashes at the end of a "raw" string literal. Try it: r"foo\" will raise an exception, while r"bar\\" will include both backslashes in the output.
The reason you are getting truncated output (rather than an exception) from your current code is because you're passing a backslash as the escQuote parameter. I think that is intended to be an alternative to specifying an escape character, rather than a supplement. What is happening is that the first backslash is being interpreted as an internal quote (which it unescapes), and since it's followed by an actual quote character, the parser thinks it's reached the end of the quoted string. Thus you get ab' as your result.
I am new to Python - but not to programming, and on a bit of a steep learning curve.
I have a programme that reads several input files - the first input file contains (amongst other things) the path and name the other files.
I can open the file and read the name OK. If I print the string it looks like this
Z:\ \python\ \rb_data.dat\n'
all my "\" become "\ \" I think I can fix this by using the "r" prefix to convert it to a literal.
My question is how do I attach the prefix to a string variable ??
This is what I want to do :
modat = open('z:\\python\mot1 input.txt') # first input file containing names of other file
rbfile = modat.readline() # get new file name
rbdat = open(rbfile) # open new file
The \\ is an escape sequence for the backslash character \. When you specify a string literal, they are enquoted by either ' or ". Because there are some characters you might need to specify to be part of the string which you cannot enter like this—for example the quotation marks themselves—escape sequences allow you to do it. They usually are \x where x is something you want to enter. Now because all escape sequences start with a backslash, the backslash itself also turns into a special character which you cannot specify directly within a string literal. So you need to escape it too.
That means that the string literal '\\' actually represents a string with a single character: The backslash. Raw strings, that are string literals with an r in front of the opening quotation character, ignore (most) escape sequences. So r'\\x' is actually the string where two backslashes are followed by an x. So it’s identical to the string described by the non-raw string literal '\\\\x'.
All this only applies to string literals though. The string itself holds no information about whether it was created with a raw string literal or not, or whether there was some escape sequence need or not. It just contains all the characters that make out the string.
That also means that as soon as you get a string from somewhere, for example by reading it from a file, then you don’t need to worry about escaping something in there to make sure that it’s a correct string. It just is.
So in your code, when you open the file at z:\python\mot1 input.txt, you need to specify that filename as a string first. So you have to use a string literal, either with escaping the backslashes, or by using a raw string.
Then, when you read the new filename from that file, you already have a real string, and don’t need to bother with anything more. Assuming that it was correctly written to the file, you can just use it like that.
The backslash \ in Python strings (and in code blocks on StackOverflow!) means, effectively, "treat the next character differently". As it is reserved for this purpose, when you actually have a backslash in your strings, it must be "escaped" by a preceding backslash:
>>> myString = "\\" # the first one "escapes" the second
>>> myString = "\" # no escape, so...
SyntaxError: EOL while scanning string literal
>>> print("\\") # when we actually print out the string
\
The short story is, you can basically ignore this in your strings. If you pass rbfile to open, Python will interpret it correctly.
Why not use os.path.normcase, like this:
with open(r'z:\python\mot1 input.txt') as f:
for line in f:
if line.strip():
if os.path.isfile(os.path.normcase(line.strip())):
with open(line.strip()) as f2:
# do something with
# f2
From the documentation of os.path.normcase:
Normalize the case of a pathname. On Unix and Mac OS X, this returns
the path unchanged; on case-insensitive filesystems, it converts the
path to lowercase. On Windows, it also converts forward slashes to
backward slashes.
I have this code:
import os
path = os.getcwd()
final = path +'\xulrunner.exe ' + path + '\application.ini'
print(final)
I want output like:
C:\Users\me\xulrunner.exe C:\Users\me\application.ini
But instead I get an error that looks like:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \xXX escape
I don't want the backslashes to be interpreted as escape sequences, but as literal backslashes. How can I do it?
Note that if the string should only contain a backslash - more generally, should have an odd number of backslashes at the end - then raw strings cannot be used. Please use How can I get a string with a single backslash in it? to close questions that are asking for a string with just a backslash in it. Use How to write string literals in python without having to escape them? when the question is specifically about wanting to avoid the need for escape sequences.
To answer your question directly, put r in front of the string.
final= path + r'\xulrunner.exe ' + path + r'\application.ini'
But a better solution would be os.path.join:
final = os.path.join(path, 'xulrunner.exe') + ' ' + \
os.path.join(path, 'application.ini')
(the backslash there is escaping a newline, but you could put the whole thing on one line if you want)
I will mention that you can use forward slashes in file paths, and Python will automatically convert them to the correct separator (backslash on Windows) as necessary. So
final = path + '/xulrunner.exe ' + path + '/application.ini'
should work. But it's still preferable to use os.path.join because that makes it clear what you're trying to do.
You can escape the slash. Use \\ and you get just one slash.
You can escape the backslash with another backslash (\\), but it won’t look nicer. To solve that, put an r in front of the string to signal a raw string. A raw string will ignore all escape sequences, treating backslashes as literal text. It cannot contain the closing quote unless it is preceded by a backslash (which will be included in the string), and it cannot end with a single backslash (or odd number of backslashes).
Another simple (and arguably more readable) approach is using string raw format and replacements like so:
import os
path = os.getcwd()
final = r"{0}\xulrunner.exe {0}\application.ini".format(path)
print(final)
or using the os path method (and a microfunction for readability):
import os
def add_cwd(path):
return os.path.join( os.getcwd(), path )
xulrunner = add_cwd("xulrunner.exe")
inifile = add_cwd("application.ini")
# in production you would use xulrunner+" "+inifile
# but the purpose of this example is to show a version where you could use any character
# including backslash
final = r"{} {}".format( xulrunner, inifile )
print(final)