unicode_escape without deprecation warning

unicode_escape without deprecation warning - python

In Python 3.8, the following:
import codecs
codecs.decode("hello\\,world", "unicode_escape")
...produces a deprecation warning:
<ipython-input-7-28f185d30178>:1: DeprecationWarning: invalid escape sequence '\,'
codecs.decode("hello\\,", "unicode_escape", errors="strict")
Is this going to become an error in a future version of Python? Where is the reference for this?
If No, is there a way to not display this warning, if Yes, what can I use instead?
Edit: I am (obviously) not interested in fixing this for a string I would have written literally in my Python script. This is purely an example, in my use case strings come from external files, I cannot change them, and some contains invalid escape sequences such as this.

The problem is that you've got two layers of de-escaping occurring here. The first one is the regular string literal escaping, converting "hello\\,world" to a string with raw characters of hello\,world. Then codecs.decode tries to decode it with unicode_escape, which sees it as an attempt to escape , with \, which is an invalid escape.
The fix for your code as written is to use a raw string so the first level of escaping doesn't happen:
codecs.decode(r"hello\\,world", "unicode_escape")
# ^ Now it's a raw string, and both backslashes are in fact backslashes
If your data comes from elsewhere with invalid escapes, you can suppress the warning for now (see the warnings module for details), but it will eventually cause an exception in a future version of Python, so the long term solution is "Don't provide invalid data". Sorry that's not super-helpful.
The reason you get this warning is that, historically, Python has been kind of lax about spurious escapes. So if you wrote, say, a Windows path of "C:\yes" it said "hey, \y doesn't mean anything, so we'll just assume they wanted a literal backslash", while an equivalent path of "C:\no" saw \n and thought "Yup, they want a newline there".
This taught bad habits (not using raw strings when you should, because it usually worked without them) and created confusion when those habits bit you (why isn't it working this time!?!). So in the future, escapes like \y will be treated as errors, so that you end up writing r'C:\yes' and are used to doing so so you don't get bit by r'C:\no'. The warning is reminding you that this is bad code, and will eventually stop working (for good reason; as your own comment notes, you're okay with it ending up with either no backslashes or two backslashes, starting with just one, which is an insane set of options to accept without knowing the single, correct, desired result).
Alternative solution
If your goal is to "fix" bad strings, the best solution is probably to just write your own simple stripping regex, e.g.:
import re
bad_escape_re = re.compile(r'\\(?=[^\n\\\'"abfnrtv0-7xNuU])')
and then use it to strip unrecognized escapes a la:
good_string = bad_escape_re.sub('', bad_string)
which when run like so:
bad_escape_re.sub('', r'\a\b\c\d\e\f\g\,\.\n\t\x12')
produces a string with the repr '\\a\\bcde\\fg,.\\n\\t\\x12'. Note that's it's not perfect, and if you need it to validate the extended escapes to distinguish valid uses from invalid ones (\[0-7], \x, \N, \u and \U), it gets more complicated, but those are also cases where it's invariably heuristic and there is no good solution; without a human to interpret, \xab is legal and \xag is not, but it's entirely likely the former wasn't intended as an escape either.

Related

re.escape returns unusable directory

using re.escape() on this directory:
C:\Users\admin\code
Should theoratically return this, right?
C:\\Users\\admin\\code
However, what I actually get is this:
C\:\\Users\\admin\\code
Notice the backslash immediately after C. This makes the string unusable, and trying to use directory.replace('\', '') just bugs out Python because it can't deal with a single backslash string, and treats everything after it as string.
Any ideas?
Update
This was a dumb question :p

No it should not. It's help says "Escape all the characters in pattern except ASCII letters, numbers and '_'"
What you are reporting you are getting is after calling the print function on the resulting string. In console, if you type directory and press enter, it would give something like: C\\:\\\\Users\\\\admin\\\\code. When using directory.replace('\\','') it would replace all backslashes. For example: directory.replace('\\','x') gives Cx:xxUsersxxadminxxcode. What might work in this case is replacing both the backslash and colon with ':' i.e. directory.replace('\\:',':'). This will work.
However, I will suggest doing something else. A neat way to work with Windows directories in Python is to use forward slash. Python and the OS will work out a way to understand your paths with forward slashes. Further, if you aren't using absolute paths, as far as the paths are concerned, your code will be portable to Unix-style OSes.
It also seems to me that you are calling re.escape unnecessarily. If the printing the directory is giving you C:\Users\admin\code then it's a perfectly fine directory to use already. And you don't need to escape it. It's already done. If it wasn't escaped print('C:\Users\admin\code') would give something like C:\Usersdmin\code since \a has special meaning (beep).

Parsing blocks as Python

I am writing a lexer + parser in JFlex + CUP, and I wanted to have Python-like syntax regarding blocks; that is, indentation marks the block level.
I am unsure of how to tackle this, and whether it should be done at the lexical or sintax level.
My current approach is to solve the issue at the lexical level - newlines are parsed as instruction separators, and when one is processed I move the lexer to a special state which checks how many characters are in front of the new line and remembers in which column the last line started, and accordingly introduces and open block or close block character.
However, I am running into all sort of trouble. For example:
JFlex cannot match empty strings, so my instructions need to have at least one blanck after every newline.
I cannot close two blocks at the same time with this approach.
Is my approach correct? Should I be doing things different?

Your approach of handling indents in the lexer rather than the parser is correct. Well, it’s doable either way, but this is usually the easier way, and it’s the way Python itself (or at least CPython and PyPy) does it.
I don’t know much about JFlex, and you haven’t given us any code to work with, but I can explain in general terms.
For your first problem, you're already putting the lexer into a special state after the newline, so that "grab 0 or more spaces" should be doable by escaping from the normal flow of things and just running a regex against the line.
For your second problem, the simplest solution (and the one Python uses) is to keep a stack of indents. I'll demonstrate something a bit simpler than what Python does.
First:
indents = [0]
After each newline, grab a run of 0 or more spaces as spaces. Then:
if len(spaces) == indents[-1]:
pass
elif len(spaces) > indents[-1]:
indents.append(len(spaces))
emit(INDENT_TOKEN)
else:
while len(spaces) != indents[-1]:
indents.pop()
emit(DEDENT_TOKEN)
Now your parser just sees INDENT_TOKEN and DEDENT_TOKEN, which are no different from, say, OPEN_BRACE_TOKEN and CLOSE_BRACE_TOKEN in a C-like language.
Of you’d want better error handling—raise some kind of tokenizer error rather than an implicit IndexError, maybe use < instead of != so you can detect that you’ve gone too far instead of exhausting the stack (for better error recovery if you want to continue to emit further errors instead of bailing at the first one), etc.
For real-life example code (with error handling, and tabs as well as spaces, and backslash newline escaping, and handling non-syntactic indentation inside of parenthesized expressions, etc.), see the tokenize docs and source in the stdlib.

Some sensitive filenames cause failure of loading data

In using keras.model.load_weights, by the way, the weight file is saved in a hdf5 format, I come across some situations where the folder names that have initial r or t, cause the error: errno = 22, error message = 'invalid argument', flags = 0, o_flags = 0.
I want to know if there are some specified rules on the filenames which should be avoided and otherwise would lead to such reading error in python, or the situation I encountered is only specific to keras.

It would greatly help debug this if you include examples of such filenames that give you trouble. However, I have a good idea on what is probably happening here.
This problems seem to appear on folders that start with r or t on their names. Also, as they are folders, on their full path name they are preceded by a \ character (for example "\thisFolder", or similar). This is true in the case of a Windows environment, as they use \ for separating paths contrary to *nix systems that use the regular slash /.
Considering these things, seems that perhaps you are experiencing this as \r and \t are both special characters that mean Carriage Return and Tabulation, respectively. If this is the case many file openers will have trouble processing such file name.
Even more, I would not be surprised if you got the same errors on folders that begin with n or other letters that when concatenated to a backslash give special characters (\n is new line, \s is a white space, etc.).
To overcome this seems that you will need to escape your backslash character before passing it as a filename. In python, an escaped backslash is "\\"
. In addition, you can also opt to pass a Raw string instead, by adding the r prefix to your string, something like r"\a\raw\string". More information on escaping and raw string can be found on this question and answers.
I want to know if there are some specified rules on the filenames which should be avoided and otherwise would lead to such reading error in python,
As mentioned, you should avoid this with characters that have a special meaning with a backslash. I suggest you check here to see the characters Python accepts like this, so you can refrain from using such characters (or well use raw strings and forget about this problem).

Why is escaping of single quotes inconsistent on file read in Python?

Given two nearly identical text files (plain text, created in MacVim), I get different results when reading them into a variable in Python. I want to know why this is and how I can produce consistent behavior.
For example, f1.txt looks like this:
This isn't a great example, but it works.
And f2.txt looks like this:
This isn't a great example, but it wasn't meant to be.
"But doesn't it demonstrate the problem?," she said.
When I read these files in, using something like the following:
f = open("f1.txt","r")
x = f.read()
I get the following when I look at the variables in the console. f1.txt:
>>> x
"This isn't a great example, but it works.\n\n"
And f2.txt:
>>> y
'This isn\'t a great example, but it wasn\'t meant to be. \n"But doesn\'t it demonstrate the problem?," she said.\n\n'
In other words, f1 comes in with only escaped newlines, while f2 also has its single quotes escaped.
repr() shows what's going on. first for f1:
>>> repr(x)
'"This isn\'t a great example, but it works.\\n\\n"'
And f2:
>>> repr(y)
'\'This isn\\\'t a great example, but it wasn\\\'t meant to be. \\n"But doesn\\\'t it demonstrate the problem?," she said.\\n\\n\''
This kind of behavior is driving me crazy. What's going on and how do I make it consistent? If it matters, I'm trying to read in plain text, manipulate it, and eventually write it out so that it shows the properly escaped characters (for pasting into Javascript code).

Python is giving you a string literal which, if you gave it back to Python, would result in the same string. This is known as the repr() (short for "representation") of the string. This may not (probably won't, in fact) match the string as it was originally specified, since there are so many ways to do that, and Python does not record anything about how it was originally specified.
It uses double quotes around your first example, which works fine because it doesn't contain any double quotes. The second string contains double quotes, so it can't use double quotes as a delimiter. Instead it uses single quotes and uses backslashes to escape the single quotes in the string (it doesn't have to escape the double quotes this way, and there are more of them than there are single quotes). This keeps the representation as short as possible.
There is no reason for this behavior to drive you crazy and no need to try to make it consistent. You only get the repr() of a string when you are peeking at values in Python's interactive mode. When you actually print or otherwise use the string, you get the string itself, not a reconstituted string literal.
If you want to get a JavaScript string literal, the easiest way is to use the json module:
import json
print json.dumps('I said, "Hello, world!"')

Both f1 and f2 contain perfectly normal, unescaped single quotes.
The fact that their repr looks different is meaningless.
There are a variety of different ways to represent the same string. For example, these are all equivalent literals:
"abc'def'ghi"
'abc\'def\'ghi'
'''abc'def'ghi'''
r"abc'def'ghi"
The repr function on a string always just generates some literal that is a valid representation of that string, but you shouldn't depend on exactly which one it generate. (In fact, you should rarely use it for anything but debugging purposes in the first place.)
Since the language doesn't define anywhere what algorithm it uses to generate a repr, it could be different for each version of each implementation.
Most of them will try to be clever, using single or double quotes to avoid as many escaped internal quotes as possible, but even that isn't guaranteed. If you really want to know the algorithm for a particular implementation and version, you pretty much have to look at the source. For example, in CPython 3.3, inside unicode_repr, it counts the number of quotes of each type; then if there are single quotes but no double quotes, it uses " instead of '.
If you want "the" representation of a string, you're out of luck, because there is no such thing. But if you want some particular representation of a string, that's no problem. You just have to know what format you want; most formats, someone's already written the code, and often it's in the standard library. You can make C literal strings, JSON-encoded strings, strings that can fit into ASCII RFC822 headers… But all of those formats have different rules from each other (and from Python literals), so you have to use the right function for the job.

using an alternative string quotation syntax in python

Just wondering...
I find using escape characters too distracting. I'd rather do something like this (console code):
>>> print ^'Let's begin and end with sets of unlikely 2 chars and bingo!'^
Let's begin and end with sets of unlikely 2 chars and bingo!
Note the ' inside the string, and how this syntax would have no issue with it, or whatever else inside for basically all cases. Too bad markdown can't properly colorize it (yet), so I decided to <pre> it.
Sure, the ^ could be any other char, I'm not sure what would look/work better. That sounds good enough to me, tho.
Probably some other language already have a similar solution. And, just maybe, Python already have such a feature and I overlooked it. I hope this is the case.
But if it isn't, would it be too hard to, somehow, change Python's interpreter and be able to select an arbitrary (or even standardized) syntax for notating the strings?
I realize there are many ways to change statements and the whole syntax in general by using pre-compilators, but this is far more specific. And going any of those routes is what I call "too hard". I'm not really needing to do this so, again, I'm just wondering.

Python has this use """ or ''' as the delimiters
print '''Let's begin and end with sets of unlikely 2 chars and bingo'''
How often do you have both of 3' and 3" in a string

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.