I am new to Python. I am trying to instantiate a grid from DEM data. Then I will try to create a flow direction map from raster data. But when I write the following line:
(grid = Grid.from_raster("C:\Users\ogun_\Masaüstü\DEM", data_name = 'dem')
I have got this error.
(SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape)
When I try the r or R functions, they don't work for this code.
Strings in Python allow "escape codes" which start with a \. For example, \n is used, that signifies a new-line character.
In your case, you've specified \U as part of \Users. Python is trying to interpret that as a raw Unicode value (which is what a \U escape code normally denotes).
You can solve this either with an escape for your \ characters (which makes the string "C:\\Users\\ogun_\\Masaüstü\\DEM") or with a "raw string" which doesn't do escape codes (which makes the string r"C:\Users\ogun_\Masaüstü\DEM").
I suspect the latter might be what you mean by "r functions". These are not functions, but rather are lexigraphic markers. They're much lower-level than function and can be thought of as part of the quoting itself. If you tried to call these as functions, that would've been why it didn't work.
You can read more about by marking a string as raw here.
NOTE: This also is different between Python 2 and 3. In 2, strings aren't Unicode by default, so you don't experience this unless you ask for a unicode string. In Python 3, strings are Unicode by default, so this happens unless you explicitly ask for bytes with a byte string.
Related
I have a pandas dataframe with hex values as given below:
df['col1']
<0020>
<0938>
<002E>
<092B>
<092B>
<0916>
<0915>
<0915>
<096F>
<096C>
I want to convert the hex values to their corresponding unicode literals. So, I try to do the following:
df['col1'] = df['col1'].apply(lambda x : '\u' + str(x)[1:-1])
Hoping, that this would convert it to my required unicode literal, but I get the following error:
File "<ipython-input-22-891ccdd39e79>", line 1
df['col1'].apply(lambda x : '\u' + str(x)[1:-1])
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
In python3, when we try the following we get :
>>> string1 = '\u03b4'
>>> print(string1)
>>> δ
So, I tried adding \u to my given string, I also tried adding \\u, but that shows up as two backslashes . Also, adding a r before \u, also ends up showing two backslashes, instead of the unicode literal. I also tried decode-unicode, but it didn't work either.
Also, it'd be great, if someone can explain the concept of rawstrings , \u, etc.
Oops, literals are for... literal values! As soon as you have variables, you should use conversion functions like int and chr.
Here you have a column containing strings. For each cell in the column, you want to remove first and last character, process what remains as an hex value, and get the unicode character with that code point. In Python, it just reads:
df['col1'].apply(lambda x: chr(int(x[1:-1], 16)))
And with your values, it gives:
0
1 स
2 .
3 फ
4 फ
5 ख
6 क
7 क
8 ९
9 ६
Now for the reason of your error.
\uxxxx escape sequences are intended for the Python parser. When they are found in a string literal they are automatically replaced with the unicode character having that code point. You can use the codecs module and the unicode_escape encoding to decode a string that would contain actual \u character characters (meaning that you escape the backslash as in "\\uxxx", but as you have directly an hex representation of the code point, it is simpler to directly use the chr function.
And in your initial code, when you write '\u', the parser sees the initial part of an encoded character and tries to decode it immediately... but cannot find the hexa code point after it, so it throws the exception. If you really want to go that way, you have to double the backslash (\) to escape it and store it as is in the string and then use codecs.decode(..., encoding='unicode_escape') to decode the string as shown in #ndclt's answer. But I do not advise you to do so.
References are to be found in the Standard Python Library documentation, chr function and codecs module.
In order to convert all your codes into unicode here one line:
import codecs
import pandas as pd
(
# create a series with the prefix "\u" to add to the existing column
pd.Series([r'\u'] * len(df['col1']))
# str.strip deletes the "<" and ">" from your column
# str.cat concatenates the prefix created before to the existing column
.str.cat(df['col1'].str.strip('<>'))
# then you apply a conversion from the raw string to normal string.
.apply(codecs.decode, args=['unicode_escape'])
)
In the previous code, you have to create the prefix as a raw string. If not, Python is waiting for a valid utf-8 code (the error you have in your code).
Edit: I add the explanation from Serge Ballesta post
\uxxxx escape sequences are intended for the Python parser. When they are found in a string literal they are automatically replaced with the unicode character having that code point. You can use the codecs module and the unicode_escape encoding to decode a string that would contain actual \u character characters (meaning that you escape the backslash as in "\uxxx", but as you have directly an hex representation of the code point, it is simpler to directly use the chr function.
His solution is more elegant than mine.
I am working with Russian words written in the Cyrillic orthography. Everything is working fine except for how many (but not all) of the Cyrillic characters are encoded as two characters when in an str. For instance:
>>>print ["ё"]
['\xd1\x91']
This wouldn't be a problem if I didn't want to index string positions or identify where a character is and replace it with another (say "e", without the diaeresis). Obviously, the 2 "characters" are treated as one when prefixed with u, as in u"ё":
>>>print [u"ё"]
[u'\u0451']
But the strs are being passed around as variables, and so can't be prefixed with u, and unicode() gives a UnicodeDecodeError (ascii codec can't decode...).
So... how do I get around this? If it helps, I am using python 2.7
There are two possible situations here.
Either your str represents valid UTF-8 encoded data, or it does not.
If it represents valid UTF-8 data, you can convert it to a Unicode object by using mystring.decode('utf-8'). After it's a unicode instance, it will be indexed by character instead of by byte, as you have already noticed.
If it has invalid byte sequences in it... You're in trouble. This is because the question of "which character does this byte represent?" no longer has a clear answer. You're going to have to decide exactly what you mean when you say "the third character" in the presence of byte sequences that don't actually represent a particular Unicode character in UTF-8 at all...
Perhaps the easiest way to work around the issue would be to use the ignore_errors flag to decode(). This will entirely discard invalid byte sequences and only give you the "correct" portions of the string.
These are actually different encodings:
>>>print ["ё"]
['\xd1\x91']
>>>print [u"ё"]
[u'\u0451']
What you're seeing is the __repr__'s for the elements in the lists. Not the __str__ versions of the unicode objects.
But the strs are being passed around as variables, and so can't be
prefixed with u
You mean the data are strings, and need to be converted into the unicode type:
>>> for c in ["ё"]: print repr(c)
...
'\xd1\x91'
You need to coerce the two-byte strings into double-byte width unicode:
>>> for c in ["ё"]: print repr(unicode(c, 'utf-8'))
...
u'\u0451'
And you'll see with this transform they're perfectly fine.
To convert bytes into Unicode, you need to know the corresponding character encoding and call bytes.decode:
>>> b'\xd1\x91'.decode('utf-8')
u'\u0451'
The encoding depends on the data source. It can be anything e.g., if the data comes from a web page; see A good way to get the charset/encoding of an HTTP response in Python
Don't use non-ascii characters in a bytes literal (it is explicitly forbidden in Python 3). Add from __future__ import unicode_literals to treat all "abc" literals as Unicode literals.
Note: a single user-perceived character may span several Unicode codepoints e.g.:
>>> print(u'\u0435\u0308')
ё
I'm writing a Python script to process some music data. It's supposed to merge two separate databases by comparing their entries and matching them up. It's almost working, but fails when comparing strings containing special characters (i.e. accented letters). I'm pretty sure it's a ASCII vs. Unicode encoding issue, as I get the error:
"Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal"
I realize I could use regular expressions to remove the offending characters, but I'm processing a lot of data and relying too much on regexes makes my program grindingly slow. Is there a way to have Python properly compare these strings? What is going on here--is there a way to tell whether it's storing my strings as ASCII or Unicode?
EDIT 1: I'm using Python v2.6.6. After checking the types, I've discovered that one database spits out me Unicode strings and one gives ASCII. So that's probably the problems. I'm trying to convert the ASCII strings from the second database to Unicode with a line like
line = unicode(f.readline().decode(latin_1).encode(utf_8))
but this gives an error like:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128)
I'm not sure why the 'ascii' codec is complaining since I'm trying to decode from ASCII. Can anyone help?
Unicode vs Bytes
First, some terminology. There are two types of strings, encoded and decoded:
Encoded. This is what's stored on disk. To Python, it's a bunch of 0's and 1's that you might treat like ASCII, but it could be anything -- binary data, a JPEG image, whatever. In Python 2.x, this is called a "string" variable. In Python 3.x, it's more accurately called a "bytes" variable.
Decoded. This is a string of actual characters. They could be encoded to 8-bit ASCII strings, or it could be encoded to 32-bit Chinese characters. But until it's time to convert to an encoded variable, it's just a Unicode string of characters.
What this means to you
So here's the thing. You said you were getting one ASCII variable and one Unicode variable. That's actually not true.
You have one variable that's a string of bytes -- ones and zeros, presumably in sets of 8. This is the variable you assumed, incorrectly, to be ASCII.
You have another variable that's Unicode data -- numbers, letters, and symbols.
Before you compare the string of bytes to a Unicode string of characters, you have to make some assumptions. In your case, Python (and you) assumed that the string of bytes was ASCII encoded. That worked fine until you came across a character that wasn't ASCII -- a character with an accent mark.
So you need to find out what that string of bytes is encoded as. It might be latin1. If it is, you want to do this:
if unicode_variable == string_variable.decode('latin1')
Latin1 is basically ASCII plus some extended characters like Ç and Â.
If your data is in Latin1, that's all you need to do. But if your string of bytes is encoded in something else, you'll need to figure out what encoding that is and pass it to decode().
The bottom line is, there's no easy answer, unless you know (or make some assumptions) about the encoding of your input data.
What I would do
Try running var.decode('latin1') on your string of bytes. That will give you a Unicode variable. If that works, and the data looks correct (ie, characters with accent marks look like they belong), roll with it.
Oh, and if latin1 doesn't parse or doesn't look right, try utf8 -- another common encoding.
You might need to preprocess the databases and convert everything into UTF-8. My guess is that you've got Latin-1 accented characters in some entries.
As to your question, the only way to know for sure is to look. Have your script spit out those that don't compare, and look up the character codes. Or just try string.decode('latin1').encode('utf8') and see what happens.
Converting both to unicode should help:
if unicode(str1) == unicode(str2):
print "same"
To find out whether YOU (not it) are storing your strings as str objects or unicode objects, print type(your_string).
You can use print repr(your_string) to show yourself (and us) unambiguously what is in your string.
By the way, exactly what version of Python are you using, on what OS? If Python 3.x, use ascii() instead of repr().
Ok, I have a hardcoded string I declare like this
name = u"Par Catégorie"
I have a # -- coding: utf-8 -- magic header, so I am guessing it's converted to utf-8
Down the road it's outputted to xml through
xml_output.toprettyxml(indent='....', encoding='utf-8')
And I get a
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
Most of my data is in French and is ouputted correctly in CDATA nodes, but that one harcoded string keep ... I don't see why an ascii codec is called.
what's wrong ?
The coding header in your source file tells Python what encoding your source is in. It's the encoding Python uses to decode the source of the unicode string literal (u"Par Catégorie") into a unicode object. The unicode object itself has no encoding; it's raw unicode data. (Internally, Python will use one of two encodings, depending on how it was configured, but Python code shouldn't worry about that.)
The UnicodeDecodeError you get means that somewhere, you are mixing unicode strings and bytestrings (normal strings.) When mixing them together (concatenating, performing string interpolation, et cetera) Python will try to convert the bytestring into a unicode string by decoding the bytestring using the default encoding, ASCII. If the bytestring contains non-ASCII data, this will fail with the error you see. The operation being done may be in a library somewhere, but it still means you're mixing inputs of different types.
Unfortunately the fact that it'll work just fine as long as the bytestrings contain just ASCII data means this type of error is all too frequent even in library code. Python 3.x solves that problem by getting rid of the implicit conversion between unicode strings (just str in 3.x) and bytestrings (the bytes type in 3.x.)
Wrong parameter name? From the doc, I can see the keyword argument name is supposed to be encoding and not coding.
I'm having a problem when trying to apply a regular expression to some strings encoded in latin-1 (ISO-8859-1).
What I'm trying to do is send some data via HTTP POST from a page encoded in ISO-8859-1 to my python application and do some parsing on the data using regular expressions in my python script.
The web page uses jQuery to send the data to the server and I'm grabbing the text from the page using the .text() method. Once the data is sent back to the server looks like this: re.compile(r"^[\s,]*(\d*\s*\d*\/*\d)[\s,]*") - Unfortunately the \s in my regular expression is not matching my data, and I traced the problem down to the fact that the html page uses which gets encoded to 0xA0 (non-breaking space) and sent to the server. For some reason, it seems, my script is not interpreting that character as whitespace and is not matching. According to the python [documentation][1] it looks like this should work, so I must have an encoding issue here.
I then wanted to try converting the string into unicode and pass it to the regular expression, so I tried to view what would happen when I converted the string: print(unicode(data, 'iso-8859-1')).
Unfortunately I got this error:
UnicodeEncodeError at /script/
'ascii' codec can't encode character u'\xa0' in position 122: ordinal not in range(128)
I'm confused though - I'm obviously not trying to use ASCII decoding - is python trying to decode using ASCII even though I'm obviously passing another codec?
Try this instead:
print(repr(unicode(data, 'iso-8859-1')))
by printing a unicode object you're implicitly trying to convert it to the default encoding, which is ASCII. Using repr will escape it into an ASCII-safe form, plus it'll be easier for you to figure out what's going on for debugging.
Are you using Python 3.X or 2.X? It makes a difference. Actually looks like 2.X but you confused me by using print(blahblah) :-)
Answer to your last question: Yes, ASCII by default when you do print(). On 3.X: Use print(ascii(foo)) for debugging, not print(foo). On 2.X use repr(), not ascii().
Your original problem with the no-break space should go away if (a) the data is unicode and (b) you use the re.UNICODE flag with the re.compile()