Python - Decode string already in bytes [duplicate] - python

This question already has answers here:
How to determine the encoding of text
(16 answers)
Closed 3 years ago.
I'm trying that a Python program reads a word made by another Python program which was encoded to UTF-8 and saved on a txt file.
For example, the string it gets might be:
b'\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'
being this a normal string, like doing this:
word_string = "b'\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'"
How do I make the script see this is a bytes string and not a normal string? I know this can be done like
word_bytes = b'\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'
but if I have the content of that variable 'word_bytes' already written in a file, how can I get it and make the program understand it just has to decode it? Because I try to decode it and it says it's a string and can't be decoded. Any help?
Thanks in advance!
UPDATE: So just to put here to anyone who gets the string from a file on at least Windows (I'm using Windows 7), with tripleee's answer, it will encode and put double backslashes on the bytes part, and when it decodes, it will just remove one of the backslashes, putting it as it was before. So the way to get it from a file and decode it is the following:
s = '\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'.encode().decode('unicode_escape') [having the bytes part between '' been gotten from a file using the open(file,"r") function, in my case]
s.encode('latin-1').decode('utf-8') [or ISO-8859-1, as it seems it's the same thing]
EDIT: tripleee's answer is almost what I wanted to know (50% missing), but it's already a way, so thank you! But how could I do it not knowing the encoding (because in this case, I didn't know the encoding was latin-1 and I can't put all the encodings there)? Like I would do by just putting a b before the bytes string like in 'word_bytes' variable (possibly it might encode with the right encoding automatically? I wanted to do that too but possibly with a funcion to a variable that has already the bytes part).

If you have the bytes in a variable already, you are all set. If you have the bytes in a string, I'm assuming you basically have a sequence of characters where the code point value of each is equivalent to the byte value it's supposed to hold. This happens to be the definition of the Latin-1 encoding - it feels a bit dirty, but the trick is to encode your string as Latin-1, then decode back as UTF-8.
>>> s = '\xd1\x84\xd0\xbe\xd1\x80\xd1\x83\xd0\xbc'
>>> s.encode('latin-1').decode('utf-8')
'форум'

you can identify if the string is in bytes using
def identifystring(string):
if isinstance(string, str):
print ("ordinary string")
elif isinstance(string, unicode):
print ("unicode string")
else:
print ("no string")

Related

Python 3 Decoding Strings

I understand that this is likely a repeat question, but I'm having trouble finding a solution.
In short I have a string I'd like to decode:
raw = "\x94my quote\x94"
string = decode(raw)
expected from string
'"my quote"'
Last point of note is that I'm working with Python 3 so raw is unicode, and thus is already decoded. Given that, what exactly do I need to do to "decode" the "\x94" characters?
string = "\x22my quote\x22"
print(string)
You don't need to decode, Python 3 does that for you, but you need the correct control character for the double quote "
If however you have a different character set, it appears you have Windows-1252, then you need to decode the byte string from that character set:
str(b"\x94my quote\x94", "windows-1252")
If your string isn't a byte string you have to encode it first, I found the latin-1 encoding to work:
string = "\x94my quote\x94"
str(string.encode("latin-1"), "windows-1252")
I don't know if you mean to this, but this works:
some_binary = a = b"\x94my quote\x94"
result = some_binary.decode()
And you got the result...
If you don't know which encoding to choose, you can use chardet.detect:
import chardet
chardet.detect(some_binary)
Did you try it like this? I think you need to call decode as a method of the byte class, and pass utf-8 as the argument. Add b in front of the string too.
string = b"\x94my quote\x94"
decoded_str = string.decode('utf-8', 'ignore')
print(decoded_str)

Python - Unicode & double backslashes [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
I scrapped a webpage with BeautifulSoup.
I got great output except parts of the list look like this after getting the text:
list = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']
My question now is how to get rid or replace these double backslashes with the special characters they are.
If i print the first the first element of the example list the output looks like
print list[0]
that\u2019s
I already read a lot of other questions / threads about this topic but I ended up being even more confused, as I am a beginner considering unicode / encoding / decoding.
I hope that someone could help me with this issue.
Thanks!
MG
Since you are using Python 2 there, it is simply a matter of re-applying the "decode" method - using the special codec "unicode_escape". It "sees" the "physical" backlashes and decodes those sequences proper unicode characters:
data = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']
result = [part.decode('unicode_escape') for part in data]
To aAnyone getting here using Python3: in that version can not apply the "decode" method to the str objects delivered by beautifulsoup - one has to first re-encode those to byte-string objects, and then decode with the uncode_escape codec. For these purposes it is usefull to make use of the latin1 codec as the transparent encoding: all bytes in the str object are preserved in the new bytes object:
result = [part.encode('latin1').decode('unicode_escape') for part in data]
the problem here is that the site ended up double encoding those unicode arguments, just do the following:
ls = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']
ls = map(lambda x: x.decode('unicode-escape'), ls)
now you have a list with properly unicode encoded strings:
for a in ls:
print a

How to write a unicode object into a file in Python?

I try to write a "string" to a file and get the following error message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xcd' in position 6: ordinal not in range(128)
I tried the following methods:
print >>f, txt
print >>f, txt.decode('utf-8')
print >>f, txt.encode('utf-8')
None of them work. I have the same error message.
What is the idea behind encoding and decoding? If I have a unicode object can I write it to the file directly or I need to transform it to a string?
How can I find out what codding is used? How can I know if it is utf-8 or ascii or something else?
ADDED
I think I have just managed to save a string into a file. print >>f, txt as well as print >>f, txt.decode('utf-8') did not work but print >>f, txt.encode('utf-8') works. I get no error message and I see Chinese characters in my file.
I recently posted another answer that addresses this very issue. Key quote:
For a good overview of the difference, read one of Joel's articles, but the gist is that bytes are, well, bytes (groups of 8 bits without any further meaning attached), whereas characters are the things that make up strings of text. Encoding turns characters into bytes, and decoding turns bytes back into characters.
In Python 2, unicode objects are character strings. Regular str objects can be either character strings or byte strings. (Pro tip: use Python 3, it makes keeping track a lot easier.)
You should be passing character strings (not byte strings) to print, but you will need to be sure that those character strings can be encoded by the codec (such as ASCII or UTF-8) associated with the destination file object f. As part of the output process, Python encodes the string for you. If the string contains characters that cannot be encoded by the file object's codec, you will get errors like the one you're seeing.
Without knowing what is in your txt object I can't be more specific.
I think you need to use codecs library:
import codecs
file = codecs.open("test.txt", "w", "utf-8")
file.write(u'\xcd')
file.close()
Works fine.
The Story of Encoding/Decoding:
In the past, there were only about ~60 characters available in computers (including upper-case and lower-case letters + numbers + some special characters). So only 1 byte was enough to assign a unique number to each letter. Assigning numbers to letters for storing in memory is called encoding. This one byte encoding that is used in python by default is named ASCII.
With growth of computers in the world, we need to have more letters and characters in computer. So 1 byte is not enough. Different encoding schemes appeared. Unicode is one of the famous. The character that you are trying to store in your file is a Unicode character and it need 2 bytes, So you must explicitly indicate to Python that you don't want to use the default encoding, i.e. the ASCII (because you need 2 bytes for this character).

How to find right encoding in python? [duplicate]

This question already has answers here:
How to determine the encoding of text
(16 answers)
Closed 5 years ago.
I'm trying to get rid of diacritics in my textfile. I converted a pdf to text with a tool, not made by myself. I wasn't able to understand which encoding they use. The text is written in Nahuatl, orthographically familiar with Spanish.
I transformed the text into a list of strings. No I'm trying to do the following:
# check whether there is a not-ascii character in the item
def is_ascii(word):
check = string.ascii_letters + "."
if word not in check:
return False
return True
# if there is a not ascii-character encode the string
def to_ascii(word):
if is_ascii(word) == False:
newWord = word.encode("utf8")
return newWord
return word
What I want to get is a unicode-version of my string. It doesn't work so far and I tried several encodings like latin1, cp1252, iso-8859-1. What I get is Can anybody tell me what I did wrong?
How can I find out the right encoding?
Thank you!
EDIT:
I wrote to the people that developed the converter (pdf-txt) and they said they were using unicode already. So John Machin was right with (1) in his answer.
As I wrote in some comment that wasn't clear to me, because in the Eclipse debugger the list itself showed some signs in unicodes, others not. And if I looked at the items seperately they were all decoded in some way, so that I actually saw unicode.
Thank you for your help!
Edit your question to show the version of Python you are using. Guessing the version from your code is not possible. Whether you are using Python 3.X or 2.X matters a lot. Following remarks assume Python 2.x.
You already seem to have determined that you have UTF-8 encoded text. Try the_text.decode('utf8'). Note decode, NOT encode.
If decoding with UTF-8 does not raise UnicodeDecodeError and your text is not trivially short, then it is very close to certain that UTF-8 is the correct encoding.
If the above does not work, show us the result of print repr(the_text).
Note that it is counter-productive trying to check whether the file is encoded in ASCII -- ASCII is a subset of UTF-8. Leaving some data as str objects and other as unicode is messy in Python 2.x and won't work in Python 3.X
In any case, your first function doesn't do what you think it does; it returns False for any input string whose length is 2 or more. Please consider unit-testing functions as you write them; it makes debugging much faster later on.
Note that latin1 and iso-8859-1 are the same encoding. As latin1 encodes the first 256 codepoints in Unicode in the same order, then it is impossible to get UnicodeDecodeError raised by text.decode('latin1'). "No error" is this case has exactly zero diagnostic value.
Update in response to this comment from OP:
I use Python 2.7. If I use text.decode("utf8") it raises the following
error: UnicodeEncodeError: 'latin-1' codec can't encode character
u'\u2014' in position 0: ordinal not in range(256).
That can happen two ways:
(1) In a single statement like foo = text.decode('utf8'), text is already a unicode object so Python 2.X tries to encode it using the default encoding (latin-1 ???).
(2) Possibly in two different statements, first foo = text.decode('utf8') where text is an str object encoded in UTF-8, and this statement doesn't raise an error, followed by something like print foo and your sys.stdout.encoding is latin-1 (???).
I can't imagine why you have "ticked" my answer as correct. Nobody knows what the question is yet!
Please edit your question to show your code (insert print repr(text) just before the text.decode("utf8") line), and the result of running it. Show the repr() result and the full traceback (so that we can determine which line is causing the error).
I ask again: can you make your file available for analysis?
By the way, u'\u2014' is an "EM DASH" and is a valid character in cp1252 (but not in latin-1, as you have seen from the error message). What version of what operating system are you using?
And to answer your last question, NO, you must NOT attempt to decode your text using every codec in the known universe. You are ALREADY getting plausible Unicode; something (your code?) is decoding something somehow -- the presence of u'\u2014' is enough evidence of that. Just show us your code and its results.
If you have read some bytes and want to interpret them as an unicode string, then you have to use .decode() rather than encode().
Like #delnan said in the comment, I hope you know the encoding. If not, the guesswork should go easy once you fix the function used.
BTW even if there are only ASCII characters in that word, why not .decode() it too? You'd have the same data type (unicode) everywhere, which will make your program simpler.

Python String Comparison--Problems With Special/Unicode Characters

I'm writing a Python script to process some music data. It's supposed to merge two separate databases by comparing their entries and matching them up. It's almost working, but fails when comparing strings containing special characters (i.e. accented letters). I'm pretty sure it's a ASCII vs. Unicode encoding issue, as I get the error:
"Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal"
I realize I could use regular expressions to remove the offending characters, but I'm processing a lot of data and relying too much on regexes makes my program grindingly slow. Is there a way to have Python properly compare these strings? What is going on here--is there a way to tell whether it's storing my strings as ASCII or Unicode?
EDIT 1: I'm using Python v2.6.6. After checking the types, I've discovered that one database spits out me Unicode strings and one gives ASCII. So that's probably the problems. I'm trying to convert the ASCII strings from the second database to Unicode with a line like
line = unicode(f.readline().decode(latin_1).encode(utf_8))
but this gives an error like:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128)
I'm not sure why the 'ascii' codec is complaining since I'm trying to decode from ASCII. Can anyone help?
Unicode vs Bytes
First, some terminology. There are two types of strings, encoded and decoded:
Encoded. This is what's stored on disk. To Python, it's a bunch of 0's and 1's that you might treat like ASCII, but it could be anything -- binary data, a JPEG image, whatever. In Python 2.x, this is called a "string" variable. In Python 3.x, it's more accurately called a "bytes" variable.
Decoded. This is a string of actual characters. They could be encoded to 8-bit ASCII strings, or it could be encoded to 32-bit Chinese characters. But until it's time to convert to an encoded variable, it's just a Unicode string of characters.
What this means to you
So here's the thing. You said you were getting one ASCII variable and one Unicode variable. That's actually not true.
You have one variable that's a string of bytes -- ones and zeros, presumably in sets of 8. This is the variable you assumed, incorrectly, to be ASCII.
You have another variable that's Unicode data -- numbers, letters, and symbols.
Before you compare the string of bytes to a Unicode string of characters, you have to make some assumptions. In your case, Python (and you) assumed that the string of bytes was ASCII encoded. That worked fine until you came across a character that wasn't ASCII -- a character with an accent mark.
So you need to find out what that string of bytes is encoded as. It might be latin1. If it is, you want to do this:
if unicode_variable == string_variable.decode('latin1')
Latin1 is basically ASCII plus some extended characters like Ç and Â.
If your data is in Latin1, that's all you need to do. But if your string of bytes is encoded in something else, you'll need to figure out what encoding that is and pass it to decode().
The bottom line is, there's no easy answer, unless you know (or make some assumptions) about the encoding of your input data.
What I would do
Try running var.decode('latin1') on your string of bytes. That will give you a Unicode variable. If that works, and the data looks correct (ie, characters with accent marks look like they belong), roll with it.
Oh, and if latin1 doesn't parse or doesn't look right, try utf8 -- another common encoding.
You might need to preprocess the databases and convert everything into UTF-8. My guess is that you've got Latin-1 accented characters in some entries.
As to your question, the only way to know for sure is to look. Have your script spit out those that don't compare, and look up the character codes. Or just try string.decode('latin1').encode('utf8') and see what happens.
Converting both to unicode should help:
if unicode(str1) == unicode(str2):
print "same"
To find out whether YOU (not it) are storing your strings as str objects or unicode objects, print type(your_string).
You can use print repr(your_string) to show yourself (and us) unambiguously what is in your string.
By the way, exactly what version of Python are you using, on what OS? If Python 3.x, use ascii() instead of repr().

Categories

Resources