Which encoding? - python

does anybody know in which way the string 'Krummh%C3%B6rn' is encoded?
Plain text is "Krummhörn".
I need to decode strings like this one in Python and tried urllib.unquote('Krummh%C3%B6rn')
The result: 'Krummh\xc3\xb6rn'

That's UTF-8 in URL encoding.
print(urllib.unquote('Krummh%C3%B6rn').decode('utf-8'))
prints the string as you'd expect it to look.

You're halfway there. Take that result and decode it as UTF-8.

Looks like URL encoding

Related

Python encoding problem when reading but not when typing

I'm reading some strings from a text file.
Some of these strings have some "strange" characters, e.g. "\xc3\xa9comiam".
If I copy that string and paste it into a variable, I can convert it to readable characters:
string = "\xc3\xa9comiam"
print(string.encode("raw_unicode_escape").decode('utf-8'))
écomiam
but if I read it from the file, it doesn't work:
with open(fn) as f:
for string in f.readlines():
print(string.encode("raw_unicode_escape").decode('utf-8'))
\xc3\xa9comiam
It seems the solution must be pretty easy, but I can't find it.
What can I do?
Thanks!
Those not unicode-escape ones - like the name suggests, that handles Unicode sequences like \u00e9 but not \xe9.
What you have is a UTF-8 enooded sequence. The way to decode that is to get it into a bytes sequence which can then be decoded to a Unicode string.
# Let's not shadow the string library
s = "\xc3\xa9comiam"
print(bytes(s, 'latin-1').decode('utf-8'))
The 'latin-1' trick is a dirty secret which simply converts every byte to a character with the same character code.
For your file, you could open it in binary mode so you don't have to explictly convert it to bytes, or you could simply apply the same conversion to the strings you read.
Thanks everyone for your help,
I think, I've found a solution (not very elegant, but it does the trick).
print(bytes(tm.strip(), "utf-8").decode("unicode_escape").encode("raw_unicode_escape").decode('utf-8'))
Thanks!

Decoding a string with python with portuguese characters

So I have this text that I pulled from the internet, that some of the words are not using the correct characters, like this one "experiências". Is there any function or something in python where I could tackle strings like that and turn into the portuguese version. like experiência.
Thanks !
What you "pulled" was not a Unicode string but a string in the Western-European encoding, probably CP1252. You must encode it back to the byte object and then decode correctly.
"experiências".encode("cp1252").decode()
# 'experiências'

Python 3: How to convert a bytearray to an ASCII string

I have the following bytearray
bytearray(b'S\x00t\x00a\x00n\x00d\x00a\x00r\x00d\x00F\x00i\x00r\x00m\x00a\x00t\x00a\x00.\x00i\x00n\x00o\x00')
It should spell out StandardFirmata.ino however, I can't figure out how to decode it.
Here is what I have tried:
print(str(board.sysex_list)) #Appears to just return a string that looks identical
print(board.sysex_list.decode()) # Returns just S
Is there a simple way to do this?
Wrong encoding.
3>> bytearray(b'S\x00t\x00a\x00n\x00d\x00a\x00r\x00d\x00F\x00i\x00r\x00m\x00a\x00t\x00a\x00.\x00i\x00n\x00o\x00').decode('utf-16le')
'StandardFirmata.ino'
But that's not ASCII.
The issue was that I was not specifying a decoding. All I had to do was change decode to decode('utf-16-le')

Reading JSON: what encoding is "\u00c5\u0082"? How do I get it to a unicode object?

One of the values in a JSON file I'm parsing is Wroc\u00c5\u0082aw. How can I turn this string into a unicode object that yields "Wrocław" (which is the correct decoding in this case)?
It looks like whatever process generated that JSON took UTF-8-encoded text and mistook it for Latin-1-encoded text. To fix the error, run the same process in reverse:
>>> u'Wroc\u00c5\u0082aw'.encode('iso-8859-1').decode('utf-8')
u'Wroc\u0142aw'
>>> import unicodedata
>>> unicodedata.name(u'\u0142')
'LATIN SMALL LETTER L WITH STROKE'
It looks your JSON hasn't the right encoding because neither \u00c5 nor \u0082aw yields the characters you're expecting in any encoding.
But you'd maybe try to encode this value in UTF8 or UTF16

Python JSON New York Times API

Really new to Python and getting data from the web, so here it goes.
I have been able to pull data from the NYT api and parse the JSON output into a CSV file. However, depending on my search, I may get the following error when I attempt to write a row to the CSV.
UnicodeEncodeError: 'charmap' codec can't encode characters in position 20-21: character maps to
This URL has the data that I am trying to parse into a CSV. (I de-selected "Print pretty results")
I am pretty sure the error is occuring near title:"Spitzer......."
I have tried to search the web, but I can't seem to get an answer. I don't know alot about encoding, but I am guessing the data I retrieve from the JSON records are encoded in some way.
Any help you can provide will be greatly appreciated.
Many thanks in advance,
Brock
You need to check your HTTP headers to see what char encoding they are using when returning the results. My bet is that everything is encoded as utf-8 and when you try to write to CSV, you are implicitly encoding output as ascii.
The ' they are using is not in the ascii char set. You can catch the UnicodeError exception.
Follow the golden rules of encodings.
Decode early into unicode (data.decode('utf-8', 'ignore'))
Use unicode internally.
Encode late - during output - data.encode('ascii', 'ignore'))
You can probably set your CSV writer to use utf-8 encodings when writing.
Note: You should really see what encoding they are giving you before blindly using utf-8 for everything.
Every piece of textual data is encoded. It's hard to tell what the problem is without any code, so the only advice I can give now is: Try decoding the response before parsing it ...
resp = do_request()
## look on the nyt site if they mention the encoding used and use it instead.
decoded = resp.decode('utf-8')
parsed = parse( decoded )
It appears to be trying to decode '/' which is used whenever a slash is used. This can be avoided by making using the string function.
str('http:\/\/www.nytimes.com\/2010\/02\/17\/business\/global\/17barclays.html')
'http:\\/\\/www.nytimes.com\\/2010\\/02\\/17\\/business\\/global\\/17barclays.html'
from there you can use replace.
str('http:\/\/www.nytimes.com\/2010\/02\/17\/business\/global\/17barclays.html').replace('\\', "")
Be careful about nytimes API -- it does not provide you the full body text.

Categories

Resources