Encoding/converting a string with spanish or polish characters

Encoding/converting a string with spanish or polish characters - python

1) How do I convert a variable with a string like "wdzi\xc4\x99czno\xc5\x9bci" into "wdzięczności"?
2) Also how do I convert string variable with characters like "Â±", "Ä™", "Ä†" into correct letters?
I emphasise "variable" because all I've got from googling was examples with " u'some string' " and the like and I can't get anything like that to work.
I use "# -*- coding: utf-8 -*-" in second line of my script and I still crash into these problems.
Also I was said that simple print should output correctly - but it does not.

In Python 2.7 IDLE, I get this output:
>>> print "wdzi\xc4\x99czno\xc5\x9bci".decode('utf-8')
wdzięczności
Your first string appears to be a UTF-8 byte string, so all that's necessary is to decode it into a Unicode string. When Python prints that string, it will encode it back to the proper encoding based on your environment.
If you're using Python 3 then you have a string that has been decoded improperly and will need a little more work to fix the damage.
>>> print("wdzi\xc4\x99czno\xc5\x9bci".encode('iso-8859-1').decode('utf-8'))
wdzięczności

Related

Spanish characters not being displayed on the terminal in python

I downloaded Spanish text from NLTK in python using
spanish_sents=nltk.corpus.floresta.sents()
when printing the sentences in the terminal the corresponding Spanish characters
are not rendered. For example printing spanish_sents[1] produces characters like u'\xe9' and if I encode it using utf-8 as in
print [x.encode("utf-8") for x in sapnish_sents[1]]
it produces '\xc3\xa9' and encoding in latin3
print [x.encode("latin3") for x in sapnish_sents[1]]
it produces '\xe9'
How can I configure my terminal to print the glyphs for these points? Thanks

Just an initial remark, Latin3 or ISO-8859-3 is indeed denoted as South European, but it was designed to cover Turkish, Maltese and Esperanto. Spanish is more commonly encoded in Latin1 (ISO-8859-1 or West European) or Latin9 (ISO-8859-15).
I can confirm that the letter é has the unicode code point U+00E9, and is represented as '\xe9' in both Latin1 and Latin3. And it is encoded as '\xc3\xc9' in UTF8, so all your conversions are correct.
But the real question How can I configure my terminal... ? is hard to answer without knowing what the terminal is...
if it is a true teletype or old vt100 without accented characters: you cannot (but I do not think you use that...)
if you use a Windows console, declare the codepage 1252 (very near to Latin1): chcp 1252 and use Latin1 encoding (or even better 'cp1252')
if you use xterm (or any derivative) on Linux or any other Unix or Unix-like, declare an utf8 charset with export LANG=en_US.UTF8 (choose your own language if you do not like american english, the interesting part here is .UTF8) and use UTF8 encoding - alternatively declare a iso-8859-1 charset (export LANG=en_US.ISO-8859-1) and use Latin1 encoding

What you are looking at, is the representation of strings, because printing lists is only for debugging purposes.
For printing lists, use .join:
print ', '.join(sapnish_sents[1])

My guess is that there are a few things going on. First, you're iterating through a str (is sapnish_sents[1] one entire entry? What happens when you print that). Second, you're not getting full characters because you're iterating through a str (a unicode character takes more "space" than an ASCII character, so addressing a single index will look weird). Third you are trying to encode when you probably mean to decode.
Try this:
print sapnish_sents[1].decode('utf-8')
I just ran the following in my terminal to help give context:
>>> a = '®†\¨ˆø' # Storing non-ASCII characters in a str is ill-advised;
# I do this as an example because it's what I think your question is
# really asking
>>> a # a now looks like a bunch of gibberish if I just output
'\xc2\xae\xe2\x80\xa0\\\xc2\xa8\xcb\x86\xc3\xb8'
>>> print a # Well, this looks normal.
®†\¨ˆø
>>> print repr(a) # Just demonstrating how the above works
'\xc2\xae\xe2\x80\xa0\\\xc2\xa8\xcb\x86\xc3\xb8'
>>> a[0] # We're only looking at one character, which is represented by all this stuff.
'\xc2'
>>> print a[0] # But because it's not a complete unicode character, the terminal balks
?
>>> print a.decode('utf-8') # Look familiar?
®†\¨ˆø
>>> print a.decode('utf-8')[0] # Our first character!
®

Python's .format() minilanguage and Unicode

I'm trying to use some of the simple unicode characters in a command line program I'm writing, but drawing these things into a table becomes difficult because Python appears to be treating single-character symbols as multi-character strings.
For example, if I try to print(u"\u2714".encode("utf-8")) I see the unicode checkmark. However, if I try to add some padding to that character (as one might in tabular structure), Python seems to be interpreting this single-character string as a 3-character one. All three of these lines print the same thing:
print("|{:1}|".format(u"\u2714".encode("utf-8")))
print("|{:2}|".format(u"\u2714".encode("utf-8")))
print("|{:3}|".format(u"\u2714".encode("utf-8")))
Now I think I understand why this is happening: it's a multibyte string. My question is, how do I get Python to pad this string appropriately?

Make your format strings unicode:
from __future__ import print_function
print(u"|{:1}|".format(u"\u2714"))
print(u"|{:2}|".format(u"\u2714"))
print(u"|{:3}|".format(u"\u2714"))
outputs:
|✔|
|✔ |
|✔ |

Don't encode('utf-8') at that point do it latter:
>>> u"\u2714".encode("utf-8")
'\xe2\x9c\x94'
The UTF-8 encoding is three bytes long. Look at how format works with Unicode strings:
>>> u"|{:1}|".format(u"\u2714")
u'|\u2714|'
>>> u"|{:2}|".format(u"\u2714")
u'|\u2714 |'
>>> u"|{:3}|".format(u"\u2714")
u'|\u2714 |'
Tested on Python 2.7.3.

Converting utf-8 encoded string to just plain text in python 3

So I've been getting all caught up in unicode and utf-8 as i have a script which grabs images and their titles off the web. Works great, except when their title has special characters (eg. Jökulsárlón.)
it comes out as unicode :-
J\\xc3\\xb6kuls\\xc3\\xa1rl\\xc3\\xb3n
So i want a way to turn that string into plain text- whether is turning them into nearest 'normal' letters (like plain o instead of ö) or printing those actual symbols (rather than \xc3 etc.) I've tried a billion different ways, but a lot of the things i've been reading havent worked for me in python 3.
Thanks in advance

It's indeed UTF-8 but they're bytes:
>>> b = b'J\xc3\xb6kuls\xc3\xa1rl\xc3\xb3n'
>>> b
b'J\xc3\xb6kuls\xc3\xa1rl\xc3\xb3n'
>>> b.decode('utf-8')
'Jökulsárlón'
As this is Python 3.x, this is a Unicode string.

J\xc3\xb6kuls\xc3\xa1rl\xc3\xb3n is not unicode. It may be UTF-8 though.
To turn them into Unicode you have to decode them. s.decode('utf-8') if it were UTF-8, for example.
Before printing or writing you have to encode them again. If you encode to ASCII, the encode method accepts an option that tells it what to do with code points that cannot be represented in the given encoding.
For example: print(s.encode('ascii', errors='ignore')
errors accepts more options.

If your string is <class 'str'> and it prints literally J\\xc3\\xb6kuls\\xc3\\xa1rl\\xc3\\xb3n, then the last line below will decode it:
>>> s='J\\xc3\\xb6kuls\\xc3\\xa1rl\\xc3\\xb3n'
>>> type(s)
<class 'str'>
>>> s
'J\\xc3\\xb6kuls\\xc3\\xa1rl\\xc3\\xb3n'
>>> s.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
'Jökulsárlón'
How it got that convoluted is unknown. If this isn't the solution, then update your question with the type of the variable holding the string (type(s) for example) and the exact value as shown above for my example.

How to transliterate Cyrillic to Latin using Python 2.7? - not correct translation output

I am trying to transliterate Cyrillic to Latin from an excel file. I am working from the bottom up and can not figure out why this isn't working.
When I try to translate a simple text string, Python outputs "EEEEE EEE" instead of the correct translation. How can I fix this to give me the right translation?? I have been trying to figure this out all day!
symbols = (u"абвгдеёзийклмнопрстуфхъыьэАБВГДЕЁЗИЙКЛМНОПРСТУФХЪЫЬЭ",
u"abvgdeezijklmnoprstufh'y'eABVGDEEZIJKLMNOPRSTUFH'Y'E")
tr = {ord(a):ord(b) for a, b in zip(*symbols)}
text = u'Добрый Ден'
print text.translate(tr)
>>EEEEEE EEE
I appreciate the help!

Your source input is wrong. However you entered your source and text literals, Python did not read the right unicode codepoints.
Instead, I strongly suspect something like the PYTHONIOENCODING variable has been set with the error handler set to replace. This causes Python to replace all codepoints that it does not recognize with question marks. All cyrillic input is treated as not-recognized.
As a result, the only codepoint in your translation map is 63, the question mark, mapped to the last character in symbols[1] (which is expected behaviour for the dictionary comprehension with only one unique key):
>>> unichr(63)
u'?'
>>> unichr(69)
u'E'
The same problem applies to your text unicode string; it too consists of only question marks. The translation mapping replaces each with the letter E:
>>> u'?????? ???'.translate({63, 69})
u'EEEEEE EEE'
You need to either avoid entering Cyrillic literal characters or fix your input method.
In the terminal, this is a function of the codec your terminal (or windows console) supports. Configure the correct codepage (windows) or locale (POSIX systems) to input and output an encoding that supports Cyrillic (UTF-8 would be best).
In a Python source file, tell Python about the encoding used for string literals with a codec comment at the top of the file.
Avoiding literals means using Unicode escape sequences:
symbols = (
u'\u0430\u0431\u0432\u0433\u0434\u0435\u0451\u0437\u0438\u0439\u043a\u043b\u043c'
u'\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u044a\u044b\u044c\u044d'
u'\u0410\u0411\u0412\u0413\u0414\u0415\u0401\u0417\u0418\u0419\u041a\u041b\u041c'
u'\u041d\u041e\u041f\u0420\u0421\u0422\u0423\u0424\u0425\u042a\u042b\u042c\u042d',
u"abvgdeezijklmnoprstufh'y'eABVGDEEZIJKLMNOPRSTUFH'Y'E"
)
tr = {ord(a):ord(b) for a, b in zip(*symbols)}
text = u'\u0414\u043e\u0431\u0440\u044b\u0439 \u0414\u0435\u043d'
print text.translate(tr)

Python unicode string with UTF-8?

I'm getting back from a library what looks to be an incorrect unicode string:
>>> title
u'Sopet\xc3\xb3n'
Now, those two hex escapes there are the UTF-8 encoding for U+00F3 LATIN SMALL LETTER O WITH ACUTE. So far as I understand, a unicode string in Python should have the actual character, not the the UTF-8 encoding for the character, so I think this is incorrect and presumably a bug either in the library or in my input, right?
The question is, how do I (a) recognize that I have UTF-8 encoded text in my unicode string, and (b) convert this to a proper unicode string?
I'm stumped on (a), as there's nothing wrong, encoding-wise, about that original string (i.e, both are valid characters in their own right, u'\xc3\xb3' == Ã³, but they're not what's supposed to be there)
It looks like I can achieve (b) by eval()ing that repr() output minus the "u" in front to get a str and then decoding the str with UTF-8:
>>> eval(repr(title)[1:]).decode("utf-8")
u'Sopet\xf3n'
>>> print eval(repr(title)[1:]).decode("utf-8")
Sopetón
But that seems a bit kludgy. Is there an officially-sanctioned way to get the raw data out of a unicode string and treat that as a regular string?

a) Try to put it through the method below.
b)
>>> u'Sopet\xc3\xb3n'.encode('latin-1').decode('utf-8')
u'Sopet\xf3n'

You should use:
>>> title.encode('raw_unicode_escape')
Python2:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape'))
Python3:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape').decode('utf8'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Encoding/converting a string with spanish or polish characters - python

Related

Spanish characters not being displayed on the terminal in python

Python's .format() minilanguage and Unicode

Converting utf-8 encoded string to just plain text in python 3

How to transliterate Cyrillic to Latin using Python 2.7? - not correct translation output

Python unicode string with UTF-8?

Categories

Resources