I used lxml to parse some web page as below:
>>> doc = lxml.html.fromstring(htmldata)
>>> element in doc.cssselect(sometag)[0]
>>> text = element.text_content()
>>> print text
u'Waldenstr\xf6m'
Why it prints u'Waldenstr\xf6m' but not "Waldenström" here?
After that, I tried to add this text to a MySQL table with UTF-8 character set and utf8_general_ci collatio, Users is a Django model:
>>> Users.objects.create(last_name=text)
'ascii' codec can't encode character u'\xf6' in position 9: ordinal not in range(128)
What I was doing wrong here? How can I get the the correct data "Waldenström" and write it to database?
you want text.encode('utf8')
>>> print text
u'Waldenstr\xf6m'
There is a difference between displaying something in the shell (which uses the repr) and printing it (which just spits out the string):
>>> u'Waldenstr\xf6m'
u'Waldenstr\xf6m'
>>> print u'Waldenstr\xf6m'
Waldenström
So, I'm not sure your snippet above is really what happened. If it definitely is, then your XHTML must contain exactly that string:
<div class="something">u'Waldenstr\xf6m'</div>
(maybe it was incorrectly generated by Python using a string's repr() instead of its str()?)
If this is right and intentional, you would need to parse that Python string literal into a simple string. One way of doing that would be:
>>> r= r"u'Waldenstr\xf6m'"
>>> print r[2:-1].decode('unicode-escape')
Waldenström
If the snippet at the top is actually not quite right and you are simply asking why Python's repr escapes all non-ASCII characters, the answer is that printing non-ASCII to the console is unreliable across various environments so the escape is safer. In the above examples you might have received ?s or worse instead of the ö if you were unlucky.
In Python 3 this changes:
>>> 'Waldenstr\xf6m'
'Waldenström'
Related
For example, if I have a unicode string, I can encode it as an ASCII string like so:
>>> u'\u003cfoo/\u003e'.encode('ascii')
'<foo/>'
However, I have e.g. this ASCII string:
'\u003foo\u003e'
... that I want to turn into the same ASCII string as in my first example above:
'<foo/>'
It took me a while to figure this one out, but this page had the best answer:
>>> s = '\u003cfoo/\u003e'
>>> s.decode( 'unicode-escape' )
u'<foo/>'
>>> s.decode( 'unicode-escape' ).encode( 'ascii' )
'<foo/>'
There's also a 'raw-unicode-escape' codec to handle the other way to specify Unicode strings -- check the "Unicode Constructors" section of the linked page for more details (since I'm not that Unicode-saavy).
EDIT: See also Python Standard Encodings.
On Python 2.5 the correct encoding is "unicode_escape", not "unicode-escape" (note the underscore).
I'm not sure if the newer version of Python changed the unicode name, but here only worked with the underscore.
Anyway, this is it.
At some point you will run into issues when you encounter special characters like Chinese characters or emoticons in a string you want to decode i.e. errors that look like this:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 109-123: ordinal not in range(128)
For my case (twitter data processing), I decoded as follows to allow me to see all characters with no errors
>>> s = '\u003cfoo\u003e'
>>> s.decode( 'unicode-escape' ).encode( 'utf-8' )
>>> <foo>
Ned Batchelder said:
It's a little dangerous depending on where the string is coming from,
but how about:
>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'
Actually this method can be made safe like so:
>>> s = '\u003cfoo\u003e'
>>> s_unescaped = eval('u"""'+s.replace('"', r'\"')+'-"""')[:-1]
Mind the triple-quote string and the dash right before the closing 3-quotes.
Using a 3-quoted string will ensure that if the user enters ' \\" ' (spaces added for visual clarity) in the string it would not disrupt the evaluator;
The dash at the end is a failsafe in case the user's string ends with a ' \" ' . Before we assign the result we slice the inserted dash with [:-1]
So there would be no need to worry about what the users enter, as long as it is captured in raw format.
It's a little dangerous depending on where the string is coming from, but how about:
>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'
I downloaded Spanish text from NLTK in python using
spanish_sents=nltk.corpus.floresta.sents()
when printing the sentences in the terminal the corresponding Spanish characters
are not rendered. For example printing spanish_sents[1] produces characters like u'\xe9' and if I encode it using utf-8 as in
print [x.encode("utf-8") for x in sapnish_sents[1]]
it produces '\xc3\xa9' and encoding in latin3
print [x.encode("latin3") for x in sapnish_sents[1]]
it produces '\xe9'
How can I configure my terminal to print the glyphs for these points? Thanks
Just an initial remark, Latin3 or ISO-8859-3 is indeed denoted as South European, but it was designed to cover Turkish, Maltese and Esperanto. Spanish is more commonly encoded in Latin1 (ISO-8859-1 or West European) or Latin9 (ISO-8859-15).
I can confirm that the letter é has the unicode code point U+00E9, and is represented as '\xe9' in both Latin1 and Latin3. And it is encoded as '\xc3\xc9' in UTF8, so all your conversions are correct.
But the real question How can I configure my terminal... ? is hard to answer without knowing what the terminal is...
if it is a true teletype or old vt100 without accented characters: you cannot (but I do not think you use that...)
if you use a Windows console, declare the codepage 1252 (very near to Latin1): chcp 1252 and use Latin1 encoding (or even better 'cp1252')
if you use xterm (or any derivative) on Linux or any other Unix or Unix-like, declare an utf8 charset with export LANG=en_US.UTF8 (choose your own language if you do not like american english, the interesting part here is .UTF8) and use UTF8 encoding - alternatively declare a iso-8859-1 charset (export LANG=en_US.ISO-8859-1) and use Latin1 encoding
What you are looking at, is the representation of strings, because printing lists is only for debugging purposes.
For printing lists, use .join:
print ', '.join(sapnish_sents[1])
My guess is that there are a few things going on. First, you're iterating through a str (is sapnish_sents[1] one entire entry? What happens when you print that). Second, you're not getting full characters because you're iterating through a str (a unicode character takes more "space" than an ASCII character, so addressing a single index will look weird). Third you are trying to encode when you probably mean to decode.
Try this:
print sapnish_sents[1].decode('utf-8')
I just ran the following in my terminal to help give context:
>>> a = '®†\¨ˆø' # Storing non-ASCII characters in a str is ill-advised;
# I do this as an example because it's what I think your question is
# really asking
>>> a # a now looks like a bunch of gibberish if I just output
'\xc2\xae\xe2\x80\xa0\\\xc2\xa8\xcb\x86\xc3\xb8'
>>> print a # Well, this looks normal.
®†\¨ˆø
>>> print repr(a) # Just demonstrating how the above works
'\xc2\xae\xe2\x80\xa0\\\xc2\xa8\xcb\x86\xc3\xb8'
>>> a[0] # We're only looking at one character, which is represented by all this stuff.
'\xc2'
>>> print a[0] # But because it's not a complete unicode character, the terminal balks
?
>>> print a.decode('utf-8') # Look familiar?
®†\¨ˆø
>>> print a.decode('utf-8')[0] # Our first character!
®
So I've been getting all caught up in unicode and utf-8 as i have a script which grabs images and their titles off the web. Works great, except when their title has special characters (eg. Jökulsárlón.)
it comes out as unicode :-
J\\xc3\\xb6kuls\\xc3\\xa1rl\\xc3\\xb3n
So i want a way to turn that string into plain text- whether is turning them into nearest 'normal' letters (like plain o instead of ö) or printing those actual symbols (rather than \xc3 etc.) I've tried a billion different ways, but a lot of the things i've been reading havent worked for me in python 3.
Thanks in advance
It's indeed UTF-8 but they're bytes:
>>> b = b'J\xc3\xb6kuls\xc3\xa1rl\xc3\xb3n'
>>> b
b'J\xc3\xb6kuls\xc3\xa1rl\xc3\xb3n'
>>> b.decode('utf-8')
'Jökulsárlón'
As this is Python 3.x, this is a Unicode string.
J\xc3\xb6kuls\xc3\xa1rl\xc3\xb3n is not unicode. It may be UTF-8 though.
To turn them into Unicode you have to decode them. s.decode('utf-8') if it were UTF-8, for example.
Before printing or writing you have to encode them again. If you encode to ASCII, the encode method accepts an option that tells it what to do with code points that cannot be represented in the given encoding.
For example: print(s.encode('ascii', errors='ignore')
errors accepts more options.
If your string is <class 'str'> and it prints literally J\\xc3\\xb6kuls\\xc3\\xa1rl\\xc3\\xb3n, then the last line below will decode it:
>>> s='J\\xc3\\xb6kuls\\xc3\\xa1rl\\xc3\\xb3n'
>>> type(s)
<class 'str'>
>>> s
'J\\xc3\\xb6kuls\\xc3\\xa1rl\\xc3\\xb3n'
>>> s.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
'Jökulsárlón'
How it got that convoluted is unknown. If this isn't the solution, then update your question with the type of the variable holding the string (type(s) for example) and the exact value as shown above for my example.
I'm using Python 2.6.5 and when I run the following in the Python shell, I get:
>>> print u'Andr\xc3\xa9'
André
>>> print 'Andr\xc3\xa9'
André
>>>
What's the explanation for the above? Given u'Andr\xc3\xa9', how can I display the above value properly in an html page so that it shows André instead of André?
'\xc3\xa9' is the UTF-8 encoding of the unicode character u'\u00e9' (which can also be specified as u'\xe9'). So you can use u'Andr\u00e9' or u'Andr\xe9'.
You can convert from one to the other:
>>> 'Andr\xc3\xa9'.decode('utf-8')
u'Andr\xe9'
>>> u'Andr\xe9'.encode('utf-8')
'Andr\xc3\xa9'
Note that the reason print 'Andr\xc3\xa9' gave you the expected result is only because your system's default encoding is UTF-8. For example, on Windows I get:
>>> print 'Andr\xc3\xa9'
André
As for outputting HTML, it depends on which web framework you use and what encoding you output in the HTML page. Some frameworks (e.g. Django) will convert unicode values to the correct encoding automatically, while others will require you to do so manually.
Try this:
>>> unicode('Andr\xc3\xa9', 'utf-8')
u'Andr\xe9'
>>> print u'Andr\xe9'
André
That may answer your question.
EDIT: or see the above answer
I am not sure, but I would guess that different codecs are applied by the print operation. Probably some utf-8 vs. unicode issue.
For HTML, you would need to encode certain characters using the HTML syntax for unicode.
I think that the Python codecs module might be able to help you.
When I tried to get the content of a tag using "unicode(head.contents[3])" i get the output similar to this: "Christensen Sk\xf6ld". I want the escape sequence to be returned as string. How to do it in python?
Assuming Python sees the name as a normal string, you'll first have to decode it to unicode:
>>> name
'Christensen Sk\xf6ld'
>>> unicode(name, 'latin-1')
u'Christensen Sk\xf6ld'
Another way of achieving this:
>>> name.decode('latin-1')
u'Christensen Sk\xf6ld'
Note the "u" in front of the string, signalling it is uncode. If you print this, the accented letter is shown properly:
>>> print name.decode('latin-1')
Christensen Sköld
BTW: when necessary, you can use de "encode" method to turn the unicode into e.g. a UTF-8 string:
>>> name.decode('latin-1').encode('utf-8')
'Christensen Sk\xc3\xb6ld'
I suspect that it's acutally working correctly. By default, Python displays strings in ASCII encoding, since not all terminals support unicode. If you actually print the string, though, it should work. See the following example:
>>> u'\xcfa'
u'\xcfa'
>>> print u'\xcfa'
Ïa
Given a byte string with Unicode escapes b"\N{SNOWMAN}", b"\N{SNOWMAN}".decode('unicode-escape) will produce the expected Unicode string u'\u2603'.