Standard Python libraries and Unicode

Standard Python libraries and Unicode - python

I have been reading left right and centre about unicode and python. I think I understand what encoding/decoding is, yet as soon as I try to use a standard library method manipulating a file name, I get the infamous:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 19:
ordinal not in range(128)
In this case \xe9 stands for 'é', and it doesn't matter if I call it from a os.path.join() or a shutil.copy(), it throws the same error. From what I understand it has to do with the default encoding of python. I try to change it with:
# -*- coding: utf-8 -*-
Nothing changes. If I type:
sys.setdefaultencoding('utf-8')
it tells me:
ImportError: cannot import name setdefaultencoding
What I really don't understand is why it works when I type it in the terminal, '\xe9' and all. Could someone please explain to me why this is happening/how to get around it?
Thank you

Filenames on *nix cannot be manipulated as unicode. The filename must be encoded to match the charset of the filesystem and then used.

you should decode manually the filename with the correct encoding (latin1?) before os.path.join
btw: # -- coding: utf-8 -- refers to the string literals in your .py file
effbot has some good infos

You should not touch the default encoding. It is best practice and highly recommendable to keep it with 'ascii' and convert your data properly to utf-8 on the output side.

Related

How to solve this encoding issue in with Spyder in Anaconda (Python 3)?

I'm trying to run the following:
import json
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
But I get the following error :
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
6987: ordinal not in range(128)
From the internet I've found that it should be because the encoding needs to be set to utf-8, but my issue is that it's already in utf-8.
sys.getdefaultencoding()
Out[43]: 'utf-8'
Also, it looks like my file is in utf-8, so I'm really confused
Also, the following code works :
In [15]: path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
In [16]: open(path).readline()
Is there a way to solve this ?
Thanks !
EDIT:
When I run the code in my console it works, but not when I run it in Spyder provided by Anaconda (https://www.continuum.io/downloads)
Do you know what can go wrong ?

The text file contains some non-ascii characters on a line somewhere. Somehow on your setup the default file encoding is set to ascii instead of utf-8 so do the following and specify the file's encoding explicitly:
import json
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line.strip()) for line in open(path, encoding="utf-8"))]
(Doing this is a good idea anyway even when the default works)

I try to ran this program with one additional line at the top:
# -*- coding: utf-8 -*-
It fetches the lines and shows the output (with u' prefixed strings; probably a conversion might be required after this). But, it didn't throw any error as you mentioned.

Python: Can't write to file - UnicodeEncodeError

This code should write some text to file.
When I'm trying to write my text to console, everything works. But when I try to write the text into the file, I get UnicodeEncodeError. I know, that this is a common problem which can be solved using proper decode or encode, but I tried it and still getting the same UnicodeEncodeError. What am I doing wrong?
I've attached an example.
print "(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)".decode("utf-8")%(dict.get('name'),dict.get('description'),dict.get('ico'),dict.get('city'),dict.get('ulCislo'),dict.get('psc'),dict.get('weby'),dict.get('telefony'),dict.get('mobily'),dict.get('faxy'),dict.get('emaily'),dict.get('dic'),dict.get('ic_dph'),dict.get('kategorie')[0],dict.get('kategorie')[1],dict.get('kategorie')[2])
(StarBuy s.r.o.,Inzertujte s foto, auto-moto, oblečenie, reality, prácu, zvieratá, starožitnosti, dovolenky, nábytok, všetko pre deti, obuv, stroj....
with open("test.txt","wb") as f:
f.write("(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)".decode("utf-8")%(dict.get('name'),dict.get('description'),dict.get('ico'),dict.get('city'),dict.get('ulCislo'),dict.get('psc'),dict.get('weby'),dict.get('telefony'),dict.get('mobily'),dict.get('faxy'),dict.get('emaily'),dict.get('dic'),dict.get('ic_dph'),dict.get('kategorie')[0],dict.get('kategorie')[1],dict.get('kategorie')[2]))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u010d' in position 50: ordinal not in range(128)
Where could be the problem?

To write Unicode text to a file, you could use io.open() function:
#!/usr/bin/env python
from io import open
with open('utf8.txt', 'w', encoding='utf-8') as file:
file.write(u'\u010d')
It is default on Python 3.
Note: you should not use the binary file mode ('b') if you want to write text.
# coding: utf8 that defines the source code encoding has nothing to do with it.
If you see sys.setdefaultencoding() outside of site.py or Python tests; assume the code is broken.

#ned-batchelder is right. You have to declare that the system default encoding is "utf-8". The coding comment # -*- coding: utf-8 -*- doesn't do this.
To declare the system default encoding, you have to import the module sys, and call sys.setdefaultencoding('utf-8'). However, sys was previously imported by the system and its setdefaultencoding method was removed. So you have to reload it before you call the method.
So, you will need to add the following codes at the beginning:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

You may need to explicitly declare that python use UTF-8 encoding.
The answer to this SO question explains how to do that: Declaring Encoding in Python

For Python 2:
Declare document encoding on top of the file (if not done yet):
# -*- coding: utf-8 -*-
Replace .decode with .encode:
with open("test.txt","wb") as f:
f.write("(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)".encode("utf-8")%(dict.get('name'),dict.get('description'),dict.get('ico'),dict.get('city'),dict.get('ulCislo'),dict.get('psc'),dict.get('weby'),dict.get('telefony'),dict.get('mobily'),dict.get('faxy'),dict.get('emaily'),dict.get('dic'),dict.get('ic_dph'),dict.get('kategorie')[0],dict.get('kategorie')[1],dict.get('kategorie')[2]))

UnicodeDecodeError error writing .xlsx file using xlsxwriter

I am trying to write about 1000 rows to a .xlsx file from my python application. The data is basically a combination of integers and strings. I am getting intermittent error while running wbook.close() command. The error is the following:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 15:
ordinal not in range(128)
My data does not have anything in unicode. I am wondering why the decoder is being at all. Has anyone noticed this problem?

0xc3 is "À". So what you need to do is change the encoding. Use the decode() method.
string.decode('utf-8')
Also depending on your needs and uses you could add
# -*- coding: utf-8 -*-
at the beginning of your script, but only if you are sure that the encoding will not interfere and break something else.

As Alex Hristov points out you have some non-ascii data in your code that needs to be encoded as UTF-8 for Excel.
See the following examples from the docs which each have instructions on handling UTF-8 with XlsxWriter in different scenarios:
Example: Simple Unicode with Python 2
Example: Simple Unicode with Python 3
Example: Unicode - Polish in UTF-8

Python webpage source read with special characters

I am reading a page source from a webpage, then parsing a value from that source.
There I am facing a problem with special characters.
In my python controller file iam using # -*- coding: utf-8 -*-.
But I am reading a webpage source which is using charset=iso-8859-1
So when I read the page content without specifying any encoding it is throwing error as UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 133: invalid start byte
when I use string.decode("iso-8859-1").encode("utf-8") then it is parsing data without any error. But it is displaying the value as 'F\u00fcnke' instead of 'Fünke'.
Please let me know how I can solve this issue.
I would greatly appreciate any suggestions.

Encoding is a PITA in Python3 for sure (and 2 in some cases as well).
Try checking these links out, they might help you:
Python - Encoding string - Swedish Letters
Python3 - ascii/utf-8/iso-8859-1 can't decode byte 0xe5 (Swedish characters)
http://docs.python.org/2/library/codecs.html
Also it would be nice with the code for "So when I read the page content without specifying any encoding" My best guess is that your console doesn't use utf-8 (for instance, windows.. your # -*- coding: utf-8 -*- only tells Python what type of characters to find within the sourcecode, not the actual data the code is going to parse or analyze itself.
For instance i write:
# -*- coding: iso-8859-1 -*-
import time
# Här skriver jag ut tiden (Translation: Here, i print out the time)
print(time.strftime('%H:%m:%s'))

error :: UnicodeDecodeError

I am getting
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 104: ordinal not in range(128)
I am using intgereproparty, stringproparty, datetimeproparty

That's because 0xb0 (decimal 176) is not a valid character code in ASCII (which defines only values between 0 and 127).
Check where you got that string from and use the proper encoding.
If you need further help, post the code.

You are trying to put Unicode data (probably text with accents) into an ASCII string.
You can use Python's codecs module to open a text file with UTF-8 encoding and write the Unicode data to it.
The .encode method may also help (u"õ".encode('utf-8') for example)

Python defaults to ASCII encoding - if you are dealing with chars outside of the ASCII range, you need to specify that in your code.
One way to do this is setting the defining the encoding at the top of your code.
This snippet sets the encoding at the top of the file to encoding to Latin-1 (which includes 0xb0):
#!/usr/bin/python
# -*- coding: latin-1 -*-
import os, sys
...
See PEP for more info on encoding.

When I write my foreign language "flashcard" programs, I always use python 3.x as its native encoding is utf-8. You're encoding problems will generally be far less frequent.
If you're working on a program that many people will share, however, you may want to consider using encode and decode with python 2.x, but only when storing and retrieving data elements in persistent storage. encode your non-ASCII characters, silently manipulate hexadecimal representations of those unicode strings in memory, and save them as hexadecimal. Finally, use decode when fetching unicode strings from persistant storage, but for end user display only. This will eliminate the need to constantly encode and decode your strings in your program.
#jcoon also has a pretty standard response to this problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Standard Python libraries and Unicode - python

Filenames on *nix cannot be manipulated as unicode. The filename must be encoded to match the charset of the filesystem and then used.

you should decode manually the filename with the correct encoding (latin1?) before os.path.join btw: # -- coding: utf-8 -- refers to the string literals in your .py file effbot has some good infos

You should not touch the default encoding. It is best practice and highly recommendable to keep it with 'ascii' and convert your data properly to utf-8 on the output side.

Related

How to solve this encoding issue in with Spyder in Anaconda (Python 3)?

Python: Can't write to file - UnicodeEncodeError

UnicodeDecodeError error writing .xlsx file using xlsxwriter

Python webpage source read with special characters

error :: UnicodeDecodeError

Categories

Resources