Python 3.6.1 - Printing string as human readable text, special characters - python

I'm building a little django 1.1 app (though I believe this issue to be specific to Python) where I've come to use commands to control the flow of getting and categorizing data. I also wish to print a sort of summary using a third command. I am using macOS 10.12.3
My problem comes from getting text data in and printing it to the console or a document using
> or >>
in the console.
I'm running these scripts using an alias of Python 3.6.1
I'm using the Tweepy api, but that should hopefully not be relevant.
These snippets should illustrate the problem I'm hoping to solve:
print(type(data))
print(type(data.text))
try:
print(data.text)
except UnicodeEncodeError:
print("no printing today :(")
print(type(data.text.encode('UTF-8')))
print(data.text.encode('UTF-8'))
this outputs:
<class 'tweepy.models.Status'>
<class 'str'>
no printing today :(
<class 'bytes'>
b'kontroll p\xc3\xa5 ... v\xc3\xa5pen.'
The ugly things there should both be the character 'å'.
This is the error that would be thrown:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe5' in position 223: ordinal not in range(128)
It says 'ascii' codec, but doing (in my Python 3.6.1 script):
print(sys.getdefaultencoding())
outputs:
utf-8
Running
print(sys.getdefaultencoding())
again in Python 2.7.10 outputs:
ascii
So the thrown error matches what 2.7.10 outputs. I am not discounting the possibility that I could be wrong about what a default encoder does
I have also tried
export LOCALE="no_NB.UTF-8"
in an attempt to see if this could be caused by my system (unless I'm misunderstanding what this does). I did not write this to any file, thinking it would persist through the current session.
Is the wrong encoder being used somehow? Could it be my terminal encoding? How can I write my special characters to the terminal and file? Are strings really this hard to get right?
Any help is greatly appreciated!!

Setting
export LC_ALL=no_NO.UTF-8
export LANG=no_NO.UTF-8
in my .bash_profile now allows me to see the characters I want in my terminal and it is also successfully echoed to a file.

Related

UnicodeEncodeError: 'ascii' codec can't encode character in print function

My company is using a database and I am writing a script that interacts with that database. There is already an script for putting the query on database and based on the query that script will return results from database.
I am working on unix environment and I am using that script in my script for getting some data from database and I am redirecting the result from the query to a file. Now when I try to read this file then I am getting an error saying-
UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 9741: ordinal not in range(128)
I know that python is not able to read file because of the encoding of the file. The encoding of the file is not ascii that's why the error is coming. I tried checking the encoding of the file and tried reading the file with its own encoding.
The code that I am using is-
os.system("Query.pl \"select title from bug where (ste='KGF-A' AND ( status = 'Not_Approved')) \">patchlet.txt")
encoding_dict3={}
encoding_dict3=chardet.detect(open("patchlet.txt", "rb").read())
print(encoding_dict3)
# Open the patchlet.txt file for storing the last part of titles for latest ACF in a list
with codecs.open("patchlet.txt",encoding='{}'.format(encoding_dict3['encoding'])) as csvFile
readCSV = csv.reader(csvFile,delimiter=":")
for row in readCSV:
if len(row)!=0:
if len(row) > 1:
j=len(row)-1
patchlets_in_latest.append(row[j])
elif len(row) ==1:
patchlets_in_latest.append(row[0])
patchlets_in_latest_list=[]
# calling the strip_list_noempty function for removing newline and whitespace characters
patchlets_in_latest_list=strip_list_noempty(patchlets_in_latest)
# coverting list of titles in set to remove any duplicate entry if present
patchlets_in_latest_set= set(patchlets_in_latest_list)
# Finding duplicate entries in list
duplicates_in_latest=[k for k,v in Counter(patchlets_in_latest_list).items() if v>1]
# Printing imp info for logs
print("list of titles of patchlets in latest list are : ")
for i in patchlets_in_latest_list:
**print(str(i))**
print("No of patchlets in latest list are : {}".format(str(len(patchlets_in_latest_list))))
Where Query.pl is the perl script that is written to bring in the result of query from database.The encoding that I am getting for "patchlet.txt" (the file used for storing result from HSD) is:
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
Even when I have provided the same encoding for reading the file, then also I am getting the error.
Please help me in resolving this error.
EDIT:
I am using python3.6
EDIT2:
While outputting the result I am getting the error and there is one line in the file which is having some unknown character. The line looks like:
Some failure because of which vtrace cannot be used along with some trace.
I am using gvim and in gvim the "vtrace" looks like "~Vvtrace" . Then I checked on database manually for this character and the character is "–" which is according to my keyboard is neither hyphen nor underscore.These kinds of characters are creating the problem.
Also I am working on linux environment.
EDIT 3:
I have added more code that can help in tracing the error. Also I have highlighted a "print" statement (print(str(i))) where I am getting the error.
Problem
Based on the information in the question, the program is processing non-ASCII input data, but is unable to output non-ASCII data.
Specifically, this code:
for i in patchlets_in_latest_list:
print(str(i))
Results in this exception:
UnicodeEncodeError: 'ascii' codec can't encode character '\u2013'
This behaviour was common in Python2, where calling str on a unicode object would cause Python to try to encode the object as ASCII, resulting in a UnicodeEncodeError if the object contained non-ASCII characters.
In Python3, calling str on a str instance doesn't trigger any encoding. However calling the print function on a str will encode the str to sys.stdout.encoding. sys.stdout.encoding defaults to that returned by locale.getpreferredencoding. This will generally be your linux user's LANG environment variable.
Solution
If we assume that your program is not overriding normal encoding behaviour, the problem should be fixed by ensuring that the code is being executed by a Python3 interpreter in a UTF-8 locale.
be 100% certain that the code is being executed by a Python3 interpreter - print sys.version_info from within the program.
try setting the PYTHONIOENCODING environment variable when running your script: PYTHONIOENCODING=UTF-8 python3 myscript.py
check your locale using the locale command in the terminal (or echo $LANG). If it doesn't end in UTF-8, consider changing it. Consult your system administrators if you are on a corporate machine.
if your code runs in a cron job, bear in mind that cron jobs often run with the 'C' or 'POSIX' locale - which could be using ASCII encoding - unless a locale is explicitly set. Likewise if the script is run under a different user, check their locale settings.
Workaround
If changing the environment is not feasible, you can workaround the problem in Python by encoding to ASCII with an error handler, then decoding back to str.
There are four useful error handlers in your particular situation, their effects are demonstrated with this code:
>>> s = 'Hello \u2013 World'
>>> s
'Hello – World'
>>> handlers = ['ignore', 'replace', 'xmlcharrefreplace', 'namereplace']
>>> print(str(s))
Hello – World
>>> for h in handlers:
... print(f'Handler: {h}:', s.encode('ascii', errors=h).decode('ascii'))
...
Handler: ignore: Hello World
Handler: replace: Hello ? World
Handler: xmlcharrefreplace: Hello – World
Handler: namereplace: Hello \N{EN DASH} World
The ignore and replace handlers lose information - you can't tell what character has been replaced with an space or question mark.
The xmlcharrefreplace and namereplace handlers do not lose information, but the replacement sequences may make the text less readable to humans.
It's up to you to decide which tradeoff is acceptable for the consumers of your program's output.
If you decided to use the replace handler, you would change your code like this:
for i in patchlets_in_latest_list:
replaced = i.encode('ascii', errors='replace').decode('ascii')
print(replaced)
wherever you are printing data that might contain non-ASCII characters.

How to read excel Unicode characters using Python

I am receiving an Excel file whose content I cannot influence. It contains some Unicode characters like "á" or "é".
My code has been unchanged, but I migrated from Eclipse Juno to LiClipse together to a migration to a different python package (2.6 from 2.5). In principle the specific package I am using has a working version on win32com package.
When I read the Excel file my code is crashing when extracting and converting to to strings using str(). The console output is the following:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 89: ordinal not in range(128)
Being more concrete I perform the following:
Read the Excel:
xlApp = Dispatch("Excel.Application")
excel = xlApp.Workbooks.Open(excel_location)
in an internal loop I extract the value of the cell:
cell_value = self.excel.ActiveSheet.Cells(excel_line + 1, excel_column + 1)
and finally, if I try to convert cell_value to str, crashes:
print str(cell_value)
If I go to the Excel and remove the non-ASCII characters everything is working smoothly. I have tried this encode proposal. Any other solution I have googled proposes saving the file in a specific format, that I can't do.
What puzzles me is that the code was working before with the same input Excel but this change to LiClipse and 2.6 Python killed everything.
Any idea how can I progress?
This is a common problem when working with UTF-8 encoded Unicode data in Python 2.x. The handling of this has changed in a few places between 2.4 and 2.7, so it's no surprise that you suddenly get an error.
The source of the error is print: In Python 2.x, print doesn't try to assume what encoding your terminal supports. It just plays save and assumes that ascii is the only supported charset (which means characters between 0 and 127 are fine, everything else gives an error).
Now you convert a COMObject to a string. str is just a bunch of bytes (values 0 to 255) as far as Python 2.x is concerned. It doesn't have an encoding.
Combining the two is a recipe for trouble. When Python prints, it tries to validate the input (the string) and suddenly finds UTF-8 encoded characters (UTF-8 adds these odd \xe1 markers which tells the decoder that the next byte is special in some way; check Wikipedia for the gory details).
That's when the ascii encoder says: Sorry, can't help you there.
That means you can work with this value, compare it and such, but you can't print it. A simple fix for the printing problem is:
s = str(cell_value) # Convert COM -> UTF-8 encoded string
print repr(s) # repr() converts anything to ascii
If your terminal supports UTF-8, then you need to tell Python about it:
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
You should also have a look at sys.stdout.encoding which tells what Python currently thinks the output encoding is/should be. When Python 2 is properly configured (like on modern Linux distributions), then the correct codec for output should be used automatically.
Related:
Python 2 Unicode howto
Pragmatic Unicode, or, How do I stop the pain?
Setting the correct encoding when piping stdout in Python
.Cells(row,col) returns a Range object. You probably want the text from the cell:
cell = xl.ActiveSheet.Cells(1,2).Text
or
cell = xl.ActiveSheet.Range('B1').Text
The resulting value will be a Unicode string. To convert to bytes that you can write to a file, use .encode(encoding), for example:
bytes = cell.encode('utf8')
The below example uses the following spreadsheet:
import win32com.client
xl = win32com.client.gencache.EnsureDispatch('Excel.Application')
xl.Workbooks.Open(r'book1.xlsx')
cell = xl.ActiveSheet.Cells(1,2)
cell_value = cell.Text
print repr(cell)
print repr(cell_value)
print cell_value
Output (Note, Chinese will only print if console/IDE supports the characters):
<win32com.gen_py.Microsoft Excel 14.0 Object Library.Range instance at 0x129909424>
u'\u4e2d\u56fd\u4eba'
中国人
What is described here is a hack, you should not use as a long term
solution. Looking at the comments it could crush the terminal.
Finally I found a solution helped by the suggestion that #Huan-YuTseng provided, probably the solutions offered by other might work in other context but not in this one.
So, what happened is that I migrated from Eclipse Juno version (as Pydev stopped working due to Java upgrade needed that I can't accomplish in this computer) to LiClipse direct package (I did not upgraded a downloaded Eclipse version).
By default, in my LiClipse version (1.4.0.201502042042) the Console output is not by default utf-8. So I needed to change the output from either LiClipse or using my code. Fourtunately, there was another question related to a similar problem that helped me. You can see more details here, but essentially what you need to do is to include at the begginning of your code the following code:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
And everything works. In the answers from #AarongDigulla the solution is there, but is actually the very last solution.
However, I need to say that LiClipse is giving me an error on sys.setdefaultencoding statement, that during execution is not creating any issue... no idea what's happening. That stopped me testing this solution before. Maybe there is something wrong in LiClipse (is alowing me to execute code with errors!)
Use 'utf-8 BOM' which in python used as utf_8_sig for Unicode character & also to avoid irrelevant results in Excel sheet.

UnicodeEncodeError - works in Spyder but not when executed from terminal

I'm using BeautifulSoup to Parse some html, with Spyder as my editor (both brilliant tools by the way!). The code runs fine in Spyder, but when I try to execute the .py file from terminal, I get an error:
file = open('index.html','r')
soup = BeautifulSoup(file)
html = soup.prettify()
file1 = open('index.html', 'wb')
file1.write(html)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 5632: ordinal not in range(128)
I'm running OPENSUSE on a linux server, with Spyder installed using zypper.
Does anyone have any suggestions what the problem might be?
Many thanks.
That is because because before outputting the result (i.e writing it to the file) you must encode it first:
file1.write(html.encode('utf-8'))
See every file has an attribute file.encoding. To quote the docs:
file.encoding
The encoding that this file uses. When Unicode strings
are written to a file, they will be converted to byte strings using
this encoding. In addition, when the file is connected to a terminal,
the attribute gives the encoding that the terminal is likely to use
(that information might be incorrect if the user has misconfigured the
terminal). The attribute is read-only and may not be present on all
file-like objects. It may also be None, in which case the file uses
the system default encoding for converting Unicode strings.
See the last sentence? soup.prettify returns a Unicode object and given this error, I'm pretty sure you're using Python 2.7 because its sys.getdefaultencoding() is ascii.
Hope this helps!

Django shell encoding error (Debian only, Ubuntu fine)

Good day
Can somebody explain what is going on behind the Django manage.py shell console?
The problem is following. I'm developing a Django app, which is using an urllib to parse some html pages to get some info from them. And that info is in russian language, so it should be unicode (this is address string in this case). Next, my script feeds this to some other third-party module which falls, because it is not valid unicode string (I'm trying to geodecode point from address).
I tried to print the string (parsed address in this case) to console with print address command but it fails:
File "<console>", line 1, in <module>
... some useless stacktrace ...
print address
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
Now comes the interesting part.
I have 2 computers: workstation with Ubuntu and Python 2.7.2 and Debian Lenny VPS with Python 2.7.2. I start parser the same way on both machines: by executing python manage.py shell and calling my function from it.
First I got the same error on both installations, but then I noticed that my python encoding is set to 'ascii' (import sys; sys.getdefaultencoding()). And when I put
import sys; reload(sys).setdefaultencoding('utf-8')
into settings.py the problem solves for Ubuntu. Now I get proper print on it, e.g. г. Челябинск, ул. Кирова, д. 27, КТК Набережный, but this is not working for Debian.
If i delete this print address string than, I get non-readable geolocation errors, but again - only on Debian. Ubuntu is working just fine:
Failed to geodecode address [г. ЧелÑбинÑк, Ñл. 1-ой ÐÑÑилеÑки, 17/1, ÑÑнок ÐÑÑак, 1-з]
No amount of unicode(address).encode('utf-8') magic can help this.
So I just can't get it. What's the differences between machines that cause me so much trouble?
If you run the following python script, you'll see what's happening:
# -*- coding: utf-8 -*-
a = r"Челябинск"
print "Encode from UTF-8 to UTF-8:",unicode(a,'utf-8').encode('utf-8')
print "Encode from ISO8859-1 to UTF-8:",unicode(a,'iso8859-1').encode('utf-8')
The output is:
Encode from ISO8859-1 to UTF-8: Челябинск
Encode from ISO8859-1 to UTF-8: ЧелÑбинÑк
In essence you're taking a string encoded (already) as UTF-8 and re-encoding it (a second time, as if it were ISO8859-1) into UTF-8.
It's worth checking what the default encoding of the machine is in each case.
If anyone can add to this answer then please do.

python: unicode in Windows terminal, encoding used?

I am using the Python interpreter in Windows 7 terminal.
I am trying to wrap my head around unicode and encodings.
I type:
>>> s='ë'
>>> s
'\x89'
>>> u=u'ë'
>>> u
u'\xeb'
Question 1: Why is the encoding used in the string s different from the one used in the unicode string u?
I continue, and type:
>>> us=unicode(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 0: ordinal
not in range(128)
>>> us=unicode(s, 'latin-1')
>>> us
u'\x89'
Question2: I tried using the latin-1 encoding on good luck to turn the string into an unicode string (actually, I tried a bunch of other ones first, including utf-8). How can I find out which encoding the terminal has used to encode my string?
Question 3: how can I make the terminal print ë as ë instead of '\x89' or u'xeb'? Hmm, stupid me. print(s) does the job.
I already looked at this related SO question, but no clues from there: Set Python terminal encoding on Windows
Unicode is not an encoding. You encode into byte strings and decode into Unicode:
>>> '\x89'.decode('cp437')
u'\xeb'
>>> u'\xeb'.encode('cp437')
'\x89'
>>> u'\xeb'.encode('utf8')
'\xc3\xab'
The windows terminal uses legacy code pages for DOS. For US Windows it is:
>>> import sys
>>> sys.stdout.encoding
'cp437'
Windows applications use windows code pages. Python's IDLE will show the windows encoding:
>>> import sys
>>> sys.stdout.encoding
'cp1252'
Your results may vary.
Avoid Windows Terminal
I'm not going out on a limb by saying the 'terminal' more appropriately the 'DOS prompt' that ships with Windows 7 is absolute junk. It was bad in Windows 95, NT, XP, Vista, and 7. Maybe they fixed it with Powershell, I don't know. However, it is indicative of the kind of problems that were plaguing OS development at Microsoft at the time.
Output to a file instead
Set the PYTHONIOENCODING environment variable and then redirect the output to a file.
set PYTHONIOENCODING=utf-8
./myscript.py > output.txt
Then using Notepad++ you can then see the UTF-8 version of your output.
Install win-unicode-console
win-unicode-console can fix your problems. You should try it out
pip install win-unicode-console
If you are interested in a through discussion on the issue of python and command-line output check out Python issue 1602. Otherwise, just use the win-unicode-console package.
py -m run script.py
Runs it per script or you can follow their directions to add win_unicode_console.enable() to every invocation by adding it to usercustomize or sitecustomize.
In case others get this page when searching
Easiest way is to set the codepage in the terminal first
CHCP 65001
then run your program.
working well for me.
For power shell start it with
powershell.exe -NoExit /c "chcp.com 65001"
Its from python: unicode in Windows terminal, encoding used?
Read through this python HOWTO about unicode after you read this section from the tutorial
Creating Unicode strings in Python is just as simple as creating normal strings:
>>> u'Hello World !'
u'Hello World !'
To answer your first question, they are different because only when using u''are you creating a unicode string.
2nd question:
sys.getdefaultencoding()
returns the default encoding
But to quote from link:
Python users who are new to Unicode sometimes are attracted by default encoding returned by sys.getdefaultencoding(). The first thing you should know about default encoding is that you don't need to care about it. Its value should be 'ascii' and it is used when converting byte strings StrIsNotAString to unicode strings.
You've answered question 1 as you ask it: the first string is an encoded byte-string, but the second is not an encoding at all, it refers to a unicode code-point, which for "LATIN SMALL LETTER E WITH DIAERESIS" is hex eb.
Now, the question of what the first encoding is is an interesting one. I would normally expect it to be either utf-8, or, since you're on Windows, ISO-8859-1 or Win-1252 (which aren't exactly the same thing, but close enough). However, the normal representation of that letter in utf-8 is c3 ab and in Win-1252 it's actually the same as the unicode code-point - ie hex eb. So, it's a bit of a mystery.
It appears you are using code page CP850, which makes sense as this is the historical code page for DOS which has been carried forward to the terminal window.
>>> s
'\x89'
>>> us=unicode(s,'CP850')
>>> us
u'\xeb'
Actually, unicode object has no
'encoding'. You should read up on
Unicode in python to avoid constant
confusion. This presentation looks
adequate -
http://farmdev.com/talks/unicode/ .
You are on russian version of
windows, right? You terminal uses
cp1251.
As you've figured out:
>>> a = "ё"
>>> a
'\xf1'
>>> print a
ё
Do you open any file when get such errors?
If so, try to open it with
import codecs
f = codecs.open('filename.txt','r','utf-8')

Categories

Resources