Encoding string in UTF8 from list - python

I'm having trouble formating the strings to utf-8
In this script im getting data from excel file
then printing it out in a loop, the problem is that
the string with special characters shows up wrong.
In result I keep getting 'PatrÄ«cija' instead of 'Patrīcija'
Can't seem to find the solution for this problem
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import xlrd
import datetime
def todaysnames():
todaysdate = datetime.datetime.strftime(datetime.date.today(), "%d.%m")
book = xlrd.open_workbook("vardadienas.xls")
sheet = book.sheet_by_name('Calendar')
for rownr in range(sheet.nrows):
if sheet.cell(rownr, 0).value == todaysdate:
string = (sheet.cell(rownr, 1).value)
string = string.encode(encoding="UTF-8",errors="strict")
names = string.split(', ')
return names
names = todaysnames()
for name in names:
print name

Changed encoding to iso8859_13(Baltic languages) and it fixed it.

I think that your problem may be caused by the print. The xlrd returns utf8. Depending of the encoding of your console, the print may have difficulties to print it correctly. I've noticed this sometimes on a french Windows (where encoding is cp1252)
The following question: Python, Unicode, and the Windows console explains how to print unicode char on the console on Windows. I didn't try myself but it looks good.
I hope it helps

Related

Python not picking up UTF-8 encoding

I am having some trouble encoding ascii characters to UTF-8, or a string is not picking up the encoding.
import unicodecsv as csv
import re
import pyodbc
import sys
import unicodedata
#!/usr/bin/python
# -*- coding: UTF-8 -*-
def remove_non_ascii_1(text):
text.encode('utf-8')
for i in text:
return ''.join(i for i in text if i=='£')
In Python 2.7 I get the error
SyntaxError: Non-ASCII character '\xc2' in file on line 16, but no encoding declared; see SyntaxError: Non-ASCII character '\xc2' in file.
With the Unicode replacement
return ''.join(i for i in text if i=='\xc2')
the error is
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
Sample text :
row from a csv file reading in
[u'06/11/2020', u'ABC', u'32154E', u'3214', u'DEF', u'Cash Purchase', u'Final', u'', u'20.00%', u'ABC', u'Sold From Pickup', u'New ', u'10.00%', u'0', u'15%', u'\xa3469.84', u'Jonathan Jerome', u'3', u'\xa3140.95', u'2%', u'\xa393.97', u'\xa39,396.83', u'', u'\xa35,638.00', u'30/06/2020', u'4', u'Boiler-Amended']
I want to remove the \xa3 or £ in the currency fields.
First 2 things ahead:
Don't use Python 2 any more because of reason mentioned here!
Don't use different encodings in Python 2.
TL;DR Python 3 just improved so many things regarding encodings that it simply isn't worth it.
Whole story: read here
Ok this out of the way let's start fixing your code.
As Klaus D. already mentioned you do not save the result of text. This leads to an encoding warning when comparing seamingly equal characters (£ and £) but one is encoded in the encoding coming from the file you read the other one is encoded in ascii (despite you encoding your code in -*- coding: UTF-8 -*-. This is just to show what your code-file is encoded in, this does not change the behaviour of the interpreter regarding str-parsing).
Edit: Also when comparing to the character you will need to compare to a unicode char so you could either convert it or simply tell the interpreter to encode it as unicode in the first place (that's why I added a leading 'u' in front of your '£')
To fix this simply safe your result into text again after you called text.encode('utf-8').
Additionally the "shebang" and the encoding info should always be on the very top of a file that as soon as you open the file you know what you are dealing with.
Something else I would correct is the first for-loop. this one is unnecessary because you return out of this function anyway after you handled the first element.
This means the completely "corrected" code is this.
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import unicodecsv as csv
import re
import pyodbc
import sys
import unicodedata
def remove_non_ascii_1(text):
text.encode('utf-8')
return ''.join(i for i in text if i==u'£')
PS: You should really think again about whether the def remove_non_ascii_1(text) is really necessary. By the looks of it you already input a list of unicode encoded strings which you probably directly read from the file. This means you don't need to correct encoding though the comparison for '£' could stay. You might just want to rename the method ;)
Hope this helped and fixed possible unclarities about Python 2 encodings :D
If you print your list as a whole now you will see it still contains \xca and not the actual '£' but if you print the elements seperately it works fine. This is because the __str__() method of list does not encode unicodes directly but uses the standard ascii encoding...
Python 3 greatly improved unicode text handling. If you have to use Python 2.7, I would recommend using the codecs library when reading text files since it helps you with Pyhton 2.7 unicode issues:
import codecs
fp = codecs.open("file", "r"; encoding="utf-8")
In your case I noticed that you are using unicodecsv as a drop-in csv replacement. In this case, you can hand the parameter encoding="utf-8" when reading the csv file into a list:
r = csv.reader(f, encoding='utf-8')
For just removing non-Ascii characters I would recommend checking this good answer on StackOverflow

Writing DataFrame to encoded JSON Newline Delimited

In Python 2.7, I have a Pandas Dataframe with several unicode columns, integer columns, etc. I need to be able to write it encoded utf-8 to JSON Newline Delimited file.
I tried this, but it only works in Python 3, not Python 2.7.
with io.open('myjson.json','w',encoding='utf-8') as f:
f.write(df.to_json(orient="records", lines=True, force_ascii=False))
This is my attempt's result, but as you can see it's not encoded utf-8.
{"account_id":"support","case_id":7697,"message":"\u0633\u0628 \u0627\u0644\u0644\u0647\u0627\u0644\u0644\u0647 \u0627\u0644\u0639","created_at":1536606086392,"agent":"108915"}
{"account_id":"support","case_id":7697924,"message":"\u0647\u0627\u064a","created_at":1536601516354,"agent":"108915"}
I think it has something to do with this. But I'm not sure.
Other research I've done shows that if I put this in my code it works. But I also read that this isn't recommended.
import sys
reload(sys)
sys.setdefaultencoding('utf8')
edit - I missed the 2.7 part - I usually use 3.5 or higher. In any case, w/ python 2.7, I was able to convert the unicode string to utf-8 using codecs:
import codecs
codecs.unicode_escape_decode(a['message'])[0].encode("utf-8")
'\xd8\xb3\xd8\xa8 \xd8\xa7\xd9\x84\xd9\x84\xd9\x87\xd8\xa7\xd9\x84\xd9\x84\xd9\x87 \xd8\xa7\xd9\x84\xd8\xb9'
Old answer -
It looks like pandas .to_json() has a default setting of ensure_ascii=True, which converts non ascii to Unicode.
From docs:
to_json(path_or_buf=None, orient=None, date_format=None, double_precision=10, force_ascii=True, date_unit='ms', default_handler=None, lines=False, compression=None, index=True)
Try setting it to False:
df.to_json(force_ascii=False)
'{"agent":{"0":"108915"},"created_at":{"0":1536606086392},"message":{"0":"سب اللهالله الع"}}'
Edit - Forgot you were looking for newline delimited,
>>> df.to_json(force_ascii=False, orient="records")
[{"agent":"108915","created_at":1536606086392,"message":"سب اللهالله الع"}]

What encoding is this and how can I decode it in Python?

I have a filename that contains %ed%a1%85%ed%b7%97.svg and want to decode that to its proper string representation in Python 3. I know the result will be 𡗗.svg but the following code does not work:
import urllib.parse
import codecs
input = '%ed%a1%85%ed%b7%97.svg'
unescaped = urllib.parse.unquote(input)
raw_bytes = bytes(unescaped, "utf-8")
decoded = codecs.escape_decode(raw_bytes)[0].decode("utf-8")
print(decoded)
will print ������.svg. It does work, however, when input is a string like %e8%b7%af.svg for which it will correctly decode to 路.svg.
I've tried to decode this with online tools such as https://mothereff.in/utf-8 by replacing % with \x leading to \xed\xa1\x85\xed\xb7\x97.svg. The tool correctly decoded this input to 𡗗.svg.
What happens here?
you need the correct encoding to get command line console/terminal (which supports & configured for utf-8) to display the correct characters
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
PEP 263 -- Defining Python Source Code Encodings: https://www.python.org/dev/peps/pep-0263/
https://stackoverflow.com/questions/3883573/encoding-error-in-python-with-chinese-characters#3888653
"""
from urllib.parse import unquote
urlencoded = '%ed%a1%85%ed%b7%97'
char = unquote(urlencoded, encoding='gbk')
char1 = unquote(urlencoded, encoding='big5_hkscs')
char2 = unquote(urlencoded, encoding='gb18030')
print(char)
print(char1)
print(char2)
# 怼呿窏
# 瞴�窾�
# 怼呿窏
this is a quite an exotic unicode character, and i was wrong about the encoding, it's not a simplified chinese char, it's traditional one, and quite far in the mapping as well \U215D7 - CJK UNIFIED IDEOGRAPHS EXTENSION B.
but the code point listed & other values made me suspicious this was a poorly encoded code, so it took me a while.
someone helped me figuring how the encoding got to that form. you need to do a few encoding transforms to revert it back to its original value.
cjk = unquote_to_bytes(urlencoded).decode('utf-8', 'surrogatepass').encode('utf-16', 'surrogatepass').decode('utf-16')
print(cjk)

write special characters into excel table by python package pyExcelerator/xlwt

Task:
I generate formated excel tables from csv-files by using the python package pyExcelerator (comparable with xlwt). I need to be able to write less-than-or-equal-to (≤) and greater-than-or-equal-to (≥) signs.
So far:
I can save my table as csv-files with UTF-8 encoding, so that I can view the special characters in my text editor, by adding the following line to my python source code:
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
Problem:
However, there is no option to choose UTF-8 as font in pyExcelerator's Font class. The only options are:
CHARSET_ANSI_LATIN = 0x00
CHARSET_SYS_DEFAULT = 0x01
CHARSET_SYMBOL = 0x02
CHARSET_APPLE_ROMAN = 0x4D
CHARSET_ANSI_JAP_SHIFT_JIS = 0x80
CHARSET_ANSI_KOR_HANGUL = 0x81
CHARSET_ANSI_KOR_JOHAB = 0x82
CHARSET_ANSI_CHINESE_GBK = 0x86
CHARSET_ANSI_CHINESE_BIG5 = 0x88
CHARSET_ANSI_GREEK = 0xA1
CHARSET_ANSI_TURKISH = 0xA2
CHARSET_ANSI_VIETNAMESE = 0xA3
CHARSET_ANSI_HEBREW = 0xB1
CHARSET_ANSI_ARABIC = 0xB2
CHARSET_ANSI_BALTIC = 0xBA
CHARSET_ANSI_CYRILLIC = 0xCC
CHARSET_ANSI_THAI = 0xDE
CHARSET_ANSI_LATIN_II = 0xEE
CHARSET_OEM_LATIN_I = 0xFF
Do any of these character sets contain the less-than-or-equal-to and greater-than-or-equal-to signs? If so, which on?
Which python encoding name corresponds to these sets?
Is there another way for generating these special characters?
This should help with writing UTF-8 chars using pyexcelerator or xlwt:
wb = xlwt.Workbook(**encoding='utf-8'**)
edit:
Seems it's not working for pyexcelerator, but I havent confirmed it.
You may be overthinking the problem. The font shouldn't play into the matter, although character encoding might.
In any case, I was able to use xlwt to create an excel spreadsheet with less-than-equal and greater-than-equal signs with the following script:
import xlwt
wb = xlwt.Workbook()
ws = wb.add_sheet('Test Sheet')
lte = u'\u2264'
gte = u'\u2265'
ws.write(0,0,lte+gte)
wb.save('foo.xls')
Note that -- coding: utf-8 -- is not required because the special characters are encoded with their unicode numeric indices. In general, I recommend using unicode where possible.
It's also possible to use utf-8 and type the characters directly into the Python code. This would be exactly the same except for how the characters are entered:
#-*- coding: utf-8 -*-
import xlwt
wb = xlwt.Workbook()
ws = wb.add_sheet('Test Sheet')
lte = u'≤'
gte = u'≥'
ws.write(0,0,lte+gte)
wb.save('foo.xls')
Note, however, that you must be using an editor that is aware that you are saving the Python code as UTF-8. If your editor encodes the file in any other way, the special characters will not be parsed properly when loaded by the Python interpreter.
(1) Re: """
I can save my table as csv-files with UTF-8 encoding, so that I can view the special characters in my text editor, by adding the following line to my python source code:
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
"""
Being able to write a file with characters encoded in UTF-8 is NOT dependant on what encoding is used in the source of the program that writes the file!
(2) UTF-8 is an encoding, not a font. Those charsets in an Excel FONT record are a blast from the past AFAIK. I've not heard from any xlwt user who ever thought it necessary to use other than the default for the charset. Just feed unicode objects to xlwt as demonstrated by Jason ... if you have an appropriate font on your system (see if you can display the characters in OpenOffice Calc), you should be OK.
(3) Any particular reason for using pyExcelerator instead of xlwt?

Unicode problems in PyObjC

I am trying to figure out PyObjC on Mac OS X, and I have written a simple program to print out the names in my Address Book. However, I am having some trouble with the encoding of the output.
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
from AddressBook import *
ab = ABAddressBook.sharedAddressBook()
people = ab.people()
for person in people:
name = person.valueForProperty_("First") + ' ' + person.valueForProperty_("Last")
name
when I run this program, the output looks something like this:
...snip...
u'Jacob \xc5berg'
u'Fernando Gonzales'
...snip...
Could someone please explain why the strings are in unicode, but the content looks like that?
I have also noticed that when I try to print the name I get the error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc5' in position 6: ordinal not in range(128)
# -*- coding: UTF-8 -*-
only affects the way Python decodes comments and string literals in your source, not the way standard output is configured, etc, etc. If you set your Mac's Terminal to UTF-8 (Terminal, Preferences, Settings, Advanced, International dropdown) and emit Unicode text to it after encoding it in UTF-8 (print name.encode("utf-8")), you should be fine.
If you run the code in your question in the interactive console the interpreter will print the repr of "name" because of the last statement of the loop.
If you change the last line of the loop from just "name" to "print name" the output should be fine. I've tested this with Terminal.app on a 10.5.7 system.
Just writing the variable name sends repr(name) to the standard output and repr() encodes all unicode values.
print tries to convert u'Jacob \xc5berg' to ASCII, which doesn't work. Try writing it to a file.
See Print Fails on the python wiki.
That means you're using legacy,
limited or misconfigured console. If
you're just trying to play with
unicode at interactive prompt move to
a modern unicode-aware console. Most
modern Python distributions come with
IDLE where you'll be able to print all
unicode characters.
Convert it to a unicode string through:
print unicode(name)

Categories

Resources