write special characters into excel table by python package pyExcelerator/xlwt - python

Task:
I generate formated excel tables from csv-files by using the python package pyExcelerator (comparable with xlwt). I need to be able to write less-than-or-equal-to (≤) and greater-than-or-equal-to (≥) signs.
So far:
I can save my table as csv-files with UTF-8 encoding, so that I can view the special characters in my text editor, by adding the following line to my python source code:
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
Problem:
However, there is no option to choose UTF-8 as font in pyExcelerator's Font class. The only options are:
CHARSET_ANSI_LATIN = 0x00
CHARSET_SYS_DEFAULT = 0x01
CHARSET_SYMBOL = 0x02
CHARSET_APPLE_ROMAN = 0x4D
CHARSET_ANSI_JAP_SHIFT_JIS = 0x80
CHARSET_ANSI_KOR_HANGUL = 0x81
CHARSET_ANSI_KOR_JOHAB = 0x82
CHARSET_ANSI_CHINESE_GBK = 0x86
CHARSET_ANSI_CHINESE_BIG5 = 0x88
CHARSET_ANSI_GREEK = 0xA1
CHARSET_ANSI_TURKISH = 0xA2
CHARSET_ANSI_VIETNAMESE = 0xA3
CHARSET_ANSI_HEBREW = 0xB1
CHARSET_ANSI_ARABIC = 0xB2
CHARSET_ANSI_BALTIC = 0xBA
CHARSET_ANSI_CYRILLIC = 0xCC
CHARSET_ANSI_THAI = 0xDE
CHARSET_ANSI_LATIN_II = 0xEE
CHARSET_OEM_LATIN_I = 0xFF
Do any of these character sets contain the less-than-or-equal-to and greater-than-or-equal-to signs? If so, which on?
Which python encoding name corresponds to these sets?
Is there another way for generating these special characters?

This should help with writing UTF-8 chars using pyexcelerator or xlwt:
wb = xlwt.Workbook(**encoding='utf-8'**)
edit:
Seems it's not working for pyexcelerator, but I havent confirmed it.

You may be overthinking the problem. The font shouldn't play into the matter, although character encoding might.
In any case, I was able to use xlwt to create an excel spreadsheet with less-than-equal and greater-than-equal signs with the following script:
import xlwt
wb = xlwt.Workbook()
ws = wb.add_sheet('Test Sheet')
lte = u'\u2264'
gte = u'\u2265'
ws.write(0,0,lte+gte)
wb.save('foo.xls')
Note that -- coding: utf-8 -- is not required because the special characters are encoded with their unicode numeric indices. In general, I recommend using unicode where possible.
It's also possible to use utf-8 and type the characters directly into the Python code. This would be exactly the same except for how the characters are entered:
#-*- coding: utf-8 -*-
import xlwt
wb = xlwt.Workbook()
ws = wb.add_sheet('Test Sheet')
lte = u'≤'
gte = u'≥'
ws.write(0,0,lte+gte)
wb.save('foo.xls')
Note, however, that you must be using an editor that is aware that you are saving the Python code as UTF-8. If your editor encodes the file in any other way, the special characters will not be parsed properly when loaded by the Python interpreter.

(1) Re: """
I can save my table as csv-files with UTF-8 encoding, so that I can view the special characters in my text editor, by adding the following line to my python source code:
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
"""
Being able to write a file with characters encoded in UTF-8 is NOT dependant on what encoding is used in the source of the program that writes the file!
(2) UTF-8 is an encoding, not a font. Those charsets in an Excel FONT record are a blast from the past AFAIK. I've not heard from any xlwt user who ever thought it necessary to use other than the default for the charset. Just feed unicode objects to xlwt as demonstrated by Jason ... if you have an appropriate font on your system (see if you can display the characters in OpenOffice Calc), you should be OK.
(3) Any particular reason for using pyExcelerator instead of xlwt?

Related

What encoding is this and how can I decode it in Python?

I have a filename that contains %ed%a1%85%ed%b7%97.svg and want to decode that to its proper string representation in Python 3. I know the result will be 𡗗.svg but the following code does not work:
import urllib.parse
import codecs
input = '%ed%a1%85%ed%b7%97.svg'
unescaped = urllib.parse.unquote(input)
raw_bytes = bytes(unescaped, "utf-8")
decoded = codecs.escape_decode(raw_bytes)[0].decode("utf-8")
print(decoded)
will print ������.svg. It does work, however, when input is a string like %e8%b7%af.svg for which it will correctly decode to 路.svg.
I've tried to decode this with online tools such as https://mothereff.in/utf-8 by replacing % with \x leading to \xed\xa1\x85\xed\xb7\x97.svg. The tool correctly decoded this input to 𡗗.svg.
What happens here?
you need the correct encoding to get command line console/terminal (which supports & configured for utf-8) to display the correct characters
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
PEP 263 -- Defining Python Source Code Encodings: https://www.python.org/dev/peps/pep-0263/
https://stackoverflow.com/questions/3883573/encoding-error-in-python-with-chinese-characters#3888653
"""
from urllib.parse import unquote
urlencoded = '%ed%a1%85%ed%b7%97'
char = unquote(urlencoded, encoding='gbk')
char1 = unquote(urlencoded, encoding='big5_hkscs')
char2 = unquote(urlencoded, encoding='gb18030')
print(char)
print(char1)
print(char2)
# 怼呿窏
# 瞴�窾�
# 怼呿窏
this is a quite an exotic unicode character, and i was wrong about the encoding, it's not a simplified chinese char, it's traditional one, and quite far in the mapping as well \U215D7 - CJK UNIFIED IDEOGRAPHS EXTENSION B.
but the code point listed & other values made me suspicious this was a poorly encoded code, so it took me a while.
someone helped me figuring how the encoding got to that form. you need to do a few encoding transforms to revert it back to its original value.
cjk = unquote_to_bytes(urlencoded).decode('utf-8', 'surrogatepass').encode('utf-16', 'surrogatepass').decode('utf-16')
print(cjk)

Encode Strings in Python

I want to run my code on terminal but it shows me this error :
SyntaxError: Non-ASCII character '\xd8' in file streaming.py on line
72, but no encoding declared; see http://python.org/dev/peps/pep-0263/
for detail
I tried to encode the Arabic string using this :
# -*- coding: utf-8 -*-
st = 'المملكة العربية السعودية'.encode('utf-8')
It's very important for me to run it on the terminal so I can't use IDLE.
The problem is since you are directly pasting your characters in to a python file, the interpreter (Python 2) attempts to read them as ASCII (even before you encode, it needs to define the literal), which is illegal. What you want is a unicode literal if pasting non-ASCII bytes:
x=u'المملكة العربية السعودية' #Or whatever the corresponding bytes are
print x.encode('utf-8')
You can also try to set the entire source file to be read as utf-8:
#/usr/bin/python
# -*- coding: utf-8 -*-
and don't forget to make it run-able, and lastly, you can import the future from Python 3:
from __future__ import unicode_literal
at the top of the file, so string literals by default are utf-8. Note that \xd8 appears as phi in my terminal, so make sure the encoding is correct.

Convert string from Latin-1 to UTF-8 and back to Latin-1

A system (not under my control) sends a latin-1 encoded string (such as Öland) which I can convert to utf-8 but not back to latin-1.
Consider this code:
text = '\xc3\x96land' # This is what the external system sends
iso = text.encode(encoding='latin-1') # this is my best guess
print(iso.decode('utf-8'))
print(u"Öland".encode(encoding='latin-1'))
This is the output:
Öland
b'\xd6land'
Now, how to I mimic the system?
Obviously '\xc3\x96land' is not '\xd6land'
if your external system sends it to you then you should first decode it rather than encoding it since it is sent as encoded.
you don't have to encode the encoded!!
hey=u"Öland".encode('latin-1')
print hey
gives output like this ?land
print hey.decode('latin-1')
gives output like this Öland
Turns out the external system send the data in utf-8 already.
Converting the string forth and back works like this now:
#!/usr/bin/env python3.4
# -*- coding: utf-8 -*-
text = '\xc3\x96land'
encoded = text.encode(encoding='raw_unicode_escape')
print(encoded)
utf8 = encoded.decode('utf-8')
print(utf8)
mimic = utf8.encode('utf-8', 'unicode_escape')
print(mimic)
And the output
b'\xc3\x96land'
Öland
b'\xc3\x96land'
Thank you for your support!

SyntaxError: Non-ASCII character - Scrapy [duplicate]

Say I have a function:
def NewFunction():
return '£'
I want to print some stuff with a pound sign in front of it and it prints an error when I try to run this program, this error message is displayed:
SyntaxError: Non-ASCII character '\xa3' in file 'blah' but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Can anyone inform me how I can include a pound sign in my return function? I'm basically using it in a class and it's within the '__str__' part that the pound sign is included.
I'd recommend reading that PEP the error gives you. The problem is that your code is trying to use the ASCII encoding, but the pound symbol is not an ASCII character. Try using UTF-8 encoding. You can start by putting # -*- coding: utf-8 -*- at the top of your .py file. To get more advanced, you can also define encodings on a string by string basis in your code. However, if you are trying to put the pound sign literal in to your code, you'll need an encoding that supports it for the entire file.
Adding the following two lines at the top of my .py script worked for me (first line was necessary):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
First add the # -*- coding: utf-8 -*- line to the beginning of the file and then use u'foo' for all your non-ASCII unicode data:
def NewFunction():
return u'£'
or use the magic available since Python 2.6 to make it automatic:
from __future__ import unicode_literals
The error message tells you exactly what's wrong. The Python interpreter needs to know the encoding of the non-ASCII character.
If you want to return U+00A3 then you can say
return u'\u00a3'
which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to return a byte string containing the literal byte 0xA3, that's
return b'\xa3'
(where in Python 2 the b is implicit; but explicit is better than implicit).
The linked PEP in the error message instructs you exactly how to tell Python "this file is not pure ASCII; here's the encoding I'm using". If the encoding is UTF-8, that would be
# coding=utf-8
or the Emacs-compatible
# -*- encoding: utf-8 -*-
If you don't know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow character-encoding tag has a tag info page with more information and some troubleshooting tips.
In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can't and mustn't guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#a3 shows 21 possible interpretations for the byte 0xA3 and that's only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. But in fact, I would guess you are actually using Latin-1, so you should have
# coding: latin-1
as the first or second line of your source file. Anyway, without knowledge of which character the byte is supposed to represent, a human would not be able to guess this, either.
A caveat: coding: latin-1 will definitely remove the error message (because there are no byte sequences which are not technically permitted in this encoding), but might produce completely the wrong result when the code is interpreted if the actual encoding is something else. You really have to know the encoding of the file with complete certainty when you declare the encoding.
Adding the following two lines in the script solved the issue for me.
# !/usr/bin/python
# coding=utf-8
Hope it helps !
You're probably trying to run Python 3 file with Python 2 interpreter. Currently (as of 2019), python command defaults to Python 2 when both versions are installed, on Windows and most Linux distributions.
But in case you're indeed working on a Python 2 script, a not yet mentioned on this page solution is to resave the file in UTF-8+BOM encoding, that will add three special bytes to the start of the file, they will explicitly inform the Python interpreter (and your text editor) about the file encoding.

using £ in function and write it to csv in Python

I want to remove pound sign from the string that is parsed from url using Beautifulsoup. And I got the following error for pound sign.
SyntaxError: Non-ASCII character '\xa3' in file
I tried to put this # -*- coding: utf-8 -*- at the start of the class but still got the error.
This is the code. After I get the float number, I want to write it csv file.
mainTag = SoupStrainer('table', {'class':'item'})
soup = BeautifulSoup(resp,parseOnlyThese=mainTag)
tag= soup.findAll('td')[3]
price = tag.text.strip()
pr = float(price.lstrip(u'£').replace(',', ''))
The problem is likely one of encoding, and bytes vs characters. What encoding was the CSV file created with? What sequence of bytes is in the file where the £ symbol occurs? What are the bytes contained in the variable price? You'll need to replace the bytes that actually occur in the string. One piece of the puzzle is the contents of the data in your source code. That's where the # -*- coding: utf-8 -*- marker at the top of the source is significant: it tells python how to interpret the bytes in a string literal. It is possible you will need (or want) to decode the bytes from the CSV file to create a Unicode string before replacing the character.
I will point out that the documentation for the csv module in Python 2.7 says:
Note: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.
The examples section includes the following code demonstrating decoding the bytes provided by the csv module to Unicode strings.
import csv
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]

Categories

Resources