Python webpage source read with special characters

Python webpage source read with special characters - python

I am reading a page source from a webpage, then parsing a value from that source.
There I am facing a problem with special characters.
In my python controller file iam using # -*- coding: utf-8 -*-.
But I am reading a webpage source which is using charset=iso-8859-1
So when I read the page content without specifying any encoding it is throwing error as UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 133: invalid start byte
when I use string.decode("iso-8859-1").encode("utf-8") then it is parsing data without any error. But it is displaying the value as 'F\u00fcnke' instead of 'Fünke'.
Please let me know how I can solve this issue.
I would greatly appreciate any suggestions.

Encoding is a PITA in Python3 for sure (and 2 in some cases as well).
Try checking these links out, they might help you:
Python - Encoding string - Swedish Letters
Python3 - ascii/utf-8/iso-8859-1 can't decode byte 0xe5 (Swedish characters)
http://docs.python.org/2/library/codecs.html
Also it would be nice with the code for "So when I read the page content without specifying any encoding" My best guess is that your console doesn't use utf-8 (for instance, windows.. your # -*- coding: utf-8 -*- only tells Python what type of characters to find within the sourcecode, not the actual data the code is going to parse or analyze itself.
For instance i write:
# -*- coding: iso-8859-1 -*-
import time
# Här skriver jag ut tiden (Translation: Here, i print out the time)
print(time.strftime('%H:%m:%s'))

Related

SyntaxError: Non-ASCII character - Scrapy [duplicate]

Say I have a function:
def NewFunction():
return '£'
I want to print some stuff with a pound sign in front of it and it prints an error when I try to run this program, this error message is displayed:
SyntaxError: Non-ASCII character '\xa3' in file 'blah' but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Can anyone inform me how I can include a pound sign in my return function? I'm basically using it in a class and it's within the '__str__' part that the pound sign is included.

I'd recommend reading that PEP the error gives you. The problem is that your code is trying to use the ASCII encoding, but the pound symbol is not an ASCII character. Try using UTF-8 encoding. You can start by putting # -*- coding: utf-8 -*- at the top of your .py file. To get more advanced, you can also define encodings on a string by string basis in your code. However, if you are trying to put the pound sign literal in to your code, you'll need an encoding that supports it for the entire file.

Adding the following two lines at the top of my .py script worked for me (first line was necessary):
#!/usr/bin/env python
# -*- coding: utf-8 -*-

First add the # -*- coding: utf-8 -*- line to the beginning of the file and then use u'foo' for all your non-ASCII unicode data:
def NewFunction():
return u'£'
or use the magic available since Python 2.6 to make it automatic:
from __future__ import unicode_literals

The error message tells you exactly what's wrong. The Python interpreter needs to know the encoding of the non-ASCII character.
If you want to return U+00A3 then you can say
return u'\u00a3'
which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to return a byte string containing the literal byte 0xA3, that's
return b'\xa3'
(where in Python 2 the b is implicit; but explicit is better than implicit).
The linked PEP in the error message instructs you exactly how to tell Python "this file is not pure ASCII; here's the encoding I'm using". If the encoding is UTF-8, that would be
# coding=utf-8
or the Emacs-compatible
# -*- encoding: utf-8 -*-
If you don't know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow character-encoding tag has a tag info page with more information and some troubleshooting tips.
In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can't and mustn't guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#a3 shows 21 possible interpretations for the byte 0xA3 and that's only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. But in fact, I would guess you are actually using Latin-1, so you should have
# coding: latin-1
as the first or second line of your source file. Anyway, without knowledge of which character the byte is supposed to represent, a human would not be able to guess this, either.
A caveat: coding: latin-1 will definitely remove the error message (because there are no byte sequences which are not technically permitted in this encoding), but might produce completely the wrong result when the code is interpreted if the actual encoding is something else. You really have to know the encoding of the file with complete certainty when you declare the encoding.

Adding the following two lines in the script solved the issue for me.
# !/usr/bin/python
# coding=utf-8
Hope it helps !

You're probably trying to run Python 3 file with Python 2 interpreter. Currently (as of 2019), python command defaults to Python 2 when both versions are installed, on Windows and most Linux distributions.
But in case you're indeed working on a Python 2 script, a not yet mentioned on this page solution is to resave the file in UTF-8+BOM encoding, that will add three special bytes to the start of the file, they will explicitly inform the Python interpreter (and your text editor) about the file encoding.

How do I direct output to a file when there are UTF-8 characters?

I have a python script that grabs a bunch of recent tweets from the twitter API and dumps them to screen. It works well, but when I try to direct the output to a file something strange happens and a print statement causes an exception:
> ./tweets.py > tweets.txt
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2018' in position 61: ordinal not in range(128)
I understand that the problem is with a UTF-8 character in one of the tweets that doesn't translate well to ASCII, but what is a simple way to dump the output to a file? Do I fix this in the python script or is there a way to coerce it at the commandline?
BTW, the script was written in Python2.

Without modifying the script, you can just set the environment variable PYTHONIOENCODING=utf8 and Python will assume that encoding when redirecting to a file.
References:
https://docs.python.org/2.7/using/cmdline.html#envvar-PYTHONIOENCODING
https://docs.python.org/3.3/using/cmdline.html#envvar-PYTHONIOENCODING

You may need encode the unicode object with .encode('utf-8')
In your python file append this to first line
# -*- coding: utf-8 -*-
If your script file is working standalone, append it to second line
#!/usr/local/bin/python
# -*- coding: utf-8 -*-
Here is the document: PEP 0263

How does "coding: pyxl" work in Python?

pyxl or interpy are using a very interesting trick to enhance the python syntax in a way: coding: from PEP-263
# coding: pyxl
print <html><body>Hello World!</body></html>
or
# coding: interpy
package = "Interpy"
print "Enjoy #{package}!"
How could I write my own coding: if I wanted to? And could I use more than one?

I'm Syrus, the creator of interpy.
Thanks to codecs # coding: your_codec_name in Python we have a chance to preprocess the file before it is converted to bytecode.
This is how it works:
Reading the file contents
At first, Python reads the file and stores its content. As the content could be encoded in a strange format, Python tries to decode it. Here is where the magic happens.
If the coding is not found, Python will try to decode the content with the default string coding: Ascii or UTF-8 codecs depending on the Python version. This is why you have to write # coding: utf-8 when using unusual chars (á, ñ, Ð, ...) in Python 2, because Ascii is the default.
Decoding file contents
If we register a custom codec (both encoder and decoder), and a file tells Python it is using our codec (via # coding: codec_name), then Python will decode the file with our codec.
Registering our codec
To register the codec without needing an import, we create a path configuration file (.pth) which registers the codec before any non-main-module is executed.
Transforming file contents
Once the decoder of our codec is called, we can modify the output we want, but... how do we know Python syntax (tokens) inside this content?
Simply call the Python tokenizer with the file contents and modify the desired tokens.
In the case of interpy, it changes the behavior only when Python strings are found in the file content.
Sending back transformed (decoded) contents
Once we transform the content, we send it back to the Python compiler to be compiled to bytecode.
Hope you find this useful!

UnicodeDecodeError error writing .xlsx file using xlsxwriter

I am trying to write about 1000 rows to a .xlsx file from my python application. The data is basically a combination of integers and strings. I am getting intermittent error while running wbook.close() command. The error is the following:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 15:
ordinal not in range(128)
My data does not have anything in unicode. I am wondering why the decoder is being at all. Has anyone noticed this problem?

0xc3 is "À". So what you need to do is change the encoding. Use the decode() method.
string.decode('utf-8')
Also depending on your needs and uses you could add
# -*- coding: utf-8 -*-
at the beginning of your script, but only if you are sure that the encoding will not interfere and break something else.

As Alex Hristov points out you have some non-ascii data in your code that needs to be encoded as UTF-8 for Excel.
See the following examples from the docs which each have instructions on handling UTF-8 with XlsxWriter in different scenarios:
Example: Simple Unicode with Python 2
Example: Simple Unicode with Python 3
Example: Unicode - Polish in UTF-8

Standard Python libraries and Unicode

I have been reading left right and centre about unicode and python. I think I understand what encoding/decoding is, yet as soon as I try to use a standard library method manipulating a file name, I get the infamous:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 19:
ordinal not in range(128)
In this case \xe9 stands for 'é', and it doesn't matter if I call it from a os.path.join() or a shutil.copy(), it throws the same error. From what I understand it has to do with the default encoding of python. I try to change it with:
# -*- coding: utf-8 -*-
Nothing changes. If I type:
sys.setdefaultencoding('utf-8')
it tells me:
ImportError: cannot import name setdefaultencoding
What I really don't understand is why it works when I type it in the terminal, '\xe9' and all. Could someone please explain to me why this is happening/how to get around it?
Thank you

Filenames on *nix cannot be manipulated as unicode. The filename must be encoded to match the charset of the filesystem and then used.

you should decode manually the filename with the correct encoding (latin1?) before os.path.join
btw: # -- coding: utf-8 -- refers to the string literals in your .py file
effbot has some good infos

You should not touch the default encoding. It is best practice and highly recommendable to keep it with 'ascii' and convert your data properly to utf-8 on the output side.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.