RIGHT-TO-LEFT char \u200f causing problems in python - python

I am reading a web page using urllib which is in utf-8 charset and includes the RIGHT-TO-LEFT charecter
http://www.charbase.com/200f-unicode-right-to-left-mark
but when I try to write all of that into a UTF-8 based text file
with codecs.open("results.html","w","utf-8") as outFile:
outFile.write(htmlResults)
outFile.close()
I get the
"UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 264: ordinal not in range(128)"
error message ....
how do I fix that ?

If htmlResults is of type str then you need to figure out what encoding it is in if you're going to decode it to Unicode (only Unicode can be encoded). E.g. if htmlResults is encoded as iso-8859-1 (i.e. latin-1), then
tmp = htmlResults.decode('iso-8859-1')
would create a Unicode string in tmp which you could then write to a file:
with codecs.open("results.html","w","utf-8") as outFile:
tmp = htmlResults.decode('iso-8859-1')
outFile.write(tmp)
if htmlResults is encoded as utf-8 then you do not need to do any decoding/encoding:
with open('results.html', 'w') as fp:
fp.write(htmlResults)
(the with statement will close the file for you).
This is not related to what browsers interpret the file as however, that is determined by the Content-Type that the web server is serving the file as, along with associated meta tags. E.g. if the file is html5 you ought to have this near the top of the head tag:
<meta charset="UTF-8">

Related

Python - Reading CSV UnicodeError

I have exported a CSV from Kaggle - https://www.kaggle.com/ngyptr/python-nltk-sentiment-analysis. However, when I attempt to iterate through the file, I receive unicode errors concerning certain characters that cannot be encoded.
File "C:\Program Files\Python35\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 264: character maps to
I have enabled utf-8 encoding while opening the file, which I assumed would have decoded the ASCII characters. Evidently not.
My Code:
with open("sentimentDataSet.csv", "r", encoding="utf-8" ,errors='ignore', newline='') as file:
reader = csv.reader(file)-
for row in reader:
if row:
print(row)
if row[sentimentCsvColumn] == sentimentScores(row[textCsvColumn]):
accuracyCount += 1
print(accuracyCount)
That's an encode error as you're printing the row, and has little to do with reading the actual CSV.
Your Windows terminal is in CP850 encoding, which can't represent everything.
There are some things you can do here.
A simple way is to set the PYTHONIOENCODING environment variable to a combination that will trash things it can't represent. set PYTHONIOENCODING=cp850:replace before running Python will have Python replace characters unrepresentable in CP850 with question marks.
Change your terminal encoding to UTF-8: chcp 65001 before running Python.
Encode the thing by hand before printing: print(str(data).encode('ascii', 'replace'))
Don't print the thing.

Not able to read file due to unicode error in python

I'm trying to read a file and when I'm reading it, I'm getting a unicode error.
def reading_File(self,text):
url_text = "Text1.txt"
with open(url_text) as f:
content = f.read()
Error:
content = f.read()# Read the whole file
File "/home/soft/anaconda/lib/python3.6/encodings/ascii.py", line 26, in
decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 404:
ordinal not in range(128)
Why is this happening? I'm trying to run the same on Linux system, but on Windows it runs properly.
According to the question,
i'm trying to run the same on Linux system, but on Windows it runs properly.
Since we know from the question and some of the other answers that the file's contents are neither ASCII nor UTF-8, it's a reasonable guess that the file is encoded with one of the 8-bit encodings common on Windows.
As it happens 0x92 maps to the character 'RIGHT SINGLE QUOTATION MARK' in the cp125* encodings, used on US and latin/European regions.
So probably the the file should be opened like this:
# Python3
with open(url_text, encoding='cp1252') as f:
content = f.read()
# Python2
import codecs
with codecs.open(url_text, encoding='cp1252') as f:
content = f.read()
There can be two reasons for that to happen:
The file contains text encoded with an encoding different than 'ascii' and, according you your comments to other answers, 'utf-8'.
The file doesn't contain text at all, it is binary data.
In case 1 you need to figure out how the text was encoded and use that encoding to open the file:
open(url_text, encoding=your_encoding)
In case 2 you need to open the file in binary mode:
open(url_text, 'rb')
As it looks, default encoding is ascii while Python3 it's utf-8, below syntax to open the file can be used
open(file, encoding='utf-8')
Check your system default encoding,
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
If it's not UTF-8, reset the encoding of your system.
export LANGUAGE=en_US.UTF-8
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LC_TYPE=en_US.UTF-8
You can use codecs.open to fix this issue with the correct encoding:
import codecs
with codecs.open(filename, 'r', 'utf8' ) as ff:
content = ff.read()

Unicode, ASCII, and Regex Won't Work

So I am using Python 3.6.something and I've been trying to figure out this completely intuitive Unicode/ASCII nightmare. I am trying to save the text from a webpage into a file, and parse it using Regex later.
When I try to read the file and parse it, I need to find the pattern:
Note 1 –
Which is apparently different from:
Note 1 -
I keep getting the error:
SyntaxError: Non-UTF-8 code starting with '\x96' in file C:\Users\Steve\eclipse-workspace\scraper\BeautifulSoupTest.py on line 28, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
on the RegEx that I am trying to do. This is really strange since '\x96' is an Unicode character from what I've seen online. Something is going on with Unicode or ASCII and I have no clue what it is. I also can't remove the '\x96' character with a replace() either, it gives the same error. Can anyone help?
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
def downloadCleanText(url, year):
urlObject = urlopen(url)
rawHTML = urlObject.read()
cleanedText = BeautifulSoup(rawHTML, 'html.parser').body.getText()
outputFile = open(str(year) + '.txt', 'w')
outputFile.write(cleanedText)
outputFile.close()
def pullNote1(year):
inputFile = open(str(year) + '.txt', 'r')
inData = inputFile.read()
outData = re.findall('Note 1 –(.*?)Note 2 ', inData)
print(outData)
inputFile.close()
downloadCleanText('https://www.sec.gov/Archives/edgar/data/320193/000032019317000070/a10-k20179302017.htm#s2A826F0B8B5755F787D29B5B8C8C7D16', 2000)
pullNote1(2000)
No, 0x96 is not an ASCII codepoint. The ASCII standard defines only 7 bit codepoints, so from 0x00 through to 0x7F. Nor is 0x96 a valid UTF-8 byte sequence.
You most likely have saved your source code as Windows Codepage 1252, where 0x96 is the U+2013 EN DASH codepoint (all codepages between 1250 and 1258 do, but 1252 is the most widely used). So, following the exception message you can make the error go away by adding:
# encoding: cp1252
at the top of the file. Or you could configure your editor to save the file as UTF-8 instead (at which point the byte sequence 0xE2 0x80 0x93 will be written to represent that codepoint).
Alternatively, use only ASCII characters in your source code by using a \uhhhh escape sequence in your string literal:
outData = re.findall('Note 1 \u2013(.*?)Note 2 ', inData)
You may want to read up on Unicode and Python, I strongly recommend Ned Batchelder's Pragmatic Unicode.

How to open a binary file stored in Google App Engine?

I have generated a binary file with the word2vec, stored the resulting .bin file to my GCS bucket, and ran the following code in my App Engine app handler:
gcs_file = gcs.open(filename, 'r')
content = gcs_file.read().encode("utf-8")
""" call word2vec with content so it doesn't need to read a file itself, as we don't have a filesystem in GAE """
Fails with this error:
content = gcs_file.read().encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 15: ordinal not in range(128)
A similar decode error happens if I try gcs_file.read(), or gcs_file.read().decode("utf-8").encode("utf-8").
Any ideas on how to read a binary file from GCS?
Thanks
If it is binary then it will not take a character encoding, which is what UTF-8 is. UTF-8 is just one possible binary encoding of the Unicode specification for character sets ( String data ). You need to go back and read up on what UTF-8 and ASCII represent and how they are used.
If it was not text data that was encoded with a specific encoding then it is not going to magically just decode, which is why you are getting that error. can't decode byte 0xf6 in position 15 is not a valid ASCII value.

unicode error with codecs when reading a pdf file in python

I am trying to read a pdf file with the following contain:
%PDF-1.4\n%âãÏÓ
If I read it with open, it works but if I try with codecs.open(filename, encoding="utf8", mode="rb") to get a unicode string, I got the following exception:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
Do you know a way to get a unicode string from the content of this file?
PS: I am using python 2.7
PDFs are made of binary data, not text. They cannot be meaningfully represented as Unicode strings.
For what it's worth, you can get a Unicode string containing those particular characters by treating the PDF as ISO8859-1 text:
f = codecs.open(filename, encoding="ISO8859-1", mode="rb")
But at that point, you're better off just using normal open and reading bytes. Unicode is for text, not data.
The issue of trying to interpret arbitrary binary data as text aside, 0xe2 is â in Latin-1, not UTF-8. You're using the wrong codec.

Categories

Resources