BeautifulSoup Unable to Parse Unexpected Encodings

BeautifulSoup Unable to Parse Unexpected Encodings - python

Apologies in advance for this post if it's not well written as I'm extremely new to Python. Pretty simple/stupid problem I'm having with Python3 and BeautifulSoup. I'm attempting to parse a CSV file in Python without knowing what the encoding of each line will contain as each line contains raw data from several sources. Before I can even parse the file, I'm using BeautifulSoup in an attempt to clean it up (I'm not sure if this is a good idea):
from bs4 import BeautifulSoup
def main():
try:
soup = BeautifulSoup(open('files/sdk_breakout_1027.csv'))
except Exception as e:
print(str(e))
When I run this, however, I encounter the following error:
'ascii' codec can't decode byte 0xed in position 287: ordinal not in range(128)
My traceback points to this line in the CSV as the source of the problem:
500i(í£ : Android OS : 4.0.4
What is a better way to go about this? I just want to convert all rows in this CSV to a uniform encoding so I can parse it later.
Thanks for your help.

Guessing the encoding of random data will never be perfect, but if you know something about your data source, you may be able to do that.
Alternatively, you can open as UTF-8 and either ignore or replace errors:
import csv
with open("filename", encoding="utf8", errors="replace") as f:
for row in csv.reader(f):
print(", ".join(row))

You can't parse a CSV file with BeautifulSoup, only HTML or XML.
If you want to use the charset guessing from BeautifulSoup on its own, you can. See the Unicode, Dammit section of the docs. If you have a complete list of all of the encodings that might have been used, but just don't know which one in that list was actually used, pass that list to Dammit.
There's a different charset-guessing library known as chardet that you also might want to try. (Note that Dammit will use chardet if you have it installed, so you might not need to try it separately.)
But both of these just make educated guesses; the documentation explains all the myriad ways they can fail.
Also, if each line is encoded differently (which is an even bigger mess than usual), you will have to Dammit or chardet each line as if it were a separate file. With much less text to work with, the guessing is going to be much less accurate, but there's nothing you can do about that if each line really is potentially in a different encoding.
Putting it all together, it would look something like this:
encodings = 'utf-8', 'latin-1', 'cp1252', 'shift-jis'
def dammitize(f):
for line in f:
yield UnicodeDammit(line, encodings).unicode_markup
with open('foo.csv', 'rb') as f:
for row in csv.reader(dammitize(f)):
do_something_with(row)

Related

UnicodeDecodeError "charmap" when trying to import a .json file into python [duplicate]

I'm trying to scrape a website, but it gives me an error.
I'm using the following code:
import urllib.request
from bs4 import BeautifulSoup
get = urllib.request.urlopen("https://www.website.com/")
html = get.read()
soup = BeautifulSoup(html)
print(soup)
And I'm getting the following error:
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>
What can I do to fix this?

I was getting the same UnicodeEncodeError when saving scraped web content to a file. To fix it I replaced this code:
with open(fname, "w") as f:
f.write(html)
with this:
with open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you need to support Python 2, then use this:
import io
with io.open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you want to use a different encoding than UTF-8, specify whatever your actual encoding is for encoding.

I fixed it by adding .encode("utf-8") to soup.
That means that print(soup) becomes print(soup.encode("utf-8")).

In Python 3.7, and running Windows 10 this worked (I am not sure whether it will work on other platforms and/or other versions of Python)
Replacing this line:
with open('filename', 'w') as f:
With this:
with open('filename', 'w', encoding='utf-8') as f:
The reason why it is working is because the encoding is changed to UTF-8 when using the file, so characters in UTF-8 are able to be converted to text, instead of returning an error when it encounters a UTF-8 character that is not suppord by the current encoding.

set PYTHONIOENCODING=utf-8
set PYTHONLEGACYWINDOWSSTDIO=utf-8
You may or may not need to set that second environment variable PYTHONLEGACYWINDOWSSTDIO.
Alternatively, this can be done in code (although it seems that doing it through env vars is recommended):
sys.stdin.reconfigure(encoding='utf-8')
sys.stdout.reconfigure(encoding='utf-8')
Additionally: Reproducing this error was a bit of a pain, so leaving this here too in case you need to reproduce it on your machine:
set PYTHONIOENCODING=windows-1252
set PYTHONLEGACYWINDOWSSTDIO=windows-1252

While saving the response of get request, same error was thrown on Python 3.7 on window 10. The response received from the URL, encoding was UTF-8 so it is always recommended to check the encoding so same can be passed to avoid such trivial issue as it really kills lots of time in production
import requests
resp = requests.get('https://en.wikipedia.org/wiki/NIFTY_50')
print(resp.encoding)
with open ('NiftyList.txt', 'w') as f:
f.write(resp.text)
When I added encoding="utf-8" with the open command it saved the file with the correct response
with open ('NiftyList.txt', 'w', encoding="utf-8") as f:
f.write(resp.text)

Even I faced the same issue with the encoding that occurs when you try to print it, read/write it or open it. As others mentioned above adding .encoding="utf-8" will help if you are trying to print it.
soup.encode("utf-8")
If you are trying to open scraped data and maybe write it into a file, then open the file with (......,encoding="utf-8")
with open(filename_csv , 'w', newline='',encoding="utf-8") as csv_file:

For those still getting this error, adding encode("utf-8") to soup will also fix this.
soup = BeautifulSoup(html_doc, 'html.parser').encode("utf-8")
print(soup)

There are multiple aspects to this problem. The fundamental question is which character set you want to output into. You may also have to figure out the input character set.
Printing (with either print or write) into a file with an explicit encoding="..." will translate Python's internal Unicode representation into that encoding. If the output contains characters which are not supported by that encoding, you will get an UnicodeEncodeError. For example, you can't write Russian or Chinese or Indic or Hebrew or Arabic or emoji or ... anything except a restricted set of some 200+ Western characters to a file whose encoding is "cp1252" because this limited 8-bit character set has no way to represent these characters.
Basically the same problem will occur with any 8-bit character set, including nearly all the legacy Windows code pages (437, 850, 1250, 1251, etc etc), though some of them support some additional script in addition to or instead of English (1251 supports Cyrillic, for example, so you can write Russian, Ukrainian, Serbian, Bulgarian, etc). An 8-bit encoding has only a maximum of 256 character codes and no way to represent a character which isn't among them.
Perhaps now would be a good time to read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
On platforms where the terminal is not capable of printing Unicode (only Windows these days really, though if you're into retrocomputing, this problem was also prevalent on other platforms in the previous millennium) attempting to print Unicode strings can also produce this error, or output mojibake. If you see something like HÃ©llÃ¶ instead of Héllö, this is your issue.
In short, then, you need to know:
What is the character set of the page you scraped, or the data you received? Was it correctly scraped? Did the originator correctly identify its encoding, or are you able to otherwise obtain this information (or guess it)? Some web sites incorrectly declare a different character set than the page actually contains, some sites have incorrectly configured the connection between the web server and a back-end database. See e.g. scrape with correct character encoding (python requests + beautifulsoup) for a more detailed example with some solutions.
What is the character set you want to write? If printing to the screen, is your terminal correctly configured, and is your Python interpreter configured identically?
Perhaps see also How to display utf-8 in windows console
If you are here, probably the answer to one of these questions is not "UTF-8". This is increasingly becoming the prevalent encoding for web pages, too, though the former standard was ISO-8859-1 (aka Latin-1) and more recently Windows code page 1252.
Going forward, you basically want all your textual data to be Unicode, outside of a few fringe use cases. Generally, that means UTF-8, though on Windows (or if you need Java compatibility), UTF-16 is also vaguely viable, albeit somewhat cumbersome. (There are several other Unicode serialization formats, which may be useful in specialized circumstances. UTF-32 is technically trivial, but takes up a lot more memory; UTF-7 is used in a few network protocols where 7-bit ASCII is required for transport.)
Perhaps see also https://utf8everywhere.org/
Naturally, if you are printing to a file, you also need to examine that file using a tool which can correctly display it. A common pilot error is to open the file using a tool which only displays the currently selected system encoding, or one which tries to guess the encoding, but guesses wrong. Again, a common symptom when viewing UTF-8 text using Windows code page 1252 would result, for example, in Héllö displaying as HÃ©llÃ¶.
If the encoding of character data is unknown, there is no simple way to automatically establish it. If you know what the text is supposed to represent, you can perhaps infer it, but this is typically a manual process with some guesswork involved. (Automatic tools like chardet and ftfy can help, but they get it wrong some of the time, too.)
To establish which encoding you are looking at, it can be helpful if you can identify the individual bytes in a character which isn't displayed correctly. For example, if you are looking at H\x8ell\x9a but expect it to represent Héllö, you can look up the bytes in a translation table. I have published one such table at https://tripleee.github.io/8bit where you can see that in this example, it's probably one of the legacy Mac 8-bit character sets; with more data points, perhaps you can narrow it down to just one of them (and if not, any one of them will do in practice, since all the code points you care about map to the same Unicode characters).
Python 3 on most platforms defaults to UTF-8 for all input and output, but on Windows, this is commonly not the case. It will then instead default to the system's default encoding (still misleadingly called "ANSI code page" in some Microsoft documentation), which depends on a number of factors. On Western systems, the default encoding out of the box is commonly Windows code page 1252.
(Earlier Python versions had somewhat different expectations, and in Python 2, the internal string representation was not Unicode.)
If you are on Windows and write UTF-8 to a text file, maybe specify encoding="utf-8-sig" which adds a BOM sequence at the beginning of the file. This is strictly speaking not necessary or correct, but some Windows tools need it to correctly identify the encoding.
Several of the earlier answers here suggest blindly applying some encoding, but hopefully this should help you understand how that's not generally the correct approach, and how to figure out - rather than guess - which encoding to use.

From Python 3.7 onwards,
Set the the environment variable PYTHONUTF8 to 1
The following script included other useful variables too which set System Environment Variables.
setx /m PYTHONUTF8 1
setx PATHEXT "%PATHEXT%;.PY" ; In CMD, Python file can be executed without extesnion.
setx /m PY_PYTHON 3.10 ; To set default python version for py
Source

I got the same error so I use (encoding="utf-8") and it solve the error.
This generally happens when we got some unidentified symbol or pattern in text data that our encoder does not understand.
with open("text.txt", "w", encoding='utf-8') as f:
f.write(data)
This will solve your problem.

if you are using windows try to pass encoding='latin1', encoding='iso-8859-1' or encoding='cp1252'
example:
csv_data = pd.read_csv(csvpath,encoding='iso-8859-1')
print(print(soup.encode('iso-8859-1')))

decoding issue while parsing JSON [python]

I am reading a JSON file in Python which has lots of fields and values (~8000 records).
Env: windows 10, python 3.6.4;
code:
import json
json_data = json.load(open('json_list.json'))
print (json_data)
With this I get an error. Below is the stack trace:
json_data = json.load(open('json_list.json'))
File "C:\Program Files (x86)\Python36-32\lib\json\__init__.py", line 296, in load
return loads(fp.read(),
File "C:\Program Files (x86)\Python36-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7977319: character maps to <undefined>
Along with this I have tried
import json
with open('json_list.json', encoding='utf-8') as fd:
json_data = json.load(fd)
print (json_data)
with this my program runs for a long time then hangs with no output.
I have searched almost all topics related to this and could not find a solution.
Note: The JSON data is a valid one as when I see it on Postman/any REST client it doesn't report any anomalies.
Any help on this or alternative solution on how can I load my JSON data (any way by converting it to string then back to JSON etc) will be of great help.
Here is what the file looks like around the reported error:
>>> from pprint import pprint
>>> f = open('C:/Users/c5242046/Desktop/test2/dblist_rest.json', 'rb')
>>> f.seek(7977319)
7977319
>>> pprint(f.read(100))
(b'\x81TICA EL ABGEN INGL\xc3\x83\xc2\x89S, S.A.","memory_size_gb":"64","since'
b'":"2017-04-10","storage_size_gb":"84.747')

The snippet you are asking about seems to have been double-encoded. Basically, whatever originally generated this data produced text in Latin-1 or some related encoding (Windows code page 1252?). It was then fed to a process which converts Latin-1 to UTF-8 ... twice.
Of course, "converting" data which is already UTF-8 but telling the computer that it's Latin-1 just produces mojibake.
The string INGL\xc3\x83\xc2\x89S suggests this analysis, if you can guess that it is supposed to say Inglés in upper case, and realize that the UTF-8 encoding for É is \xC3 \x89 and then examine which characters these two bytes encode in Latin-1 (or, as it happens, Unicode, which is a superset of Latin-1, though they are not compatible on the encoding level).
Notice that being able to guess which string a problematic sequence is supposed to represent is the crucial step here; it also explains why including a representative snippet of the problematic data - with enough context! - is vital for debugging.
Anyway, if the entire file has the same symptom, you should be able to undo the second, superfluous and incorrect round of re-encoding; though an error this far into the file makes me imagine it's probably a local problem with just one or a few records. Maybe they were merged from multiple input files, only one of which had this error. Then fixing it requires a fair bit of detective work, and manual editing, or identifying and fixing the erroneous source. A quick and dirty workaround is to simply manually remove any erroneous records.

Best way to convert unicode in csv to plain text?

I have a large csv file that contains unicode characters which are causing errors in a Python script I am trying to run. My process for removing them so far has been quite tedious. I run my script and as soon as it hits a unicode character, I get an error:
'ascii' codec can't encode character u'\xef' in position 197: ordinal not in range(128)
Then I Google u'\xef' and try to figure out what the character actually is (Does anyone know of a website with a list of these definitions?). I'm using that information to build a dictionary and I have a second Python script that converts the unicode characters to regular text:
unicode_dict = {"\xb0":"deg", "\xa0":" ", "\xbd":"1/2", "\xbc":"1/4", "\xb2":"^2", "\xbe":"3/4"}
for f in glob.glob(r"C:\Folder1\*.csv"):
in_csv = f
out_csv = f.replace(".csv", "_2.csv")
write_f=open(out_csv, "wb")
writer = csv.writer(write_f)
with open(in_csv,'rb') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
new_row = []
for s in row:
for k, v in unicode_dict.iteritems():
s = s.replace(k, v)
new_row.append(s)
writer.writerow(new_row)
write_f.close()
os.remove(in_csv)
os.rename(out_csv, in_csv)
Then I have to run the code again, get another error, and look up the next unicode character on Google. There must be a better way, right?

Read http://www.joelonsoftware.com/articles/Unicode.html . Carefully.
Then, you'll understand that you need to know which encoding your file is in. If you've been able to find out what \xbd means, maybe that some place mentions which encoding it is.
Then, use io.open(in_csv, 'rb', encoding='yourencodinghere') instead of the vanilla open call.
Then, apparently the csv module doesn't handle Unicode, sigh. Use something from SBillion's answer (e.g. http://www.joelonsoftware.com/articles/Unicode.html ) to work around it.

You should have a look at this for a way to handle Unicode via utf-8 in csv files with the standard python library:
https://docs.python.org/2/library/csv.html#csv-examples
But if you prefer, you can use this external unicode-compliant module: https://pypi.python.org/pypi/unicodecsv/0.9.0

How to decode file in Python-3.x?

For my project I need to parse xml file. For doing this I use lxml. The file I need to parse has a cp1251 coding, but, ofcourse, for parsing it using lxml I need to decode it into utf-8, and I dont know how to do it. I tryed to serch something about this, but all solutions was for Python 2.7 or didnt work.
if try to write something like
inp = open("business.xml", "r", encoding='cp1251').decode('utf-8')
or
inp.decode('utf-8')
It gets
builtins.AttributeError: '_io.TextIOWrapper' object has no attribute 'decode'
I have Python 3.2.
Any help is well,
thanks you.

open() decodes the file for you. You are already receiving Unicode data.
For lxml you need to open the file in binary mode, and let the XML parser deal with encoding. Do not do this yourself.
with open("business.xml", "rb") as inp:
tree = etree.parse(inp)
XML files include a header to indicate what encoding they use, and the parser adjusts to that. If the header is missing, the parser can safely assume UTF-8.

How to parse unicode strings with minidom?

I'm trying to parse a bunch of xml files with the library xml.dom.minidom, to extract some data and put it in a text file. Most of the XMLs go well, but for some of them I get the following error when calling minidom.parsestring():
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 5189: ordinal not in range(128)
It happens for some other non-ascii characters too. My question is: what are my options here? Am I supposed to somehow strip/replace all those non-English characters before being able to parse the XML files?

Try to decode it:
> print u'abcdé'.encode('utf-8')
> abcdÃ©
> print u'abcdé'.encode('utf-8').decode('utf-8')
> abcdé

In case your string is 'str':
xmldoc = minidom.parseString(u'{0}'.format(str).encode('utf-8'))
This worked for me.

Minidom doesn't directly support parsing Unicode strings; it's something that has historically had poor support and standardisation. Many XML tools recognise only byte streams as something an XML parser can consume.
If you have plain files, you should either read them in as byte strings (not Unicode!) and pass that to parseString(), or just use parse() which will read a file directly.

I know the O.P. asked about parsing strings, but I had the same exception upon writing the DOM model to a file via Document.writexml(...). In case people with that (related) problem land here, I will offer my solution.
My code which was throwing the UnicodeEncodeError looked like:
with tempfile.NamedTemporaryFile(delete=False) as fh:
dom.writexml(fh, encoding="utf-8")
Note that the "encoding" param only effects the XML header and has no effect on the treatment of the data. To fix it, I changed it to:
with tempfile.NamedTemporaryFile(delete=False) as fh:
fh = codecs.lookup("utf-8")[3](fh)
dom.writexml(fh, encoding="utf-8")
This will wrap the file handle with an instance of encodings.utf_8.StreamWriter, which handles the data as UTF-8 rather then ASCII, and the UnicodeEncodeError went away. I got the idea from reading the source of xml.dom.minidom.Node.toprettyxml(...).

I encounter this error a few times, and my hacky way of dealing with it is just to do this:
def getCleanString(word):
str = ""
for character in word:
try:
str_character = str(character)
str = str + str_character
except:
dummy = 1 # this happens if character is unicode
return str
Of course, this is probably a dumb way of doing it, but it gets the job done for me, and doesn't cost me anything in speed.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.