Writing utf-8 to file but parsing incorrectly - python

I am reading in text (stored as paper_dict['Abstract']) from a website that is encoded in utf-8 and trying to write it out to a utf-8 encoded file.
But the ' (apostrophe) character is coming out as â or â instead. If I manually encode the text as utf-8 then it is shown as \xe2\x80\x99 or \xc3\xa2\xc2\x80\xc2\x99.
I keep getting this same issue, regardless of the method I've tried using to write the text to a file. Here is one example:
import io
from bs4 import BeautifulSoup
import re
f = io.open('file.txt', encoding='utf-8', mode='a+')
base = 'https://www.federalreserve.gov'
path = '/econres/notes/feds-notes/index.htm'
response = requests.get(base + path, verify=False)
page = BeautifulSoup(response.text, 'html.parser')
links = page.find_all('a', href=re.compile("/econres/notes/feds-notes/"))
for a in links:
paper_dict = {}
paper_dict['Abstract'] = a.find_next('p').find_next('p').text
print(paper_dict['Abstract'], file=f)
or
f.write(paper_dict['Abstract'])
The particular example I've been looking at is the note titled "SOMA's Unrealized Loss: What does it mean?" which has a description of "This Note discusses the various valuation measures of the Fed’s securities holdings, what these values mean, and the expected evolution of the value of the SOMA portfolio." But in my output file, instead of "Fed's" it says "Fedâs"

I think that your file contains the correct UTF-8 encoded strings. The problems probably comes from the fact that you later read it as if it was latin1 (iso-8859-1) encoded.
And you should be cautious that the APOSTROPHE (') is the unicode character U+0027, or the ASCII character of code 0x27, but in the HTML page you get, Fed’s contains a different character, a RIGHT SINGLE QUOTATION MARK which is the unicode character U+2019.
Now everything can be explained:
"Fed’s".encode('utf8') is the following byte string: b'Fed\xe2\x80\x99s'. If you try to read (decode) it as latin1, you get:
>>> "Fed’s".encode('utf8').decode('latin1')
'Fedâ\x80\x99s'
because â is the unicode character U+00E2 or the iso-8859-1 character of code 0xe2. And in the Latin1 character set, both '\x80' and '\x99' are non printing characters, so you get:
>>> print("Fed’s".encode('utf8').decode('latin1'))
Fedâs
So your output file is correct, simply the way you display it is wrong: you should use an UTF-8 enable text editor, like the excellent vim (gvim) or notepad++ (google for them if you do not know them).

Related

What encoding is this and how can I decode it in Python?

I have a filename that contains %ed%a1%85%ed%b7%97.svg and want to decode that to its proper string representation in Python 3. I know the result will be 𡗗.svg but the following code does not work:
import urllib.parse
import codecs
input = '%ed%a1%85%ed%b7%97.svg'
unescaped = urllib.parse.unquote(input)
raw_bytes = bytes(unescaped, "utf-8")
decoded = codecs.escape_decode(raw_bytes)[0].decode("utf-8")
print(decoded)
will print ������.svg. It does work, however, when input is a string like %e8%b7%af.svg for which it will correctly decode to 路.svg.
I've tried to decode this with online tools such as https://mothereff.in/utf-8 by replacing % with \x leading to \xed\xa1\x85\xed\xb7\x97.svg. The tool correctly decoded this input to 𡗗.svg.
What happens here?
you need the correct encoding to get command line console/terminal (which supports & configured for utf-8) to display the correct characters
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
PEP 263 -- Defining Python Source Code Encodings: https://www.python.org/dev/peps/pep-0263/
https://stackoverflow.com/questions/3883573/encoding-error-in-python-with-chinese-characters#3888653
"""
from urllib.parse import unquote
urlencoded = '%ed%a1%85%ed%b7%97'
char = unquote(urlencoded, encoding='gbk')
char1 = unquote(urlencoded, encoding='big5_hkscs')
char2 = unquote(urlencoded, encoding='gb18030')
print(char)
print(char1)
print(char2)
# 怼呿窏
# 瞴�窾�
# 怼呿窏
this is a quite an exotic unicode character, and i was wrong about the encoding, it's not a simplified chinese char, it's traditional one, and quite far in the mapping as well \U215D7 - CJK UNIFIED IDEOGRAPHS EXTENSION B.
but the code point listed & other values made me suspicious this was a poorly encoded code, so it took me a while.
someone helped me figuring how the encoding got to that form. you need to do a few encoding transforms to revert it back to its original value.
cjk = unquote_to_bytes(urlencoded).decode('utf-8', 'surrogatepass').encode('utf-16', 'surrogatepass').decode('utf-16')
print(cjk)

How to write both Chinese characters and English characters into a file (Python 3)?

I wrote a script to scrape the titles of a YouTube playlist page
Everything works fine, according to print statements, until I try to write the titles into a text file, at which point I get "UnicodeEncodeError: 'charmap' codec can't encode characters in position..."
I've tried adding "encoding='utf8'" when I open the file, and while that fixes the error, all the Chinese characters are replaced by random, gibberish characters
I also tried encoding the output string with 'replace', then decoding it, but that also just replaces all the special characters with question marks
Here is my code:
from bs4 import BeautifulSoup as BS
import urllib.request
import re
playlist_url = input("gib nem: ")
with urllib.request.urlopen(playlist_url) as response:
playlist = response.read().decode('utf-8')
soup = BS(playlist, "lxml")
title_attrs = soup.find_all(attrs={"data-title":re.compile(r".*")})
titles = [tag["data-title"] for tag in title_attrs]
titles_str = '\n'.join(titles)#.encode('cp1252','replace').decode('cp1252')
print(titles_str)
with open("playListNames.txt", "a") as f:
f.write(titles_str)
And here is the sample playlist I've been using to test:
https://www.youtube.com/playlist?list=PL3oW2tjiIxvSk0WKXaEiDY78KKbKghOOo
Using an encoding will fix your problem. Windows defaults to an ANSI encoding that on US Windows is Windows-1252. It doesn't support Chinese. You should use utf8 or utf-8-sig as the encoding. Some Windows editors prefer the latter and assume ANSI otherwise.
with open('playListNames.txt','w',encoding='utf-8-sig') as f:
The documentation is clear about file encoding:
encoding is the name of the encoding used to decode or encode the
file. This should only be used in text mode. The default encoding is
platform dependent (whatever locale.getpreferredencoding() returns),
but any text encoding supported by Python can be used. See the codecs
module for the list of supported encodings.
To answer questions from your last comment.
You can find out what's the preferred encoding on Windows with
import locale
locale.getpreferredencoding()
If playListNames.txt was created with open('playListNames.txt', 'w') then the value returned by locale.getpreferredencoding() was used for encoding.
If the file was created manually then the encoding depends on the editor's default/preferred encoding.
Refer to How to convert a file to utf-8 in Python? or How do I convert an ANSI encoded file to UTF-8 with Notepad++? [closed].

Converting character codes to unicode [Python]

So I have a large csv of french verbs that I am using to make a program, in the csv, verbs with accent characters contain codes instead of the actual accents:
être is être, for example (atleast when I open the file in Excel)
Here is the csv:
https://raw.githubusercontent.com/ianmackinnon/inflect/master/french-verb-conjugation.csv
In Chrome and Firefox atleast, the codes are converted to the correct accents. I was wondering if once the string is imported in python into a given a variable, ie.
...
for row in reader:
inf_lst.append(row[0])
verb = inf_lst[2338]
#(verb = être)
if there was a straightforward/built in method for printing it out with correct unicode to give "être"?
I am aware that you could do this by replacing the ê's with ê's in each string but since this would have to be done for each different possible accent, I was wondering if there was an easier way.
Thanks,
You can use unicode encoding by prefixing a string with 'u'.
>>> foo = u'être' >>> print foo être
It all comes down to the character encoding of the data. Its possible that it is utf-8 encoded and you are viewing it in a Windows tool that is using your local code page, which gives a different display for the stream. How to read/write with files is covered in the csv doc examples.
You've given us a zipped, utf-8 encoded web page and the requests modules is good at handling that sort of thing. So, you could read the csv with:
>>> import requests
>>> import csv
>>> resp=requests.get("https://raw.githubusercontent.com/ianmackinnon/inflect/master/french-verb-conjugation.csv",
... stream=True)
>>> try:
... inf_lst = list(csv.reader(resp.iter_lines(decode_unicode=True)))
... finally:
... del resp
...
>>> len(inf_list)
5362
You have a UTF-8-encoded file. Excel likes that encoding to start with a byte order mark character (U+FEFF) or it assumes the default ANSI encoding for your version of Windows instead. To get UTF-8 with BOM, use a tool like Notepad++. Open the file in Notepad++. On the Encoding menu, select "Encode in UTF-8-BOM" and save. Now it will display correctly in Excel.
To write a file that Excel can open, use the encoding utf-8-sig and write Unicode strings:
import io
with io.open('out.csv','w',encoding='utf-8-sig') as f:
f.write(u'être')

using £ in function and write it to csv in Python

I want to remove pound sign from the string that is parsed from url using Beautifulsoup. And I got the following error for pound sign.
SyntaxError: Non-ASCII character '\xa3' in file
I tried to put this # -*- coding: utf-8 -*- at the start of the class but still got the error.
This is the code. After I get the float number, I want to write it csv file.
mainTag = SoupStrainer('table', {'class':'item'})
soup = BeautifulSoup(resp,parseOnlyThese=mainTag)
tag= soup.findAll('td')[3]
price = tag.text.strip()
pr = float(price.lstrip(u'£').replace(',', ''))
The problem is likely one of encoding, and bytes vs characters. What encoding was the CSV file created with? What sequence of bytes is in the file where the £ symbol occurs? What are the bytes contained in the variable price? You'll need to replace the bytes that actually occur in the string. One piece of the puzzle is the contents of the data in your source code. That's where the # -*- coding: utf-8 -*- marker at the top of the source is significant: it tells python how to interpret the bytes in a string literal. It is possible you will need (or want) to decode the bytes from the CSV file to create a Unicode string before replacing the character.
I will point out that the documentation for the csv module in Python 2.7 says:
Note: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.
The examples section includes the following code demonstrating decoding the bytes provided by the csv module to Unicode strings.
import csv
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]

Replacing a weird single-quote (’) with blank string in Python

I'm trying to use string.replace('’','') to replace the dreaded weird single-quote character: ’ (aka \xe2 aka #8217). But when I run that line of code, I get this error:
SyntaxError: Non-ASCII character '\xe2' in file
EDIT: I get this error when trying to replace characters in a CSV file obtained remotely.
# encoding: utf-8
import urllib2
# read raw CSV data from URL
url = urllib2.urlopen('http://www.aaphoenix.org/meetings/aa_meetings.csv')
raw = url.read()
# replace bad characters
raw = raw.replace('’', "")
print(raw)
Even after the above code is executed, the unwanted character still exists in the print result. I tried the suggestions in the below answers as well. Pretty sure it's an encoding issue, but I just don't know how to fix it, so of course any help is much appreciated.
The problem here is with the encoding of the file you downloaded (aa_meetings.csv). The server doesn't declare an encoding in its HTTP headers, but the only non-ASCII1 octet in the file has the value 0x92. You say that this is supposed to be "the dreaded weird single-quote character", therefore the file's encoding is windows-1252. But you're trying to search and replace for the UTF-8 encoding of U+2019, i.e. '\xe2\x80\x99', which is not what is in the file.
Fixing this is as simple as adding appropriate calls to encode and decode:
# encoding: utf-8
import urllib2
# read raw CSV data from URL
url = urllib2.urlopen('http://www.aaphoenix.org/meetings/aa_meetings.csv')
raw = url.read().decode('windows-1252')
# replace bad characters
raw = raw.replace(u'’', u"'")
print(raw.encode("ascii"))
1 by "ASCII" I mean "the character encoding which maps single octets with values 0x00 through 0x7F directly to U+0000 through U+007F, and does not define the meaning of octets with values 0x80 through 0xFF".
You have to declare the encoding of your source file.
Put this as one of the first two lines of your code:
# encoding: utf-8
If you are using an encoding other than UTF-8 (for example Latin-1), you have to put that instead.
This file is encoded in Windows-1252. The apostrophe U+2019 encodes to \x92 in this encoding. The proper thing is to decode the file to Unicode for processing:
data = open('aa_meetings.csv').read()
assert '\x92' in data
chars = data.decode('cp1252')
assert u'\u2019' in chars
fixed = chars.replace(u'\u2019', '')
assert u'\u2019' not in fixed
The problem was you were searching for a UTF-8 encoded U+2019, i.e. \xe2\x80\x99, which was not in the file. Converting to Unicode solves this.
Using unicode literals as I have here is an easy way to avoid this mistake. However, you can encode the character directly if you write it as u'’':
Python 2.7.1
>>> u'’'
u'\u2019'
>>> '’'
'\xe2\x80\x99'
You can do string.replace('\xe2', "'") to replace them with the normal single-quote.
I was getting such Non-ASCII character '\xe2' errors repeatedly with my Python scripts, despite replacing the single-quotes. It turns out the non-ASCII character really was a double en dash (−−). I replaced it with a regular double dash (--) and that fixed it. [Both will look the same on most screens. Depending on your font settings, the problematic one might look a bit longer.]
For anyone encountering the same issue in their Python scripts (in their lines of code, not in data loaded by your script):
Option 1: get rid of the problematic character
Re-type the line by hand. (To make sure you did not copy-paste the problematic character by mistake.)
Note that commenting the line out will not work.
Check whether the problematic character really is the one you think.
Option 2: change the encoding
Declare an encoding at the beginning of the script, as Roberto pointed out:
# encoding: utf-8
Hope this helps someone.

Categories

Resources