I want to remove pound sign from the string that is parsed from url using Beautifulsoup. And I got the following error for pound sign.
SyntaxError: Non-ASCII character '\xa3' in file
I tried to put this # -*- coding: utf-8 -*- at the start of the class but still got the error.
This is the code. After I get the float number, I want to write it csv file.
mainTag = SoupStrainer('table', {'class':'item'})
soup = BeautifulSoup(resp,parseOnlyThese=mainTag)
tag= soup.findAll('td')[3]
price = tag.text.strip()
pr = float(price.lstrip(u'£').replace(',', ''))
The problem is likely one of encoding, and bytes vs characters. What encoding was the CSV file created with? What sequence of bytes is in the file where the £ symbol occurs? What are the bytes contained in the variable price? You'll need to replace the bytes that actually occur in the string. One piece of the puzzle is the contents of the data in your source code. That's where the # -*- coding: utf-8 -*- marker at the top of the source is significant: it tells python how to interpret the bytes in a string literal. It is possible you will need (or want) to decode the bytes from the CSV file to create a Unicode string before replacing the character.
I will point out that the documentation for the csv module in Python 2.7 says:
Note: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.
The examples section includes the following code demonstrating decoding the bytes provided by the csv module to Unicode strings.
import csv
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
Related
I am having some trouble encoding ascii characters to UTF-8, or a string is not picking up the encoding.
import unicodecsv as csv
import re
import pyodbc
import sys
import unicodedata
#!/usr/bin/python
# -*- coding: UTF-8 -*-
def remove_non_ascii_1(text):
text.encode('utf-8')
for i in text:
return ''.join(i for i in text if i=='£')
In Python 2.7 I get the error
SyntaxError: Non-ASCII character '\xc2' in file on line 16, but no encoding declared; see SyntaxError: Non-ASCII character '\xc2' in file.
With the Unicode replacement
return ''.join(i for i in text if i=='\xc2')
the error is
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
Sample text :
row from a csv file reading in
[u'06/11/2020', u'ABC', u'32154E', u'3214', u'DEF', u'Cash Purchase', u'Final', u'', u'20.00%', u'ABC', u'Sold From Pickup', u'New ', u'10.00%', u'0', u'15%', u'\xa3469.84', u'Jonathan Jerome', u'3', u'\xa3140.95', u'2%', u'\xa393.97', u'\xa39,396.83', u'', u'\xa35,638.00', u'30/06/2020', u'4', u'Boiler-Amended']
I want to remove the \xa3 or £ in the currency fields.
First 2 things ahead:
Don't use Python 2 any more because of reason mentioned here!
Don't use different encodings in Python 2.
TL;DR Python 3 just improved so many things regarding encodings that it simply isn't worth it.
Whole story: read here
Ok this out of the way let's start fixing your code.
As Klaus D. already mentioned you do not save the result of text. This leads to an encoding warning when comparing seamingly equal characters (£ and £) but one is encoded in the encoding coming from the file you read the other one is encoded in ascii (despite you encoding your code in -*- coding: UTF-8 -*-. This is just to show what your code-file is encoded in, this does not change the behaviour of the interpreter regarding str-parsing).
Edit: Also when comparing to the character you will need to compare to a unicode char so you could either convert it or simply tell the interpreter to encode it as unicode in the first place (that's why I added a leading 'u' in front of your '£')
To fix this simply safe your result into text again after you called text.encode('utf-8').
Additionally the "shebang" and the encoding info should always be on the very top of a file that as soon as you open the file you know what you are dealing with.
Something else I would correct is the first for-loop. this one is unnecessary because you return out of this function anyway after you handled the first element.
This means the completely "corrected" code is this.
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import unicodecsv as csv
import re
import pyodbc
import sys
import unicodedata
def remove_non_ascii_1(text):
text.encode('utf-8')
return ''.join(i for i in text if i==u'£')
PS: You should really think again about whether the def remove_non_ascii_1(text) is really necessary. By the looks of it you already input a list of unicode encoded strings which you probably directly read from the file. This means you don't need to correct encoding though the comparison for '£' could stay. You might just want to rename the method ;)
Hope this helped and fixed possible unclarities about Python 2 encodings :D
If you print your list as a whole now you will see it still contains \xca and not the actual '£' but if you print the elements seperately it works fine. This is because the __str__() method of list does not encode unicodes directly but uses the standard ascii encoding...
Python 3 greatly improved unicode text handling. If you have to use Python 2.7, I would recommend using the codecs library when reading text files since it helps you with Pyhton 2.7 unicode issues:
import codecs
fp = codecs.open("file", "r"; encoding="utf-8")
In your case I noticed that you are using unicodecsv as a drop-in csv replacement. In this case, you can hand the parameter encoding="utf-8" when reading the csv file into a list:
r = csv.reader(f, encoding='utf-8')
For just removing non-Ascii characters I would recommend checking this good answer on StackOverflow
I have a filename that contains %ed%a1%85%ed%b7%97.svg and want to decode that to its proper string representation in Python 3. I know the result will be 𡗗.svg but the following code does not work:
import urllib.parse
import codecs
input = '%ed%a1%85%ed%b7%97.svg'
unescaped = urllib.parse.unquote(input)
raw_bytes = bytes(unescaped, "utf-8")
decoded = codecs.escape_decode(raw_bytes)[0].decode("utf-8")
print(decoded)
will print ������.svg. It does work, however, when input is a string like %e8%b7%af.svg for which it will correctly decode to 路.svg.
I've tried to decode this with online tools such as https://mothereff.in/utf-8 by replacing % with \x leading to \xed\xa1\x85\xed\xb7\x97.svg. The tool correctly decoded this input to 𡗗.svg.
What happens here?
you need the correct encoding to get command line console/terminal (which supports & configured for utf-8) to display the correct characters
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
PEP 263 -- Defining Python Source Code Encodings: https://www.python.org/dev/peps/pep-0263/
https://stackoverflow.com/questions/3883573/encoding-error-in-python-with-chinese-characters#3888653
"""
from urllib.parse import unquote
urlencoded = '%ed%a1%85%ed%b7%97'
char = unquote(urlencoded, encoding='gbk')
char1 = unquote(urlencoded, encoding='big5_hkscs')
char2 = unquote(urlencoded, encoding='gb18030')
print(char)
print(char1)
print(char2)
# 怼呿窏
# 瞴�窾�
# 怼呿窏
this is a quite an exotic unicode character, and i was wrong about the encoding, it's not a simplified chinese char, it's traditional one, and quite far in the mapping as well \U215D7 - CJK UNIFIED IDEOGRAPHS EXTENSION B.
but the code point listed & other values made me suspicious this was a poorly encoded code, so it took me a while.
someone helped me figuring how the encoding got to that form. you need to do a few encoding transforms to revert it back to its original value.
cjk = unquote_to_bytes(urlencoded).decode('utf-8', 'surrogatepass').encode('utf-16', 'surrogatepass').decode('utf-16')
print(cjk)
I am reading in text (stored as paper_dict['Abstract']) from a website that is encoded in utf-8 and trying to write it out to a utf-8 encoded file.
But the ' (apostrophe) character is coming out as â or â instead. If I manually encode the text as utf-8 then it is shown as \xe2\x80\x99 or \xc3\xa2\xc2\x80\xc2\x99.
I keep getting this same issue, regardless of the method I've tried using to write the text to a file. Here is one example:
import io
from bs4 import BeautifulSoup
import re
f = io.open('file.txt', encoding='utf-8', mode='a+')
base = 'https://www.federalreserve.gov'
path = '/econres/notes/feds-notes/index.htm'
response = requests.get(base + path, verify=False)
page = BeautifulSoup(response.text, 'html.parser')
links = page.find_all('a', href=re.compile("/econres/notes/feds-notes/"))
for a in links:
paper_dict = {}
paper_dict['Abstract'] = a.find_next('p').find_next('p').text
print(paper_dict['Abstract'], file=f)
or
f.write(paper_dict['Abstract'])
The particular example I've been looking at is the note titled "SOMA's Unrealized Loss: What does it mean?" which has a description of "This Note discusses the various valuation measures of the Fed’s securities holdings, what these values mean, and the expected evolution of the value of the SOMA portfolio." But in my output file, instead of "Fed's" it says "Fedâs"
I think that your file contains the correct UTF-8 encoded strings. The problems probably comes from the fact that you later read it as if it was latin1 (iso-8859-1) encoded.
And you should be cautious that the APOSTROPHE (') is the unicode character U+0027, or the ASCII character of code 0x27, but in the HTML page you get, Fed’s contains a different character, a RIGHT SINGLE QUOTATION MARK which is the unicode character U+2019.
Now everything can be explained:
"Fed’s".encode('utf8') is the following byte string: b'Fed\xe2\x80\x99s'. If you try to read (decode) it as latin1, you get:
>>> "Fed’s".encode('utf8').decode('latin1')
'Fedâ\x80\x99s'
because â is the unicode character U+00E2 or the iso-8859-1 character of code 0xe2. And in the Latin1 character set, both '\x80' and '\x99' are non printing characters, so you get:
>>> print("Fed’s".encode('utf8').decode('latin1'))
Fedâs
So your output file is correct, simply the way you display it is wrong: you should use an UTF-8 enable text editor, like the excellent vim (gvim) or notepad++ (google for them if you do not know them).
I get a Json object from a URL which has values in the form like above:
title:'\u05d9\u05d7\u05e4\u05d9\u05dd'
I need to print these values as readable text however I'm not able to convert them as they are taken as literal strings and not unicode objects.
doing unicode(myStr) does not work
doing a = u'%s' % myStr does not work
all are escaped as string so return the same sequence of characters.
Does any one know how I can do this conversion in python?
May be the right approach is to change the encoding of the response, how do I do that?
You should use the json module to load the JSON data into a Python object. It will take care of this for you, and you'll have Unicode strings. Then you can encode them to match your output device, and print them.
json strings always use ", not ' so '\u05d9\u05d7\u05e4\u05d9\u05dd' is not a json string.
If you load a valid json text then all Python strings in it are Unicode so you don't need to decode anything. To display them you might need to encode them using a character encoding suitable for your terminal.
Example
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
d = json.loads(u'''{"title": "\u05d9\u05d7\u05e4\u05d9\u05dd"}''')
print d['title'].encode('utf-8') # -> יחפים
Note: it is a coincidence that the source encoding (specified in the first line) is equal to the output encoding (the last line) they are unrelated and can be different.
If you'd like to see less \uxxxx sequences in a json text then you could use ensure_ascii=False:
Example
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
L = ['יחפים']
json_text = json.dumps(L) # default encoding for input bytes is utf-8
print json_text # all non-ASCII characters are escaped
json_text = json.dumps(L, ensure_ascii=False)
print json_text # output as-is
Output
["\u05d9\u05d7\u05e4\u05d9\u05dd"]
["יחפים"]
If you have a string like this outside of your JSON object for some reason, you can decode the string using raw_unicode_escape to get the unicode string you want:
>>> '\u05d9\u05d7\u05e4\u05d9\u05dd'.decode('raw_unicode_escape')
u'\u05d9\u05d7\u05e4\u05d9\u05dd'
>>> print '\u05d9\u05d7\u05e4\u05d9\u05dd'.decode('raw_unicode_escape')
יחפים
I'm trying to use string.replace('’','') to replace the dreaded weird single-quote character: ’ (aka \xe2 aka #8217). But when I run that line of code, I get this error:
SyntaxError: Non-ASCII character '\xe2' in file
EDIT: I get this error when trying to replace characters in a CSV file obtained remotely.
# encoding: utf-8
import urllib2
# read raw CSV data from URL
url = urllib2.urlopen('http://www.aaphoenix.org/meetings/aa_meetings.csv')
raw = url.read()
# replace bad characters
raw = raw.replace('’', "")
print(raw)
Even after the above code is executed, the unwanted character still exists in the print result. I tried the suggestions in the below answers as well. Pretty sure it's an encoding issue, but I just don't know how to fix it, so of course any help is much appreciated.
The problem here is with the encoding of the file you downloaded (aa_meetings.csv). The server doesn't declare an encoding in its HTTP headers, but the only non-ASCII1 octet in the file has the value 0x92. You say that this is supposed to be "the dreaded weird single-quote character", therefore the file's encoding is windows-1252. But you're trying to search and replace for the UTF-8 encoding of U+2019, i.e. '\xe2\x80\x99', which is not what is in the file.
Fixing this is as simple as adding appropriate calls to encode and decode:
# encoding: utf-8
import urllib2
# read raw CSV data from URL
url = urllib2.urlopen('http://www.aaphoenix.org/meetings/aa_meetings.csv')
raw = url.read().decode('windows-1252')
# replace bad characters
raw = raw.replace(u'’', u"'")
print(raw.encode("ascii"))
1 by "ASCII" I mean "the character encoding which maps single octets with values 0x00 through 0x7F directly to U+0000 through U+007F, and does not define the meaning of octets with values 0x80 through 0xFF".
You have to declare the encoding of your source file.
Put this as one of the first two lines of your code:
# encoding: utf-8
If you are using an encoding other than UTF-8 (for example Latin-1), you have to put that instead.
This file is encoded in Windows-1252. The apostrophe U+2019 encodes to \x92 in this encoding. The proper thing is to decode the file to Unicode for processing:
data = open('aa_meetings.csv').read()
assert '\x92' in data
chars = data.decode('cp1252')
assert u'\u2019' in chars
fixed = chars.replace(u'\u2019', '')
assert u'\u2019' not in fixed
The problem was you were searching for a UTF-8 encoded U+2019, i.e. \xe2\x80\x99, which was not in the file. Converting to Unicode solves this.
Using unicode literals as I have here is an easy way to avoid this mistake. However, you can encode the character directly if you write it as u'’':
Python 2.7.1
>>> u'’'
u'\u2019'
>>> '’'
'\xe2\x80\x99'
You can do string.replace('\xe2', "'") to replace them with the normal single-quote.
I was getting such Non-ASCII character '\xe2' errors repeatedly with my Python scripts, despite replacing the single-quotes. It turns out the non-ASCII character really was a double en dash (−−). I replaced it with a regular double dash (--) and that fixed it. [Both will look the same on most screens. Depending on your font settings, the problematic one might look a bit longer.]
For anyone encountering the same issue in their Python scripts (in their lines of code, not in data loaded by your script):
Option 1: get rid of the problematic character
Re-type the line by hand. (To make sure you did not copy-paste the problematic character by mistake.)
Note that commenting the line out will not work.
Check whether the problematic character really is the one you think.
Option 2: change the encoding
Declare an encoding at the beginning of the script, as Roberto pointed out:
# encoding: utf-8
Hope this helps someone.