Using Python an Beautiful Soup, I have created a script that takes the name, address and phone number of businesses off a website and the output is saved into three columns of a CSV file.
The script works fine but it stops when I get to a business name that is as follows:
u'\nLevel 12, 280 George Street SYDNEY\xa0 NSW\xa0 2000. . Sydney. NSW 2000\n'
The problem is the "xa0" part. The error message states:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 35: ordinal not in range(128)
I have a vague idea of what this error means but have no idea how to deal with it. Any ideas?
Thanks
Edit:
My script is as follows:
import bs4
import requests
page = requests.get('http://accountantlist.com.au/x123-Accountants-in-Sydney.aspx?Page=0')
soup = bs4.BeautifulSoup(page.content)
for company in soup.select('table#ctl00_ContentPlaceHolder1_dgLawyers tr > td > table'):
name = company.a.text
address = company.find_all('tr')[1].text
phone = company.tr.find_all('td')[1].text
with open('/home/kwal0203/Desktop/eggs.csv', 'a') as csvfile:
s = csv.writer(csvfile)
s.writerow([name,address,phone])
You need to encode it to utf-8 format while writing to csv file as Python's built-in csv doesn't supports unicode.
def remove_non_ascii(text):
return ''.join(i for i in text if ord(i)<128)
name = remove_non_ascii(company.a.text)
address = remove_non_ascii(company.find_all('tr')[1].text)
phone = remove_non_ascii(company.tr.find_all('td')[1].text)
with open('/home/kwal0203/Desktop/eggs.csv', 'a') as csvfile:
s = csv.writer(csvfile)
s.writerow([data.encode("utf-8") for data in [name,address,phone]])
Or you can install unicodecsv which supports unicode by default.
You can install it like this.
pip install unicodecsv
Related
I've written a webscraper that scrapes NBA box score data off of basketball-reference. The specific webpage that my error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u0107' in position 11: character maps to <undefined>
is occurring on is here. Lastly, the specific player data that is tripping it up and throwing this specific UnicodeEncodeError is this one (although I am sure the error is more generalized and will be produced with any character that contains an obscure accent mark).
The minimal reproducible code:
def get_boxscore_basic_table(tag): #used to only get specific tables
tag_id = tag.get("id")
tag_class = tag.get("class")
return (tag_id and tag_class) and ("basic" in tag_id and "section_wrapper" in tag_class and not "toggleable" in tag_class)
import requests
from bs4 import BeautifulSoup
import lxml
import csv
import re
website = 'https://www.basketball-reference.com/boxscores/202003110MIA.html'
r = requests.get(website).text
soup = BeautifulSoup(r, 'lxml')
tables = soup.find_all(get_boxscore_basic_table)
in_file = open('boxscore.csv', 'w', newline='')
csv_writer = csv.writer(in_file)
column_names = ['Player','Name','MP','FG','FGA','FG%','3P','3PA','3P%','FT','FTA','FT%','ORB','DRB','TRB','AST','STL','BLK','TOV','PF','PTS','+/-']
csv_writer.writerow(column_names)
for table in tables:
rows = table.select('tbody tr')
for row in rows:
building_player = [] #temporary container to hold player and stats
player_name = row.th.text
if 'Reserves' not in player_name:
building_player.append(player_name)
stats = row.select('td.right')
for stat in stats:
building_player.append(stat.text)
csv_writer.writerow(building_player) #writing to csv
in_file.close()
What is the best way around this?
I've seen some stuff online about changing the encoding and specifically using the.encode('utf-8') method on the string before writing to the csv but it seems that this .encode() method, although it stops an error from being thrown, has several of its own problems. For instance; player_name.encode('utf-8') before writing to csv turns the name 'Willy Hernangómez' into 'b'Willy Hernang\xc3\xb3mez'' within by csv... not exactly a step in the right direction.
Any help with this and an explanation as to what is happening would be much appreciated!
use
in_file = open('boxscore.csv', 'w', newline='', encoding='utf-8')
instead of
in_file = open('boxscore.csv', 'w', newline='')
and keep everything the same. Make sure you open Excel in utf-8 encoding
I am trying to scrape data from a table on a website.
page_soup = soup(html, 'html.parser')
stat_table = page_soup.find_all('table')
stat_table = stat_table[0]
with open ('stats.txt','w') as q:
for row in stat_table.find_all('tr'):
for cell in row.find_all('td'):
q.write(cell.text)
However, when I try to write the file, I get this error message: 'ascii' codec can't encode character '\xa0' in position 19: ordinal not in range(128).
I understand that it should be encoded with .encode('utf-8'), but
cell.text.encode('utf-8')
doesn't work.
Any help would be greatly appreciated. Using Python 3.6
The file encoding is determined from the current Environment, in this case assuming ascii. You can specify the file encoding directly use:
with open ('stats.txt', 'w', encoding='utf8') as q:
pass
My CSV was originally created by Excel. Anticipating encoding anomalies, I opened and re-saved the file with UTF-8 BOM encoding using Sublime Text.
Imported into the notebook:
filepath = "file:///Volumes/PASSPORT/Inserts/IMAGETRAC/csv/universe_wcsv.csv"
uverse = sc.textFile(filepath)
header = uverse.first()
data = uverse.filter(lambda x:x<>header)
Formatted my fields:
fields = header.replace(" ", "_").replace("/", "_").split(",")
Structured the data:
import csv
from StringIO import StringIO
from collections import namedtuple
Products = namedtuple("Products", fields, verbose=True)
def parse(row):
reader = csv.reader(StringIO(row))
row = reader.next()
return Products(*row)
products = data.map(parse)
If I then do products.first(), I get the first record as expected. However, if I want to, say, see the count by brand and so run:
products.map(lambda x: x.brand).countByValue()
I still get an UnicodeEncodeError related Py4JJavaError:
File "<ipython-input-18-4cc0cb8c6fe7>", line 3, in parse
UnicodeEncodeError: 'ascii' codec can't encode character u'\xab' in
position 125: ordinal not in range(128)
How can I fix this code?
csv module in legacy Python versions doesn't support Unicode input. Personally I would recommend using Spark csv data source:
df = spark.read.option("header", "true").csv(filepath)
fields = [c.strip().replace(" ", "_").replace("/", "_") for c in df.columns]
df.toDF(*fields).rdd
For most applications Row objects should work as well as namedtuple (it extends tuple and provides similar attribute getters) but you can easily follow convert one into another.
You could also try reading data as without decoding:
uverse = sc.textFile(filepath, use_unicode=False)
and decoding fields manually after initial parsing:
(data
.map(parse)
.map(lambda prod: Products(*[x.decode("utf-8") for x in prod])))
Related question Reading a UTF8 CSV file with Python
I'm not entirely sure what I need to do about this error. I assumed that it had to do with needing to add .encode('utf-8'). But I'm not entirely sure if that's what I need to do, nor where I should apply this.
The error is:
line 40, in <module>
writer.writerows(list_of_rows)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 1
7: ordinal not in range(128)
This is the base of my python script.
import csv
from BeautifulSoup import BeautifulSoup
url = \
'https://dummysite'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', {'class': 'table'})
list_of_rows = []
for row in table.findAll('tr')[1:]:
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace('[','').replace(']','')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
outfile = open("./test.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Name", "Location"])
writer.writerows(list_of_rows)
Python 2.x CSV library is broken. You have three options. In order of complexity:
Edit: See below Use the fixed library https://github.com/jdunck/python-unicodecsv (pip install unicodecsv). Use as a drop-in replacement - Example:
with open("myfile.csv", 'rb') as my_file:
r = unicodecsv.DictReader(my_file, encoding='utf-8')
Read the CSV manual regarding Unicode: https://docs.python.org/2/library/csv.html (See examples at the bottom)
Manually encode each item as UTF-8:
for cell in row.findAll('td'):
text = cell.text.replace('[','').replace(']','')
list_of_cells.append(text.encode("utf-8"))
Edit, I found python-unicodecsv is also broken when reading UTF-16. It complains about any 0x00 bytes.
Instead, use https://github.com/ryanhiebert/backports.csv, which more closely resembles Python 3 implementation and uses io module..
Install:
pip install backports.csv
Usage:
from backports import csv
import io
with io.open(filename, encoding='utf-8') as f:
r = csv.reader(f):
The issue lies with the csv library in python 2.
From the unicodecsv project page
Python 2’s csv module doesn’t easily deal with unicode strings, leading to the dreaded “‘ascii’ codec can’t encode characters in position …” exception.
If you can, just install unicodecsv
pip install unicodecsv
import unicodecsv
writer = unicodecsv.writer(csvfile)
writer.writerow(row)
I found the easiest option, in addition to Alastair's excellent suggestions, to be using python3 instead of python 2. all it required in my script was to change wb in the open statement to simply w in accordance with Python3's syntax.
I've been trying to scrape data from a website and write out the data that I find to a file. More than 90% of the time, I don't run into Unicode errors but when the data has the following characters such as "Burger King®, Hans Café", it doesn't like writing that into the file so my error handling prints it to the screen as is and without any further errors.
I've tried the encode and decode functions and the various encodings but to no avail.
Please find an excerpt of the current code that I've written below:
import urllib2,sys
import re
import os
import urllib
import string
import time
from BeautifulSoup import BeautifulSoup,NavigableString, SoupStrainer
from string import maketrans
import codecs
f=codecs.open('alldetails7.txt', mode='w', encoding='utf-8', errors='replace')
...
soup5 = BeautifulSoup(html5)
enc_s5 = soup5.originalEncoding
for company in iter(soup5.findAll(height="20px")):
stream = ""
count_detail = 1
for tag in iter(company.findAll('td')):
if count_detail > 1:
stream = stream + tag.text.replace(u',',u';')
if count_detail < 4 :
stream=stream+","
count_detail = count_detail + 1
stream.strip()
try:
f.write(str(stnum)+","+br_name_addr+","+stream.decode(enc_s5)+os.linesep)
except:
print "Unicode error ->"+str(storenum)+","+branch_name_address+","+stream
Your f.write() line doesn't make sense to me - stream will be a unicode since it's made indirectly from from tag.text and BeautifulSoup gives you Unicode, so you shouldn't call decode on stream. (You use decode to turn a str with a particular character encoding into a unicode.) You've opened the file for writing with codecs.open() and told it to use UTF-8, so you can just write() a unicode and that should work. So, instead I would try:
f.write(unicode(stnum)+br_name_addr+u","+stream+os.linesep)
... or, supposing that instead you had just opened the file with f=open('alldetails7.txt','w'), you would do:
line = unicode(stnum)+br_name_addr+u","+stream+os.linesep
f.write(line.encode('utf-8'))
Have you checked the encoding of the file you're writing to, and made sure the characters can be shown in the encoding you're trying to write to the file? Try setting the character encoding to UTF-8 or something else explicitly to have the characters show up.