I am more than a bit tired, but here goes:
I am doing tome HTML scraping in python 2.6.5 with BeautifulSoap on an ubuntubox
Reason for python 2.6.5: BeautifulSoap sucks under 3.1
I try to run the following code:
# dataretriveal from html files from DETHERM
# -*- coding: utf-8 -*-
import sys,os,re,csv
from BeautifulSoup import BeautifulSoup
sys.path.insert(0, os.getcwd())
raw_data = open('download.php.html','r')
soup = BeautifulSoup(raw_data)
for numdiv in soup.findAll('div', {"id" : "sec"}):
currenttable = numdiv.find('table',{"class" : "data"})
if currenttable:
numrow=0
numcol=0
data_list=[]
for row in currenttable.findAll('td', {"class" : "dataHead"}):
numrow=numrow+1
for ncol in currenttable.findAll('th', {"class" : "dataHead"}):
numcol=numcol+1
for col in currenttable.findAll('td'):
col2 = ''.join(col.findAll(text=True))
if col2.index('±'):
col2=col2[:col2.index('±')]
print(col2.encode("utf-8"))
ref=numdiv.find('a')
niceref=''.join(ref.findAll(text=True))
Now due to the ± signs i get the following error when trying to interprent the code with:
python code.py
Traceback (most recent call last):
File "detherm-wtest.py", line 25, in
if col2.index('±'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
How do i solve this? putting an u in so we have: '±' -> u'±' results in:
Traceback (most recent call last):
File "detherm-wtest.py", line 25, in
if col2.index(u'±'):
ValueError: substring not found
current code file encoding is utf-8
thank you
Byte strings like "±" (in Python 2.x) are encoded in the source file's encoding, which might not be what you want. If col2 is really a Unicode object, you should use u"±" instead like you already tried. You might know that somestring.index raises an exception if it doesn't find an occurrence whereas somestring.find returns -1. Therefore, this
if col2.index('±'):
col2=col2[:col2.index('±')] # this is not indented correctly in the question BTW
print(col2.encode("utf-8"))
should be
if u'±' in col2:
col2=col2[:col2.index(u'±')]
print(col2.encode("utf-8"))
so that the if statement doesn't lead to an exception.
Related
I want to scrape the name and price from this website:
https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&uniqBStoreParam1=val1&wid=11.productCard.PMU_V2
Both name and price are within div tags.
Name:
Price
Printing name works fine, but printing Price gives me an error:
Traceback (most recent call last):
File "c:\File.py", line 37, in <module>
print(price.text)
File "C:\Python37\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u20b9' in position 0: character maps to <undefined>
Code:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import requests
response = requests.get("https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&uniq")
soup = BeautifulSoup(response.text, 'html.parser')
for a in soup.findAll('a',href=True, attrs={'class':'_31qSD5'}):
name=a.find('div', attrs={'class':'_3wU53n'})
price=a.find('div', attrs={'class':'_1vC4OE _2rQ-NK'})
print(name.text)
What is the difference between those?
So why one of them give me an error and the other one is not?
It is yielding that error because python is having trouble with that currency sign. The Indian rupee sign is interpreted differently depending on the language and is not in the python charmap by default. If we change your last print statement to print(str(price.text.encode("utf-8"))) we will get results that look like this:
b'\xe2\x82\xb961,990'
b'\xe2\x82\xb940,000'
b'\xe2\x82\xb963,854'
b'\xe2\x82\xb934,990'
b'\xe2\x82\xb948,990'
b'\xe2\x82\xb952,990'
b'\xe2\x82\xb932,990'
b'\xe2\x82\xb954,990'
b'\xe2\x82\xb952,990'
Since this output isn't very pretty and probably isn't usable, I would personally truncate that symbol before printing. If you really want python to print the Indian rupee symbol, you can add it to your charmap. Follow this steps from this post to add customizations to the charmap.
The following code:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
uClient = uReq('http://www.google.com')
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html.decode('utf-8', 'ignore'), 'lxml')
print(page_soup.find_all('p'))
...produces the following error:
C:\>python ws1.py
Traceback (most recent call last):
File "ws1.py", line 10, in <module>
print(page_soup.find_all('p'))
File "C:\Python34\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in position 40
: character maps to <undefined>
I have searched, in vain, for a solution as every post I have read suggests using a specific encoding none of which has eradicated the problem.
Any help would be appreciated.
Thank you.
You're trying to print a Unicode string that contains characters that can't be represented in the encoding used by your console.
It appears you're using the Windows command line, which means your problem could be solved simply by switching to Python 3.6 - it bypasses the console encoding altogether and sends Unicode straight to Windows.
If that's not possible, you can encode the string yourself and specify that unprintable characters should be replaced with an escape sequence. Then you must decode it again so that print will work properly.
bstr = page_soup.find_all('p').encode(sys.stdout.encoding, errors='backslashreplace')
print(bstr.decode(sys.stdout.encoding))
When I use urllib in Python3 to get the HTML code of a web page, I use this code:
def getHTML(url):
request = Request(url)
request.add_header('User-Agent', 'Mozilla/5.0')
html = urlopen(request).read().decode('utf-8')
print(html)
return html
However, this fails every time with the error:
Traceback (most recent call last):
File "/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 56, in <module>
getHTML('https://www.hltv.org/team/7900/spirit-academy')
File "/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 53, in getHTML
print(html)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10636-10638: ordinal not in range(128)
[Finished in 1.14s]
The page is in UTF-8 and I am decoding it properly according to the urllib docs. The page is not gzipped or in another charset from what I can tell.
url.info().get_charset() returns None for the page, however the meta tags specify UTF-8. I have no problems viewing the HTML in any program.
I do not want to use any external libraries.
Is there a solution? What is going on? This works fine with the following Python2 code:
def getHTML(url):
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open(url)
html = response.read()
return html
You don't need to decode('utf-8')
The following should return the fetched html.
def getHTML(url):
request = Request(url)
request.add_header('User-Agent', 'Mozilla/5.0')
html = urlopen(request).read()
return html
There, found your error, the parsing was done just fine, everything was evaluated alright. But when you read the Traceback carefully:
Traceback (most recent call last): File
"/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 56, in <module>
getHTML('hltv.org/team/7900/spirit-academy') File
"/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 53, in getHTML
print(html)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10636-10638: ordinal not in range(128)
[Finished in 1.14s]
The error was caused by the print statement, as you can see, this is in the traceback print(html).
This is somewhat common exception, it's just telling you that with your current system encoding, some of the text cannot be printed to the console. One simple solution will be to add print(html.encode('ascii', 'ignore')) to ignore all the unprintable characters. You still can do all the other stuff with html, it's just that you can't print it.
See this if you want a better "fix": https://wiki.python.org/moin/PrintFails
btw: re module can search byte strings. Copy this exactly as-is, will work:
import re
print(re.findall(b'hello', b'hello world'))
I am trying to create an HTML Parser in Python 3.4.2 on a Macbook Air(OS X):
plaintext.py:
from html.parser import HTMLParser
import urllib.request, formatter, sys
website = urllib.request.urlopen("http://www.profmcmillan.com")
data = website.read()
website.close()
format = formatter.AbstractFormatter(formatter.DumbWriter(sys.stdout))
ptext = HTMLParser(format)
ptext.feed(data)
ptext.close()
But I get the following error:
Traceback (most recent call last):
File "/Users/deannarobertazzi/Documents/plaintext.py", line 9, in <module>
ptext.feed(data)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/html/parser.py", line 164, in feed
self.rawdata = self.rawdata + data
TypeError: Can't convert 'bytes' object to str implicitly
I looked at the Python documentation and apparently the way you parse HTML data in Python 3 is vastly different from doing such a thing in Python 2. I don't know how to modify my code so that it works for Python 3. Thank you.
2.x implicit conversions only worked if all the bytes were in the ascii range.[0-127]
>>> u'a' + 'b'
u'ab'
>>> u'a' + '\xca'
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
u'a' + '\xca'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xca in position 0: ordinal not in range(128)
What often happened, and why this was dropped, is that code would work when tested with ascii data, such as Prof. McMillan's site seems to be today, and later fail, such as if Prof. McMillan were to add a title with a non-ascii char, or if another source were used that were not all-ascii.
The doc for HTMLParser.feed(data) says that the data must be 'text', which in 3.x means a unicode string. So bytes from the web must be decoded to unicode. Decoding the site with utf-8 works today because ascii is a subset of utf-8. However, the page currently has
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252">
So if a non-ascii char were to be added, and the encoding not changed, utf-8 would not work. There is really no substitute for paying attention to encoding of bytes. How to discover or guess the encoding of a web page (assuming that there is only one encoding used) is a separate subject.
I'm trying to scrape the NDTV website for news titles. This is the page I'm using as a HTML source. I'm using BeautifulSoup (bs4) to handle the HTML code, and I've got everything working, except my code breaks when I encounter the hindi titles in the page I linked to.
My code so far is :
import urllib2
from bs4 import BeautifulSoup
htmlUrl = "http://archives.ndtv.com/articles/2012-01.html"
FileName = "NDTV_2012_01.txt"
fptr = open(FileName, "w")
fptr.seek(0)
page = urllib2.urlopen(htmlUrl)
soup = BeautifulSoup(page, from_encoding="UTF-8")
li = soup.findAll( 'li')
for link_tag in li:
hypref = link_tag.find('a').contents[0]
strhyp = str(hypref)
fptr.write(strhyp)
fptr.write("\n")
The error I get is :
Traceback (most recent call last):
File "./ScrapeTemplate.py", line 30, in <module>
strhyp = str(hypref)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
I got the same error even when I didn't include the from_encoding parameter. I initially used it as fromEncoding, but python warned me that it was deprecated usage.
How do I fix this? From what I've read I need to either avoid the hindi titles or explicitly encode it into non-ascii text, but I don't know how to do that. Any help would be greatly appreciated!
What you see is a NavigableString instance (which is derived from the Python unicode type):
(Pdb) hypref.encode('utf-8')
'NDTV'
(Pdb) hypref.__class__
<class 'bs4.element.NavigableString'>
(Pdb) hypref.__class__.__bases__
(<type 'unicode'>, <class 'bs4.element.PageElement'>)
You need to convert to utf-8 using
hypref.encode('utf-8')
strhyp = hypref.encode('utf-8')
http://joelonsoftware.com/articles/Unicode.html