python's beautiful soup module giving error - python

I am using the following code in an attempt to do webscraping .
import sys , os
import requests, webbrowser,bs4
from PIL import Image
import pyautogui
p = requests.get('http://www.goal.com/en-ie/news/ozil-agent-eviscerates-jealous-keown-over-stupid-comments/1javhtwzz72q113dnonn24mnr1')
n = open("exml.txt" , 'wb')
for i in p.iter_content(1000) :
n.write(i)
n.close()
n = open("exml.txt" , 'r')
soupy= bs4.BeautifulSoup(n,"html.parser")
elems = soupy.select('img[src]')
for u in elems :
print (u)
so what I am intending to do is to extract all the image links that is there in the xml response obtained from the page .
(Please correct me If I am wrong in thinking that requests.get returns the whole static html file of the webpage that opens on entering the URL)
However in the line :
soupy= bs4.BeautifulSoup(n,"html.parser")
I am getting the following error :
Traceback (most recent call last):
File "../../perl/webscratcher.txt", line 24, in <module>
soupy= bs4.BeautifulSoup(n,"html.parser")
File "C:\Users\Kanishc\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4\__init__.py", line 191, in __init__
markup = markup.read()
File "C:\Users\Kanishc\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 24662: character maps to <undefined>
I am clueless about the error and the "Appdata" folder is empty .
How to proceed further ?
Post Trying suggestions :
I changed the extension of the filename to py and this error got removed . However on the following line :
soupy= bs4.BeautifulSoup(n,"lxml") I am getting the following error :
Traceback (most recent call last):
File "C:\perl\webscratcher.py", line 23, in
soupy= bs4.BeautifulSoup(p,"lxml")
File "C:\Users\PREMRAJ\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4_init_.py", line 192, in init
elif len(markup) <= 256 and (
TypeError: object of type 'Response' has no len()
How to tackle this ?

You are over-complicating things. Pass the bytes content of a Response object directly into the constructor of the BeautifulSoup object, instead of writing it to a file.
import requests
from bs4 import BeautifulSoup
response = requests.get('http://www.goal.com/en-ie/news/ozil-agent-eviscerates-jealous-keown-over-stupid-comments/1javhtwzz72q113dnonn24mnr1')
soup = BeautifulSoup(response.content, 'lxml')
for element in soup.select('img[src]'):
print(element)

Okay so you you might want to do a review on working with BeautifulSoup. I referenced an old project of mine and this is all you need for printing them. Check the BS documents to find the exact syntax you want with the select method.
This will print all the img tags from the html
import requests, bs4
site = 'http://www.goal.com/en-ie/news/ozil-agent-eviscerates-jealous-keown-over-stupid-comments/1javhtwzz72q113dnonn24mnr1'
p = requests.get(site).text
soupy = bs4.BeautifulSoup(p,"html.parser")
elems = soupy.select('img[src]')
for u in elems :
print (u)

Related

UnicodeDecodeError error with every urllib.request

When I use urllib in Python3 to get the HTML code of a web page, I use this code:
def getHTML(url):
request = Request(url)
request.add_header('User-Agent', 'Mozilla/5.0')
html = urlopen(request).read().decode('utf-8')
print(html)
return html
However, this fails every time with the error:
Traceback (most recent call last):
File "/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 56, in <module>
getHTML('https://www.hltv.org/team/7900/spirit-academy')
File "/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 53, in getHTML
print(html)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10636-10638: ordinal not in range(128)
[Finished in 1.14s]
The page is in UTF-8 and I am decoding it properly according to the urllib docs. The page is not gzipped or in another charset from what I can tell.
url.info().get_charset() returns None for the page, however the meta tags specify UTF-8. I have no problems viewing the HTML in any program.
I do not want to use any external libraries.
Is there a solution? What is going on? This works fine with the following Python2 code:
def getHTML(url):
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open(url)
html = response.read()
return html
You don't need to decode('utf-8')
The following should return the fetched html.
def getHTML(url):
request = Request(url)
request.add_header('User-Agent', 'Mozilla/5.0')
html = urlopen(request).read()
return html
There, found your error, the parsing was done just fine, everything was evaluated alright. But when you read the Traceback carefully:
Traceback (most recent call last): File
"/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 56, in <module>
getHTML('hltv.org/team/7900/spirit-academy') File
"/Users/chris/Documents/Code/Python/HLTV Parser/getTeams.py", line 53, in getHTML
print(html)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10636-10638: ordinal not in range(128)
[Finished in 1.14s]
The error was caused by the print statement, as you can see, this is in the traceback print(html).
This is somewhat common exception, it's just telling you that with your current system encoding, some of the text cannot be printed to the console. One simple solution will be to add print(html.encode('ascii', 'ignore')) to ignore all the unprintable characters. You still can do all the other stuff with html, it's just that you can't print it.
See this if you want a better "fix": https://wiki.python.org/moin/PrintFails
btw: re module can search byte strings. Copy this exactly as-is, will work:
import re
print(re.findall(b'hello', b'hello world'))

grabbing the api dictionary json with python

I was following this tutorial on api grabbing with python:
https://www.youtube.com/watch?v=pxofwuWTs7c
The url gives:
{"date":"1468500743","ticker":{"buy":"27.96","high":"28.09","last":"27.97","low":"27.69","sell":"27.97","vol":"41224179.11399996"}}
I tried to follow the video and grab the 'last' data.
import urllib2
import json
url = 'https://www.okcoin.cn/api/v1/ticker.do?symbol=ltc_cny'
json_obj=urllib2.urlopen(url)
data= json.load(json_obj)
for item in data['ticker']:print item['last']
After typing the last line python returns:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: string indices must be integers
I think you just misread the payload returned by the server. In this case the ticker key is not of type list in the dictionary converted by the json module.
So You should do the following
import urllib2
import json
url = 'https://www.okcoin.cn/api/v1/ticker.do?symbol=ltc_cny'
json_obj = urllib2.urlopen(url)
data = json.load(json_obj)
print data['ticker']['last']

Python Beautifulsoup : file.write(str) method get TypeError: write() argument must be str, not BeautifulSoup

I wrote below codes:
from bs4 import BeautifulSoup
import sys # where is the sys module in the source code folder ?
try:
import urllib.request as urllib2
except ImportError:
import urllib2
print (sys.argv) #
print(type(sys.argv)) #
#baseUrl = "https://ecurep.mainz.de.xxx.com/ae5/"
baseUrl = "http://www.bing.com"
baseUrl = "http://www.sohu.com/"
print(baseUrl)
url = baseUrl
page = urllib2.urlopen(url) #urlopen is a function, function is also an object
soup = BeautifulSoup(page.read(), "html.parser") #NameError: name 'BeautifulSoup' is not defined
html_file = open("Output.html", "w")
soup_string = str(soup)
print(type(soup_string))
html_file.write(soup_string) # TypeError: write() argument must be str, not BeautifulSoup
html_file.close()
which compiler give below errors:
C:\hzg>py Py_logDownload2.py 1
['Py_logDownload2.py', '1']
<class 'list'>
http://www.sohu.com/
<class 'str'>
Traceback (most recent call last):
File "Py_logDownload2.py", line 25, in <module>
html_file.write(soup_string) # TypeError: write() argument must be str, not
BeautifulSoup
File "C:\Users\ADMIN\AppData\Local\Programs\Python\Python35\lib\encodings\
cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 376-377:
character maps to <undefined>
but soup_string is obviously a str, so why the first error given by compiler ?
also don't know why the second one appear.
what is more confusion, if I change code to:
baseUrl = "https://ecurep.mainz.de.xxx.com/ae5/"
#baseUrl = "http://www.bing.com"
#baseUrl = "http://www.sohu.com/"
it will compile and have no error. (I have change the original name to "xxx").
Can anyone help to debug this?
Update :
I used to write Java codes and am a newbie for Python. With the help of you, I think I have made some progress with debugging Python: when debugging Java, always try to solve the first error; while in Python, ayways try to solve the last one.
Try to open your file with utf-8
import codecs
f = codecs.open("test", "w", "utf-8")
You can ignore the coding error (not recommended) by
f = codecs.open("test", "w", "utf-8", errors='ignore')

Error for Python 3.4.1 regarding string pattern and bytes-like object

I'm new to Python and stack overflow.
I'm trying to follow a tutorial on youtube (outdated I'm guessing based on the error I get) regarding fetching stock prices.
Here is the following program:
import urllib.request
import re
html = urllib.request.urlopen('http://finance.yahoo.com/q?uhb=uh3_finance_vert_gs_ctrl2&fr=&type=2button&s=AAPL')
htmltext = html.read()
regex = '<span id="yfs_l84_aapl">.+?</span>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print(price)
Since this is Python 3, I had to research on urllib.request and use those methods instead of a simple urllib.urlopen.
Anyways, when I run it, I get the following error:
Traceback (most recent call last):
File "/Users/Harshil/Desktop/stockFetch.py", line 13, in <module>
price = re.findall(pattern, htmltext)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/re.py", line 206, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
I realize the error and attempted to fix it by adding the following:
codec = html.info().get_param('charset', 'utf8')
htmltext = html.decode(codec)
But it gives me another error:
Traceback (most recent call last):
File "/Users/Harshil/Desktop/stockFetch.py", line 9, in <module>
htmltext = html.decode(codec)
AttributeError: 'HTTPResponse' object has no attribute 'decode'
Hence, after spending reasonable amount of time, I don't know what to do. All I want to do is get the price for AAPL so I can further continue to build a general program to fetch prices for an array of stocks and use the prices in future programs.
Any help is appreciated. Thanks!
You are barking up the right tree. Try decoding the actual HTML byte string rather than the urlopen HTTPResponse:
htmltext = html.read()
codec = html.info().get_param('charset', 'utf8')
htmltext = htmltext.decode(codec)
price = re.findall(pattern, htmltext)

Handling Indian Languages in BeautifulSoup

I'm trying to scrape the NDTV website for news titles. This is the page I'm using as a HTML source. I'm using BeautifulSoup (bs4) to handle the HTML code, and I've got everything working, except my code breaks when I encounter the hindi titles in the page I linked to.
My code so far is :
import urllib2
from bs4 import BeautifulSoup
htmlUrl = "http://archives.ndtv.com/articles/2012-01.html"
FileName = "NDTV_2012_01.txt"
fptr = open(FileName, "w")
fptr.seek(0)
page = urllib2.urlopen(htmlUrl)
soup = BeautifulSoup(page, from_encoding="UTF-8")
li = soup.findAll( 'li')
for link_tag in li:
hypref = link_tag.find('a').contents[0]
strhyp = str(hypref)
fptr.write(strhyp)
fptr.write("\n")
The error I get is :
Traceback (most recent call last):
File "./ScrapeTemplate.py", line 30, in <module>
strhyp = str(hypref)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
I got the same error even when I didn't include the from_encoding parameter. I initially used it as fromEncoding, but python warned me that it was deprecated usage.
How do I fix this? From what I've read I need to either avoid the hindi titles or explicitly encode it into non-ascii text, but I don't know how to do that. Any help would be greatly appreciated!
What you see is a NavigableString instance (which is derived from the Python unicode type):
(Pdb) hypref.encode('utf-8')
'NDTV'
(Pdb) hypref.__class__
<class 'bs4.element.NavigableString'>
(Pdb) hypref.__class__.__bases__
(<type 'unicode'>, <class 'bs4.element.PageElement'>)
You need to convert to utf-8 using
hypref.encode('utf-8')
strhyp = hypref.encode('utf-8')
http://joelonsoftware.com/articles/Unicode.html

Categories

Resources