I am doing python challenge level-9 with BeautifulSoup.
url = "http://www.pythonchallenge.com/pc/return/good.html".
bs4.version == '4.3.2'.
There are two comments in its page source. The output of soup should be as follows.
However, when BeautifulSoup is applied, the second comment is missing.
It seems kinda weird. Any hint? Thanks!
import requests
from bs4 import BeautifulSoup
url = "http://www.pythonchallenge.com/pc/return/good.html"
page = requests.get(url, auth = ("huge", "file")).text
print page
soup = BeautifulSoup(page)
print soup
Beautiful Soup is a wrapper around an html parser. The default parser is very strict, and when it encounters malformed html silently drops the elements it had trouble with.
You should instead install the package 'html5lib' and use that as your parser, like so:
soup = BeautifulSoup(page, 'html5lib')
Related
I'm trying to get the length of a YouTube video with BeautifulSoup and by inspecting the site I can find this line: <span class="ytp-time-duration">6:14:06</span> which seems to be perfect to get the duration, but I can't figure out how to do it.
My script:
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.youtube.com/watch?v=_uQrJ0TkZlc")
soup = BeautifulSoup(response.text, "html5lib")
mydivs = soup.findAll("span", {"class": "ytp-time-duration"})
print(mydivs)
The problem is that the output is []
From the documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/
soup.findAll('div', attrs={"class":u"ytp-time-duration"})
But I guess that your syntax is a shortcut, so I would probably consider that when loading the youtube page, the div.ytp-time-duration is not present. It is only added after loading. Maybe that's why the soup does not pick it up.
I have a script that I've used for several years. One particular page on the site loads and returns soup, but all my finds return no result. This is old code that has worked on this site in the past. Instead of searching for a specific <div> I simplified it to look for any table, tr or td, with find or findAll. I've tried various methods of opening the page, including lxml - all with no results.
My interests are in the player_basic and player_records div's
from BeautifulSoup import BeautifulSoup, NavigableString, Tag
import urllib2
url = "http://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=60456"
#html = urllib2.urlopen(url).read()
html = urllib2.urlopen(url,"lxml")
soup = BeautifulSoup(html)
#div = soup.find('div', {"class":"player_basic"})
#div = soup.find('div', {"class":"player_records"})
item = soup.findAll('td')
print item
you're not reading the response. try this:
import urllib2
url = 'http://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=60456'
response = urllib2.urlopen(url, 'lxml')
html = response.read()
then you can use it with BeautifulSoup. if it still does not work, there are strong reasons to believe that there is malformed HTML in that page (missing closing tags, etc.) since the parsers that BeautifulSoup uses (specially html.parser) are not very tolerant with that.
UPDATE: try using lxml parser:
soup = BeautifulSoup(html, 'lxml')
tds = soup.find_all('td')
print len(tds)
$ 142
html = urlopen(url)
bs = BeautifulSoup(html.read(), 'html5lib')
After running several times, the process gets stuck at BeautifulSoup(html.read(), 'html5lib'), and I have tried to change from html parser to 'lxml' and 'html.parser'. However, the problem persists. Is there a bug in BeautifulSoup? How can I solve this problem?
update
I add some logs inside the program, like this
print('open the url')
html = urlopen(url)
print('create BeautifulSoup Object')
bs = BeautifulSoup(html.read(), 'html5lib')
The console print create BeautifulSoup Object and just stay there with a blinking cursor.
I've encountered the same problem and I found out that the program got stuck at html.read(), which may because urlopen() resource does not close correctly when the response has some errors.
You can change like this:
with urlopen(url) as html:
html = html.read()
bs = BeautifulSoup(html, "lxml")
Or you can choose to use the requests package, which is better than the urllib like this:
import requests
html = requests.get(url).text
bs = BeautifulSoup(html, "lxml")
Hope it can solve your problem
I was trying to scrape a number of large Wikipedia pages like this one.
Unfortunately, BeautifulSoup is not able to work with such a large content, and it truncates the page.
I found a solution to this problem using BeautifulSoup at beautifulsoup-where-are-you-putting-my-html, because I think it is easier than lxml.
The only thing you need to do is to install:
pip install html5lib
and add it as a parameter to BeautifulSoup:
soup = BeautifulSoup(htmlContent, 'html5lib')
However, if you prefer, you can also use lxml as follows:
import lxml.html
doc = lxml.html.parse('https://en.wikipedia.org/wiki/Talk:Game_theory')
I suggest you get the html content and then pass it to BS:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://en.wikipedia.org/wiki/Talk:Game_theory')
if r.ok:
soup = BeautifulSoup(r.content)
# get the div with links at the bottom of the page
links_div = soup.find('div', id='catlinks')
for a in links_div.find_all('a'):
print a.text
else:
print r.status_code
I have this sript:
import urrlib2
from bs4 import BeautifulSoup
url = "http://www.shoptop.ru/"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
divs = soup.findAll('a')
print divs
For this web site, it prints empty list? What can be problem? I am running on Ubuntu 12.04
Actually there are quite couple of bugs in BeautifulSoup which might raise some unknown errors. I had a similar issue when working on apache using lxml parser
So, just try to use other couple of parsers mentioned in the documentation
soup = BeautifulSoup(page, "html.parser")
This should work!
It looks like you have a few mistakes in your code urrlib2 should be urllib2, I've fixed the code for you and this works using BeautifulSoup 3
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.shoptop.ru/"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
divs = soup.findAll('a')
print divs