Use BeautifulSoup to parse HTML but gets stuck in creating BeatuifulSoup object - python

html = urlopen(url)
bs = BeautifulSoup(html.read(), 'html5lib')
After running several times, the process gets stuck at BeautifulSoup(html.read(), 'html5lib'), and I have tried to change from html parser to 'lxml' and 'html.parser'. However, the problem persists. Is there a bug in BeautifulSoup? How can I solve this problem?
update
I add some logs inside the program, like this
print('open the url')
html = urlopen(url)
print('create BeautifulSoup Object')
bs = BeautifulSoup(html.read(), 'html5lib')
The console print create BeautifulSoup Object and just stay there with a blinking cursor.

I've encountered the same problem and I found out that the program got stuck at html.read(), which may because urlopen() resource does not close correctly when the response has some errors.
You can change like this:
with urlopen(url) as html:
html = html.read()
bs = BeautifulSoup(html, "lxml")
Or you can choose to use the requests package, which is better than the urllib like this:
import requests
html = requests.get(url).text
bs = BeautifulSoup(html, "lxml")
Hope it can solve your problem

Related

How do I log data from a live website using beautiful soup

Hello I am trying to use beautiful soup and requests to log the data coming from an anemometer which updates live every second. The link to this website here:
http://88.97.23.70:81/
The piece of data I want to scrape is highlighted in purple in the image :
from inspection of the html in my browser.
I have written the code bellow in to try to print out the data however when I run the code it prints: None. I think this means that the soup object doesnt infact contain the whole html page? Upon printing soup.prettify() I cannot find the same id=js-2-text I find when inspecting the html in my browser. If anyone has any ideas why this might be or how to fix it I would be most grateful.
from bs4 import BeautifulSoup
import requests
wind_url='http://88.97.23.70:81/'
r = requests.get(wind_url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
print(soup.find(id='js-2-text'))
All the best,
Brendan
The data is loaded from external URL, so beautifulsoup doesn't need it. You can try to use API URL the page is connecting to:
import requests
from bs4 import BeautifulSoup
api_url = "http://88.97.23.70:81/cgi-bin/CGI_GetMeasurement.cgi"
data = {"input_id": "1"}
soup = BeautifulSoup(requests.post(api_url, data=data).content, "html.parser")
_, direction, metres_per_second, *_ = soup.csv.text.split(",")
knots = float(metres_per_second) * 1.9438445
print(direction, metres_per_second, knots)
Prints:
210 006.58 12.79049681

How do I clean up Beautiful soup's output

I am trying to scrape a book from a website and while parsing it with Beautiful Soup I noticed that there were some errors. For example this sentence:
"You have more… direct control over your skaa here. How many woul "Oh, a half dozen or so,"
The "more…" and " woul" are both errors that occurred somewhere in the script.
Is there anyway to automatically clean mistakes like this up?
Example code of what I have is below.
import requests
from bs4 import BeautifulSoup
url = 'http://thefreeonlinenovel.com/con/mistborn-the-final-empire_page-1'
res = requests.get(url)
text = res.text
soup = BeautifulSoup(text, 'html.parser')
print(soup.prettify())
trin = soup.tr.get_text()
final = str(trin)
print(final)
You need to escape the convert the html entities as detailed here. To apply in your situation however, and retain the text, you can use stripped_strings:
import requests
from bs4 import BeautifulSoup
import html
url = 'http://thefreeonlinenovel.com/con/mistborn-the-final-empire_page-1'
res = requests.get(url)
text = res.text
soup = BeautifulSoup(text, 'lxml')
for r in soup.select_one('table tr').stripped_strings:
s = html.unescape(r)
print(s)

BeautifulSoup scraping Bitcoin Price issue

I am still new to python, and especially BeautifulSoup. I've been reading up on this stuff for a few days and playing around with bunch of different codes and getting mix results. However, on this page is the Bitcoin Price I would like to scrape. The price is located in:
<span class="text-large2" data-currency-value="">$16,569.40</span>
Meaning that, I'd like to have my script print only that line where the value is. My current code prints the whole page and it doesn't look very nice, since it's printing a lot of data. Could anybody please help to improve my code?
import requests
from BeautifulSoup import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/bitcoin/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
div = soup.find('text-large2', attrs={'class': 'stripe'})
for row in soup.findAll('div'):
for cell in row.findAll('tr'):
print cell.text
And this is a snip of the output I get after running the code. It doesn't look very nice or readable.
#SourcePairVolume (24h)PriceVolume (%)Updated
1BitMEXBTC/USD$3,280,130,000$15930.0016.30%Recently
2BithumbBTC/KRW$2,200,380,000$17477.6010.94%Recently
3BitfinexBTC/USD$1,893,760,000$15677.009.41%Recently
4GDAXBTC/USD$1,057,230,000$16085.005.25%Recently
5bitFlyerBTC/JPY$636,896,000$17184.403.17%Recently
6CoinoneBTC/KRW$554,063,000$17803.502.75%Recently
7BitstampBTC/USD$385,450,000$15400.101.92%Recently
8GeminiBTC/USD$345,746,000$16151.001.72%Recently
9HitBTCBCH/BTC$305,554,000$15601.901.52%Recently
Try this:
import requests
from BeautifulSoup import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/bitcoin/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
div = soup.find("div", {"class" : "col-xs-6 col-sm-8 col-md-4 text-left"
}).find("span", {"class" : "text-large2"})
for i in div:
print i
This prints 16051.20 for me.
Later Edit: and if you put the above code in a function and loop it it will constantly update. I get different values now.
This works. But I think you use older version of BeautifulSoup, try pip install bs4 in command prompt or PowerShell
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/bitcoin/'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
value = soup.find('span', {'class': 'text-large2'})
print(''.join(value.stripped_strings))

bs4 the second comment <!-- > is missing

I am doing python challenge level-9 with BeautifulSoup.
url = "http://www.pythonchallenge.com/pc/return/good.html".
bs4.version == '4.3.2'.
There are two comments in its page source. The output of soup should be as follows.
However, when BeautifulSoup is applied, the second comment is missing.
It seems kinda weird. Any hint? Thanks!
import requests
from bs4 import BeautifulSoup
url = "http://www.pythonchallenge.com/pc/return/good.html"
page = requests.get(url, auth = ("huge", "file")).text
print page
soup = BeautifulSoup(page)
print soup
Beautiful Soup is a wrapper around an html parser. The default parser is very strict, and when it encounters malformed html silently drops the elements it had trouble with.
You should instead install the package 'html5lib' and use that as your parser, like so:
soup = BeautifulSoup(page, 'html5lib')

BeautifulSoup does not work for some web sites

I have this sript:
import urrlib2
from bs4 import BeautifulSoup
url = "http://www.shoptop.ru/"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
divs = soup.findAll('a')
print divs
For this web site, it prints empty list? What can be problem? I am running on Ubuntu 12.04
Actually there are quite couple of bugs in BeautifulSoup which might raise some unknown errors. I had a similar issue when working on apache using lxml parser
So, just try to use other couple of parsers mentioned in the documentation
soup = BeautifulSoup(page, "html.parser")
This should work!
It looks like you have a few mistakes in your code urrlib2 should be urllib2, I've fixed the code for you and this works using BeautifulSoup 3
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.shoptop.ru/"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
divs = soup.findAll('a')
print divs

Categories

Resources