I'm trying to scrape this site, and I want to check all of the anchor tags.
I have imported beautifulsoup 4.3.2 and here is my code:
url = """http://www.civicinfo.bc.ca/bids?pn=1"""
Html = urlopen(url).read()
Soup = BeautifulSoup(Html, 'html.parser')
Content = Soup.find_all('a')
My problem is that Content is always empty (i.e. Content = []). Does anyone have any ideas?
From the documentation html.parser is not very lenient before certain versions of Python. So you're likely looking at some malformed HTML.
What you want to do works if you use lxml instead of html.parser
From the documentation:
That said, there are things you can do to speed up Beautiful Soup. If
you’re not using lxml as the underlying parser, my advice is to start.
Beautiful Soup parses documents significantly faster using lxml than
using html.parser or html5lib.
So the relevant code would be:
Soup = BeautifulSoup(Html, 'lxml')
Related
I'm trying to scrape a website with BeautifulSoup and have written the following code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://gematsu.com/tag/media-create-sales")
soup = BeautifulSoup(page.text, 'html.parser')
try:
content = soup.find('div', id='main')
print (content)
except:
print ("Exception")
However, this returns a NoneType, even though the div exists with the correct ID on the website. Is there anything I'm doing wrong?
I'm seeing the div with the id main on the page:
I also find the div main when I print soup:
This is briefly covered in BeautifulSoup's documentation
Beautiful Soup presents the same interface to a number of different parsers, but each parser is different. Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers
[ ... ]
Here’s the same document parsed with Python’s built-in HTML parser:
BeautifulSoup("<a></p>", "html.parser")
Like html5lib, this parser ignores the closing </p> tag. Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a tag. Unlike lxml, it doesn’t even bother to add an tag.
The issue you are experiencing is likely due to malformed HTML that html.parser is not able to handle appropriately. This resulted in id="main" being stripped when BeautifulSoup parsed the HTML. By changing the parser to either html5lib or lxml, BeautifulSoup handles malformed HTML differently than html.parser
I'm trying to use/learn beautifulsoup4 to scrape some basic data from a website, specifically the information contained within the html record below:
<li class="is-first-in-list css-9999999" data-testid="record-999999999">
I have around 285,000 records all with a unique
data-testid
However, while I can obtain the information from classes and tags I am familar with, custom tags are still evading me.
I've tried variations of:
for link in soup.find_all("data-testid"):
print() #changed to include data-testid.text/innertext/leftblank etc etc
The remainder of my code appears to work, as I can extract tags and href data without issue (including printing these in the terminal), just nothing from custom tags, I'm user the solution is mindbogglingly simple, I've just failed to come up with a success yet!
Is this what you are trying to do?
from bs4 import BeautifulSoup
html = """<li class="is-first-in-list css-9999999" data-testid="record-999999999">"""
soup = BeautifulSoup(html, features='html.parser')
for link in soup.select("li"):
print(link.get('data-testid'))
Prints
record-999999999
With class select
from bs4 import BeautifulSoup
html = """<li class="is-first-in-list css-9999999" data-testid="record-999999999">
<li class="hello css-9999999" data-testid="record-8888888">
<li class="0mr3 css-9999999" data-testid="record-777777">"""
soup = BeautifulSoup(html, features='html.parser')
for link in soup.select("li.is-first-in-list"):
print(link.get('data-testid'))
Prints
record-999999999
Similar to #0m3r but with a few tweaks
from bs4 import BeautifulSoup
from lxml import etree
html = """<li class="is-first-in-list css-9999999" data-testid="record-999999999">"""
soup = BeautifulSoup(html, features='lxml')
for link in soup.find_all("li", class_="is-first-in-list css-9999999"):
print(link.get('data-testid'))
Generally, I find lxml is a lot faster than html.parser.
That said, there are things you can do to speed up Beautiful Soup. If you’re not using lxml as the underlying parser, my advice is to start. Beautiful Soup parses documents significantly faster using lxml than using html.parser or html5lib.
from https://www.crummy.com/software/BeautifulSoup/bs4/doc/, especially knowing that you have 285,000 elements to loop through. Also, the class_ argument gives you more restrictions on the class so your not sifting through every li element.
I'm trying to scrape a website with BeautifulSoup and have written the following code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://gematsu.com/tag/media-create-sales")
soup = BeautifulSoup(page.text, 'html.parser')
try:
content = soup.find('div', id='main')
print (content)
except:
print ("Exception")
However, this returns a NoneType, even though the div exists with the correct ID on the website. Is there anything I'm doing wrong?
I'm seeing the div with the id main on the page:
I also find the div main when I print soup:
This is briefly covered in BeautifulSoup's documentation
Beautiful Soup presents the same interface to a number of different parsers, but each parser is different. Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers
[ ... ]
Here’s the same document parsed with Python’s built-in HTML parser:
BeautifulSoup("<a></p>", "html.parser")
Like html5lib, this parser ignores the closing </p> tag. Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a tag. Unlike lxml, it doesn’t even bother to add an tag.
The issue you are experiencing is likely due to malformed HTML that html.parser is not able to handle appropriately. This resulted in id="main" being stripped when BeautifulSoup parsed the HTML. By changing the parser to either html5lib or lxml, BeautifulSoup handles malformed HTML differently than html.parser
I'm a newbie to Python. Just installed it for Windows and try to HTML scraping.
Here's my test code:
from bs4 import BeautifulSoup
html = 'text Download text'
print(html)
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a'):
print(link.get('href'))
This code returns collected but broken link:
Transfert.php?Filename=myfile_x86&version=5¶m=13
How can I fix it?
You are feeding the parser invalid HTML, the correct way to include &
in a URL in a HTML attribute is to escape it to &
Simply change & to &
html = 'text Download text'
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a'):
print(link.get('href'))
Output:
Transfert.php?Filename=myfile_x86&version=5¶m=13
The reason why it works with html5lib and lxml is because some parsers can handle broken HTML better than others. As mentioned by Goyo in the comments, you can't prevent other people from writing broken HTML :)
This is a great answer to your question that explains it in detail: https://stackoverflow.com/a/26073147/4796844.
I am using BeautifulSoup to parse a bunch of possibly very dirty HTML documents. I stumbled upon a very bizarre thing.
The HTML comes from this page: http://www.wvdnr.gov/
It contains multiple errors, like multiple <html></html>, <title> outside the <head>, etc...
However, html5lib usually works well even in these cases. In fact, when I do:
soup = BeautifulSoup(document, "html5lib")
and I pretti-print soup, I see the following output: http://pastebin.com/8BKapx88
which contains a lot of <a> tags.
However, when I do soup.find_all("a") I get an empty list. With lxml I get the same.
So: has anybody stumbled on this problem before? What is going on? How do I get the links that html5lib found but isn't returning with find_all?
Even if the correct answer is "use another parser" (thanks #alecxe), I have another workaround. For some reason, this works too:
soup = BeautifulSoup(document, "html5lib")
soup = BeautifulSoup(soup.prettify(), "html5lib")
print soup.find_all('a')
which returns the same link list of:
soup = BeautifulSoup(document, "html.parser")
When it comes to parsing a not well-formed and tricky HTML, the parser choice is very important:
There are also differences between HTML parsers. If you give Beautiful
Soup a perfectly-formed HTML document, these differences won’t matter.
One parser will be faster than another, but they’ll all give you a
data structure that looks exactly like the original HTML document.
But if the document is not perfectly-formed, different parsers will
give different results.
html.parser worked for me:
from bs4 import BeautifulSoup
import requests
document = requests.get('http://www.wvdnr.gov/').content
soup = BeautifulSoup(document, "html.parser")
print soup.find_all('a')
Demo:
>>> from bs4 import BeautifulSoup
>>> import requests
>>> document = requests.get('http://www.wvdnr.gov/').content
>>>
>>> soup = BeautifulSoup(document, "html5lib")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "lxml")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "html.parser")
>>> len(soup.find_all('a'))
147
See also:
Differences between parsers.