Python 'BeautiulSoup()' function what does it actually do? - python

Python nube here. I know two methods to parse URL to BeautifulSoup to open URLs.
Method #1 USING REQUESTS
from bs4 import BeautifulSoup
import requests
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print soup.prettify()
Method #2 USING URLLIB/URLLIB2
from bs4 import BeautifulSoup
import urllib2
f = urllib2.urlopen(url)
page = f.read() #Some people skip this step.
soup = BeautifulSoup(page)
print soup.prettify()
I have following questions:
What exactly does BeautifulSoup() function does ? Somewhere it requires page.content and html.parser and somewhere it only takes urllib2.urlopen(url).read (as stated in the second example). This is very simple to cram but hard to understand what is going on here. I have checked the official documentation, not very helpful. (Please also comment on html.parser and page.content, why not just html and page like in second example ?)
In Method#2 as stated above, what difference does it make if I skip the f.read() command ?
For experts, these questions might be very simple, but I would really appreciate help on these. I have googled quite a lot but still not getting the answers.
Thanks !

BeautifulSoup does not open URLs. It takes HTML, and gives you the ability to prettify the output (as you have done).
In both method #1 and #2 you are fetching the HTML using another libary (either requests, or urllib) and then presenting the resulting HTML to beautiful soup.
This is why you need to read the content in method #2.
Therefore, I think you are looking in the wrong spots for documentation. You should be searching how to use request or urllib (I recommend requests myself).

BeautifulSoup is a python package to help you parse html.
The first argument it requires is just a raw html response, or any raw html or xml text that it can parse, so it doesn't matter what package delivers that as long as it is in valid html format.
The second argument, in your first example html.parser is telling BeautifulSoup what package to use to actually parse the data. In my knowledge there are only 2 options, html.parser and lxml. They do the same basically but with different performance advantages, that's the only difference as far as I can tell.
If you omit that second argument then the BeautifulSoup package just uses the default, which is lxml in most cases.
To your last question i'm not entirely sure, but I think there is no fundamental difference between invoking f.read() first or having BeautifulSoup do that implicitly but that would not always work and is bad practice.
Like #Klaus said in a comment to you, you should really read the docs here

Related

No output after using requests and BeautifulSoup in PyCharm

I was trying to get some headlines from the newyorktimes website. I have 2 questions,
question 1:
This is my code, but I gives me no output, does anyone know what I'd have to change?
import requests
from bs4 import BeautifulSoup
url = 'https://www.nytimes.com'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
a = soup.find_all(class_="balancedHeadline")
for story_heading in a:
print(story_heading)
My second question:
As the HTML is not the same for all headlines (there's a different class for the big headlines and the smaller ones for example), how would I take all those different classes in my code and give me all of the headlines as output?
Thanks in advance!
BeautifulSoup is a robust parsing library.
But, unlike your browser, it does not evaluate javascript.
Elements with balancedHeadline class you were looking for are
not present in the download HTML document.
They get added in later when assets have downloaded
and javascript functions have run.
You won't be able to find such a class using your current technique.
The answer to your second question is in the docs.
A regex or a function would work, but you might find that
passing in a list is simpler for your application.

BeautifulSoup not parsing past the title tag

I am trying to parse a page
http://gwyneddathletics.com/custompages/sport/mlacrosse/stats/2014/ml0402gm.htm
and when I try to findAll('b') I get no results, same with 'tr'. I cannot find anything beyond the initial title tag.
Also, when I do soup = BeautifulSoup(markup) and print the soup, I get the entire page with an extra at the end of the output
I am using python 2.6 with BeautifulSoup 3.2.0. Why is my soup not parsing the page correctly?
It's likely the parser that BeautifulSoup is using really doesn't like the markup on the page, I have had similar issues happen in the past. I did a quick test on your input and found that if you upgrade to the newest BeautifulSoup (the package is called bs4) it just works. bs4 also supports python2.6 and the backwards incompatible changes between it and BeautifulSoup (the 3.x series) are tiny. See here if you need to check out how to port.

How to parse a wikipedia page in Python?

I've been trying to parse a wikipedia page in Python and have been quite successful using the API.
But, somehow the API documentation seems a bit too skeletal for me to get all the data.
As of now, I'm doing a requests.get() call to
http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=China&format=json&exintro=1
But, this only returns me the first paragraph. Not the entire page. I've tried to use allpages and search but to no avail. A better explanation of how to get the data from a wiki page would be of real help. All the data and not just the introduction as returned by the previous query.
You seem to be using the query action to get the content of the page. According to it's api specs it returns only a part of the data. The proper action seems to be query.
Here is a sample
import urllib2
req = urllib2.urlopen("http://en.wikipedia.org/w/api.php?action=parse&page=China&format=json&prop=text")
content = req.read()
# content in json - use json or simplejson to get relevant sections.
Have you considered using Beautiful Soup to extract the content from the page?
While I haven't used this for wikipedia, others have, and having used it to scrape other pages and it is an excellent tool.
If someone is lookin for a python3 answer here you go:
import urllib.request
req = urllib.request.urlopen("http://en.wikipedia.org/w/api.php?action=parse&page=China&format=json&prop=text")
print(req.read())
I'm using python version 3.7.0b4.

How can I fetch the page source of a webpage using Python?

I wish to fetch the source of a webpage and parse individual tags myself. How can I do this in Python?
import urllib2
urllib2.urlopen('http://stackoverflow.com').read()
That's the simple answer, but you should really look at BeautifulSoup
http://www.crummy.com/software/BeautifulSoup/
Some options are:
urllib
urllib2
httplib
httplib2
HTMLParser
Beautiful Soup
All except httplib2 and Beautiful Soup are in the Python Standard Library. The pages for each of the packages above contain simple examples that will let you see what suits your needs best.
I would suggest you use BeautifulSoup
#for HTML parsing
from BeautifulSoup import BeautifulSoup
import urllib2
doc = urllib2.urlopen('http://google.com').read()
soup = BeautifulSoup(''.join(doc))
soup.contents[0].name
After this you can pretty much parse anything out of this document. See documentation which has detailed examples of how to do it.
All the answers here are true, and BeautifulSoup is great, however when the source HTML is dynamically created by javascript, and that's usually the case these days, you'll need to use some engine that first creates the final HTML and only then fetch it, or else you'll have most of the content missing.
As far as I know, the easiest way is simply using the browser's engine for this. In my experience, Python+Selenium+Firefox is the least resistant path

Crawler with Python?

I'd like to write a crawler using python. This means: I've got the url of some websites' home page, and I'd like my program to crawl through all the website following links that stay into the website. How can I do this easily and FAST? I tried BeautifulSoup already, but it is really cpu consuming and quite slow on my pc.
I'd recommend using mechanize in combination with lxml.html. as robert king suggested, mechanize is probably best for navigating through the site. for extracting elements I'd use lxml. lxml is much faster than BeautifulSoup and probably the fastest parser available for python. this link shows a performance test of different html parsers for python. Personally I'd refrain from using the scrapy wrapper.
I haven't tested it, but this is probably what youre looking for, first part is taken straight from the mechanize documentation. the lxml documentation is also quite helpful. especially take a look at this and this section.
import mechanize
import lxml.html
br = mechanize.Browser()
response = br.open("somewebsite")
for link in br.links():
print link
br.follow_link(link) # takes EITHER Link instance OR keyword args
print br
br.back()
# you can also display the links with lxml
html = response.read()
root = lxml.html.fromstring(html)
for link in root.iterlinks():
print link
you can also get elements via root.xpath(). A simple wget might even be the easiest solution.
Hope I could be helpful.
I like using mechanize. Its fairly simple, you download it and create a browser object. With this object you can open a URL. You can use "back" and "forward" functions as in a normal browser. You can iterate through the forms on the page and fill them out if need be.
You can iterate through all the links on the page too. Each link object has the url etc which you could click on.
here is an example:
Download all the links(related documents) on a webpage using Python
Here's an example of a very fast (concurrent) recursive web scraper using eventlet. It only prints the urls it finds but you can modify it to do what you want. Perhaps you'd want to parse the html with lxml (fast), pyquery (slower but still fast) or BeautifulSoup (slow) to get the data you want.
Have a look at scrapy (and related questions). As for performance... very difficult to make any useful suggestions without seeing the code.

Categories

Resources