How can I fetch the page source of a webpage using Python? - python

I wish to fetch the source of a webpage and parse individual tags myself. How can I do this in Python?

import urllib2
urllib2.urlopen('http://stackoverflow.com').read()
That's the simple answer, but you should really look at BeautifulSoup
http://www.crummy.com/software/BeautifulSoup/

Some options are:
urllib
urllib2
httplib
httplib2
HTMLParser
Beautiful Soup
All except httplib2 and Beautiful Soup are in the Python Standard Library. The pages for each of the packages above contain simple examples that will let you see what suits your needs best.

I would suggest you use BeautifulSoup
#for HTML parsing
from BeautifulSoup import BeautifulSoup
import urllib2
doc = urllib2.urlopen('http://google.com').read()
soup = BeautifulSoup(''.join(doc))
soup.contents[0].name
After this you can pretty much parse anything out of this document. See documentation which has detailed examples of how to do it.

All the answers here are true, and BeautifulSoup is great, however when the source HTML is dynamically created by javascript, and that's usually the case these days, you'll need to use some engine that first creates the final HTML and only then fetch it, or else you'll have most of the content missing.
As far as I know, the easiest way is simply using the browser's engine for this. In my experience, Python+Selenium+Firefox is the least resistant path

Related

Python 'BeautiulSoup()' function what does it actually do?

Python nube here. I know two methods to parse URL to BeautifulSoup to open URLs.
Method #1 USING REQUESTS
from bs4 import BeautifulSoup
import requests
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print soup.prettify()
Method #2 USING URLLIB/URLLIB2
from bs4 import BeautifulSoup
import urllib2
f = urllib2.urlopen(url)
page = f.read() #Some people skip this step.
soup = BeautifulSoup(page)
print soup.prettify()
I have following questions:
What exactly does BeautifulSoup() function does ? Somewhere it requires page.content and html.parser and somewhere it only takes urllib2.urlopen(url).read (as stated in the second example). This is very simple to cram but hard to understand what is going on here. I have checked the official documentation, not very helpful. (Please also comment on html.parser and page.content, why not just html and page like in second example ?)
In Method#2 as stated above, what difference does it make if I skip the f.read() command ?
For experts, these questions might be very simple, but I would really appreciate help on these. I have googled quite a lot but still not getting the answers.
Thanks !
BeautifulSoup does not open URLs. It takes HTML, and gives you the ability to prettify the output (as you have done).
In both method #1 and #2 you are fetching the HTML using another libary (either requests, or urllib) and then presenting the resulting HTML to beautiful soup.
This is why you need to read the content in method #2.
Therefore, I think you are looking in the wrong spots for documentation. You should be searching how to use request or urllib (I recommend requests myself).
BeautifulSoup is a python package to help you parse html.
The first argument it requires is just a raw html response, or any raw html or xml text that it can parse, so it doesn't matter what package delivers that as long as it is in valid html format.
The second argument, in your first example html.parser is telling BeautifulSoup what package to use to actually parse the data. In my knowledge there are only 2 options, html.parser and lxml. They do the same basically but with different performance advantages, that's the only difference as far as I can tell.
If you omit that second argument then the BeautifulSoup package just uses the default, which is lxml in most cases.
To your last question i'm not entirely sure, but I think there is no fundamental difference between invoking f.read() first or having BeautifulSoup do that implicitly but that would not always work and is bad practice.
Like #Klaus said in a comment to you, you should really read the docs here

crawling web data using python html error

i want to crawling data using python
i tried tried again
but it didn't work
i can not found code's error
i wrote code like this:
import re
import requests
from bs4 import BeautifulSoup
url='http://news.naver.com/main/ranking/read.nhn?mid=etc&sid1=111&rankingType=popular_week&oid=277&aid=0003773756&date=20160622&type=1&rankingSectionId=102&rankingSeq=1'
html=requests.get(url)
#print(html.text)
a=html.text
bs=BeautifulSoup(a,'html.parser')
print(bs)
print(bs.find('span',attrs={"class" : "u_cbox_contents"}))
i want to crawl reply data in news
as you can see, i tried to searing this:
span, class="u_cbox_contents" in bs
but python only say "None"
None
so i check bs using function print(bs)
and i check up bs variable's contents
but there is no span, class="u_cbox_contents"
why this happing?
i really don't know why
please help me
thanks for reading.
Requests will fetch the URL's contents, but will not execute any JavaScript.
I performed the same fetch with cURL, and I can't find any occurrence of u_cbox_contents in the HTML code. Most likely, it's injected using JavaScript, which explains why BeautifulSoup can't find it.
If you need the page's code as it would be rendered in a "normal" browser, you could try Selenium. Also have a look at this SO question.

how to get all the urls of a website using a crawler or a scraper?

i have to get many urls from a website and then i've to copy these in an excel file.
I'm looking for an automatic way to do that. The website is structured having a main page with about 300 links and inside of each link there are 2 or 3 links that are interesting for me.
Any suggestions ?
If you want to develop your solution in Python then I can recommend Scrapy framework.
As far as inserting the data into an Excel sheet is concerned, there are ways to do it directly, see for example here: Insert row into Excel spreadsheet using openpyxl in Python , but you can also write the data into a CSV file and then import it into Excel.
If the links are in the html... You can use beautiful soup. This has worked for me in the past.
import urllib2
from bs4 import BeautifulSoup
page = 'http://yourUrl.com'
opened = urllib2.urlopen(page)
soup = BeautifulSoup(opened)
for link in soup.find_all('a'):
print (link.get('href'))
have you tried selenium or urllib?.urllib is faster than selenium
http://useful-snippets.blogspot.in/2012/02/simple-website-crawler-with-selenium.html
You can use beautiful soup for parsing ,
[http://www.crummy.com/software/BeautifulSoup/]
More information about docs here http://www.crummy.com/software/BeautifulSoup/bs4/doc/
I won't suggest scrappy because you don't need that for work you described in your question.
For e.g. this code will use urllib2 library to open a google homepage and find all links in that output in the form of list
import urllib2
from bs4 import BeautifulSoup
data=urllib2.urlopen('http://www.google.com').read()
soup=BeautifulSoup(data)
print soup.find_all('a')
For handling excel files take a look at http://www.python-excel.org

Python Scraping fb comments from a website

I have been trying to scrape facebook comments using Beautiful Soup on the below website pages.
import BeautifulSoup
import urllib2
import re
url = 'http://techcrunch.com/2012/05/15/facebook-lightbox/'
fd = urllib2.urlopen(url)
soup = BeautifulSoup.BeautifulSoup(fd)
fb_comment = soup("div", {"class":"postText"}).find(text=True)
print fb_comment
The output is a null set. However, I can clearly see the facebook comment is within those above tags in the inspect element of the techcrunch site (I am little new to Python and was wondering if the approach is correct and where I am going wrong?)
Like Christopher and Thiefmaster: it is all because of javascript.
But, if you really need that information, you can still retrieve it thanks to Selenium on http://seleniumhq.org then use beautifulsoup on this output.
Facebook comments are loaded dynamically using AJAX. You can scrape the original page to retrieve this:
<fb:comments href="http://techcrunch.com/2012/05/15/facebook-lightbox/" num_posts="25" width="630"></fb:comments>
After that you need to send a request to some Facebook API that will give you the comments for the URL in that tag.
The parts of the page you are looking for are not included in the source file. Use a browser and you can see this for yourself by opening the page source.
You will need to use something like pywebkitgtk to have the javascript executed before passing the document to BeautifulSoup

Crawler with Python?

I'd like to write a crawler using python. This means: I've got the url of some websites' home page, and I'd like my program to crawl through all the website following links that stay into the website. How can I do this easily and FAST? I tried BeautifulSoup already, but it is really cpu consuming and quite slow on my pc.
I'd recommend using mechanize in combination with lxml.html. as robert king suggested, mechanize is probably best for navigating through the site. for extracting elements I'd use lxml. lxml is much faster than BeautifulSoup and probably the fastest parser available for python. this link shows a performance test of different html parsers for python. Personally I'd refrain from using the scrapy wrapper.
I haven't tested it, but this is probably what youre looking for, first part is taken straight from the mechanize documentation. the lxml documentation is also quite helpful. especially take a look at this and this section.
import mechanize
import lxml.html
br = mechanize.Browser()
response = br.open("somewebsite")
for link in br.links():
print link
br.follow_link(link) # takes EITHER Link instance OR keyword args
print br
br.back()
# you can also display the links with lxml
html = response.read()
root = lxml.html.fromstring(html)
for link in root.iterlinks():
print link
you can also get elements via root.xpath(). A simple wget might even be the easiest solution.
Hope I could be helpful.
I like using mechanize. Its fairly simple, you download it and create a browser object. With this object you can open a URL. You can use "back" and "forward" functions as in a normal browser. You can iterate through the forms on the page and fill them out if need be.
You can iterate through all the links on the page too. Each link object has the url etc which you could click on.
here is an example:
Download all the links(related documents) on a webpage using Python
Here's an example of a very fast (concurrent) recursive web scraper using eventlet. It only prints the urls it finds but you can modify it to do what you want. Perhaps you'd want to parse the html with lxml (fast), pyquery (slower but still fast) or BeautifulSoup (slow) to get the data you want.
Have a look at scrapy (and related questions). As for performance... very difficult to make any useful suggestions without seeing the code.

Categories

Resources