Python html missing when scraping website - python

I tried to scrape a website using code like
import requests
requests.get("myurl.com").content
But some important elements from the website were missing. How can I get the whole website content with Python 3, just like I would use the inspector in Firefox or other browsers?

Why don't you try Scrapy, Selenium or even Splash? They are powerful scraping libraries.

You can use Beautiful Soup, a python library for scraping, for this purpose. Simply import it at the top:
from bs4 import BeautifulSoup
Then, add these lines to your code
data = requests.get("myurl.com").text
soup = BeautifulSoup(data, 'html.parser')

Related

Beautiful Soup in Python: extracting different data with the same classes

I would like to extract via Beautiful Soup different data from the same web page, but apparently all the data are with the same html info.
The web page is https://www.ine.es
What I am trying to get is: 47.329.981, -0,5 and -22,1.
I don't know if it is possible having the same class (the only different are the images).
Many thanks.
Please Check this out
import requests
from bs4 import BeautifulSoup
url="https://www.ine.es"
results=requests.get(url)
soup=BeautifulSoup(results.text,"html.parser")
details=soup.findAll("span",attrs={"class":"dato"})
for i in details:
print(i.text)
Output: 47.329.981 -0,5 -22,1 15,33 55,54

how to get all the urls of a website using a crawler or a scraper?

i have to get many urls from a website and then i've to copy these in an excel file.
I'm looking for an automatic way to do that. The website is structured having a main page with about 300 links and inside of each link there are 2 or 3 links that are interesting for me.
Any suggestions ?
If you want to develop your solution in Python then I can recommend Scrapy framework.
As far as inserting the data into an Excel sheet is concerned, there are ways to do it directly, see for example here: Insert row into Excel spreadsheet using openpyxl in Python , but you can also write the data into a CSV file and then import it into Excel.
If the links are in the html... You can use beautiful soup. This has worked for me in the past.
import urllib2
from bs4 import BeautifulSoup
page = 'http://yourUrl.com'
opened = urllib2.urlopen(page)
soup = BeautifulSoup(opened)
for link in soup.find_all('a'):
print (link.get('href'))
have you tried selenium or urllib?.urllib is faster than selenium
http://useful-snippets.blogspot.in/2012/02/simple-website-crawler-with-selenium.html
You can use beautiful soup for parsing ,
[http://www.crummy.com/software/BeautifulSoup/]
More information about docs here http://www.crummy.com/software/BeautifulSoup/bs4/doc/
I won't suggest scrappy because you don't need that for work you described in your question.
For e.g. this code will use urllib2 library to open a google homepage and find all links in that output in the form of list
import urllib2
from bs4 import BeautifulSoup
data=urllib2.urlopen('http://www.google.com').read()
soup=BeautifulSoup(data)
print soup.find_all('a')
For handling excel files take a look at http://www.python-excel.org

Python Scraping fb comments from a website

I have been trying to scrape facebook comments using Beautiful Soup on the below website pages.
import BeautifulSoup
import urllib2
import re
url = 'http://techcrunch.com/2012/05/15/facebook-lightbox/'
fd = urllib2.urlopen(url)
soup = BeautifulSoup.BeautifulSoup(fd)
fb_comment = soup("div", {"class":"postText"}).find(text=True)
print fb_comment
The output is a null set. However, I can clearly see the facebook comment is within those above tags in the inspect element of the techcrunch site (I am little new to Python and was wondering if the approach is correct and where I am going wrong?)
Like Christopher and Thiefmaster: it is all because of javascript.
But, if you really need that information, you can still retrieve it thanks to Selenium on http://seleniumhq.org then use beautifulsoup on this output.
Facebook comments are loaded dynamically using AJAX. You can scrape the original page to retrieve this:
<fb:comments href="http://techcrunch.com/2012/05/15/facebook-lightbox/" num_posts="25" width="630"></fb:comments>
After that you need to send a request to some Facebook API that will give you the comments for the URL in that tag.
The parts of the page you are looking for are not included in the source file. Use a browser and you can see this for yourself by opening the page source.
You will need to use something like pywebkitgtk to have the javascript executed before passing the document to BeautifulSoup

Raw HTML vs. DOM scraping in python using mechanize and beautiful soup

I am attempting to write a program that, as an example, will scrape the top price off of this web page:
http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults
First, I am easily able to retrieve the HTML by doing the following:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import mechanize
webpage = 'http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults'
br = mechanize.Browser()
data = br.open(webpage).get_data()
soup = BeautifulSoup(data)
print soup
However, the raw HTML does not contain the price. The browser does...it's thing (clarification here might help me also)...and retrieves the price from elsewhere while it constructs the DOM tree.
I was led to believe that mechanize would act just like my browser and return the DOM tree, which I am also led to believe is what I see when I look at, for example, Chrome's Developer Tools view of the page (if I'm incorrect about this, how do I go about getting whatever that price information is stored in?) Is there something that I need to tell mechanize to do in order to see the DOM tree?
Once I can get the DOM tree into python, everything else I need to do should be a snap. Thanks!
Mechanize and Beautiful soup are un-beatable tools web-scraping in python.
But you need to understand what is meant for what:
Mechanize : It mimics the browser functionality on a webpage.
BeautifulSoup : HTML parser, works well even when HTML is not well-formed.
Your problem seems to be javascript. The price is getting populated via an ajax call using javascript. Mechanize, however, does not do javascript, so any content that results from javascript will remain invisible to mechanize.
Take a look at this : http://github.com/davisp/python-spidermonkey/tree/master
This does a wrapper on mechanize and Beautiful soup with js execution.
Answering my own question because in the years since asking this I have learned a lot. Today I would use Selenium Webdriver to do this job. Selenium is exactly the tool I was looking for back in 2012 for this type of web scraping project.
https://www.seleniumhq.org/download/
http://chromedriver.chromium.org/

Parsing data stored in URLs via BeautifulSoup?

I'm attempting to access the URLs of the different fish family from this website: http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon
I'd like to be able to run a script that opens the links of the given website to then be able to parse the information stored within the pages. I'm fairly new to web scraping, so any help will be appreciated. Thanks in advance!
This is what I have so far:
import urllib2
import re
from bs4 import BeautifulSoup
import time
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon'
page = urllib2.urlopen(fish_url)
html_doc = page.read()
soup = BeautifulSoup(html_doc)
page = urllib2.urlopen('http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon').read()
soup = BeautifulSoup(page)
soup.prettify()
for fish in soup.findAll('a', href=True):
print fish['href']
Scrapy is the perfect tool for this. It is a python web scraping framework.
http://doc.scrapy.org/en/latest/intro/tutorial.html
You can pass in your url with you term, and create rules for crawling.
FOr example using a regex you would add a rule to scrape all links with the path /Summary and then extract the information using XPath or Beautiful soup.
Additionally you can set up a rule to automatically handle the pagination, ie in your example url it could automatically follow the Next link.
Basically a lot of what you are trying to do comes packaged for free in scrapy. I would def take a look into it.
If you're just writing a one-off script to grab all the data from this site, you can do something like:
fish_url_base = "http://www.fishbase.org/ComNames/%s"
fish_urls = [fish_url_base%a['href'] for a in soup.find_all('a')]
This gives you a list of links to traverse, which you can pass to urllib2.urlopen and BeautifulSoup:
for url in fish_urls:
fish_soup = BeautifulSoup(urllib2.urlopen(url).read())
# Do something with your fish_soup
(Note 1: I haven't tested this code; you might need to adjust the base URL to fit the href attributes so you get to the right site.)
(Note 2: I see you're using bs4 but calling findAll on the soup. findAll was right for BS3, but it is changed to find_all in bs4.)
(Note 3: If you're doing this for practical rather than learning purposes / fun, there are more efficient ways of scraping, such as scrapy also mentioned here.)

Categories

Resources