Raw HTML vs. DOM scraping in python using mechanize and beautiful soup

Raw HTML vs. DOM scraping in python using mechanize and beautiful soup - python

I am attempting to write a program that, as an example, will scrape the top price off of this web page:
http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults
First, I am easily able to retrieve the HTML by doing the following:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import mechanize
webpage = 'http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults'
br = mechanize.Browser()
data = br.open(webpage).get_data()
soup = BeautifulSoup(data)
print soup
However, the raw HTML does not contain the price. The browser does...it's thing (clarification here might help me also)...and retrieves the price from elsewhere while it constructs the DOM tree.
I was led to believe that mechanize would act just like my browser and return the DOM tree, which I am also led to believe is what I see when I look at, for example, Chrome's Developer Tools view of the page (if I'm incorrect about this, how do I go about getting whatever that price information is stored in?) Is there something that I need to tell mechanize to do in order to see the DOM tree?
Once I can get the DOM tree into python, everything else I need to do should be a snap. Thanks!

Mechanize and Beautiful soup are un-beatable tools web-scraping in python.
But you need to understand what is meant for what:
Mechanize : It mimics the browser functionality on a webpage.
BeautifulSoup : HTML parser, works well even when HTML is not well-formed.
Your problem seems to be javascript. The price is getting populated via an ajax call using javascript. Mechanize, however, does not do javascript, so any content that results from javascript will remain invisible to mechanize.
Take a look at this : http://github.com/davisp/python-spidermonkey/tree/master
This does a wrapper on mechanize and Beautiful soup with js execution.

Answering my own question because in the years since asking this I have learned a lot. Today I would use Selenium Webdriver to do this job. Selenium is exactly the tool I was looking for back in 2012 for this type of web scraping project.
https://www.seleniumhq.org/download/
http://chromedriver.chromium.org/

Related

Beautiful Soup object's HTML, does not match HTML from the web browser

I am scraping the links from this website https://www.firstmallorca.com/en/search, for each of the properties that appear on it, so I can further scrape them and collect more detailed data.
My problem is that the parsed HTML(I am using html5lib parser) from which I scrape the data seems to be different in some areas with respect to the HTML which I see on the browser's DevTool. To demonstrate this:
1.This is the last link I select. On the browser, its href="/en/sales/penthouse-in-santa-ponsa/102512"
1.Image
2.I print the parsed HTML from the Beautiful Soup Object from the webpage with bs4Object.prettfy() and I copy the whole output into notepad++.
3.Then, in the notepad I look for the same element as in point 1. I find it and the href="/en/sales/finca-in-portocolom/159515", which is different from what I see on the actual webpage.3.Image
I do not understand the nature of what's happening. On point 3, I was expecting to see href="/en/sales/penthouse-in-santa-ponsa/102512" instead of href="/en/sales/finca-in-portocolom/159515".
It seems to me like I am doing the scraping on other similar webpage, though not the one I see through the browser.

The website loads content via javascript, which your parser does not execute.
This is a task for selenium.
The selenium package is used to automate the interaction with the web browser from Python.

Beautiful Soup not picking up some data form the website

I have been trying to scrape some data using beautiful soup from https://www.eia.gov/coal/markets/. However when I parse the contents some of the data does not show up at all. Those data fields are visible in chrome inspector but not in the soup. The thing is they do not seem to be text elements. I think they are fed using an external database. I have attached the screenshots below. Is there any other way to scrape that data?
Thanks in advance.
Google inspector:
Beautiful soup parsed content:

#DMart is correct. The data you are looking for is being populated by Javascript, have a look at line 1629 in the page source. Beautiful soup doesn't act as a client browser so there is nowhere for the JS to execute. So it looks like selenium is your best bet.
See This thread for more information.

Not enough detail in your question but this information is probably dynamically loaded and you're not fetching the entire page source.
Without your code it's tough to see if you're using selenium to do it (you tagged this questions as such) which may indicate you're using page_source which does not guarantee you the entire completed source of the page you're looking at.
If you're using requests its even more unlikely you're capturing the entire page's completed source code.

The data is loaded via ajax, so it is not available in the initial document. If you go to the networking tab in chrome dev tools you will see that the site reaches out to https://www.eia.gov/coal/markets/coal_markets_json.php. I searched for some of the numbers in the response and it looks like the data you are looking for is there.
This is a direct json response from the backend. Its better than selenium if you can get it to work.

Thanks you all!
Opening the page using selenium using a webdriver and then parsing the page source using beautiful soup worked.
webdriver.get('https://www.eia.gov/coal/markets/')
html=webdriver.page_source
soup=BS(html)
table=soup.find("table",{'id':'snl_dpst'})
rows=table.find_all("tr")

How to scrape a javascript website in Python?

I am trying to scrape a website. I have tried using two methods but both do not provide me with the full website source code that I am looking for. I am trying to scrape the news titles from the website URL provided below.
URL: "https://www.todayonline.com/"
These are the two methods I have tried but failed.
Method 1: Beautiful Soup
tdy_url = "https://www.todayonline.com/"
page = requests.get(tdy_url).text
soup = BeautifulSoup(page)
soup # Returns me a HTML with javascript text
soup.find_all('h3')
### Returns me empty list []
Method 2: Selenium + BeautifulSoup
tdy_url = "https://www.todayonline.com/"
options = Options()
options.headless = True
driver = webdriver.Chrome("chromedriver",options=options)
driver.get(tdy_url)
time.sleep(10)
html = driver.page_source
soup = BeautifulSoup(html)
soup.find_all('h3')
### Returns me only less than 1/4 of the 'h3' tags found in the original page source
Please help. I have tried scraping other news websites and it is so much easier. Thank you.

The news data on the website you are trying to scrape is fetched from the server using JavaScript (this is called XHR -- XMLHttpRequest). It is happening dynamically, while the page is loading or being scrolled. so this data is not returned inside the page returned by the server.
In the first example, you are getting only the page returned by the server -- without the news, but with JS that is supposed to get them. Neither requests nor BeautifulSoup can execute JS.
However, you can try to reproduce requests that are getting news titles from the server with Python requests. Do the following steps:
Open DevTools of your browser (usually you have to press F12 or the combination of Ctrl+Shift+I for that), and take a look at requests that are getting news titles from the server. Sometimes, it is even easier than web scraping with BeautifulSoup. Here is a screenshot (Firefox):
Copy the request link (right-click -> Copy -> Copy link), and pass it to requests.get(...).
Get .json() of the request. It will return a dict that is easy to work with. To better understand the structure of the dict, I would recommend to use pprint instead of simple print. Note you have to do from pprint import pprint before using it.
Here is an example of the code that gets the titles from the main news on the page:
import requests
nodes = requests.get("https://www.todayonline.com/api/v3/news_feed/7")\
.json()["nodes"]
for node in nodes:
print(node["node"]["title"])
If you want to scrape a group of news under caption, you need to change the number after news_feed/ in the request URL (to get it, you just need to filter the requests by "news_feed" in the DevTools and scroll the news page down).
Sometimes web sites have protection against bots (although the website you are trying to scrape doesn't). In such cases, you might need to do these steps as well.

You can access data via API (check out the Network tab):
For example,
import requests
url = "https://www.todayonline.com/api/v3/news_feed/7"
data = requests.get(url).json()

I will suggest you the fairly simple approach,
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://www.todayonline.com/googlenews.xml').content
soup = bs(page)
news = [i.text for i in soup.find_all('news:title')]
print(news)
output
['DBS named world’s best bank by New York-based financial publication',
'Russia has very serious questions to answer on Navalny - UK',
"Exclusive: 90% of China's Sinovac employees, families took coronavirus vaccine - CEO",
'Three militants killed after fatal attack on policeman in Tunisia',
.....]
Also, you can check the XML page for more information if required.
P.S. Always check for the compliance before scraping any website :)

There are different ways of gathering the content of a webpage that contains Javascript.
Using selenium with Firefox web driver
Using a headless browser with phantomJS
Making an API call using a REST client or python requests library
You have to do your research first

Wait until page is fully loaded and then reading its content with urllib2/3 [duplicate]

I'm learning to make web scrapers and want to scrape TripAdvisor for a personal project, grabbing the html using urllib2.
However, I'm running into a problem where, using the code below, the html I get back is not correct as the page seems to take a second to redirect (you can verify this by visiting the url) - instead I get the code from the page that initially briefly appears.
Is there some behavior or parameter to set to make sure the page has completely finished loading/redirecting before getting the website content?
import urllib2
from bs4 import BeautifulSoup
bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
print soup.prettify()
Edit: The answer is thorough, however, in the end what solved my problem was this:
https://stackoverflow.com/a/3210737/1157283

Inreresting the problem isn't a redirect is that page modifies the content using javascript, but urllib2 doesn't have a JS engine it just GETS data, if you disabled javascript on your browser you will note it loads basically the same content as what urllib2 returns
import urllib2
from BeautifulSoup import BeautifulSoup
bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
open('test.html', 'w').write(soup.read())
test.html and disabling JS in your browser, easiest in firefox content -> uncheck enable javascript, generates identical result sets.
So what can we do well, first we should check if the site offers an API, scrapping tends to be frown up
http://www.tripadvisor.com/help/what_type_of_tripadvisor_content_is_available
Travel/Hotel API's?
it looks they might, though with some restrictions.
But if we still need to scrape it, with JS, then we can use selenium http://seleniumhq.org/ its mainly used for testing, but its easy and has fairly good docs.
I also found this Scraping websites with Javascript enabled? and this http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/
hope that helps.
As a side note:
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>>
>>> bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
>>> value = bostonPage.read()
>>> soup = BeautifulSoup(value)
>>> open('test.html', 'w').write(value)

Python Scraping fb comments from a website

I have been trying to scrape facebook comments using Beautiful Soup on the below website pages.
import BeautifulSoup
import urllib2
import re
url = 'http://techcrunch.com/2012/05/15/facebook-lightbox/'
fd = urllib2.urlopen(url)
soup = BeautifulSoup.BeautifulSoup(fd)
fb_comment = soup("div", {"class":"postText"}).find(text=True)
print fb_comment
The output is a null set. However, I can clearly see the facebook comment is within those above tags in the inspect element of the techcrunch site (I am little new to Python and was wondering if the approach is correct and where I am going wrong?)

Like Christopher and Thiefmaster: it is all because of javascript.
But, if you really need that information, you can still retrieve it thanks to Selenium on http://seleniumhq.org then use beautifulsoup on this output.

Facebook comments are loaded dynamically using AJAX. You can scrape the original page to retrieve this:
<fb:comments href="http://techcrunch.com/2012/05/15/facebook-lightbox/" num_posts="25" width="630"></fb:comments>
After that you need to send a request to some Facebook API that will give you the comments for the URL in that tag.

The parts of the page you are looking for are not included in the source file. Use a browser and you can see this for yourself by opening the page source.
You will need to use something like pywebkitgtk to have the javascript executed before passing the document to BeautifulSoup

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Raw HTML vs. DOM scraping in python using mechanize and beautiful soup - python

Related

Beautiful Soup object's HTML, does not match HTML from the web browser

Beautiful Soup not picking up some data form the website

How to scrape a javascript website in Python?

Wait until page is fully loaded and then reading its content with urllib2/3 [duplicate]

Python Scraping fb comments from a website

Categories

Resources