Simulate clicking a link when scraping with Python and BeautifulSoup - python

After reading for years, this is my first SO question. Thanks in advance for the help!
I'm looking to scrape content from articles on the Forbes website. The this as an example page: http://www.forbes.com/sites/katevinton/2015/09/22/google-microsoft-qualcomm-and-baidu-announce-joint-investment-cloudflare/. When an article is loaded directly, the page source becomes a mess of JavaScript that is hard to parse. However, when I click on the 'print' button, it appends a "/print/" to the URL and gives me a page I have no problem parsing with BeautifulSoup.
When I enter the url with "/print/" appended, it redirects to the non-"/print/" page. I only get to the actual "/print/" page when I click on the button. Thus, my question is: how can I simulate clicking that print button programmatically to get to the Beautiful Soup scrapable page? Poking around, people seem to recommend mechanize for simulating browser actions but I'm not sure what I'd be trying to do with it in this case. Or is there a better way to scrape this data entirely?
I appreciate any help you can offer!

You need to request it with the referer set, so something like this would work:
import requests
url = "http://www.forbes.com/sites/samsungbusiness/2015/09/23/how-your-car-is-becoming-the-next-hot-tech-gadget/print/"
print requests.get(url, headers={"referer": url.replace("print/", "")}).content

Related

Beautiful Soup not picking up some data form the website

I have been trying to scrape some data using beautiful soup from https://www.eia.gov/coal/markets/. However when I parse the contents some of the data does not show up at all. Those data fields are visible in chrome inspector but not in the soup. The thing is they do not seem to be text elements. I think they are fed using an external database. I have attached the screenshots below. Is there any other way to scrape that data?
Thanks in advance.
Google inspector:
Beautiful soup parsed content:
#DMart is correct. The data you are looking for is being populated by Javascript, have a look at line 1629 in the page source. Beautiful soup doesn't act as a client browser so there is nowhere for the JS to execute. So it looks like selenium is your best bet.
See This thread for more information.
Not enough detail in your question but this information is probably dynamically loaded and you're not fetching the entire page source.
Without your code it's tough to see if you're using selenium to do it (you tagged this questions as such) which may indicate you're using page_source which does not guarantee you the entire completed source of the page you're looking at.
If you're using requests its even more unlikely you're capturing the entire page's completed source code.
The data is loaded via ajax, so it is not available in the initial document. If you go to the networking tab in chrome dev tools you will see that the site reaches out to https://www.eia.gov/coal/markets/coal_markets_json.php. I searched for some of the numbers in the response and it looks like the data you are looking for is there.
This is a direct json response from the backend. Its better than selenium if you can get it to work.
Thanks you all!
Opening the page using selenium using a webdriver and then parsing the page source using beautiful soup worked.
webdriver.get('https://www.eia.gov/coal/markets/')
html=webdriver.page_source
soup=BS(html)
table=soup.find("table",{'id':'snl_dpst'})
rows=table.find_all("tr")

Stucked with infinite scrolling using Python, Requests and BeautifulSoup

I've been scraping (python) articles from a couple of news websites from my country succesfully, basically by parsing the main page, fetching the hrefs and accesing them to parse the articles. But I just hit a wall with https://www.clarin.com/. I am only getting a very limited amount of elements because of the infinite scrolling. I researched a lot but I couldn't find the right resource to overcome this, but of course it is more than likely that I am doing it wrong.
For what I see in the devtools the url request that loads more is a json file, but I don't know how to fetch it automatically in order to parse it. I would like to get some quick guidance on what to learn to do this. I hope I made some sense, this is my base code:
source = requests.get(https://www.clarin.com/)
html = BeautifulSoup(source.text, "lxml")
This is an example request url I am seeing in chrome devtools.
https://www.clarin.com/ondemand/eyJtb2R1bGVDbGFzcyI6IkNMQUNsYXJpbkNvbnRhaW5lckJNTyIsImNvbnRhaW5lcklkIjoidjNfY29sZnVsbF9ob21lIiwibW9kdWxlSWQiOiJtb2RfMjAxOTYyMjQ4OTE0MDgzIiwiYm9hcmRJZCI6IjEiLCJib2FyZFZlcnNpb25JZCI6IjIwMjAwNDMwXzAwNjYiLCJuIjoiMiJ9.json

Scraping a paginated website: Scraping page 2 gives back page 1 results

I am using the get method of the requests library in python to scrape information from a website which is organized into pages (i.e paginated with numbers at the bottom).
Page 1 link: https://realfood.tesco.com/search.html?DietaryOption=Vegetarian
I am able to extract the data that I need from the first page but when I feed my code the url for the second page, I get the same data from the first page. Now after carefully analyzing my code, I am certain the issue is not my code logic but the way the second page url is structured.
So my question is how can I get my code to work as I want. I suspect it is a question of parameters but I am not 100% percent sure. If indeed it is parameters that I need to pass to request, I would appreciate some guidance on how to break down the parameters. My page 2 link is attached below.
Thanks.
Page 2 link: https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q='selectedobjecttype%3DRECIPES%26page%3D2%26perpage%3D30%26DietaryOption%3DVegetarian'
Note: The pages are not really links per se.
It looks like platform is ASP.NET and pagination links are operated by JS. I seriously doubt you will have it easy with python, since beautifulsoup is a HTML parser/extractor, so if you really want to use this site, I would suggest to looking into Selenium or even PhantomJS, since they fully replicate the browser.
But in this particular case you are lucky, because there's a legacy website version which doesn't use modern bells and whistles :)
http://legacy.realfood.tesco.com/recipes/search.html?st=vegetarian&cr=False&page=3&srt=searchRelevance
It looks like the pagination of this site is handled by the query parameters passed in the second URL you posted, ie:
https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q='selectedobjecttype%3DRECIPES%26page%3D2%26perpage%3D30%26DietaryOption%3DVegetarian'
The query string is url encoded. %3D is = and %26 is &. It might be more readable like this:
q='selectedobjecttype=RECIPES&page=2&perpage=30&DietaryOption=Vegetarian'
For example, if you wanted to pull back the fifth page of Vegetarian Recipes the URL would look like this:
https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q='selectedobjecttype%3DRECIPES%26page%3D5%26perpage%3D30%26DietaryOption%3DVegetarian'
You can keep incrementing the page number until you get a page with no results which looks like this.
What about this?
from bs4 import BeautifulSoup
import urllib.request
for numb in ('1', '10'):
resp = urllib.request.urlopen("https://realfood.tesco.com/search.html?DietaryOption=Vegetarian")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])
Hopefully it works for you. I can't test it because my office blocks these kinds of things. I'll try it when I get home tonight to see if it does what it should do...

How to scrape the HTML content of a webpage after rendering is completed in Python [duplicate]

This question already has answers here:
scrape html generated by javascript with python
(5 answers)
Closed 6 years ago.
I'm currently on a mission to scrape popular joke websites. One example is a website called jokes.cc.com. If you visit the website, hover your cursor above the 'Get Random Joke' button on the left of the page briefly, you will notice the link it redirects to will be jokes.cc.com/#.
If you wait for a while, it changes to a proper link within the website which displays the actual joke. It changes to jokes.cc.com/*legit joke link*.
If you analyze the HTML of the page, you will notice that there is a link ( <a>) with a class=random_link whose <href> stores the link to the random joke the page wants to redirect you do. You can check this after the page has completely loaded. Basically, the '#' is replaced by a legit link.
Now, here is my code for scraping off the HTML as I have done with static websites until now. I have used BeautifulSoup library:
import urllib
from bs4 import BeautifulSoup
urlToRead = "http://jokes.cc.com";
handle = urllib.urlopen(urlToRead)
htmlGunk = handle.read()
soup = BeautifulSoup(htmlGunk, "html.parser")
# Find out the exact position of the joke in the page
print soup.findAll('a', {'class':'random_link'})[0]
Output: #
This is the expected output as I have come to realize that the page has not completely rendered.
How do I scrape the page after waiting a while, or after the rendering is complete. Will I need to use external libraries like Mechanize? I'm unsure on how to do that so any help/guidance is appreciated
EDIT: I was finally able to resolve my issue by using PhantomJS along with Selenium in Python. Here is the code which fetches the page after rendering is complete.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.PhantomJS() #selenium for PhantomJS
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #fetch HTML source code after rendering
# locate the link in HTML
randomJokeLink = soupFromJokesCC.findAll('div', {'id':'random_joke'})[0].findAll('a')[0]['href']
# now go to that page and scrape the joke from there
print randomJokeLink #It works :D
The data you're after is generated by JavaScript running dynamically on page load. BeautifulSoup does not have a JavaScript engine, so it doesn't matter how long you wait, the link will never change. There are Python libraries which can scrape and understand JavaScript, but your best bet is probably to dig and work out how the JS on the website actually works. If they have a data feed of jokes that a random joke is pulled from, for example, it might be in a format such as JSON which Python can parse very easily. This would make your application much more lightweight than including a fully blown scripting engine.

Python Scraping fb comments from a website

I have been trying to scrape facebook comments using Beautiful Soup on the below website pages.
import BeautifulSoup
import urllib2
import re
url = 'http://techcrunch.com/2012/05/15/facebook-lightbox/'
fd = urllib2.urlopen(url)
soup = BeautifulSoup.BeautifulSoup(fd)
fb_comment = soup("div", {"class":"postText"}).find(text=True)
print fb_comment
The output is a null set. However, I can clearly see the facebook comment is within those above tags in the inspect element of the techcrunch site (I am little new to Python and was wondering if the approach is correct and where I am going wrong?)
Like Christopher and Thiefmaster: it is all because of javascript.
But, if you really need that information, you can still retrieve it thanks to Selenium on http://seleniumhq.org then use beautifulsoup on this output.
Facebook comments are loaded dynamically using AJAX. You can scrape the original page to retrieve this:
<fb:comments href="http://techcrunch.com/2012/05/15/facebook-lightbox/" num_posts="25" width="630"></fb:comments>
After that you need to send a request to some Facebook API that will give you the comments for the URL in that tag.
The parts of the page you are looking for are not included in the source file. Use a browser and you can see this for yourself by opening the page source.
You will need to use something like pywebkitgtk to have the javascript executed before passing the document to BeautifulSoup

Categories

Resources