Extracting Disqus comments using Python and Beautiful Soup - python

This question is similar to the one asked here, but the answer was not of much help.
I am trying to extract comments from a webpage which uses Disqus, however I am not able to access the section.
This is what I have so far, it's not much
import urllib
import urllib2,cookielib
from bs4 import BeautifulSoup
from IPython.display import HTML
site= "http://www.timesofmalta.com/articles/view/20161207/local/daphne-caruana-galizia-among-politicos-28-most-influential.633146"
hdr = {'User-Agent':'Mozilla/5.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page,"html.parser")
title = soup.title.text
print title
Any hints as to how I could attempt to tackle this?

I had the same issue while trying to download an infinity scroll on java. After doing a million things, including beautiful soup, i realized that the best way to tackle this problem was debugging with chrome, to get the URL of the petition that would come out as the dynamic content was loaded, and then find a way to regulate the expression so that i could call it in different ways.
so for example, if when you activate your infinite scroll, you have the chrome debugging console open, you will see an HTTP petition(probably HTTP-get) coming out. If the URL has a structure as:
http:www.yourlink.com/get_comments/product/page_offset_numbertoload/
you will be able to build an http petition with python and send it, get the response, in which the data that you are looking for is stored. Good luck man!

Related

How can i fix the find_all error while web scraping?

I am kinda a newbie in data world. So i tried to use bs4 and requests to scrap data from trending youtube videos. I have tried using soup.findall() method. To see if it works i displayed it. But it gives me an empty list. Can you help me fix it? Click here to see the spesific part of the html code.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.youtube.com/feed/explore")
soup = BeautifulSoup(r.content,"lxml")
soup.prettify()
trendings = soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-
shelf-contents-renderer"})
print(trending)
This webpage is dynamic and contains scripts to load data. Whenever you make a request using requests.get("https://www.youtube.com/feed/explore"), it loads the initial source code file that only contains information like head, meta, etc, and scripts. In a real-world scenario, you will have to wait until scripts load data from the server. BeautifulSoup does not catch the interactions with DOM via JavaScript. That's why soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-shelf-contents-renderer"}) gives you empty list as there is no ytd-video-renderer tag or style-scope ytd-expanded-shelf-contents-renderer class.
For dynamic webpages, I think you should use Selenium (or maybe Scrapy).
For Youtube, you can use it's API as well.

Wait until page is fully loaded and then reading its content with urllib2/3 [duplicate]

I'm learning to make web scrapers and want to scrape TripAdvisor for a personal project, grabbing the html using urllib2.
However, I'm running into a problem where, using the code below, the html I get back is not correct as the page seems to take a second to redirect (you can verify this by visiting the url) - instead I get the code from the page that initially briefly appears.
Is there some behavior or parameter to set to make sure the page has completely finished loading/redirecting before getting the website content?
import urllib2
from bs4 import BeautifulSoup
bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
print soup.prettify()
Edit: The answer is thorough, however, in the end what solved my problem was this:
https://stackoverflow.com/a/3210737/1157283
Inreresting the problem isn't a redirect is that page modifies the content using javascript, but urllib2 doesn't have a JS engine it just GETS data, if you disabled javascript on your browser you will note it loads basically the same content as what urllib2 returns
import urllib2
from BeautifulSoup import BeautifulSoup
bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
open('test.html', 'w').write(soup.read())
test.html and disabling JS in your browser, easiest in firefox content -> uncheck enable javascript, generates identical result sets.
So what can we do well, first we should check if the site offers an API, scrapping tends to be frown up
http://www.tripadvisor.com/help/what_type_of_tripadvisor_content_is_available
Travel/Hotel API's?
it looks they might, though with some restrictions.
But if we still need to scrape it, with JS, then we can use selenium http://seleniumhq.org/ its mainly used for testing, but its easy and has fairly good docs.
I also found this Scraping websites with Javascript enabled? and this http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/
hope that helps.
As a side note:
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>>
>>> bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
>>> value = bostonPage.read()
>>> soup = BeautifulSoup(value)
>>> open('test.html', 'w').write(value)

Simulate clicking a link when scraping with Python and BeautifulSoup

After reading for years, this is my first SO question. Thanks in advance for the help!
I'm looking to scrape content from articles on the Forbes website. The this as an example page: http://www.forbes.com/sites/katevinton/2015/09/22/google-microsoft-qualcomm-and-baidu-announce-joint-investment-cloudflare/. When an article is loaded directly, the page source becomes a mess of JavaScript that is hard to parse. However, when I click on the 'print' button, it appends a "/print/" to the URL and gives me a page I have no problem parsing with BeautifulSoup.
When I enter the url with "/print/" appended, it redirects to the non-"/print/" page. I only get to the actual "/print/" page when I click on the button. Thus, my question is: how can I simulate clicking that print button programmatically to get to the Beautiful Soup scrapable page? Poking around, people seem to recommend mechanize for simulating browser actions but I'm not sure what I'd be trying to do with it in this case. Or is there a better way to scrape this data entirely?
I appreciate any help you can offer!
You need to request it with the referer set, so something like this would work:
import requests
url = "http://www.forbes.com/sites/samsungbusiness/2015/09/23/how-your-car-is-becoming-the-next-hot-tech-gadget/print/"
print requests.get(url, headers={"referer": url.replace("print/", "")}).content

BeautifulSoup4: Missing Parsed Table Data

I'm trying to extract the Earnings Per Share data through BeautifulSoup 4 from this page.
When I parse the data, the table information is missing using the default, lxml and HTML 5 parsers. I believe this has something to do with Javascript and I have been trying to implement PyV8 to transform the script into readable HTML for BS4. The problem is I don't know where to go from here.
Do you know if this is in fact my issue? I have been reading many posts and it's been a very big headache for me today. Below is a quick example. The financeWrap includes the table information, but beautifulSoup shows that it is empty.
import requests
from bs4 import BeautifulSoup
url = "http://financials.morningstar.com/ratios/r.html?t=AAPL&region=usa&culture=en-US"
response = requests.get(url)
soup_key_ratios = bs(response.content, 'html5lib')
financial_tables = soup_key_ratios.find("div", {"id":"financeWrap"})
print financial_tables
# Output: <div id="financeWrap">
# </div>
The issue is that you're trying to get data that is coming in through Ajax on the website. If you go to the link you provided, and looked at the source via the browser, you'll see that there should be no content with the data.
However, if you use a console manager, such as Firebug, you will see that there are Ajax requests made to the following URL, which is something you can parse via beautifulsoup (perhaps - I haven't tried it or looked at the structure of the data).
Keep in mind that this is quite possibly against the website's ToS.

Python Scraping fb comments from a website

I have been trying to scrape facebook comments using Beautiful Soup on the below website pages.
import BeautifulSoup
import urllib2
import re
url = 'http://techcrunch.com/2012/05/15/facebook-lightbox/'
fd = urllib2.urlopen(url)
soup = BeautifulSoup.BeautifulSoup(fd)
fb_comment = soup("div", {"class":"postText"}).find(text=True)
print fb_comment
The output is a null set. However, I can clearly see the facebook comment is within those above tags in the inspect element of the techcrunch site (I am little new to Python and was wondering if the approach is correct and where I am going wrong?)
Like Christopher and Thiefmaster: it is all because of javascript.
But, if you really need that information, you can still retrieve it thanks to Selenium on http://seleniumhq.org then use beautifulsoup on this output.
Facebook comments are loaded dynamically using AJAX. You can scrape the original page to retrieve this:
<fb:comments href="http://techcrunch.com/2012/05/15/facebook-lightbox/" num_posts="25" width="630"></fb:comments>
After that you need to send a request to some Facebook API that will give you the comments for the URL in that tag.
The parts of the page you are looking for are not included in the source file. Use a browser and you can see this for yourself by opening the page source.
You will need to use something like pywebkitgtk to have the javascript executed before passing the document to BeautifulSoup

Categories

Resources