Scraping the comments section of a Web page - python

I am attempting to scrape the comments counter from a Web page. The code is presented below.
When I ask it to print letters, the output is an empty list. Why that is happening?
import urllib2
from bs4 import BeautifulSoup
r2 = urllib2.urlopen("http://www.ign.com/articles/2016/01/03/steam-surpasses-12-million-concurrent-users").read()
soup2 = BeautifulSoup(r2)
letters = soup2.find_all("div",class_="fyre-comment-count")
print letters

The list is empty because there are no comments on that page. div#livefyre-comment is empty, and div.fyre-comment-count does not exist.
Up in the page's header, there is a suspicious script tag pulling JavaScript from http://cdn.livefyre.com/Livefyre.js. I don't know what Livefyre is, but I assume it sucks comments out of a database somewhere and inserts them into div#livefyre-comment or its surrounding div.article-comments. Presumably, div.fyre-comment-count will also appear somewhere in the DOM once the script is done.
This sort of... design decision is increasingly common on Web sites. To see what a Web page really looks like, browse it with both JavaScript and cookies off (and be prepared for the occasional "500 Internal Server Error" from sites that never imagined such hooliganism was possible).
I don't know enough about screen scraping to tell you where to go from here. You might be able to piece together a URL to fetch the comments (and their count) directly from Livefyre. I'd start by perusing the JavaScript functions they provide, and the data-settings attribute of div#livefyre-comment, which appears to be a JSON dictionary full of relevant parameters.

Your code is very close, almost right. You just missed a few things. Check the code below.
import urlparse
from bs4 import BeautifulSoup
import urllib2
r2 = urllib2.urlopen("http://www.ign.com/articles/2016/01/03/steam-surpasses-12-million-concurrent-users").read()
soup = BeautifulSoup(r2, 'html.parser')
for line in soup.find_all("div",class_="fyre-comment-count"):
comments = ''.join(line.find_all(text=True))
print (comments)

Related

Scraping a paginated website: Scraping page 2 gives back page 1 results

I am using the get method of the requests library in python to scrape information from a website which is organized into pages (i.e paginated with numbers at the bottom).
Page 1 link: https://realfood.tesco.com/search.html?DietaryOption=Vegetarian
I am able to extract the data that I need from the first page but when I feed my code the url for the second page, I get the same data from the first page. Now after carefully analyzing my code, I am certain the issue is not my code logic but the way the second page url is structured.
So my question is how can I get my code to work as I want. I suspect it is a question of parameters but I am not 100% percent sure. If indeed it is parameters that I need to pass to request, I would appreciate some guidance on how to break down the parameters. My page 2 link is attached below.
Thanks.
Page 2 link: https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q='selectedobjecttype%3DRECIPES%26page%3D2%26perpage%3D30%26DietaryOption%3DVegetarian'
Note: The pages are not really links per se.
It looks like platform is ASP.NET and pagination links are operated by JS. I seriously doubt you will have it easy with python, since beautifulsoup is a HTML parser/extractor, so if you really want to use this site, I would suggest to looking into Selenium or even PhantomJS, since they fully replicate the browser.
But in this particular case you are lucky, because there's a legacy website version which doesn't use modern bells and whistles :)
http://legacy.realfood.tesco.com/recipes/search.html?st=vegetarian&cr=False&page=3&srt=searchRelevance
It looks like the pagination of this site is handled by the query parameters passed in the second URL you posted, ie:
https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q='selectedobjecttype%3DRECIPES%26page%3D2%26perpage%3D30%26DietaryOption%3DVegetarian'
The query string is url encoded. %3D is = and %26 is &. It might be more readable like this:
q='selectedobjecttype=RECIPES&page=2&perpage=30&DietaryOption=Vegetarian'
For example, if you wanted to pull back the fifth page of Vegetarian Recipes the URL would look like this:
https://realfood.tesco.com/search.html?DietaryOption=Vegetarian#!q='selectedobjecttype%3DRECIPES%26page%3D5%26perpage%3D30%26DietaryOption%3DVegetarian'
You can keep incrementing the page number until you get a page with no results which looks like this.
What about this?
from bs4 import BeautifulSoup
import urllib.request
for numb in ('1', '10'):
resp = urllib.request.urlopen("https://realfood.tesco.com/search.html?DietaryOption=Vegetarian")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])
Hopefully it works for you. I can't test it because my office blocks these kinds of things. I'll try it when I get home tonight to see if it does what it should do...

How to scrape ASP webpage in Python?

In this video, I give you a look at the dataset I want to scrape/take from the web. Very sorry about the audio, but did the best with what I have. It is hard for me to describe what I am trying to do as I see a page with thousands of pages and obviously has tables, but pd.read_html doesn't work! Until it hit me, this page has a form to be filled out first....
https://opir.fiu.edu/instructor_eval.asp
Going to this link will allow you to select a semester, and in doing so, will show thousands upon thousands of tables. I attempted to use the URL after selecting a semester hoping to read HTML, but no such luck.. I still don't know what I'm even looking at (like, is it a webpage, or is it ASP? What even IS ASP?). If you follow the video link, you'll see that it gives an ugly error if you select spring semester, copy the link, and put it in the search bar. Some SQL error.
So this is my dilemma. I'm trying to GET this data... All these tables. Last post I made, I did a brute force attempt to get them by just clicking and dragging for 10+ minutes, then pasting into excel. That's an awful way of doing it, and it wasn't even particularly useful when I imported that excel sheet into python because the data was very difficult to work with. Very unstructured. So I thought, hey, why not scrape with bs4? Not that easy either, it seems, as the URL won't work. After filtering to spring semester, the URL just won't work, not for you, and not if you paste it into python for bs4 to use...
So I'm sort of at a loss here of how to reasonably work with this data. I want to scrape it with bs4, and put it into dataframes to be manipulated later. However, as it is ASP or whatever it is, I can't find a way to do so yet :\
ASP stands for Active Server Pages and is a page running a server-side script (usually vbs), so this shouldn't concern you as you want to scrape data from the rendered page.
In order to get a valid response from /instructor_evals/instr_eval_result.asp you have to submit a POST request with the form data of /instructor_eval.asp, otherwise the page returns an error message.
If you submit the correct data with urllib you should be able to get the tables with bs4.
from urllib.request import urlopen, Request
from urllib.parse import urlencode
from bs4 import BeautifulSoup
url = 'https://opir.fiu.edu/instructor_evals/instr_eval_result.asp'
data = {'Term':'1171', 'Coll':'%', 'Dept':'','RefNum':'','Crse':'','Instr':''}
r = urlopen(Request(url, data=urlencode(data).encode()))
html = r.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(html, 'html.parser')
tables = soup.find_all('table')
By the way this error message is a strong indication that the page is vulnerable to SQL Injection which is a very nasty bug, and i think you should inform the admin about it.

Router Access - Beautiful Soup - Python 3.5

I have a router that I want to login to and retrieve information using Python script. Im a newbie to Python but want to learn and explore more with it. Here is what I have written so far:
from requests.auth import HTTPBasicAuth
import requests
from bs4 import BeautifulSoup
response = requests.get('http://192.168.1.1/Settings.html/', auth=HTTPBasicAuth('Username', 'Password'))
html = response.content
soup = BeautifulSoup(html, "html.parser")
print (soup.prettify())
I have two questions which are:
When I run the script the first time, I receive an authentication error. On running the script a second time it seems to authenticate fine and retrieve the HTML. Is there a better method?
With BS I want to only retrieve the code I require from the script. I cant see a tag to set BS to scrape. At the start of the HTML there are a list of variables of which I want to scrape the data for example:
var Device Pin = '12345678';
Its much easier to retrieve the information using a single script instead of logging onto the web interface each time. It sits within the script type="text/javascript".
Is BS the correct tool for the job. Can I just scrape the one line in the list of variables?
Any help as always very much appreciatted.
As far as I know, BeautifulSoup does not handle javascript. In this case, it's simple enough to just use regular expressions
import re
m = re.search(r"var Device Pin\s+= '(\d+)'", html)
pin = m.group(1)
Regarding the authentication problem, you can wrap your call in try except to redo the call if it doesn't work the first time.
I'd run a packet sniffer, tcpdump or wireshark, to see the interaction between your script and your router. Viewing the interactions may help determine why you're unable to authenticate on the first pass. As a workaround, run the auth section in a for loop which will try N number of times to authenticate before failing.
Regarding scraping, you may want to consider lxml with the beautiful soup parser so you can use XPath. See can we use xpath with BeautifulSoup?
XPath would allow you easily pull a single value, text, attribute, etc. from the html if lxml can parse it.

Wait until page is fully loaded and then reading its content with urllib2/3 [duplicate]

I'm learning to make web scrapers and want to scrape TripAdvisor for a personal project, grabbing the html using urllib2.
However, I'm running into a problem where, using the code below, the html I get back is not correct as the page seems to take a second to redirect (you can verify this by visiting the url) - instead I get the code from the page that initially briefly appears.
Is there some behavior or parameter to set to make sure the page has completely finished loading/redirecting before getting the website content?
import urllib2
from bs4 import BeautifulSoup
bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
print soup.prettify()
Edit: The answer is thorough, however, in the end what solved my problem was this:
https://stackoverflow.com/a/3210737/1157283
Inreresting the problem isn't a redirect is that page modifies the content using javascript, but urllib2 doesn't have a JS engine it just GETS data, if you disabled javascript on your browser you will note it loads basically the same content as what urllib2 returns
import urllib2
from BeautifulSoup import BeautifulSoup
bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
open('test.html', 'w').write(soup.read())
test.html and disabling JS in your browser, easiest in firefox content -> uncheck enable javascript, generates identical result sets.
So what can we do well, first we should check if the site offers an API, scrapping tends to be frown up
http://www.tripadvisor.com/help/what_type_of_tripadvisor_content_is_available
Travel/Hotel API's?
it looks they might, though with some restrictions.
But if we still need to scrape it, with JS, then we can use selenium http://seleniumhq.org/ its mainly used for testing, but its easy and has fairly good docs.
I also found this Scraping websites with Javascript enabled? and this http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/
hope that helps.
As a side note:
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>>
>>> bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
>>> value = bostonPage.read()
>>> soup = BeautifulSoup(value)
>>> open('test.html', 'w').write(value)

How to setup BeautifulSoup to avoid false results?

In using BeautifulSoup I am seeing many cases where the information sought is definitely in the HTML input yet BeautifulSoup fails to find it. This is a problem because there are cases where the information isn't there and so it is impossible to know if BeautifulSoup's search result is a case of it failing or a true case of the information simply not being there.
Here's a simple example:
url_obj = urllib2.urlopen(url)
html = url_obj.read()
url_obj.close()
parsed_html = BeautifulSoup(html)
html = parsed_html.find(id="SalesRank")
I've run tests with dozens of URL's of pages that do have this id and, to my dismay, get seemingly random results. Sometimes some of the URL's will produce a search hit and other times none.
In sharp contrast to this, if I run a simple string search I get the correct result every time:
url_obj = urllib2.urlopen(url)
html = url_obj.read()
url_obj.close()
index = html.find("SalesRank")
# Slice off a chunk-o-html from there
# Then use regex to grab what we are after
This works every time. The prior BeautifulSoup example fails in a seemingly random fashion. Same URL's. What's alarming is that I can run the BeautifulSoup code twice in a row on the same set of URL's and get different responses. The simple string search code is 100% consistent and accurate in its results.
Is there a trick to setting up BeautifulSoup in order to ensure it is as consistent and reliable as a simple string search?
If not, is there an alternative library that is rock solid reliable and repeatable?
Nowadays, the page load gets more complex and often involves a series of asynchronous calls, a lot of client-side javascript logic, DOM manipulation etc. The page you see in the browser usually is not the page you get via requests or urllib2. Additionally, the site can have defensive mechanisms working, like, for example, it can check for the User-Agent header, ban your IP after multiple continuous requests etc. This is really web-site specific and there is no "silver bullet" here.
Besides, the way BeautifulSoup parses the page depends on the underlying parser. See: Differences between parsers.
The most reliable way to achieve "What you see in the browser is what you get in the code" is to utilize a real browser, headless or not. For example, selenium package would be useful here.

Categories

Resources