Scraping site that uses AJAX - python

I've read some relevant posts here but couldn't figure an answer.
I'm trying to crawl a web page with reviews. When site is visited there are only 10 reviews at first and a user should press "Show more" to get 10 more reviews (that also adds #add10 to the end of site's address) every time when he scrolls down to the end of reviews list. Actually, a user can get full review list by adding #add1000 (where 1000 is a number of additional reviews) to the end of the site's address. The problem is that I get only first 10 reviews using site_url#add1000 in my spider just like with site_url so this approach doesn't work.
I also can't find a way to make an appropriate Request imitating the origin one from the site. Origin AJAX url is of the form 'domain/ajaxlst?par1=x&par2=y' and I tried all of this:
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all)
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers})
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers}, cookies={all_cookies})
But every time I'm getting a 404 Error. Can anyone explain what I'm doing wrong?

What you need is a headless browser for this since request module can not handle AJAX well.
One of such headless browser is selenium.
i.e.)
driver.find_element_by_id("show more").click() # This is just an example case

Normally, when you scroll down the page, Ajax will send request to the server, and the server will then response a json/xml file back to your browser to refresh the page.
You need to figure out the url linked to this json/xml file. Normally, you can open your firefox browser and open tools/web dev/web console. monitor the network activities and you can easily catch this json/xml file.
Once you find this file, then you can directly parse reviews from them (I recommend Python Module requests and bs4 to do this work) and decrease a huge amount of time. Remember to use some different clients and IPs. Be nice to the server and it won't block you.

Related

Logging and potentially blocking XHR Requests by javascript using selenium

I have some kind of single page application which composes XHR requests on-the-fly. It is used to implement pagination for a list of links I want to click on using selenium.
The page only provides a Goto next page link. When clicking the next page link a javascript function creates a XHR request and updates the page content.
Now when I click on one of the links in the list I get redirected to a new page (again through javascript with obfuscated request generation). Though this is exactly the behaviour I want, when going back to the previous page I have to start over from the beginning (i.e. starting at page 0 and click through to page n)
There are a few solutions which came to my mind:
block the second XHR request when clicking on the links in the list, store it and replay it later. This way I can skim through the pages but keep my links for replay later
Somehow 'inject' the first XHR request which does the pagination in order to save myself from clicking through all the pages again
I was also trying out some simple proxies but https is causing troubles for me and was wondering if there is any simple solution I might have missed.
browsermobproxy integrates easily and will allow you to capture all the requests made. It should also allow you to block certain calls from returning.
It does sound like you are scraping a site, so it might be worth parsing the data the XHR calls make and mimicking them.

Python: Using requests for a webpage that redirects

Update: I did find the information I needed in the API, not really an answer to this specific question but a solution for my software.
I'm trying to login to a webpage, navigate to another page, and parse an HTML table.
If you use the browser to go to the target page without being logged in, it takes you to the default landing page and you have to navigate to the target page anyway. That is why I have two URL calls.
import requests
payload = {'username' : 'USER', 'password' : 'PASSWORD'}
with requests.Session() as s:
p = s.post('login_url', data=payload)
r = s.get('target_url')
When you navigate to the login page, it normally goes to another page to check your browser before going to the login page itself. I get this response from 'p':
<span data-translate="checking_browser">Checking your browser before accessing</span> website.</h1>
<p data-translate="process_is_automatic">This process is automatic. Your browser will redirect to your requested content shortly.</p>
<p data-translate="allow_5_secs">Please allow up to 5 seconds…</p>
...which is just the page telling you to wait to be redirected and log in. Is there a way to handle this so that it waits for the page where it can log in? I will need to make this call about every 20 minutes in a code, so even better if I can stay logged in and only go for the target page.
Ideal solution: Log in one time at beginning of program and stay logged in.
Better solution: Re-log in each time but avoid the five second wait time to change pages.
Acceptable solution: Wait the five seconds to be redirected prior to login each time.
This "checking your browser" message looks like a CloudFlare feature which is designed to stop people from accessing the site in this way - you will need to run some javascript from the server to pass this barrier (the idea being that someone accessing the site in a browser will have the javascript run automatically - if they're using a bot to scrape the site it'll fail)
. If the site has an API, switching to use that would be my first suggestion.
Otherwise, there are packages to help you get around this issue, but since the barriers are explicitly to prevent this kind of use, they're liable to stop working when CloudFlare makes changes.

Post API search with Python

I'm trying to scrape all news items from this website. They are not showing in the source code: http://www.uvm.dk/aktuelt
I've tried using Firefox' LIVE Http Headers and Chrome's developer tool but still can't figure out what goes on behind the scenes. I'm sure it's pretty simple :-)
I have these information but how do I use them to scrape the wanted news?
http://www.uvm.dk/api/search
Request Method:POST
Connection: keep-alive
PageId=8938bc1b-a673-4513-80d1-e1714ca93d7c&Term=&Years%5B%5D=2017&WorkAreaIds=&SubjectIds=&TemplateIds=&NewsListIds%5B%5D=Emner&TimeSearch%5BEvaluation%5D=&FlagSearch%5BEvaluation%5D=Alle&DepartmentNames=&Letters=&RootItems=&Language=da&PageSize=10&Page=1
Can anyone help?
Not a direct answer but some hints.
Your approach with livehttpheaders is a good one. Open the side bar before loading home page, clear all. Then load home page and an article. There usually will a ton of http request because of images, css and js. But you'll be able to find the few ones useful. Usually the very first is for home page and one somewhere below is the article main page. An other interesting one is the one when you click next page.
I like to decouple download (HTTP) and scraping (HTML or JSON or so).
I download to a file with a first script and scrap with a second one.
First because I want to be able to adjust scraping without downloading again and again. Second because I prefer to use bash+curl to download and python+lxml to scrap. If I need information from scraping to go on downloading, my scraping script output it on the console.

Urllib Counts as WebPage Hit?

I'm studying opening pages through Python (3.3).
url=('http://www.google.com')
page = urllib.request.urlopen(url)
Does the above code count that as one hit to Google or does this?
os.system('start chrome.exe google.com')
The first one scrapes the page while the second one actually opens the page in a browser. I was just wondering if it made a difference page hit wise?
both do very different things.
using urllib.request.urlopen makes a single http request.
your second example will do the same, and then it will parse the document it receives and request subsequent resources (images/javascript/css/whatever). So the result of loading google.com in your browser will trigger many hits.
try it yourself by looking in your browsers development tools (usually in network section) while you load a page.

Direct link to comments that are being loaded asynchronously?

I am playing around with change.org and trying to download a couple of comments on a petition. For this, I would like to know where the comments are being pulled from when the user clicks on "load more reasons" For an example, look here:
http://www.change.org/petitions/tell-usda-to-stop-using-pink-slime-in-school-food
Looking at the XHR requests in Chrome, I see requests being sent to http://www.change.org/petitions/tell-usda-to-stop-using-pink-slime-in-school-food/opinions?page=2&role=comments Of course, the page number varies with the number of times comments are being loaded.
However, this link leads to a blank page when I try it in a browser. Is this because of some missing data in the url or is this a result of some authentication step within the javascript that makes the request in the first place?
Any pointers will be appreciated. Thanks!
EDIT: Thanks to the first response, I see that the data is being received when I use the console. How do I receive the same data when making the request from a python script. Do I have to mimic the browser or is there a way to just use urllib?
They must be validating the source of the request. If you go to the site open the console and run this:
$.get('http://www.change.org/petitions/tell-usda-to-stop-using-pink-slime-in-school-food/opinions?page=2&role=comments',{},function(data){console.log(data);});
You will see the data come back

Categories

Resources