Logging and potentially blocking XHR Requests by javascript using selenium - python

I have some kind of single page application which composes XHR requests on-the-fly. It is used to implement pagination for a list of links I want to click on using selenium.
The page only provides a Goto next page link. When clicking the next page link a javascript function creates a XHR request and updates the page content.
Now when I click on one of the links in the list I get redirected to a new page (again through javascript with obfuscated request generation). Though this is exactly the behaviour I want, when going back to the previous page I have to start over from the beginning (i.e. starting at page 0 and click through to page n)
There are a few solutions which came to my mind:
block the second XHR request when clicking on the links in the list, store it and replay it later. This way I can skim through the pages but keep my links for replay later
Somehow 'inject' the first XHR request which does the pagination in order to save myself from clicking through all the pages again
I was also trying out some simple proxies but https is causing troubles for me and was wondering if there is any simple solution I might have missed.

browsermobproxy integrates easily and will allow you to capture all the requests made. It should also allow you to block certain calls from returning.
It does sound like you are scraping a site, so it might be worth parsing the data the XHR calls make and mimicking them.

Related

Intercepting AJAX calls while web-scraping

I am trying to scrape an appStore, particularly the reviews for every app. The issue is that the reviews are dynamic and the client-side is making an AJAX POST call to get that data. The payload for this call is also being generated dynamically for every page. I was wondering if there is any way to intercept this request through code to get the payload, using which I could make a call using requests or any other library. I am able to make this call through POSTMAN using the parameters I get from inspect Network Activity utility of the web-browsers.
I could use selenium to scrape the finally loaded page- but it waits for the entire page to load which is highly un-optimized since I don't need to wait for the entirety of the page to load.
payload = "<This is dynamically created for every page and is constant for that given page>"
header = {"Content-Type": "application/x-www-form-urlencoded"}
url = 'https://appexchange.salesforce.com/appxListingDetail'
r = requests.post(url=url, data=payload, headers=header)
I was wondering if its possible to get this payload through the scraper which can intercept all the AJAX calls made when it tries to scrape the base web-page.
If you want to scrape a page that contains scripts and dynamic content, I would recommend using puppeteer.
It is a headless scriptable browser (can also be used in non-headless mode), it loads the page exactly like a browser with all its dynamic contents and scripts. So you can simply wait for the page to finish loading, and then read the rendered content
You can also use selenium
Also check this out: a list of almost all headless browsers
UPDATE
I know it's been some time, but I will post the answer to your comment here, in case you still need it or just for the sake of completeness:
In puppeteer you can asynchronously wait for a specific DOM node to appear on the page, or for some other events to occur. So you can continue on with your work after that and don't care about loading the rest of the page (although it will load in the background but you don't need to wait for it).

Scraping site that uses AJAX

I've read some relevant posts here but couldn't figure an answer.
I'm trying to crawl a web page with reviews. When site is visited there are only 10 reviews at first and a user should press "Show more" to get 10 more reviews (that also adds #add10 to the end of site's address) every time when he scrolls down to the end of reviews list. Actually, a user can get full review list by adding #add1000 (where 1000 is a number of additional reviews) to the end of the site's address. The problem is that I get only first 10 reviews using site_url#add1000 in my spider just like with site_url so this approach doesn't work.
I also can't find a way to make an appropriate Request imitating the origin one from the site. Origin AJAX url is of the form 'domain/ajaxlst?par1=x&par2=y' and I tried all of this:
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all)
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers})
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers}, cookies={all_cookies})
But every time I'm getting a 404 Error. Can anyone explain what I'm doing wrong?
What you need is a headless browser for this since request module can not handle AJAX well.
One of such headless browser is selenium.
i.e.)
driver.find_element_by_id("show more").click() # This is just an example case
Normally, when you scroll down the page, Ajax will send request to the server, and the server will then response a json/xml file back to your browser to refresh the page.
You need to figure out the url linked to this json/xml file. Normally, you can open your firefox browser and open tools/web dev/web console. monitor the network activities and you can easily catch this json/xml file.
Once you find this file, then you can directly parse reviews from them (I recommend Python Module requests and bs4 to do this work) and decrease a huge amount of time. Remember to use some different clients and IPs. Be nice to the server and it won't block you.

Google serves its homepage to urllib2 when a local search is made

When a local search is done on Google, then the user clicks on the 'More ...' link below the map, the user is then brought to a page such as this.
If the URL:
https://www.google.com/ncr#q=chiropractors%2BNew+York,+NY&rflfq=1&rlha=0&tbm=lcl
is copied out and pasted back into a browser, one arrives, as expected, at the same page. Likewise when a browser is opened with WebDriver, directly accessing the URL brings WebDriver to the same page.
When an attempt is made, however, to request the same page with urllib2, Google serves it its home page (google.com), and it means, among other things, that lxml's extraction capabilities cannot be used.
While urllib2 is not the culprit here (perhaps Google does the same with all headless requests), is there any way of getting Google to serve the desired page? A quick tests with the requests library is indicating the same issue.
I think the big hint here is in the URL:
https://www.google.com/ncr#q=chiropractors%2BNew+York,+NY&rflfq=1&rlha=0&tbm=lcl
Do you notice how there is that hash character (#) in there? Everything following the hash component is never actually sent to the server, so the server can't process it. This indicates (in this case) that the page you are seeing in WebDriver and in your browser is a result of client side scripting.
When you load up the page, your browser sends a request for https://www.google.com/ncr and google returns the home page. The homepage contains javascript that analyses the component after the hash and uses it to generate the page that you expect to see. The browser and Webdriver can do this because they process the javascript. If you disable javascript in your browser and go to that link, you'll see that the page isn't generated either.
urllib2 however, does not process javascript. All it sees is the HTML that the website initially sent along with the javascript, but it can't process the javascript that actually generates the page you are expecting.
Google is serving the page you're asking for, but your problem is that urllib2 is not equipped to render it. To fix this, you'll have to use a scraping framework that supports Javascript. Optionally in this particular case, you could simply use the non-javascript version of Google for your scraping.

Python mechanize wait and click

Using mechanize, how can I wait for some time after page load (some websites have a timer before links appear, like in download pages), and after the links have been loaded, click on a specific link?
Since it's an anchor tag and not a submit button, will browser.submit() work(I got errors while doing that)?
Mechanize does not offer javascript functionality, so you will not see dynamic content (like a timer that turns into a link).
As far as clicking on a link, you have to find the element, and then you can call click_link on it. See the Finding Links section of this site.
If you are looking for something to handle such sites, a good option is PhantomJS. It uses nodejs, but runs on the webkit engine, allowing you parse dynamic content. If you have your heart set on python, using Selenium to programatically drive a real browser may be your best bet.
If it's an anchor tag, then just GET/POST whatever it is.
The timer between links appearing is generally done in javascript - some sites you are attempting to scrape may not be usable without javascript, or requires a token generated in javascript with clientside math.
Depending on the site, you can either extract the wait time in msec/sec and time.sleep() for that long, or you'll have to use something that can execute javascript

Urllib Counts as WebPage Hit?

I'm studying opening pages through Python (3.3).
url=('http://www.google.com')
page = urllib.request.urlopen(url)
Does the above code count that as one hit to Google or does this?
os.system('start chrome.exe google.com')
The first one scrapes the page while the second one actually opens the page in a browser. I was just wondering if it made a difference page hit wise?
both do very different things.
using urllib.request.urlopen makes a single http request.
your second example will do the same, and then it will parse the document it receives and request subsequent resources (images/javascript/css/whatever). So the result of loading google.com in your browser will trigger many hits.
try it yourself by looking in your browsers development tools (usually in network section) while you load a page.

Categories

Resources