I am trying to scrape an appStore, particularly the reviews for every app. The issue is that the reviews are dynamic and the client-side is making an AJAX POST call to get that data. The payload for this call is also being generated dynamically for every page. I was wondering if there is any way to intercept this request through code to get the payload, using which I could make a call using requests or any other library. I am able to make this call through POSTMAN using the parameters I get from inspect Network Activity utility of the web-browsers.
I could use selenium to scrape the finally loaded page- but it waits for the entire page to load which is highly un-optimized since I don't need to wait for the entirety of the page to load.
payload = "<This is dynamically created for every page and is constant for that given page>"
header = {"Content-Type": "application/x-www-form-urlencoded"}
url = 'https://appexchange.salesforce.com/appxListingDetail'
r = requests.post(url=url, data=payload, headers=header)
I was wondering if its possible to get this payload through the scraper which can intercept all the AJAX calls made when it tries to scrape the base web-page.
If you want to scrape a page that contains scripts and dynamic content, I would recommend using puppeteer.
It is a headless scriptable browser (can also be used in non-headless mode), it loads the page exactly like a browser with all its dynamic contents and scripts. So you can simply wait for the page to finish loading, and then read the rendered content
You can also use selenium
Also check this out: a list of almost all headless browsers
UPDATE
I know it's been some time, but I will post the answer to your comment here, in case you still need it or just for the sake of completeness:
In puppeteer you can asynchronously wait for a specific DOM node to appear on the page, or for some other events to occur. So you can continue on with your work after that and don't care about loading the rest of the page (although it will load in the background but you don't need to wait for it).
Related
Hello is there a way to use two different web site urls and switching them?
I mean i have two different websites like:
import requests
session = request.session()
firstPage = session.get("https://stackoverflow.com")
print("Hey! im in first page now!")
secondPage = session.get("https://youtube.com")
print("Hey! im in second page now!")
i know a way to do it in selenium like this: driver.switch_to.window(driver.window_handles[1])
but i want do it in "Requests" so is there a way to do it?
Selenium and Requests are two fundamentally different services. Selenium is a headless browser which fully simulates a user. Requests is a python library which simply sends HTTP requests.
Because of this, Requests is particularly good for scraping static data and data that does not involve javascript rendering (through jQuery or similar), such as RESTful APIs, which often return JSON formatted data (with no HTML styling, or page rendering at all). With Requests, after the initial HTTP request is made, the data is saved in an object, and the connection is closed.
Selenium allows you to traverse through complex, javascript-rendered menus and the like, since each page is actually built (under the hood) as if it were being displayed to a user. Selenium encapsulates everything that your browser does except displaying the HTML (including the HTTP requests that Requests is built to perform). After connecting to a page with Selenium, the connection remains open. This allows you to navigate through a complex site where you would need the full URL of the final page to use Requests.
Because of this distinction, it makes sense that Selenium would have a switch_to_window method, but Requests would not. The way your code is written, you can access the response to the HTTP get calls which you've made directly though your variables (firstPage contains the response from stackoverflow, secondPage contains the response from youtube). While using Requests, you are never "in" a page in the sense that you can be with Selenium, since it is an HTTP library and not a full headless browser.
Depending on what you're looking to scrape, it might be better to use either Requests or Selenium.
I'm new to BeautifulSoup and web scraping so please bare with me.
I'm using Beautiful soup to pull all job post cards from LinkedIn with the title "Security Engineer". After using inspect element on https://www.linkedin.com/jobs/search/?keywords=security%20engineer on an individual job post card, I believe to have found the correct 'li' portion for the class. The code works, but it's returning an empty list '[ ]'. I don't want to use any APIs because this is an exercise for me to learn web scraping. Thank you for your help. Here's my code so far:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.linkedin.com/jobs/search/?keywords=security%20engineer').text
soup = BeautifulSoup(html_text, 'lxml')
jobs = soup.find_all('li', class_ = "jobs-search-results__list-item occludable-update p0 relative ember-view")
print(jobs)
As #baduker mentioned, using plain requests won't do all the heavy lifting that browsers do.
Whenever you open a page on your browser, the browser renders the visuals, does extra network calls, and runs javascript. The first thing it does is load the initial response, which is what you're doing with requests.get('https://www.linkedin.com/jobs/search/?keywords=security%20engineer')
The page you see on your browser is because of many, many more requests:
The reason your list is empty is because the html you get back is very minimal. You can print it out to the console and compare it to the browser's.
To make things easier, instead of using requests you can use Selenium which is essentially a library for programmatically controlling a browser. Selenium will do all those requests for you like a normal browser and let you access the page-source as you were expecting it to look.
This is a good place to start, but your scraper will be slow. There are things you can do in Selenium to speed things up, like running in headless-mode
which means don't render the page graphically, but it won't be as fast as figuring out how to do it on your own with requests.
If you want to do it using requests you're going to need to do a lot of snooping through the requests, maybe using a tool like postman, and see how to simulate the necessary steps to get the data from whatever page.
For example some websites have a handshake process when logging in.
A website I've worked on goes like this:
(Step 0 really) Setup request headers because the site doesn't seem to respond unless User-Agent header is included
Fetch initial HTML, get unique key from a hidden element in a <form>
Using this key, make a POST request to the url from that form
Get a session id key from the response
Setup a another POST request that combines username, password, and sessionid. The URL was in some javascript function, but I found it using the network inspector in the devtools
So really, I work strictly with Selenium if it's too complicated and I'm only getting the data once or not so often. I'll go through the heavy stuff if I'm building a scraper for an API that others will use frequently.
Hope any of this made sense to you. Happy scraping!
I have some kind of single page application which composes XHR requests on-the-fly. It is used to implement pagination for a list of links I want to click on using selenium.
The page only provides a Goto next page link. When clicking the next page link a javascript function creates a XHR request and updates the page content.
Now when I click on one of the links in the list I get redirected to a new page (again through javascript with obfuscated request generation). Though this is exactly the behaviour I want, when going back to the previous page I have to start over from the beginning (i.e. starting at page 0 and click through to page n)
There are a few solutions which came to my mind:
block the second XHR request when clicking on the links in the list, store it and replay it later. This way I can skim through the pages but keep my links for replay later
Somehow 'inject' the first XHR request which does the pagination in order to save myself from clicking through all the pages again
I was also trying out some simple proxies but https is causing troubles for me and was wondering if there is any simple solution I might have missed.
browsermobproxy integrates easily and will allow you to capture all the requests made. It should also allow you to block certain calls from returning.
It does sound like you are scraping a site, so it might be worth parsing the data the XHR calls make and mimicking them.
I've read some relevant posts here but couldn't figure an answer.
I'm trying to crawl a web page with reviews. When site is visited there are only 10 reviews at first and a user should press "Show more" to get 10 more reviews (that also adds #add10 to the end of site's address) every time when he scrolls down to the end of reviews list. Actually, a user can get full review list by adding #add1000 (where 1000 is a number of additional reviews) to the end of the site's address. The problem is that I get only first 10 reviews using site_url#add1000 in my spider just like with site_url so this approach doesn't work.
I also can't find a way to make an appropriate Request imitating the origin one from the site. Origin AJAX url is of the form 'domain/ajaxlst?par1=x&par2=y' and I tried all of this:
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all)
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers})
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers}, cookies={all_cookies})
But every time I'm getting a 404 Error. Can anyone explain what I'm doing wrong?
What you need is a headless browser for this since request module can not handle AJAX well.
One of such headless browser is selenium.
i.e.)
driver.find_element_by_id("show more").click() # This is just an example case
Normally, when you scroll down the page, Ajax will send request to the server, and the server will then response a json/xml file back to your browser to refresh the page.
You need to figure out the url linked to this json/xml file. Normally, you can open your firefox browser and open tools/web dev/web console. monitor the network activities and you can easily catch this json/xml file.
Once you find this file, then you can directly parse reviews from them (I recommend Python Module requests and bs4 to do this work) and decrease a huge amount of time. Remember to use some different clients and IPs. Be nice to the server and it won't block you.
I'm studying opening pages through Python (3.3).
url=('http://www.google.com')
page = urllib.request.urlopen(url)
Does the above code count that as one hit to Google or does this?
os.system('start chrome.exe google.com')
The first one scrapes the page while the second one actually opens the page in a browser. I was just wondering if it made a difference page hit wise?
both do very different things.
using urllib.request.urlopen makes a single http request.
your second example will do the same, and then it will parse the document it receives and request subsequent resources (images/javascript/css/whatever). So the result of loading google.com in your browser will trigger many hits.
try it yourself by looking in your browsers development tools (usually in network section) while you load a page.