During Loading this web page , the browser makes many requests,
now I need a certain request url (e.g.starts with 'http://api.le.com/mms/out/video/playJson?') during the Loading process ,and the request is made on condition that the adobe flash player plugin for NPAPI is enabled , so any way to get the url?
P.S. Better to show some code, I am new to this area.
Scrapy doesn't handle requests while the page is being processed, you either know specifically the url you want and directly request it
or
You have to use something like scrapy-splash which can return a HAR file with all the requests made while loading the page. The only downside to this is that splash doesn't return the contents of each request, only the headers =(
If you absolutely need the contents of the request it's best to use Selenium with browsermob, if you find a better solution please do tell.
EDIT
Seems now that Splash does handle requests' body, check #Mikhail comment.
Related
Hello is there a way to use two different web site urls and switching them?
I mean i have two different websites like:
import requests
session = request.session()
firstPage = session.get("https://stackoverflow.com")
print("Hey! im in first page now!")
secondPage = session.get("https://youtube.com")
print("Hey! im in second page now!")
i know a way to do it in selenium like this: driver.switch_to.window(driver.window_handles[1])
but i want do it in "Requests" so is there a way to do it?
Selenium and Requests are two fundamentally different services. Selenium is a headless browser which fully simulates a user. Requests is a python library which simply sends HTTP requests.
Because of this, Requests is particularly good for scraping static data and data that does not involve javascript rendering (through jQuery or similar), such as RESTful APIs, which often return JSON formatted data (with no HTML styling, or page rendering at all). With Requests, after the initial HTTP request is made, the data is saved in an object, and the connection is closed.
Selenium allows you to traverse through complex, javascript-rendered menus and the like, since each page is actually built (under the hood) as if it were being displayed to a user. Selenium encapsulates everything that your browser does except displaying the HTML (including the HTTP requests that Requests is built to perform). After connecting to a page with Selenium, the connection remains open. This allows you to navigate through a complex site where you would need the full URL of the final page to use Requests.
Because of this distinction, it makes sense that Selenium would have a switch_to_window method, but Requests would not. The way your code is written, you can access the response to the HTTP get calls which you've made directly though your variables (firstPage contains the response from stackoverflow, secondPage contains the response from youtube). While using Requests, you are never "in" a page in the sense that you can be with Selenium, since it is an HTTP library and not a full headless browser.
Depending on what you're looking to scrape, it might be better to use either Requests or Selenium.
I'm new to BeautifulSoup and web scraping so please bare with me.
I'm using Beautiful soup to pull all job post cards from LinkedIn with the title "Security Engineer". After using inspect element on https://www.linkedin.com/jobs/search/?keywords=security%20engineer on an individual job post card, I believe to have found the correct 'li' portion for the class. The code works, but it's returning an empty list '[ ]'. I don't want to use any APIs because this is an exercise for me to learn web scraping. Thank you for your help. Here's my code so far:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.linkedin.com/jobs/search/?keywords=security%20engineer').text
soup = BeautifulSoup(html_text, 'lxml')
jobs = soup.find_all('li', class_ = "jobs-search-results__list-item occludable-update p0 relative ember-view")
print(jobs)
As #baduker mentioned, using plain requests won't do all the heavy lifting that browsers do.
Whenever you open a page on your browser, the browser renders the visuals, does extra network calls, and runs javascript. The first thing it does is load the initial response, which is what you're doing with requests.get('https://www.linkedin.com/jobs/search/?keywords=security%20engineer')
The page you see on your browser is because of many, many more requests:
The reason your list is empty is because the html you get back is very minimal. You can print it out to the console and compare it to the browser's.
To make things easier, instead of using requests you can use Selenium which is essentially a library for programmatically controlling a browser. Selenium will do all those requests for you like a normal browser and let you access the page-source as you were expecting it to look.
This is a good place to start, but your scraper will be slow. There are things you can do in Selenium to speed things up, like running in headless-mode
which means don't render the page graphically, but it won't be as fast as figuring out how to do it on your own with requests.
If you want to do it using requests you're going to need to do a lot of snooping through the requests, maybe using a tool like postman, and see how to simulate the necessary steps to get the data from whatever page.
For example some websites have a handshake process when logging in.
A website I've worked on goes like this:
(Step 0 really) Setup request headers because the site doesn't seem to respond unless User-Agent header is included
Fetch initial HTML, get unique key from a hidden element in a <form>
Using this key, make a POST request to the url from that form
Get a session id key from the response
Setup a another POST request that combines username, password, and sessionid. The URL was in some javascript function, but I found it using the network inspector in the devtools
So really, I work strictly with Selenium if it's too complicated and I'm only getting the data once or not so often. I'll go through the heavy stuff if I'm building a scraper for an API that others will use frequently.
Hope any of this made sense to you. Happy scraping!
I am using Python requests to ping a site that uses the ASP.NET framework. One of the URL's is giving me trouble and the response is the exact same page that I posted to, but the browser does not behave this way with the same URL - it refreshes with a new URL and all (but I do not think it is redirecting technically). What are some ways I can try to troubleshoot this? I would provide code and links but it is a secured website and requires authentication/subscription.
I am trying to scrape data using scrapy from https://www.ta.com/portfolio/business-services, however the response is NULL. I am looking to scrape href in div.tiles js-portfolio-tiles using the code response.css("div.tiles.js-portfolio-tiles a::attr(href)").extract()
I think this has something to do with ::before that appears just before this, but maybe not. How do I go about extracting this? website HTML
The elements that you are interested in retrieving are loaded by your browser using javascript. By default scrapy is not able to load elements using javascript as it is not a browser, it simply retrieves the raw HTML.
Scrapy shell is an invaluable tool for inspecting what is available in the response that scrapy receives.
This set of commands will open the response in your default web browser:
$ scrapy shell
>>> fetch("https://www.ta.com/portfolio/business-services")
>>> view (response)
As you can see the js-portfolio tiles are not visible as they have not been loaded.
I have had a look at the AJAX requests in the network panel of the developer tools and it appears that the information you require may be available in an XHR request. If it is not then you will need to use additional software to load the javascript, namely scrapy splash or selenium, I would advise exploring the AJAX (XHR) request first though as this will be much faster and easier.
See this question for additional details on using your browsers dev tools to inspect AJAX requests.
I'm trying to parse the HTML of a page with infinite scrolling. I want to load all of the content so that I can parse it all. I'm using Python. Any hints?
Those pages update their html with AJAX. Usually you just need to find the new AJAX requests send by browser, guess the meaning of the AJAX url parameters and fetch the data from the API.
API servers may validate the user agent, referer, cookie, oauth_token ... of the AJAX request, keep an eye on them.
the data is
either loaded in advance
or the page sends a request while you scroll
you can use httpfox to find the request and send it